pandas:如何将Dataframe插入clickhouse

cbeh67ev  于 2021-07-15  发布在  ClickHouse
关注(0)|答案(4)|浏览(3318)

我正在尝试将Pandas数据框插入clickhouse。
这是我的密码

import pandas
import sqlalchemy as sa

uri = 'clickhouse://default:@localhost/default'
ch_db = sa.create_engine(uri)

pdf = pandas.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
])

pdf.to_sql('test_humans', ch_db, if_exists='append', index=False)

这就是我收到的错误。这是否与发动机缺少的额外参数有关?我怎样才能解决这个问题?
异常:代码:62,e.displaytext()=db::异常:语法错误:位置65处失败(第7行,第2列):格式tabseparatedwithnamesandtypes。应为以下之一:引擎、存储定义(版本19.15.2.2(正式构建))
登录中
info:sqlalchemy.engine.base.engine:exists table test\u人类info:sqlalchemy.engine.base.engine:{} info:sqlalchemy.engine.base.engine:create table test\u humans(first\u name text,year bigint)
info:sqlalchemy.engine.base.engine:{} info:sqlalchemy.engine.base.engine:回滚

ctrmrzij

ctrmrzij1#

你也可以使用https://github.com/kszucs/pandahouse 插入Dataframe而无需额外转换。
pip安装pandahouse

import pandahouse as ph

pdf = pandas.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
])

connection = dict(database='default',
                  host='localhost',
                  user='default',
                  password='')

ph.to_clickhouse(pdf, 'test_humans', index=False, chunksize=100000, connection=connection)
r6l8ljro

r6l8ljro2#

你可以不用担心 sqlalchemy .
pip安装clickhouse驱动程序

from clickhouse_driver import Client

client = Client('localhost')
df = pandas.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
])

# df processing blablabla...

client.execute("INSERT INTO your_table VALUES", df.to_dict('records'))

希望这有帮助。

luaexgnf

luaexgnf3#

sqlalchemy clickhouse无法自动创建表(至少在版本0.1.5.post0(含0.1.5.post0)之前),并将创建表的任何sql查询解释为始终以格式tabseparatedwithnamesandtypes子句结尾的select查询。
要解决此问题,需要使用infi.clickhouse\u orm-api手动创建表(此模块随sqlalchemy clickhouse一起提供):

import pandas as pd
from infi.clickhouse_orm.engines import Memory
from infi.clickhouse_orm.fields import UInt16Field, StringField
from infi.clickhouse_orm.models import Model
from sqlalchemy import create_engine

# define the ClickHouse table schema

class Test_Humans(Model):
    year = UInt16Field()
    first_name = StringField()
    engine = Memory()

engine = create_engine('clickhouse://default:@localhost/test')

# create table manually

with engine.connect() as conn:
    conn.connection.create_table(Test_Humans) # https://github.com/Infinidat/infi.clickhouse_orm/blob/master/src/infi/clickhouse_orm/database.py#L142

pdf = pd.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
    # ! sqlalchemy-clickhouse ignores the last item so add fake one
    {}
])

pdf.to_sql('test_humans', engine, if_exists='append', index=False)

考虑到sqlalchemy clickhouse忽略了最后一项,所以添加一个伪项(参见源代码和相关的第10期)。
让我们在db中检查此表:

SELECT *
FROM test.test_humans

┌─year─┬─first_name─┐
│ 1994 │ Vova       │
│ 1995 │ Anja       │
│ 1996 │ Vasja      │
│ 1997 │ Petja      │
└──────┴────────────┘

4 rows in set. Elapsed: 0.003 sec.

* /
3df52oht

3df52oht4#

从版本0.2.0开始,clickhouse\u驱动程序实现方法 insert_dataframe . 请参见:https://clickhouse-driver.readthedocs.io/en/latest/api.html#clickhouse_driver.client.insert_dataframe

相关问题