Clickhouse：当通过pandas dataframe插入数据并且单元格值为null时，如何取列的默认值

jfgube3f 于 10个月前发布在 ClickHouse

关注(0)|答案(1)|浏览(94)

我试图将Pandas数据框插入Clickhouse，但遇到了一些问题。下面是表模式：

CREATE TABLE IF NOT EXISTS test_table
(
    name String,
    day DateTime64(3) DEFAULT '2020-07-01 00:00:00',
)
engine = MergeTree
ORDER BY (name, day);

字符串
pandas dataframe中的数据如下：

name   day
0  'a'    NaT
1  'b'    NaT
2  'c'   '2019-08-31 00:00:00'

型
插入python代码是：

from clickhouse_driver import Client
with Client(host="", port="", password="",
            user="", settings={"use_numpy": True}) as client:
    client.insert_dataframe(
        'INSERT INTO test_table VALUES',
        df)

型
clickhouse的结果是

SELECT *
FROM test_table

┌─name─┬─day────────────────────┐
│ a    │ 1970-01-01 00:00:00.000│
│ b    │ 1970-01-01 00:00:00.000│
│ c    │ 2019-08-31 00:00:00.000│
└──────┴────────────────────────

型

但我真正想要的是它将是默认值，意味着'1970-01-01 00：00：00.000'将被'2020-07-01 00：00：00.000'替换。

我做了一些尝试和调查，以下是我所做的：
1.更改了NaT并将其替换为None或Numpy.NaN

df.replace({pd.NaT: None}, inplace=True)
or
df1.replace({pd.NaT: np.NaN}, inplace=True)

型
但这些变化的结果仍然相同
1.在clickhouse-client中，当使用 insert into 时，schema可以工作，结果是我想要的。就像这些：

insert into test_table (name,day) values ('test-null',null);
or
insert into test_table (name) values ('test-sub');

┌─name──────────┬─day────────────────────┐
│ test-null     │ 2020-07-01 00:00:00.000│
│ test-sub      │ 2020-07-01 00:00:00.000│
└───────────────┴─────────────────────────

型
1.在clickhouse-client中，当我使用 insert into with**empty string 时，结果将与我使用pandas dataframe的结果相同

insert into test_table (name,day) values ('test-empty','');

SELECT *
FROM test_table

┌─name──────────┬─day────────────────────┐
│ test-empty    │ 1970-01-01 00:00:00.000│
└───────────────┴────────────────────────

型
所以，我现在所做的就是把 Dataframe 分成两部分，插入两次，（我认为这不是Python的，也不是很高效），但它确实可以工作

# select not null rows
mask1 = ~np.isnan(df.day.values)

# select null rows
mask3 = np.isnan(df.day.values)

with Client(host="", port="", password=,
            user="", settings={"use_numpy": True}) as client:
    # insert entire pandas dataframe
    client.insert_dataframe(
        'INSERT INTO test_table VALUES',
       df.loc[mask1])
    client.insert_dataframe(
     'INSERT INTO test_table (* EXCEPT(day)) VALUES',
        df.loc[mask3].drop(['day'], axis=1))

型
总之，我想问两件事：
1.有没有更好的方法来实现我的目标：**当pandas dataframe中的单元格为NaT/NaN/None时，插入clickhouse后为column的默认值。**不通过pandas设置。

clickhouse DataTime数据类型是否存在bug？当一个空字符串插入到列中时，它会忽略默认值而使用clickhouse自己的默认值。
在我看来，第二个可能是解决这个问题的关键，因为当我使用clickhouse-driver的客户端时，它可能将NaT/NaN/None转换为空字符串。
编辑：对于问题2，我发现在clickhouse中，DateTime列将空字符串视为0（零）或'0'（字符串中的零），这可以解释为什么day的值是1970-01-01 00：00：00.000。
所以，问题是这样的：为什么DateTime会这样对待这个值？而且，我猜clickhouse-driver客户端会**将None/NaT/NaN视为空字符串，并将空字符串传递给clickhouse。**驱动程序可以将None/NaT/NaN视为空字符串（尽管python中只有NoneType）或直接删除单元格（比如，传递每一行，但我读clickhouse-driver的代码发现它传递每一列以获得整个值）。

pandas

来源：https://stackoverflow.com/questions/69824147/clickhouse-how-to-take-columns-default-value-when-insert-data-through-pandas-d