pandasDataframe类型datetime64[ns]在hive/athena中不起作用

jhdbpxl9 于 2021-06-27 发布在 Hive

关注(0)|答案(5)|浏览(426)

我正在开发一个python应用程序，它只是将csv文件转换成hive/athena兼容的Parquet格式，我正在使用fastparquet和pandas库来执行这个操作。csv文件中有如下时间戳值 2018-12-21 23:45:00 需要写成 timestamp 键入Parquet文件。下面是我正在运行的代码，

columnNames = ["contentid","processed_time","access_time"]

dtypes = {'contentid': 'str'}

dateCols = ['access_time', 'processed_time']

s3 = boto3.client('s3')

obj = s3.get_object(Bucket=bucketname, Key=keyname)

df = pd.read_csv(io.BytesIO(obj['Body'].read()), compression='gzip', header=0, sep=',', quotechar='"', names = columnNames, error_bad_lines=False, dtype=dtypes, parse_dates=dateCols)

s3filesys = s3fs.S3FileSystem()

myopen = s3filesys.open

write('outfile.snappy.parquet', df, compression='SNAPPY', open_with=myopen,file_scheme='hive',partition_on=PARTITION_KEYS)

代码运行成功，下面是pandas创建的Dataframe

contentid                 object
processed_time            datetime64[ns]
access_time               datetime64[ns]

最后，当我在hive和athena中查询parquet文件时，timestamp值是 +50942-11-30 14:00:00.000 而不是 2018-12-21 23:45:00 非常感谢您的帮助

Hive python pandas amazon-athena fastparquet

来源：https://stackoverflow.com/questions/53919763/pandas-dataframe-type-datetime64ns-is-not-working-in-hive-athena

5条答案

按热度按时间

lhcgjxsq1#

我也面临着同样的问题，经过大量的研究，现在已经解决了。
当你这么做的时候

write('outfile.snappy.parquet', df, compression='SNAPPY', open_with=myopen,file_scheme='hive',partition_on=PARTITION_KEYS)

它在场景后面使用fastparquet，它对datetime使用了与雅典娜兼容的不同的编码。
解决方案是：卸载fastparquet并安装pyarrow
快速Parquet地板
pip安装箭头
再次运行代码。这次应该有用

赞(0）回复(0）举报 2021-06-27

uelo1irk2#

你可以试试：

dataframe.to_parquet(file_path, compression=None, engine='pyarrow', allow_truncated_timestamps=True, use_deprecated_int96_timestamps=True)

赞(0）回复(0）举报 2021-06-27

50pmv0ei3#

我用这种方法解决了问题。
用tou datetime方法转换df序列
接下来使用.dt accesor选择datetime64[ns]的日期部分
例子：

df.field = pd.to_datetime(df.field)
df.field = df.field.dt.date

在那之后，雅典娜会认出这些数据

赞(0）回复(0）举报 2021-06-27

w8biq8rn4#

问题似乎出在雅典娜身上，它似乎只支持int96，当你在pandas中创建时间戳时，它就是int64
我的dataframe列包含一个字符串date是“sdate”，我首先转换为timestamp


# add a new column w/ timestamp

df["ndate"] = pandas.to_datetime["sdate"]

# convert the timestamp to microseconds

df["ndate"] = pandas.to_datetime(["ndate"], unit='us')

# Then I convert my dataframe to pyarrow

table = pyarrow.Table.from_pandas(df, preserve_index=False)

# After that when writing to parquet add the coerce_timestamps and

# use_deprecated_int96_timstamps. (Also writing to S3 directly)

OUTBUCKET="my_s3_bucket"

pyarrow.parquet.write_to_dataset(table, root_path='s3://{0}/logs'.format(OUTBUCKET), partition_cols=['date'], filesystem=s3, coerce_timestamps='us', use_deprecated_int96_timestamps=True)

赞(0）回复(0）举报 2021-06-27

g6baxovj5#

我知道这个问题由来已久，但它仍然是相关的。
如前所述，雅典娜只支持int96作为时间戳。使用fastparquet可以为雅典娜生成具有正确格式的Parquet文件。重要的部分是times='int96'，因为它告诉fastparquet将pandas datetime转换为int96 timestamp。

from fastparquet import write
import pandas as pd

def write_parquet():
  df = pd.read_csv('some.csv')
  write('/tmp/outfile.parquet', df, compression='GZIP', times='int96')

赞(0）回复(0）举报 2021-06-27

我来回答

pandasDataframe类型datetime64[ns]在hive/athena中不起作用

5条答案

相关问题

热门标签

最新问答