我遇到了以下问题:
我在没有分区的hdfs中的emr集群中创建了一个配置单元表,并向其中加载了一个数据。
我基于第1段中的表创建了另一个hiva表,但是使用了datetime列中的分区:partitioned by(year string,month string,day string)。
我将非分区表中的一个数据加载到分区表中,得到了有效的结果。
我创建了一个athena数据库和表,其结构与hive表相同。
我从hdfs本地复制分区文件,并通过awss3sync将所有文件传输到s3空存储桶中。所有文件都以与hdfs中的配置单元目录相同的顺序传输,没有错误。
我通过msck修复表加载分区,在输出中没有得到任何错误。
之后,我发现许多值都有缩进,例如需要在“ip”列中的值在“operating\u sys”列中,等等。
我的脚本是:
-- Hive tables
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs_page_part
(
log_DATE STRING,
user_id STRING,
page_path STRING,
referer STRING,
tracking_referer STRING,
medium STRING,
campaign STRING,
source STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
ad_id STRING,
keyword STRING,
user_agent STRING
)
PARTITIONED BY
(
`year` STRING,
`month` STRING,
`day` STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/admin/events_partitioned';
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs_event_part
(
log_DATE STRING,
user_id STRING,
category STRING,
action STRING,
label STRING,
value STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
extra_data_json STRING
)
PARTITIONED BY
(
`year` STRING,
`month` STRING,
`day` STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/admin/pages_partitioned';
INSERT INTO TABLE cloudfront_logs_page_part
PARTITION
(
`year`,
`month`,
`day`
)
SELECT
log_DATE,
user_id,
page_path,
referer,
tracking_referer,
medium,
campaign,
source,
visitor_id,
ip,
session_id,
operating_sys,
ad_id,
keyword,
user_agent,
year(log_DATE) as `year`,
month(log_DATE) as `month`,
day(log_DATE) as `day`
FROM
cloudfront_logs_page;
INSERT INTO TABLE cloudfront_logs_event_part
PARTITION
(
`year`,
`month`,
`day`
)
SELECT
log_DATE,
user_id,
category,
action,
label,
value,
visitor_id,
ip,
session_id,
operating_sys,
extra_data_json,
year(log_DATE) as `year`,
month(log_DATE) as `month`,
day(log_DATE) as `day`
FROM
cloudfront_logs_event;
-- Athena tables
CREATE DATABASE IF NOT EXISTS test
LOCATION 's3://...';
DROP TABLE IF EXISTS test.cloudfront_logs_page_ath;
CREATE EXTERNAL TABLE IF NOT EXISTS powtoon_hive.cloudfront_logs_page_ath (
log_DATE STRING,
user_id STRING,
page_path STRING,
referer STRING,
tracking_referer STRING,
medium STRING,
campaign STRING,
source STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
ad_id STRING,
keyword STRING,
user_agent STRING
)
PARTITIONED BY (`year` STRING,`month` STRING, `day` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 's3://.../';
DROP TABLE IF EXISTS test.cloudfront_logs_event_ath;
CREATE EXTERNAL TABLE IF NOT EXISTS test.cloudfront_logs_event_ath
(
log_DATE STRING,
user_id STRING,
category STRING,
action STRING,
label STRING,
value STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
extra_data_json STRING
)
PARTITIONED BY (`year` STRING,`month` STRING, `day` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 's3://.../';
有什么问题吗?表结构?雅典娜?
1条答案
按热度按时间3wabscal1#
最简单的方法是将原始文件直接转换为分区Parquet柱格式。这样做的好处是分区、列存储、 predicate 下推以及所有其他花哨的词汇。
请参阅:转换为列格式-amazon athena