aws雅典娜创建缩进,并在分区加载后将值移到错误的列中

w8ntj3qf  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(287)

我遇到了以下问题:
我在没有分区的hdfs中的emr集群中创建了一个配置单元表,并向其中加载了一个数据。
我基于第1段中的表创建了另一个hiva表,但是使用了datetime列中的分区:partitioned by(year string,month string,day string)。
我将非分区表中的一个数据加载到分区表中,得到了有效的结果。
我创建了一个athena数据库和表,其结构与hive表相同。
我从hdfs本地复制分区文件,并通过awss3sync将所有文件传输到s3空存储桶中。所有文件都以与hdfs中的配置单元目录相同的顺序传输,没有错误。
我通过msck修复表加载分区,在输出中没有得到任何错误。
之后,我发现许多值都有缩进,例如需要在“ip”列中的值在“operating\u sys”列中,等等。
我的脚本是:

-- Hive tables

SET hive.exec.dynamic.partition = true;  
SET hive.exec.dynamic.partition.mode = nonstrict; 

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs_page_part 
    ( 
        log_DATE STRING,  
        user_id STRING, 
        page_path STRING, 
        referer STRING,
        tracking_referer STRING,
        medium STRING,
        campaign STRING,
        source STRING,
        visitor_id STRING,
        ip STRING,
        session_id STRING,
        operating_sys STRING,
        ad_id STRING,
        keyword STRING,
        user_agent STRING
    )
PARTITIONED BY
(
        `year` STRING,
        `month` STRING,
        `day` STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/admin/events_partitioned';

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs_event_part
    ( 
        log_DATE STRING, 
        user_id STRING, 
        category STRING, 
        action STRING, 
        label STRING, 
        value STRING,
        visitor_id STRING,
        ip STRING,
        session_id STRING,
        operating_sys STRING,
        extra_data_json STRING
    )
PARTITIONED BY
(
        `year` STRING,
        `month` STRING,
        `day` STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE
LOCATION '/user/admin/pages_partitioned';

INSERT INTO TABLE cloudfront_logs_page_part
PARTITION 
(
    `year`,
    `month`,
    `day`
)
SELECT
    log_DATE,
    user_id,
    page_path,
    referer,
    tracking_referer,
    medium, 
    campaign, 
    source,
    visitor_id,
    ip,
    session_id,
    operating_sys,
    ad_id,
    keyword,
    user_agent,
    year(log_DATE) as `year`,
    month(log_DATE) as `month`,
    day(log_DATE) as `day`
FROM
    cloudfront_logs_page;

INSERT INTO TABLE cloudfront_logs_event_part
PARTITION 
(
    `year`,
    `month`,
    `day`
)
SELECT
    log_DATE,
    user_id,
    category,
    action,
    label,
    value,
    visitor_id,
    ip,
    session_id,
    operating_sys,
    extra_data_json,
    year(log_DATE) as `year`,
    month(log_DATE) as `month`,
    day(log_DATE) as `day`
FROM
    cloudfront_logs_event;

-- Athena tables

CREATE DATABASE IF NOT EXISTS test
LOCATION 's3://...';

DROP TABLE IF EXISTS test.cloudfront_logs_page_ath;

CREATE EXTERNAL TABLE IF NOT EXISTS powtoon_hive.cloudfront_logs_page_ath ( 
    log_DATE STRING,  
    user_id STRING, 
    page_path STRING, 
    referer STRING,
    tracking_referer STRING,
    medium STRING,
    campaign STRING,
    source STRING,
    visitor_id STRING,
    ip STRING,
    session_id STRING,
    operating_sys STRING,
    ad_id STRING,
    keyword STRING,
    user_agent STRING
)
PARTITIONED BY (`year` STRING,`month` STRING, `day` STRING)
ROW FORMAT DELIMITED
FIELDS   TERMINATED BY ','
LOCATION 's3://.../';

DROP TABLE IF EXISTS test.cloudfront_logs_event_ath;

CREATE EXTERNAL TABLE IF NOT EXISTS test.cloudfront_logs_event_ath 
    ( 
        log_DATE STRING, 
        user_id STRING, 
        category STRING, 
        action STRING, 
        label STRING, 
        value STRING,
        visitor_id STRING,
        ip STRING,
        session_id STRING,
        operating_sys STRING,
        extra_data_json STRING
    )
PARTITIONED BY (`year` STRING,`month` STRING, `day` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 's3://.../';

有什么问题吗?表结构?雅典娜?

3wabscal

3wabscal1#

最简单的方法是将原始文件直接转换为分区Parquet柱格式。这样做的好处是分区、列存储、 predicate 下推以及所有其他花哨的词汇。
请参阅:转换为列格式-amazon athena

相关问题