如何在aws athena中按列名而不是按列顺序从多个CSV创建表

tquggr8v 于 2021-06-26 发布在 Hive

关注(0)|答案(4)|浏览(347)

我想从存储在s3中的多个csv文件在aws athena中创建一个表。
csv有一个带有列名的标题行。我的问题是，在每个csv中，列的顺序不同，我希望按列的名称获取列。
当我在athena中尝试正常的create表时，我得到了前两列。

CREATE EXTERNAL TABLE `test`(
  `id` string, 
  `name` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
WITH SERDEPROPERTIES ( 
  'escapeChar'='\\', 
  'quoteChar'='\"', 
  'separatorChar'=',') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://...'
TBLPROPERTIES (
  'has_encrypted_data'='false')

举个例子：
csv 1：

+----+-------+-------+---------+
| id | name  | price | comment |
+----+-------+-------+---------+
|  1 | shirt |   123 | abc     |
|  2 | shoes |   222 | ddd     |
+----+-------+-------+---------+

csv 2：

+----+------+-------+-------+---------+
| id | size | price | color |  name   |
+----+------+-------+-------+---------+
|  5 | L    |   100 | red   | shirt   |
|  6 | S    |    55 | white | t-shirt |
+----+------+-------+-------+---------+

我想要的table：

+----+---------+
| id |  name   |
+----+---------+
|  1 | shirt   |
|  2 | shoes   |
|  5 | shirt   |
|  6 | t-shirt |
+----+---------+

我得到的table：

+----+-------+
| id | name  |
+----+-------+
|  1 | shirt |
|  2 | shoes |
|  5 | L     |
|  6 | S     |
+----+-------+

谢谢

Hive ddl amazon-athena

来源：https://stackoverflow.com/questions/50367011/how-to-create-a-table-in-aws-athena-from-multiple-csvs-by-column-names-and-not-b

4条答案

按热度按时间

gfttwv5a1#

在我看来，glue crawler没有设置来获取列名并使用它们定义表的模式，这有点疯狂。我们遇到了这个问题（s3中同一文件夹中的模式更改），下面是我们如何解决它的。
注意-如果可以将模式（头顺序）Map到特定的s3路径，那么下面的解决方案就可以工作。

源数据

我们有四份档案。 a.csv 以及 b.csv 共享相同的模式 c.csv 以及 d.csv 有不同的模式。

$ cat a.csv
a,b
1,2
3,4
$ cat b.csv
a,b
5,6
3,4
$ cat c.csv
a,b,c
1,2,3
4,5,6
$ cat d.csv
a,c,d,x
6,7,8,9
1,2,3,4

它们保存在s3中：

$ aws s3 ls s3://example-s3-bucket/
2019-01-04 09:47:42         12 a.csv
2019-01-04 09:49:49         12 b.csv
2019-01-04 09:49:53         18 c.csv
2019-01-04 09:49:56         24 d.csv

为每个架构创建一个表

为每个模式创建一个表，只需在中传递相同的s3位置。
请注意，为了简洁起见，我省略了分隔符和字段分隔符的定义。

create external table athena_testing_ab (
  a int,
  b int
)
LOCATION 's3://example-s3-bucket/'
;

create external table athena_testing_c (
  a int,
  b int,
  c int
)
LOCATION 's3://example-s3-bucket/'
;

create external table athena_testing_d (
  a int,
  c int,
  d int,
  x int
)
LOCATION 's3://example-s3-bucket/'
;

使用联合查询所有表

我们现在查询这3个表 UNION 将它们放在一起，对每个表的相应s3路径进行过滤。
您可能希望使用regex或子字符串解析来更优雅地筛选 $PATH ，尤其是当您的存储桶中有成百上千个文件时。

select
  a,
  b,
  null as c,
  null as d,
  null as x
from
  athena_testing_ab
where "$PATH" in  ('s3://example-s3-bucket/a.csv', 's3://example-s3-bucket/b.csv')

union all

select
  a,
  b,
  c,
  null as d,
  null as x
from
  athena_testing_c
where "$PATH" in  ('s3://example-s3-bucket/c.csv')

union all

select
  a,
  null as b,
  c,
  d,
  x
from
  athena_testing_d
where "$PATH" in  ('s3://example-s3-bucket/d.csv')

赞(0）回复(0）举报 2021-06-26

rseugnpd2#

将文件放在不同的文件夹中，并使用glue crawler创建数据存储。

赞(0）回复(0）举报 2021-06-26

67up9zun3#

我会为不同的csv使用两个不同的表（您需要将csv存储在不同的文件夹中）。
最后，为了得到两个csv的id、name结构，我将使用从不同表中合并必要列的视图。

赞(0）回复(0）举报 2021-06-26

jgovgodb4#

使用胶水爬虫。这会有帮助的。

赞(0）回复(0）举报 2021-06-26