hadoop—如何获取存储在hdfs中的orc文件的模式(列及其类型)？

dgjrabp2 于 2021-05-27 发布在 Hadoop

关注(0)|答案(2)|浏览(527)

我将orc文件存储在hdfs上的不同文件夹中，如下所示：

/DATA/UNIVERSITY/DEPT/STUDENT/part-00000.orc
/DATA/UNIVERSITY/DEPT/CREDIT/part-00000.orc

我不知道每个表中有多少列( STUDENT , CREDIT 等）。有没有办法从这些文件中获取模式？我正在寻找获取列名及其数据类型的方法，以便为配置单元外部表编写create语句。

hadoop hdfs orc

来源：https://stackoverflow.com/questions/58288941/how-to-get-the-schema-columns-and-their-types-of-orc-files-stored-in-hdfs

2条答案

按热度按时间

pgpifvop1#

找到了一种通过 Spark ```
data = sqlContext.sql("SELECT * FROM orc.<HDFS_path>");
data.printSchema()

这将以下面的格式打印输出，即我想从hdfs上的orc文件中提取的信息：

赞(0）回复(0）举报 2021-05-27

cedebl8k2#

Hive兽人转储命令将解决你的目的

hive --orcfiledump /DATA/UNIVERSITY/DEPT/STUDENT/part-00000

您将获得列、它们的类型、最小值、最大值、计数或记录以及其他更多的统计信息，如下所示

Rows: 6 .
Compression: ZLIB .
Compression size: 262144 .
Type: struct<_col0:string,_col1:string> .

Stripe Statistics:
Stripe 1:
Column 0: count: 6 .
Column 1: count: 6 min: abc max: mno sum: 17 .
Column 2: count: 6 min: def max: tre sum: 18 .

File Statistics:
Column 0: count: 6 .
Column 1: count: 6 min: abc max: mno sum: 17 .
Column 2: count: 6 min: def max: tre sum: 18 .

Stripes:
Stripe: offset: 3 data: 58 rows: 6 tail: 49 index: 67 .
Stream: column 0 section ROW_INDEX start: 3 length 9 .
Stream: column 1 section ROW_INDEX start: 12 length 29 .
Stream: column 2 section ROW_INDEX start: 41 length 29 .
Stream: column 1 section DATA start: 70 length 20 .
Stream: column 1 section LENGTH start: 90 length 12 .
Stream: column 2 section DATA start: 102 length 21 .
Stream: column 2 section LENGTH start: 123 length 5 .
Encoding column 0: DIRECT .
Encoding column 1: DIRECT_V2 .
Encoding column 2: DIRECT_V2 .

赞(0）回复(0）举报 2021-05-27

我来回答

hadoop—如何获取存储在hdfs中的orc文件的模式(列及其类型)？

2条答案

相关问题

热门标签

最新问答