我将orc文件存储在hdfs上的不同文件夹中,如下所示:
/DATA/UNIVERSITY/DEPT/STUDENT/part-00000.orc /DATA/UNIVERSITY/DEPT/CREDIT/part-00000.orc
我不知道每个表中有多少列( STUDENT , CREDIT 等)。有没有办法从这些文件中获取模式?我正在寻找获取列名及其数据类型的方法,以便为配置单元外部表编写create语句。
STUDENT
CREDIT
pgpifvop1#
找到了一种通过 Spark ```data = sqlContext.sql("SELECT * FROM orc.<HDFS_path>");data.printSchema()
Spark
<HDFS_path>
这将以下面的格式打印输出,即我想从hdfs上的orc文件中提取的信息:
root|-- <column_name1>: (nullable = <true/false>)|-- <column_name2>: (nullable = <true/false>)|-- <column_name3>: (nullable = <true/false>)|-- <column_name4>: (nullable = <true/false>)|-- <column_name5>: (nullable = <true/false>)
cedebl8k2#
Hive兽人转储命令将解决你的目的
hive --orcfiledump /DATA/UNIVERSITY/DEPT/STUDENT/part-00000
您将获得列、它们的类型、最小值、最大值、计数或记录以及其他更多的统计信息,如下所示
Rows: 6 . Compression: ZLIB . Compression size: 262144 . Type: struct<_col0:string,_col1:string> . Stripe Statistics: Stripe 1: Column 0: count: 6 . Column 1: count: 6 min: abc max: mno sum: 17 . Column 2: count: 6 min: def max: tre sum: 18 . File Statistics: Column 0: count: 6 . Column 1: count: 6 min: abc max: mno sum: 17 . Column 2: count: 6 min: def max: tre sum: 18 . Stripes: Stripe: offset: 3 data: 58 rows: 6 tail: 49 index: 67 . Stream: column 0 section ROW_INDEX start: 3 length 9 . Stream: column 1 section ROW_INDEX start: 12 length 29 . Stream: column 2 section ROW_INDEX start: 41 length 29 . Stream: column 1 section DATA start: 70 length 20 . Stream: column 1 section LENGTH start: 90 length 12 . Stream: column 2 section DATA start: 102 length 21 . Stream: column 2 section LENGTH start: 123 length 5 . Encoding column 0: DIRECT . Encoding column 1: DIRECT_V2 . Encoding column 2: DIRECT_V2 .
2条答案
按热度按时间pgpifvop1#
找到了一种通过
Spark
```data = sqlContext.sql("SELECT * FROM orc.
<HDFS_path>
");data.printSchema()
root
|-- <column_name1>: (nullable = <true/false>)
|-- <column_name2>: (nullable = <true/false>)
|-- <column_name3>: (nullable = <true/false>)
|-- <column_name4>: (nullable = <true/false>)
|-- <column_name5>: (nullable = <true/false>)
cedebl8k2#
Hive兽人转储命令将解决你的目的
您将获得列、它们的类型、最小值、最大值、计数或记录以及其他更多的统计信息,如下所示