配置单元找不到由spark结构化流写入的分区数据

fv2wmkja  于 2021-06-26  发布在  Hive
关注(0)|答案(1)|浏览(253)

我有一份spark结构化流媒体工作,将数据写入ibm云对象存储(s3):

dataDf.
  writeStream.
  format("parquet").
  trigger(Trigger.ProcessingTime(trigger_time_ms)).
  option("checkpointLocation", s"${s3Url}/checkpoint").
  option("path", s"${s3Url}/data").
  option("spark.sql.hive.convertMetastoreParquet", false).
  partitionBy("InvoiceYear", "InvoiceMonth", "InvoiceDay", "InvoiceHour").
  start()

我可以使用hdfs cli查看数据:

[clsadmin@xxxxx ~]$ hdfs dfs -ls s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0 | head
Found 616 items
-rw-rw-rw-   1 clsadmin clsadmin      38085 2018-09-25 01:01 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-1e1dda99-bec2-447c-9bd7-bedb1944f4a9.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin      45874 2018-09-25 00:31 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-28ff873e-8a9c-4128-9188-c7b763c5b4ae.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin       5124 2018-09-25 01:10 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-5f768960-4b29-4bce-8f31-2ca9f0d42cb5.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin      40154 2018-09-25 00:20 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-70abc027-1f88-4259-a223-21c4153e2a85.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin      41282 2018-09-25 00:50 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-873a1caa-3ecc-424a-8b7c-0b2dc1885de4.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin      41241 2018-09-25 00:40 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-88b617bf-e35c-4f24-acec-274497b1fd31.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin       3114 2018-09-25 00:01 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-deae2a19-1719-4dfa-afb6-33b57f2d73bb.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin      38877 2018-09-25 00:10 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-e07429a2-43dc-4e5b-8fe7-c55ec68783b3.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin      39060 2018-09-25 00:20 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00001-1553da20-14d0-4c06-ae87-45d22914edba.c000.snappy.parquet

但是,当我尝试查询数据时:

hive> select * from invoiceitems limit 5;
OK
Time taken: 2.392 seconds

我的表ddl如下所示:

CREATE EXTERNAL TABLE `invoiceitems`(
  `invoiceno` int,
  `stockcode` int,
  `description` string,
  `quantity` int,
  `invoicedate` bigint,
  `unitprice` double,
  `customerid` int,
  `country` string,
  `lineno` int,
  `invoicetime` string,
  `storeid` int,
  `transactionid` string,
  `invoicedatestring` string)
PARTITIONED BY (
  `invoiceyear` int,
  `invoicemonth` int,
  `invoiceday` int,
  `invoicehour` int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3a://streaming-data-landing-zone-partitioned/data'

我还尝试了列/分区名称的正确大小写-这也不起作用。
你知道为什么我的查询找不到数据吗?
更新1:
我尝试过将位置设置为一个包含没有分区的数据的目录,但仍然不起作用,所以我想知道这是否是一个数据格式问题?

CREATE EXTERNAL TABLE `invoiceitems`(
  `InvoiceNo` int,
  `StockCode` int,
  `Description` string,
  `Quantity` int,
  `InvoiceDate` bigint,
  `UnitPrice` double,
  `CustomerID` int,
  `Country` string,
  `LineNo` int,
  `InvoiceTime` string,
  `StoreID` int,
  `TransactionID` string,
  `InvoiceDateString` string)
PARTITIONED BY (
  `InvoiceYear` int,
  `InvoiceMonth` int,
  `InvoiceDay` int,
  `InvoiceHour` int)
STORED AS PARQUET
LOCATION
  's3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/';

hive> Select * from invoiceitems limit 5;
OK
Time taken: 2.066 seconds
23c0lvtd

23c0lvtd1#

读取snappy压缩Parquet文件
数据采用snappy压缩Parquet文件格式。

s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-1e1dda99-bec2-447c-9bd7-bedb1944f4a9.c000.snappy.parquet

所以在create table ddl语句中设置'parquet.compress'='snappy'表属性。您也可以在ambari的“自定义配置单元站点设置”部分中为iop或hdp设置parquet.compression=snappy。
下面是在配置单元中的表创建语句期间使用table属性的示例:

hive> CREATE TABLE inv_hive_parquet( 
   trans_id int, product varchar(50), trans_dt date
    )
 PARTITIONED BY (
        year int)
 STORED AS PARQUET
 TBLPROPERTIES ('PARQUET.COMPRESS'='SNAPPY');

更新外部表中的分区元数据
另外,对于一个外部分区表,每当任何外部作业(在本例中是spark作业)将分区直接写入datafolder时,我们都需要更新分区元数据,因为除非显式更新分区,否则hive不会意识到这些分区。
这可以通过以下方式实现:

ALTER TABLE inv_hive_parquet RECOVER PARTITIONS;
//or
MSCK REPAIR TABLE inv_hive_parquet;

相关问题