parquet上的hive外部表未获取数据

gzszwxb4 于 2021-06-27 发布在 Hive

关注(0)|答案(4)|浏览(364)

我正在尝试创建一个数据管道，其中incomng数据存储到parquet中，我创建和外部配置单元表，用户可以查询配置单元表并检索数据。我可以保存配置单元数据并直接检索它，但当我查询配置单元表时，它不会返回任何行。我做了下面的测试
--create external hive table create external table emp（id double，hire_dt timestamp，user string）存储为parquet location'/test/emp'；
现在在一些数据上创建了dataframe并保存到parquet。
---创建dataframe并插入数据

val employeeDf = Seq(("1", "2018-01-01","John"),("2","2018-12-01", "Adam")).toDF("id","hire_dt","user")
val schema = List(("id", "double"), ("hire_dt", "date"), ("user", "string"))
val newCols= schema.map ( x => col(x._1).cast(x._2)) 
val newDf = employeeDf.select(newCols:_*)
newDf.write.mode("append").parquet("/test/emp")
newDf.show 

--read the contents directly from parquet 
val sqlcontext=new org.apache.spark.sql.SQLContext(sc)
sqlcontext.read.parquet("/test/emp").show 

+---+----------+----+
| id|   hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+

--read from the external hive table 
spark.sql("select  id,hire_dt,user from  emp").show(false)

+---+-------+----+
|id |hire_dt|user|
+---+-------+----+
+---+-------+----+

如上所示，我可以看到数据，如果我从Parquet地板直接读取，但不是从Hive。问题是我在这里做错了什么？我做错了，Hive没有得到数据。我认为msck修复可能是一个原因，但我得到错误，如果我试图做msck修复表说表没有分区。

Hive apache-spark apache-spark-sql parquet hiveql

来源：https://stackoverflow.com/questions/53768358/hive-external-table-on-parquet-not-fetching-data

4条答案

按热度按时间

g0czyy6m1#

根据create table语句，您已经使用location作为/test/emp，但是在写入数据时，您正在/tenants/gwm/idr/emp处写入数据。所以在/test/emp中没有数据。
create external hive table create external table emp（id double，hire_dt timestamp，user string）存储为parquet location'/test/emp'；
请将外部表重新创建为
create external hive table create external table emp（id double，hire\u dt timestamp，user string）存储为parquet location'/tenants/gwm/idr/emp'；

赞(0）回复(0）举报 2021-06-27

5cnsuln72#

我和下面的人一起工作。

val dfTransformed = employeeDf.withColumn("id", employeeDf.col("id").cast(DoubleType))
            .withColumn("hire_dt", employeeDf.col("hire_dt".cast(TimestampType))

所以基本上问题是数据类型不匹配，以及转换的原始代码似乎无法工作。所以我做了一个显式的cast，然后写出来，一切正常，也可以查询回来。

val employeeDf = Seq(("1", "2018-01-01","John"),("2","2018-12-01", "Adam")).toDF("id","hire_dt","user")

val dfTransformed = employeeDf.withColumn("id", employeeDf.col("id").cast(DoubleType))
    .withColumn("hire_dt", employeeDf.col("hire_dt".cast(TimestampType))

dfTransformed.write.mode("append").parquet("/test/emp")
dfTransformed.show 

--read the contents directly from parquet 
val sqlcontext=new org.apache.spark.sql.SQLContext(sc)
sqlcontext.read.parquet("/test/emp").show 

+---+----------+----+
| id|   hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+

--read from the external hive table 
spark.sql("select  id,hire_dt,user from  emp").show(false)
+---+----------+----+
| id|   hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+

赞(0）回复(0）举报 2021-06-27

irtuqstp3#

在sparksession builder（）语句中是否有enablehivesupport（）。你能连接到hive元存储吗？尝试在代码中显示表/数据库，看看是否可以显示配置单元位置上的表？

赞(0）回复(0）举报 2021-06-27

wmtdaxz34#

除了下面ramdev给出的答案外，您还需要谨慎使用日期/时间戳周围的正确数据类型；作为 date '创建配置单元表时，Parquet不支持'type'。
为此你可以改变 date '列的类型' hire_dt '收件人' timestamp '.
否则，通过spark持久化并尝试在hive（或hivesql）中读取的数据将不匹配。在这两个位置保持“timestamp”可以解决问题。希望对你有帮助。

赞(0）回复(0）举报 2021-06-27

我来回答

parquet上的hive外部表未获取数据

4条答案

相关问题

热门标签

最新问答