hive 如何在pyspark中读取配置单元表沿着元数据？

nue99wik 于 8个月前发布在 Hive

关注(0)|答案(1)|浏览(72)

我无法使用pyspark读取配置单元表及其元数据沿着
我认为我准确地创建了Hive表
设置：

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.enableHiveSupport().getOrCreate()

data1 = [(1,2,3),(3,4,5),(5,6,7)]
df1=spark.createDataFrame(data1,schema = 'a int,b int,c int')
parquet_path = './bucket_test_parquet1'

现在，用DESCRIBE检查表

df1.write.bucketBy(5,"a").format("parquet").saveAsTable('df',path=parquet_path,mode='overwrite')
spark.sql("DESCRIBE EXTENDED df").show(100)
output:
+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|                   a|                 int|   null|
|                   b|                 int|   null|
|                   c|                 int|   null|
|                    |                    |       |
|# Detailed Table ...|                    |       |
|            Database|             default|       |
|               Table|                  df|       |
|               Owner|               nitin|       |
|        Created Time|Tue Feb 01 09:05:...|       |
|         Last Access|             UNKNOWN|       |
|          Created By|         Spark 3.2.0|       |
|                Type|            EXTERNAL|       |
|            Provider|             parquet|       |
|         Num Buckets|                   5|       |
|      Bucket Columns|               [`a`]|       |
|        Sort Columns|                  []|       |
|            Location|file:/home/nitin/...|       |
|       Serde Library|org.apache.hadoop...|       |
|         InputFormat|org.apache.hadoop...|       |
|        OutputFormat|org.apache.hadoop...|       |
+--------------------+--------------------+-------+

read_parquet1 = spark.read.format("parquet").load(parquet_path,header=True)
read_parquet1.createOrReplaceTempView("rp1")
read_parquet1 = spark.table("rp1")
spark.sql("DESCRIBE EXTENDED rp1").show(100)
output:
|col_name|data_type|comment|
+--------+---------+-------+
|       a|      int|   null|
|       b|      int|   null|
|       c|      int|   null|
+--------+---------+-------+

正如您所看到的，当我从磁盘中阅读表时，元数据没有被读取。你能帮我读表，以便我有元数据沿着？

Hive

来源：https://stackoverflow.com/questions/70935318/how-to-read-hive-table-along-with-metadata-in-pyspark

1条答案

按热度按时间

relj7zay1#

如果你想要一个数据路径的表模式，你也可以这样做：
read_parquet1 = spark.read.format（“parquet”）.load（parquet_path，header=True）read_parquet1.PrintSchema（）--这将给予您想要的结果。
代码的问题在于，当您注意到的第一个代码时，您已经将数据写入了一个位置并要求提供其模式，而在第二种情况下，您正在从一个位置阅读，创建一个临时表并要求临时表提供其定义。理想情况下，您应该像我的代码中那样询问数据路径的Schema。

赞(0）回复(0）举报 8个月前

我来回答

hive 如何在pyspark中读取配置单元表沿着元数据？

1条答案

相关问题

热门标签

最新问答