spark-compute数组列的统计信息(arraytype)

3hvapo4f 于 2021-06-25 发布在 Hive

关注(0)|答案(0)|浏览(230)

我目前正在使用spark 2.4.3，必须执行一些连接，例如：

val res = df1.join(df2, Seq("x")).join(df3, Seq("y")).join(df4, Seq("z"))

其中涉及的每个df都是从包含arraytype列的配置单元表创建的。例如：

val df1 = spark.sql("SELECT x FROM TableWithArrayCols LATERAL VIEW EXPLODE(xArray) explodedX as x")

如您所见，要创建dfs，需要分解tablewitharraycols的列。
我希望cbo触发重新排序联接的优化。但是，据我所知，这是不可能发生的，因为缺少数组列的统计信息。事实上，如果我计算：

analyze table TableWithArrayCols compute statistics for columns xArray

我得到以下例外：

org.apache.spark.sql.AnalysisException: Column xArray in table `myDB`.`TableWithArrayCols` is of type
ArrayType(StringType,true), and Spark does not support statistics collection on this column type.;

我的问题：
1）我可以手动将统计值（例如min、max、num\u nulls、distinct\u count）添加到相应的元数据中吗？如果是，怎么做？如果没有，是否有其他解决方案？
2）手动添加统计信息是否会自动触发来自cbo的优化？

Hive apache-spark apache-spark-sql query-optimization catalyst

来源：https://stackoverflow.com/questions/59930287/spark-compute-statistics-for-array-columns-arraytype

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

spark-compute数组列的统计信息(arraytype)

暂无答案！

相关问题

热门标签

最新问答