sparksql查询中的exchange singlepartition错误

eoxn13cs 于 2021-05-27 发布在 Spark

关注(0)|答案(0)|浏览(1242)

我使用的是sparksql2.4查询。我正在使用下面的sql，它抛出了一个错误：查询很大，有几个步骤，所以我在下面给出了一个简明的版本。当我执行查询时，从 Spark-Shell ，失败，错误如下。解释计划相当长，因此我将其调整到更易于管理的程度：
我已经检查了 partition by encnbr 列是相当独特的。然而，spark ui中的stages选项卡只显示1个非常长的标签 task 指示 SKEW . 然而，由于键是唯一的，我不知道为什么会发生这种情况。我试过使用 cluster by encnbr 徒劳的。

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:

Exchange SinglePartition

+-  *(79) LocalLimit 4
    +- *(79) Project [enc_key#976, prsn_key#951, prov_key#952, clm_key#977, clm_ln_key#978... 7 more fields]
       +- Window [lag(non_keys#2862, 1, null) windowspecdefinition(encnbr#2722, eff_dt#713 ASC NULLS FIRST, data_timestamp#2723 ASC NULLS FIRST, specifiedwindowframe(Rowframe, -1, -1)) AS _we0#2868], [encnbr#2722], [eff_dt#713 ASC NULLS FIRST, data_timestamp#2723 ASC NULLS FIRST]......

查询由几个步骤组成，其中一个步骤取决于上一个步骤的结果。然而，失败的步骤类似于：

select
enc_key,
prsn_key,
prov_key,
clm_key,
clm_ln_key
birth_dt,
case when lag(non_keys) over (partition by encnbr order by eff_dt asc, data_timestamp asc) is null
     then 'Y'
     when lag(non_keys) <> non_keys
     then 'Y'
     else 'N'
end as mod_flg
FROM (
      select
      enc_key,
      encnbr,
      prsn_key,
      prov_key,
      clm_key,
      clm_ln_key
      birth_dt,
      eff_dt,
      data_timestamp,
      md5(enc_key || prsn_key || prov_key || clm_key || clm_ln_key) as non_keys
      from 
      table1
      where encnbr is not null

      union all

      select
      enc_key,
      encnbr,
      prsn_key,
      prov_key,
      clm_key,
      clm_ln_key
      birth_dt,
      eff_dt,
      data_timestamp,
      md5(enc_key || prsn_key || prov_key || clm_key || clm_ln_key) as non_keys
      from 
      table2
      where encnbr is not null
   )

你能帮我缓解这个问题吗。我试过使用 cluster by encnbr 但它仍然不断失败。
请帮忙
谢谢。

apache-spark apache-spark-sql

来源：https://stackoverflow.com/questions/63323647/exchange-singlepartition-error-in-sparksql-query