为什么spark包解析器(`--packages`)不将依赖项复制到$spark\u home/jars?

scyqe7ek  于 2021-05-18  发布在  Spark
关注(0)|答案(0)|浏览(281)

有人能解释一下为什么我要手动复制 com.amazonaws_aws-java-sdk-bundle 到我的本地$spark\u家,虽然我使用的是自动包解析器 --packages ?
我所做的是从spark shell开始提交spark:

$SPARK_HOME/bin/spark-shell \
  --master k8s://https://localhost:6443  \
  --deploy-mode client  \
  --conf spark.executor.instances=1  \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark  \
  --conf spark.kubernetes.container.image=spark:spark-docker  \
  --packages org.apache.hadoop:hadoop-aws:3.2.0,io.delta:delta-core_2.12:0.7.0 \
  --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore \
  --conf spark.hadoop.fs.path.style.access=true \
  --conf spark.hadoop.fs.s3a.access.key=$MINIO_ACCESS_KEY \
  --conf spark.hadoop.fs.s3a.secret.key=$MINIO_SECRET_KEY \
  --conf spark.hadoop.fs.s3a.endpoint=$MINIO_ENDPOINT \
  --conf spark.hadoop.fs.s3a.connection.ssl.enabled=false \
  --conf spark.hadoop.fs.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
  --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
  --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
  --conf spark.driver.port=4040 \
  --name spark-locally

我的设置是最新的spark3.0.1和hadoop3.2,还有本地kubernetes和docker桌面for mac。
如我所说,上面的代码将成功下载依赖项 --packages org.apache.hadoop:hadoop-aws:3.2.0 其中com.amazonaws\u aws-java-sdk-bundle-1.11.375作为依赖项:

Ivy Default Cache set to: /Users/sspaeti/.ivy2/cache
The jars for the packages stored in: /Users/sspaeti/.ivy2/jars
:: loading settings :: url = jar:file:/Users/sspaeti/Documents/spark/spark-3.0.1-bin-hadoop3.2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hadoop#hadoop-aws added as a dependency
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-91fd31e1-0b2a-448c-9c69-fd9dc430d41c;1.0
    confs: [default]
    found org.apache.hadoop#hadoop-aws;3.2.0 in central
    found com.amazonaws#aws-java-sdk-bundle;1.11.375 in central
    found io.delta#delta-core_2.12;0.7.0 in central
    found org.antlr#antlr4;4.7 in central
    found org.antlr#antlr4-runtime;4.7 in central
    found org.antlr#antlr-runtime;3.5.2 in central
    found org.antlr#ST4;4.0.8 in central
    found org.abego.treelayout#org.abego.treelayout.core;1.0.3 in central
    found org.glassfish#javax.json;1.0.4 in central
    found com.ibm.icu#icu4j;58.2 in central
:: resolution report :: resolve 376ms :: artifacts dl 22ms
    :: modules in use:
    com.amazonaws#aws-java-sdk-bundle;1.11.375 from central in [default]
    com.ibm.icu#icu4j;58.2 from central in [default]
    io.delta#delta-core_2.12;0.7.0 from central in [default]
    org.abego.treelayout#org.abego.treelayout.core;1.0.3 from central in [default]
    org.antlr#ST4;4.0.8 from central in [default]
    org.antlr#antlr-runtime;3.5.2 from central in [default]
    org.antlr#antlr4;4.7 from central in [default]
    org.antlr#antlr4-runtime;4.7 from central in [default]
    org.apache.hadoop#hadoop-aws;3.2.0 from central in [default]
    org.glassfish#javax.json;1.0.4 from central in [default]

但是为什么,我总是在这里得到错误 java.lang.NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException . 我不明白的是,当我在 deploy-mode client ,我以为maven会解决所有依赖于我的本地spark(驱动程序)的问题,不是吗?或者失踪的谜题在哪里?
我也试过了 --packages org.apache.hadoop:hadoop-aws:3.2.0,io.delta:delta-core_2.12:0.7.0,com.amazonaws:aws-java-sdk-bundle:1.11.375 也不走运。

我的解决方案,但不知道我为什么要这么做

有效的方法是我手动复制(从maven复制)或者直接从我下载的 .ivy2 文件夹如下: cp $HOME/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.375.jar $SPARK_HOME/jars 之后,我可以成功地读写本地s3(minio)。

与jupyter合作

另一件奇怪的事情是,我还安装了jupyter笔记本在我当地的kubernetes和那里它正常工作 --packages . 我用的是pyspark,区别是,pyspark在工作,但不是在Spark壳上?
如果是这样,我将如何在pyspark本地而不是spark shell上进行相同的测试?
非常感谢你的解释,我已经浪费了很多时间。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题