失败或中止时重新运行spark作业

vs3odd8k 于 2021-06-02 发布在 Hadoop

关注(0)|答案(2)|浏览(499)

我期待着配置或参数，自动重新启动Spark作业的情况下，任何失败提交通过Yarn。我知道任务失败后会自动重启。我正期待着一个Yarn或Spark配置，将触发重新运行整个工作。
现在，如果我们的任何作业由于任何问题而中止，我们必须手动重新启动它，这会导致长数据队列的处理，因为这些都是设计为近乎实时的工作。
当前配置：


# !/bin/bash

export SPARK_MAJOR_VERSION=2

# Minimum TODOs on a per job basis:

# 1. define name, application jar path, main class, queue and log4j-yarn.properties path

# 2. remove properties not applicable to your Spark version (Spark 1.x vs. Spark 2.x)

# 3. tweak num_executors, executor_memory (+ overhead), and backpressure settings

# the two most important settings:

num_executors=6
executor_memory=32g

# 3-5 cores per executor is a good default balancing HDFS client throughput vs. JVM overhead

# see http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

executor_cores=2

# backpressure

reciever_minRate=1
receiver_max_rate=10
receiver_initial_rate=10

/usr/hdp/2.6.1.0-129/spark2/bin/spark-submit --master yarn --deploy-mode cluster \
  --name br1_warid_ccn_sms_production \
  --class com.spark.main\
  --driver-memory 16g \
  --num-executors ${num_executors} --executor-cores ${executor_cores} --executor-memory ${executor_memory} \
  --queue default \
  --files log4j-yarn-warid-br1-ccn-sms.properties \
  --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j-yarn-warid-br1-ccn-sms.properties" \
  --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j-yarn-warid-br1-ccn-sms.properties" \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer `# Kryo Serializer is much faster than the default Java Serializer` \
  --conf spark.kryoserializer.buffer.max=1g \
  --conf spark.locality.wait=30 \
  --conf spark.task.maxFailures=8 `# Increase max task failures before failing job (Default: 4)` \
  --conf spark.ui.killEnabled=true `# Prevent killing of stages and corresponding jobs from the Spark UI` \
  --conf spark.logConf=true `# Log Spark Configuration in driver log for troubleshooting` \
`# SPARK STREAMING CONFIGURATION` \
  --conf spark.scheduler.mode=FAIR \
  --conf spark.default.parallelism=32 \
  --conf spark.streaming.blockInterval=200 `# [Optional] Tweak to balance data processing parallelism vs. task scheduling overhead (Default: 200ms)` \
  --conf spark.streaming.receiver.writeAheadLog.enable=true `# Prevent data loss on driver recovery` \
  --conf spark.streaming.backpressure.enabled=false \
  --conf spark.streaming.kafka.maxRatePerPartition=${receiver_max_rate} `# [Spark 1.x]: Corresponding max rate setting for Direct Kafka Streaming (Default: not set)` \
`# YARN CONFIGURATION` \
  --conf spark.yarn.driver.memoryOverhead=4096 `# [Optional] Set if --driver-memory < 5GB` \
  --conf spark.yarn.executor.memoryOverhead=4096 `# [Optional] Set if --executor-memory < 10GB` \
  --conf spark.yarn.maxAppAttempts=4 `# Increase max application master attempts (needs to be <= yarn.resourcemanager.am.max-attempts in YARN, which defaults to 2) (Default: yarn.resourcemanager.am.max-attempts)` \
  --conf spark.yarn.am.attemptFailuresValidityInterval=1h `# Attempt counter considers only the last hour (Default: (none))` \
  --conf spark.yarn.max.executor.failures=$((8 * ${num_executors})) `# Increase max executor failures (Default: max(numExecutors * 2, 3))` \
  --conf spark.yarn.executor.failuresValidityInterval=1h `# Executor failure counter considers only the last hour` \
  --conf spark.task.maxFailures=8 \
  --conf spark.speculation=false \
/home//runscripts/production.jar

注意：在主题领域有几个问题，但没有公认的答案，或者答案偏离了预期的解决方案。在yarn上运行spark应用程序，而不使用spark submithow来配置yarn上应用程序驱动程序的自动重启
这个问题从Yarn和Spark的Angular 探讨了可能的解决方案。

hadoop yarn apache-spark spark-streaming hortonworks-data-platform

来源：https://stackoverflow.com/questions/46999601/re-run-spark-jobs-on-failure-or-abort

2条答案

按热度按时间

rqmkfv5c1#

只是一个想法！
让我们将脚本文件（包含上述脚本）称为 run_spark_job.sh .
尝试在脚本末尾添加以下语句：

return_code=$?

if [[ ${return_code} -ne 0 ]]; then
    echo "Job failed"
    exit ${return_code}
fi

echo "Job succeeded"
exit 0

我们再来一个脚本文件 spark_job_runner.sh ，我们称之为上面的脚本。例如，

./run_spark_job.sh
while [ $? -ne 0 ]; do
    ./run_spark_job.sh
done

Yarn为基础的方法：更新1：这个链接将是一个很好的阅读。它讨论了提交和跟踪的yarn rest api：https://community.hortonworks.com/articles/28070/starting-spark-jobs-directly-via-yarn-rest-api.html
更新2：此链接显示如何使用java将spark应用程序提交到yarn环境：https://github.com/mahmoudparsian/data-algorithms-book/blob/master/misc/how-to-submit-spark-job-to-yarn-from-java-code.md
基于spark的编程方法：
如何使用编程spark提交功能
基于Spark的Yarn配置方法：
Yarn模式下唯一可重新启动的Spark参数是 spark.yarn.maxAppAttempts 并且不应超过Yarn资源管理器参数 yarn.resourcemanager.am.max-attempts 官方文件摘录https://spark.apache.org/docs/latest/running-on-yarn.html
提交申请的最大尝试次数。

赞(0）回复(0）举报 2021-06-02

ngynwnxp2#

在yarn模式下，您可以将yarn.resourcemanager.am.max-attempts设置为默认值2，以便重新运行失败的作业，您可以根据需要增加次数。或者您可以使用spark的spark.yarn.maxappattempts配置来实现相同的功能。

赞(0）回复(0）举报 2021-06-02