pyspark在执行mapreduce作业时抛出错误

2cmtqfgy  于 2021-05-29  发布在  Hadoop
关注(0)|答案(2)|浏览(376)

我有以下pyspark代码抛出错误

data = sc.textFile("file:///zika-map/cdc_zika/update_clean_zika.csv")
header = data.first()
byCountryNoHeader = data.filter(lambda x: x!=header)
sepColumn = byCountryNoHeader.map(lambda x: x.split(","))
byCountry =sepColumn.map(lambda x: (x[1], x[5])).reduceByKey(lambda x,y: int(x)+int(y))
byCountry.collect()

更新\u clean \u zika.csv的数据如下:

report date country city    location type   data field  value   unit
19/03/2016  Argentina   Buenos Aires    province    cumulative confirmed local cases    0   cases
19/03/2016  Argentina   Buenos Aires    province    cumulative probable local cases 0   cases
19/03/2016  Argentina   Buenos Aires    province    cumulative confirmed imported cases 2   cases
19/03/2016  Argentina   Buenos Aires    province    cumulative probable imported cases  1   cases
19/03/2016  Argentina   Buenos Aires    province    cumulative cases under study    127 cases
19/03/2016  Argentina   Buenos Aires    province    cumulative cases discarded  0   cases
19/03/2016  Argentina   CABA    province    cumulative confirmed local cases    0   cases
19/03/2016  Argentina   CABA    province    cumulative probable local cases 0   cases
19/03/2016  Argentina   CABA    province    cumulative confirmed imported cases 9   cases
19/03/2016  Argentina   CABA    province    cumulative probable imported cases  0   cases
19/03/2016  Argentina   CABA    province    cumulative cases under study    68  cases

基本上,我想做的是,绘制有病例的国家,然后根据国家得出总病例数。Map工作正常,但会减少导致以下错误的关键:

Traceback (most recent call last):

  File "<ipython-input-19-db6ad3fdabe0>", line 16, in <module>
    byCountry.groupByKey().collect()

  File "C:\Spark\python\lib\pyspark.zip\pyspark\rdd.py", line 771, in collect
    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())

  File "C:\Spark\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py", line 813, in __call__
    answer, self.gateway_client, self.target_id, self.name)

  File "C:\Spark\python\lib\py4j-0.9-src.zip\py4j\protocol.py", line 308, in get_return_value
    format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 46.0 failed 1 times, most recent failure: Lost task 0.0 in stage 46.0 (TID 63, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "C:\Spark\python\lib\pyspark.zip\pyspark\worker.py", line 111, in main
  File "C:\Spark\python\lib\pyspark.zip\pyspark\worker.py", line 106, in process
  File "C:\Spark\python\lib\pyspark.zip\pyspark\rdd.py", line 2346, in pipeline_func
  File "C:\Spark\python\lib\pyspark.zip\pyspark\rdd.py", line 2346, in pipeline_func
  File "C:\Spark\python\lib\pyspark.zip\pyspark\rdd.py", line 317, in func
  File "C:\Spark\python\lib\pyspark.zip\pyspark\rdd.py", line 1776, in combineLocally
  File "C:\Spark\python\lib\pyspark.zip\pyspark\shuffle.py", line 238, in mergeValues
    d[k] = comb(d[k], v) if k in d else creator(v)
  File "<ipython-input-19-db6ad3fdabe0>", line 7, in <lambda>
ValueError: invalid literal for int() with base 10: 'zika confirmed laboratory'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "C:\Spark\python\lib\pyspark.zip\pyspark\worker.py", line 111, in main
  File "C:\Spark\python\lib\pyspark.zip\pyspark\worker.py", line 106, in process
  File "C:\Spark\python\lib\pyspark.zip\pyspark\rdd.py", line 2346, in pipeline_func
  File "C:\Spark\python\lib\pyspark.zip\pyspark\rdd.py", line 2346, in pipeline_func
  File "C:\Spark\python\lib\pyspark.zip\pyspark\rdd.py", line 317, in func
  File "C:\Spark\python\lib\pyspark.zip\pyspark\rdd.py", line 1776, in combineLocally
  File "C:\Spark\python\lib\pyspark.zip\pyspark\shuffle.py", line 238, in mergeValues
    d[k] = comb(d[k], v) if k in d else creator(v)
  File "<ipython-input-19-db6ad3fdabe0>", line 7, in <lambda>
ValueError: invalid literal for int() with base 10: 'zika confirmed laboratory'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
1 more

我尝试过stackoverflow的各种方法和不同的主题,但没有成功。任何帮助或建议都将不胜感激。

jucafojl

jucafojl1#

你有一个 Value Error 在lambda函数中执行以下操作时: int(x) + int(y) . 标准显示: ValueError: invalid literal for int() with base 10: 'zika confirmed laboratory' 这意味着 x[5] 无法转换为int,即“zika受限实验室”无法转换为int。您可能只需要修复索引。

9o685dep

9o685dep2#

我基本上已经弄明白了,没有什么空值问题。因此,创建了一个Dataframe,然后使用sparksql就可以忽略这些。

相关问题