spark:vectorassembler抛出错误org.apache.spark.sparkeException:由于阶段失败而中止作业:

htzpubme  于 2021-05-29  发布在  Spark
关注(0)|答案(5)|浏览(529)

我有一个如下所示的Dataframe。功能是 F1 , F2 , F3 输出变量为 Output ```
+-----+-----+-------+------+
| F1|F2 |F3 |0utput|
+-----+-----+-------+------+
|6.575| 4.98| 15.3|504000|
|6.421| 9.14| 17.8|453600|
|7.185| 4.03| 17.8|728700|
|6.998| 2.94| 18.7|701400|
|7.147| 5.33| 18.7|760200|

为了让apachespark运行任何ml算法,我们需要2列、features和output label。features列是一个组合所有特征值的向量。为此,我使用矢量汇编。

from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import StructField, StringType, IntegerType, StructType
data_schema = [StructField('F1',IntegerType(), True),
StructField('F2',IntegerType(),True),
StructField('F3', IntegerType(),True),
StructField('Output', IntegerType(),True)]
final_struc = StructType(fields=data_schema)
training=spark.read.csv('housing.csv', schema=final_struc)
vectorAssembler = VectorAssembler(inputCols = ['F1', 'F2', 'F3'], outputCol = 'features')
vhouse_df = vectorAssembler.transform(training)
vhouse_df = vhouse_df.select(['features', 'Output'])

当我想看电视时,我得到一个错误

vhouse_df.show()

Py4JJavaError: An error occurred while calling o948.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 22.0 failed 1 times, most recent failure: Lost task 0.0 in stage 22.0 (TID 22, 10.0.2.15, executor driver): org.apache.spark.SparkException: Failed to execute user defined function(VectorAssembler$$Lambda$2720/0x00000008410e0840: (struct<F1_double_VectorAssembler_becd63a80d0f:double,F2_double_VectorAssembler_becd63a80d0f:double,F3_double_VectorAssembler_becd63a80d0f:double>) => struct<type:tinyint,size:int,indices:array,values:array>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)

ehxuflar

ehxuflar1#

在代码中更改此项

vhouse_df = vhouse_df.select(['features', 'OUTPUT'])

至-------

vhouse_df = vhouse_df.select('features', 'OUTPUT')
4urapxun

4urapxun2#

查看您提供的架构-
在数据集中,所有输入列f1、f2和f3都是双精度的,请将整数改为双精度

from pyspark.sql import types as T

data_schema = T.StructType([
StructField('F1',T.DoubleType(), True),
StructField('F2',T.DoubleType(),True),
StructField('F3', T.DoubleType(),True),
StructField('Output', T.IntegerType(),True)
])

您也可以尝试这样的方法-这是直接从您的dfMap模式

training=spark.read.json('housing.csv', schema=df.schema)

还有一个简短的建议

wgmfuz8q

wgmfuz8q3#


我不确定,如果你是在寻找以下工作罚款为我-

from pyspark.sql import functions as F
from pyspark.ml.feature import VectorAssembler

df = spark.read.csv('/FileStore/tables/datasets_1379_2485_housing.csv', header="true", inferSchema="true")
vectorAssembler = VectorAssembler(inputCols = ['RM', 'LSTAT', 'PTRATIO'], outputCol = 'features')
vhouse_df = vectorAssembler.transform(df)
vhouse_df.show()

-------输出

s4chpxco

s4chpxco4#

+-----+-----+-------+--------+------------------+
|   RM|LSTAT|PTRATIO|    MEDV|          features|
+-----+-----+-------+--------+------------------+
|6.575| 4.98|   15.3|504000.0| [6.575,4.98,15.3]|
|6.421| 9.14|   17.8|453600.0| [6.421,9.14,17.8]|
|7.185| 4.03|   17.8|728700.0| [7.185,4.03,17.8]|
|6.998| 2.94|   18.7|701400.0| [6.998,2.94,18.7]|
|7.147| 5.33|   18.7|760200.0| [7.147,5.33,18.7]|
| 6.43| 5.21|   18.7|602700.0|  [6.43,5.21,18.7]|
|6.012|12.43|   15.2|480900.0|[6.012,12.43,15.2]|
|6.172|19.15|   15.2|569100.0|[6.172,19.15,15.2]|
|5.631|29.93|   15.2|346500.0|[5.631,29.93,15.2]|
|6.004| 17.1|   15.2|396900.0| [6.004,17.1,15.2]|
|6.377|20.45|   15.2|315000.0|[6.377,20.45,15.2]|
|6.009|13.27|   15.2|396900.0|[6.009,13.27,15.2]|
|5.889|15.71|   15.2|455700.0|[5.889,15.71,15.2]|
|5.949| 8.26|   21.0|428400.0| [5.949,8.26,21.0]|
|6.096|10.26|   21.0|382200.0|[6.096,10.26,21.0]|
|5.834| 8.47|   21.0|417900.0| [5.834,8.47,21.0]|
|5.935| 6.58|   21.0|485100.0| [5.935,6.58,21.0]|
| 5.99|14.67|   21.0|367500.0| [5.99,14.67,21.0]|
|5.456|11.69|   21.0|424200.0|[5.456,11.69,21.0]|
|5.727|11.28|   21.0|382200.0|[5.727,11.28,21.0]|
+-----+-----+-------+--------+------------------+

相关问题