我有一个如下所示的Dataframe。功能是 F1
, F2
, F3
输出变量为 Output
```
+-----+-----+-------+------+
| F1|F2 |F3 |0utput|
+-----+-----+-------+------+
|6.575| 4.98| 15.3|504000|
|6.421| 9.14| 17.8|453600|
|7.185| 4.03| 17.8|728700|
|6.998| 2.94| 18.7|701400|
|7.147| 5.33| 18.7|760200|
为了让apachespark运行任何ml算法,我们需要2列、features和output label。features列是一个组合所有特征值的向量。为此,我使用矢量汇编。
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import StructField, StringType, IntegerType, StructType
data_schema = [StructField('F1',IntegerType(), True),
StructField('F2',IntegerType(),True),
StructField('F3', IntegerType(),True),
StructField('Output', IntegerType(),True)]
final_struc = StructType(fields=data_schema)
training=spark.read.csv('housing.csv', schema=final_struc)
vectorAssembler = VectorAssembler(inputCols = ['F1', 'F2', 'F3'], outputCol = 'features')
vhouse_df = vectorAssembler.transform(training)
vhouse_df = vhouse_df.select(['features', 'Output'])
当我想看电视时,我得到一个错误
vhouse_df.show()
Py4JJavaError: An error occurred while calling o948.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 22.0 failed 1 times, most recent failure: Lost task 0.0 in stage 22.0 (TID 22, 10.0.2.15, executor driver): org.apache.spark.SparkException: Failed to execute user defined function(VectorAssembler$$Lambda$2720/0x00000008410e0840: (struct<F1_double_VectorAssembler_becd63a80d0f:double,F2_double_VectorAssembler_becd63a80d0f:double,F3_double_VectorAssembler_becd63a80d0f:double>) => struct<type:tinyint,size:int,indices:array,values:array>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
5条答案
按热度按时间ehxuflar1#
在代码中更改此项
至-------
4urapxun2#
查看您提供的架构-
在数据集中,所有输入列f1、f2和f3都是双精度的,请将整数改为双精度
您也可以尝试这样的方法-这是直接从您的dfMap模式
还有一个简短的建议
wgmfuz8q3#
我不确定,如果你是在寻找以下工作罚款为我-
-------输出
s4chpxco4#
pgvzfuti5#
更新