如何将配置单元表转换为mllib标签点？

dffbzjpn 于 2021-06-02 发布在 Hadoop

关注(0)|答案(1)|浏览(237)

我使用impala构建了一个表，其中包含一个目标和数百个特性。我想用spark mllib来训练一个模型。我知道，为了通过spark运行分布式监督模型，数据需要采用以下几种格式之一。标签点似乎是最直观的我。使用pyspark将配置单元表转换为标记点的最有效方法是什么？

hadoop Hive apache-spark pyspark apache-spark-mllib

来源：https://stackoverflow.com/questions/35585765/how-to-convert-a-hive-table-into-a-mllib-labeledpoint

1条答案

按热度按时间

jum4pzuy1#

这个问题的最佳解决方案可能是使用ml库及其模型，因为它们直接作用于Dataframe。
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=ml#module-pyspark.ml.分类
然而，mlapi还没有达到与mllib的功能奇偶性，您需要的东西可能丢失了。因此，我们在工作流中通过调用hive上下文检索到的Dataframe上的Map来解决这个问题。

from pyspark import SparkContext, HiveContext
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithLBFGS

table_name = "MyTable"
target_col = "MyTargetCol"

sc = SparkContext()
hc = HiveContext(sc)

# get the table from the hive context

df = hc.table(table_name) 

# reorder columns so that we know the index of the target column

df = df.select(target_col, *[col for col in dataframe.columns if col != target_col])

# map through the data to produce an rdd of labeled points

rdd_of_labeled_points = df.map(lambda row: LabeledPoint(row[0], row[1:]))

# use the rdd as input to a model

model = LogisticRegressionWithLBFGS.train(rdd_of_labeled_points)

请记住，无论何时使用python进行Map，都需要将数据从jvm封送到python vm，因此性能会受到影响。我们发现使用Map对性能的影响对于我们的数据来说可以忽略不计，但是您的里程数可能会有所不同。

赞(0）回复(0）举报 2021-06-02

我来回答

如何将配置单元表转换为mllib标签点？

1条答案

相关问题

热门标签

最新问答