spark随机分裂训练和测试数据行计数总是给出不同的结果

aoyhnmkz 于 2021-05-22 发布在 Spark

关注(0)|答案(1)|浏览(464)

我正在pyspark中测试一些二元分类机器学习问题，并希望得到分类模型中的典型分析值（回忆、f1分数和精度）。我在jupyter笔记本上做这个。为了训练和测试我的模型，我使用 randomSplit() 功能。
在这样做的过程中，我得到了后面所有参数的不一致结果。我挖得更深一点，意识到即使这样做 count() 在培训和测试数据集上，我得到了不一致的结果：


# Split data into training and testing sets

(training_data, test_data) = eq_df.randomSplit([0.75, 0.25])

# This was printing inconsistent results!

print("Size of training set:", training_data.count())
print("Size of testing set:", test_data.count())

有人知道发生了什么事吗？

apache-spark pyspark

来源：https://stackoverflow.com/questions/64332759/spark-randomsplit-training-and-testing-data-row-count-always-giving-different-re

1条答案

按热度按时间

au9on6nz1#

经过进一步调查，我发现这篇文章：
https://medium.com/udemy-engineering/pyspark-under-the-hood-randomsplit-and-sample-inconsistencies-examined-7c6ec62644bc
这就解释了为什么每次 count() 函数在 training_data 以及 test_data Dataframe， randomSplit() 再次运行。因此解释了为什么我在这些Dataframe的计数和其他计算中得到不同的结果 randomSplit() 在后台不断地被重新计算。
为了解决这个问题，感谢steven在下面的评论，我缓存了测试和训练数据集，并在结果中获得了一致性。


# Split data into training and testing sets

(training_data, test_data) = eq_df.randomSplit([0.75, 0.25])

# Cache result so that these datasets remain constant throughout the code

training_data.cache()
test_data.cache()

赞(0）回复(0）举报 2021-05-23

我来回答

spark随机分裂训练和测试数据行计数总是给出不同的结果

1条答案

相关问题

热门标签

最新问答