pyspark 无法使用保存在S3中的recordio protobuf数据训练Sagemaker模型

rqqzpn5f 于 6个月前发布在 Spark

关注(0)|答案(1)|浏览(77)

我想使用PySpark在Amazon EMR中预处理数据，并使用管道模式在SageMaker中训练机器学习模型。我现在遇到的问题是将数据保存在S3中并将其提供给模型。
SageMaker模型接受application/x-recordio-protobuf类型。因此，我将数据保存为：

output_path = f"s3://my_path/output_processed"
df_transformed.write.format("sagemaker").mode("overwrite").save(output_path)

字符串
其中df_transformed是pyspark dataframe。
当我试图将我的数据馈送到模型时：

records = RecordSet(s3_data=train_path, s3_data_type='S3Prefix', num_records=-1, feature_dim=50) rcf.fit(records)

型
我得到这个错误：

Failed. Reason: ClientError: Unable to read data channel 'train'. Requested content-type is 'application/x-recordio-protobuf'. Please verify the data matches the requested content-type. (caused by MXNetError)

型
你知道我做错了什么吗？有必要在EMR中单独预处理数据并在SageMaker中训练，或者我可以在SageMaker中完成所有工作吗？（考虑到成本）。
我关注了：https://aws.amazon.com/blogs/machine-learning/using-pipe-input-mode-for-amazon-sagemaker-algorithms/

pyspark

来源：https://stackoverflow.com/questions/77518065/cant-train-sagemaker-model-using-recordio-protobuf-data-saved-in-s3

1条答案

按热度按时间

wvt8vs2t1#

更改以RecordIO格式保存数据的格式：-

output_path = f"s3://my_path/output_processed"
df_transformed.write.format("recordio").mode("overwrite").save(output_path)

字符串
及用途：-

records = RecordSet(s3_data=train_path, s3_data_type='S3RecordIO', num_records=-1, feature_dim=50) rcf.fit(records)

型
参考：-
将管道输入模式用于Amazon SageMaker算法：https://aws.amazon.com/blogs/machine-learning/using-pipe-input-mode-for-amazon-sagemaker-algorithms/
Amazon SageMaker的预处理数据：https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html
如果考虑到成本，那么最好选择SageMaker。

但是如果数据集很大，数据逻辑的预处理很复杂**，如果代码已经在EMR基础架构中实现，那么将其更改为SageMaker的成本会更高。

参考：-
Amazon SageMaker成本优化指南：https://aws.amazon.com/blogs/machine-learning/optimizing-costs-for-machine-learning-with-amazon-sagemaker/
Amazon EMR成本优化指南：https://aws.amazon.com/emr/pricing/

赞(0）回复(0）举报 6个月前

我来回答

pyspark 无法使用保存在S3中的recordio protobuf数据训练Sagemaker模型

1条答案

相关问题

热门标签

最新问答