使用pyarrow将pysparkDataframe转换为pandasDataframe时出错

jgovgodb  于 2021-05-27  发布在  Spark
关注(0)|答案(0)|浏览(3431)

在将pysparkDataframe转换为pandasDataframe(topandas)时,出现以下错误:

File "/usr/local/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2121, in toPandas batches = self._collectAsArrow()

File "/usr/local/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2179, in _collectAsArrow return list(_load_from_socket(sock_info, ArrowStreamSerializer()))

File "/usr/local/lib/python3.7/site-packages/pyspark/rdd.py", line 144, in _load_from_socket (sockfile, sock) = local_connect_and_auth(*sock_info)

TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were given

我用的是pyarrow。为此,我在sparkconf中添加了以下conf:

.set("spark.sql.execution.arrow.enabled", "true")
.set("spark.sql.execution.arrow.fallback.enabled", "true")
.set("spark.sql.execution.arrow.maxRecordsPerBatch", 5000)

在这个过程中,我读了大约8000万行。
注意:只有当我启用“arrow”时,这个错误才会出现。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题