pyspark从dataframe创建dataframe的时间太长，即使启用了apachearrow

mpgws1up 于 2021-05-27 发布在 Spark

关注(0)|答案(0)|浏览(268)

我正在尝试从databricks aws/ec2上的pandasDataframe创建pysparkDataframe。
环境：

pyarrow : 0.13.0
 pandas:  1.1.1
 spark:  2.4.5
 scala: 2.11

基于https://bryancutler.github.io/createdataframe/

import pandas as pd
 import numpy as np

 data = np.random.rand(100000000, 10)

 pdf = pd.DataFrame(data, columns=list("abcdefghij"))
 spark.conf.set("spark.sql.execution.arrow.enabled", "true")

 df = spark.createDataFrame(pdf)

 pdf.memory_usage(deep=True).sum()/(1024**2)
 # 7629.3946533203125 MB

it成本

32 secondss on AWS/EC2 i3.4 16 cores 122 GB memory

但是，在同一个ec2示例上，对于另一个PandasDataframe

u_id (string) p_id (string) val (int8) review (string)
 taa-ssca      fasc-wdavsd    8         I like this because ... (could be as long as 300+ words)

u\u id和p\u id不超过15个字符，审查可能很长。
PandasDataframe的大小是

18 GB

运行时间超过1个小时

df = spark.createDataFrame(pdf)

在同一台ec2上。
我查过了https://bryancutler.github.io/createdataframe/httpshttp://gist.github.com/bryancutler/bc73d573b7e46a984ff8b6edf228e298https://docs.databricks.com/spark/latest/spark-sql/spark-pandas.html
它们都不起作用。
有人能帮我吗？

python DataFrame apache-spark pyspark pandas

来源：https://stackoverflow.com/questions/63692441/pyspark-create-dataframe-from-pandas-dataframe-too-long-time-even-though-apache