如何用分布式副本将Dataframe转换成javardd？

iqxoj9l9 于 2021-05-29 发布在 Hadoop

关注(0)|答案(0)|浏览(230)

我是一个新的Spark优化。我正在尝试将配置单元数据读入Dataframe。然后我将dataframe转换为javardd并在其上运行map函数。我面临的问题是，运行在javardd之上的转换是用单个任务运行的。此外，在这个javardd之上运行的转换是使用单个任务运行的。为了并行化，我重新划分了javardd。有没有更好的方法，因为重新分区需要更多的时间来洗牌数据。

DataFrame tempDf = df.sqlContext().sql("SELECT * FROM my_table");

// without repartition, the next transformation will run with 1 task only.
        JavaRDD<IMSSummaryPOJO> inputData = tempDf.toJavaRDD().flatMap(new FlatMapFunction<Row, IMSSummaryPOJO>() {
        //map operation
        }).repartition(repartition);

// Even though i've extra executors, if the previous transformation(inputData) is not repartitioned, then this transformation runs with single task.
    JavaPairRDD<Text,IMSMetric> inputRecordRdd = inputData.flatMapToPair(new IMSInputRecordFormat(dimensionName,hllCounterPValue,hllCounterKValue,dimensionConfigMapBroadCast));

Java hadoop rdd apache-spark spark-dataframe

来源：https://stackoverflow.com/questions/48114349/how-to-convert-dataframe-to-javardd-with-distibuted-copy