我发现spark2加载orc文件的速度比spark1慢得多,然后我尝试了一些加速spark2的方法,但没有成功。代码如下:
Spark1.5
val conf = new SparkConf().setAppName("LoadOrc")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.akka.frameSize", "512")
.set("spark.akka.timeout","800s")
.set("spark.storage.blockManagerHeartBeatMs", "300000")
.set("spark.kryoserializer.buffer.max","1024m")
.set("spark.executor.extraJavaOptions", "-Djava.util.Arrays.useLegacyMergeSort=true")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
val start = System.nanoTime()
val ret = hiveContext.read.orc(args(0)).count()
val end = System.nanoTime()
println(s"count: $ret")
println(s"Time taken: ${(end - start) / 1000 / 1000} ms")
sc.stop()
spark用户界面
spark1用户界面
结果
count: 2290811187
Time taken: 401063 ms
Spark2
val spark = SparkSession.builder()
.appName("LoadOrc")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.akka.frameSize", "512")
.config("spark.akka.timeout","800s")
.config("spark.storage.blockManagerHeartBeatMs", "300000")
.config("spark.kryoserializer.buffer.max","1024m")
.config("spark.executor.extraJavaOptions", "-Djava.util.Arrays.useLegacyMergeSort=true")
.enableHiveSupport()
.getOrCreate()
println(spark.time(spark.read.format("org.apache.spark.sql.execution.datasources.orc")
.load(args(0)).count()))
spark.close()
spark用户界面
spark2用户界面
结果
Time taken: 1384464 ms
2290811187
暂无答案!
目前还没有任何答案,快来回答吧!