我能够查询restapi数据并转换为rdd和dataframe。但是当我尝试查询列时,得到的是逗号分隔的值,而不是列结果。
我错过什么了吗?
代码:
package Stream
import org.apache.spark.sql._
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.http.client.methods.HttpGet
import org.apache.http.impl.client.DefaultHttpClient
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types._
import org.apache.spark.storage.StorageLevel
object SparkRestApi {
def main(args: Array[String]): Unit = {
val logger = Logger.getLogger("blah")
Logger.getLogger("org").setLevel(Level.WARN)
Logger.getLogger("akka").setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName("blah")
.config("spark.driver.memory", "2g")
.master("local[*]")
//.enableHiveSupport()
.getOrCreate()
import spark.implicits._
val url = "https://platform-api.opentargets.io/v3/platform/public/association/filter"
val result2 = List(scala.io.Source.fromURL(url).mkString)
val githubRdd2=spark.sparkContext.makeRDD(result2)
val gitHubDF2=spark.read.json(githubRdd2)
gitHubDF2.show()
val mediandf= gitHubDF2.select(col("data.association_score.overall").as("association_score"))
mediandf.show()
spark.stop()
}
}
但结果格式不好。它是逗号分隔的,而不是格式良好的行中的值。
association_score|
|[1.0, 1.0, 1.0, 1...|
但我期待着
1.0
1.0
1.0
1条答案
按热度按时间thigvfpy1#
检查以下代码。
Note
如果分解数组列值,将得到重复的行。