使用spark读取kudu表时出现问题(带有apache toree-scala内核的jupyer笔记本)

olhwl3o2  于 2021-07-14  发布在  Spark
关注(0)|答案(0)|浏览(146)

我试图在运行ApacheToree-scala内核的jupyter笔记本中使用ApacheSpark读取kudu表。
spark版本:2.2.0 scala版本:2.11 apache toree版本:0.3
这是我用来读取kudu表的代码

val kuduMasterAddresses = KUDU_MASTER_ADDRESSES_HERE
val kuduMasters: String = Seq(kuduMasterAddresses).mkString(",")

val kuduContext = new KuduContext(kuduMasters, spark.sparkContext)

val table = TABLE_NAME_HERE

def readKudu(table: String) = {
    val tableKuduOptions: Map[String, String] = Map(
    "kudu.table"  -> table,
    "kudu.master" -> kuduMasters
    )
    spark.sqlContext.read.options(tableKuduOptions).kudu
}

val kuduTableDF = readKudu(table)

使用kuducontext.tableexists(table)返回true。使用kudutabledf.columns会给出一个具有正确列名的字符串数组。
当我尝试应用count、show等操作时,问题就出现了。。。引发当前异常:
名称:org.apache.spark.sparkeexception消息:由于阶段失败而中止作业:获取任务结果时出现异常:java.io.ioexception:java.lang.classnotfoundexception:org.apache.kudu.spark.kudu.kuducontext$timestackTrace:atorg.apache.spark.scheduler.dagscheduler.org$apache$spark$scheduler$dagscheduler$$failjobandindependentstages(dagscheduler)。scala:1567)在org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler。scala:1555)在org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler。scala:1554)在scala.collection.mutable.resizablearray$class.foreach(resizablearray。scala:59)在scala.collection.mutable.arraybuffer.foreach(arraybuffer。scala:48)在org.apache.spark.scheduler.dagscheduler.abortstage(dagscheduler。scala:1554)位于org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler)。scala:803)在org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler)。scala:803)在scala.option.foreach(option。scala:257)在org.apache.spark.scheduler.dagscheduler.handletasksetfailed(dagscheduler。scala:803)位于org.apache.spark.scheduler.dagschedulereventprocessloop.doonreceive(dagscheduler。scala:1782)在org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler。scala:1737)位于org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler。scala:1726)在org.apache.spark.util.eventloop$$anon$1.run(eventloop。scala:48)
在org.apache.spark.scheduler.dagscheduler.runjob(dagscheduler。scala:619)在org.apache.spark.sparkcontext.runjob(sparkcontext。scala:2031)在org.apache.spark.sparkcontext.runjob(sparkcontext。scala:2052)在org.apache.spark.sparkcontext.runjob(sparkcontext。scala:2071)在org.apache.spark.sql.execution.sparkplan.executetake(sparkplan。scala:336)在org.apache.spark.sql.execution.collectlimitexec.executecollect(限制。scala:38)org.apache.spark.sql.dataset.org$apache$spark$sql$dataset$$collectfromplan(数据集)。scala:2865)在org.apache.spark.sql.dataset$$anonfun$head$1.apply(dataset。scala:2154)在org.apache.spark.sql.dataset$$anonfun$head$1.apply(dataset。scala:2154)在org.apache.spark.sql.dataset$$anonfun$55.apply(数据集。scala:2846)位于org.apache.spark.sql.execution.sqlexecution$.withnewexecutionid(sqlexecution)。scala:65)在org.apache.spark.sql.dataset.withaction(dataset。scala:2845)在org.apache.spark.sql.dataset.head(dataset。scala:2154)在org.apache.spark.sql.dataset.take(dataset。scala:2367)在org.apache.spark.sql.dataset.showstring(数据集。scala:241)在org.apache.spark.sql.dataset.show(dataset。scala:641)在org.apache.spark.sql.dataset.show(dataset。scala:600)在org.apache.spark.sql.dataset.show(dataset。scala:609)
我已经在apache toree笔记本中使用了adddeps魔术,如下所示:

%AddDeps org.apache.kudu kudu-spark2_2.11 1.6.0 --transitive --trace
%AddDeps org.apache.kudu kudu-client 1.6.0 --transitive --trace

我在执行以下导入时没有问题:

import org.apache.kudu.spark.kudu._

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题