我想把cassandra表加载到spark中的datafram中,我遵循了下面的示例程序(在这个答案中找到),但是我得到了下面提到的一个execption,我尝试先将表加载到rdd,然后将其转换为datafrme,加载rdd是成功的,但是当我尝试将它转换为Dataframe时,我得到了与第一种方法相同的执行选项,有什么建议吗?我使用的是spark 2.0.0、cassandra 3.7和Java8。
public class SparkCassandraDatasetApplication {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("SparkCassandraDatasetApplication")
.config("spark.sql.warehouse.dir", "/file:C:/temp")
.config("spark.cassandra.connection.host", "127.0.0.1")
.config("spark.cassandra.connection.port", "9042")
.master("local[2]")
.getOrCreate();
//Read data to dataframe
// this is throwing an exception
Dataset<Row> dataset = spark.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "mykeyspace");
put("table", "mytable");
}
}).load();
//Print data
dataset.show();
spark.stop();
}
}
提交时,我收到以下例外情况:
Exception in thread "main" java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.fs.s3.S3FileSystem not found
at java.util.ServiceLoader.fail(ServiceLoader.java:239)
at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2623)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2634)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:115)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
使用rdd方法从cassandra读取数据是成功的(我已经用count()调用对其进行了测试),但是将rdd转换为df会引发与第一个方法相同的异常。
public class SparkCassandraRDDApplication {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("App")
.config("spark.sql.warehouse.dir", "/file:/opt/spark/temp")
.config("spark.cassandra.connection.host", "127.0.0.1")
.config("spark.cassandra.connection.port", "9042")
.master("local[2]")
.getOrCreate();
SparkContext sc = spark.sparkContext();
//Read
JavaRDD<UserData> resultsRDD = javaFunctions(sc).cassandraTable("mykeyspace", "mytable",CassandraJavaUtil.mapRowTo(UserData.class));
//This is again throwing an exception
Dataset<Row> usersDF = spark.createDataFrame(resultsRDD, UserData.class);
//Print
resultsRDD.foreach(data -> {
System.out.println(data.id);
System.out.println(data.username);
});
sc.stop();
}
}
1条答案
按热度按时间izkcnapc1#
请检查类路径中是否有“hadoop-common-2.2.0.jar”。您可以通过创建一个包含所有依赖项的jar来测试您的应用程序。在pom.xml下面使用maven shade plugin来包含创建uber jar的所有依赖项。
你可以像下面那样运行jar