sqlcontext.createdataframe正在生成错误

iq3niunx  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(282)

我对spark环境非常陌生,我正在尝试将csv文件导入spark 2.0.2。我正在Windows10上使用Pypark。到目前为止这是我的代码

from pyspark.sql.types import *
    import csv
    projectFile = sc.textFile("bankfull.csv",4)
    schema = StructType([StructField("int_field",   IntegerType()),StructField("string_field", StringType())])
    header = projectFile.first()
    projectHeader = projectFile.filter(lambda l: "age" in l)
    projectNoHeader = projectFile.subtract(projectHeader)
    project_rdd = projectNoHeader.mapPartitions(lambda x: csv.reader(x, delimiter=","))
    project_df = sqlContext.createDataFrame(project_rdd,schema)

此时,我收到一条错误消息

An error occurred while calling o23.applySchemaToPythonRDD.
: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: ---------
        at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
        at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:189)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
        at java.lang.reflect.Constructor.newInstance(Unknown Source)
        at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
        at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
        at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
        at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
        at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
        at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
        at org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
        at org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
        at org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
        at org.apache.spark.sql.hive.HiveSessionState$$anon$1.<init>(HiveSessionState.scala:63)
        at org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
        at org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
        at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
        at org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:666)
        at org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:656)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: ---------
        at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612)   
at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
        at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
        ... 32 more

我该如何解决这个问题?谢谢您

zf9nrax1

zf9nrax11#

在windows上运行spark时,可以通过创建tmp/hive文件夹(数据库)来模拟c:drive上的配置单元数据库。为此,您需要将winutil.exe放在您设置的$hadoop\u home的bin文件夹中。您可以从这里下载winutil.exe链接。如果仍然遇到问题,请尝试授予/tmp/hive目录的完全访问权限,或者以管理员身份创建c:/tmp/hive目录

相关问题