如何在pyspark代码中提供全部kerberos身份验证细节？

kt06eoxx 于 2021-05-27 发布在 Spark

关注(0)|答案(0)|浏览(267)

我正在尝试使用pyspark读取hdfs上的一个文件。该文件存在于与运行pyspark作业的hdfs集群和服务器不同的hdfs集群和服务器中，即我的pyspark作业必须读取存在于不同hadoop集群上的文件。为了访问该文件，我必须通过keytab身份验证，为此我编写了以下代码：

sc_conf = SparkConf()
sc = SparkContext()

sc_conf.setAppName("check_conn_cross_cluster")
sc_conf.setMaster("yarn")
sc_conf.set('spark.executor.memory', "2g")
sc_conf.set('spark.executor.cores', "2")
sc_conf.set('spark.yarn.keytab', "/home/devusr/devusr.keytab")
sc_conf.set('spark.yarn.principal', "devusr@HADOOP.COMPANY.COM")
sc_conf.set('spark.executor.instances', "1")

try:
    sc.stop()
    sc = SparkContext(conf=sc_conf)
except:
    sc = SparkContext(conf=sc_conf)

keytab&principle是文件所在集群的一部分。我不明白的是，我应该在哪里提供keytab、kdc、领域的其他细节，这些细节在下面给出，我不明白的是，我应该在代码中的哪里提供这些信息：

1. "hadoop.security.authentication", "kerberos"
2.  System.setProperty("java.security.krb5.kdc", kdc);
    System.setProperty("java.security.krb5.realm", realm);
3.  UserGroupInformation.setConfiguration(conf);
    UserGroupInformation.loginUserFromKeytab(user, keyPath);
4. If I have to pass the config files like core-site.xml & hdfs-site.xml files, do I have to pass the config files of the server where the file exist ?

我已经用纯java编写了代码来访问hadoop集群上的一个文件，在代码中提供了所有必要的细节，而不使用spark。现在我需要在pyspark中编写相同的代码，但是我不知道如何为spark配置提供某些细节，以便在spark程序在另一台服务器上运行时读取另一台服务器上的文件。有人能告诉我如何在代码中添加必要的细节，以便从不同的服务器访问hdfs上的文件吗。

hadoop apache-spark pyspark

来源：https://stackoverflow.com/questions/62916251/how-to-give-total-kerberos-authentication-details-in-pyspark-code