pyspark，在git中使用多个源文件

eit6fx6z 于 2021-05-27 发布在 Spark

关注(0)|答案(0)|浏览(205)

我正在与databricks合作，并试图为我的项目找到一个好的设置。基本上，我必须处理许多文件，因此我编写了相当多的python代码。我现在面临的问题是如何在spark集群上运行这些代码，使我的源代码文件也位于相应的节点上。到目前为止，我的方法（甚至可以）是在rdd上调用一个Map，然后为rdd的每个元素运行parse\u file函数。有没有更好的方法，这样我就不必每次克隆整个git了？

def parse_file(filename, string):
  os.system("rm -rf source_code_folder")
  os.system("git clone https://user:password@dev.azure.com/company/project/_git/repo source_code_folder")
  sys.path.append(os.path.join(subprocess.getoutput("pwd"),"source_code_folder/"))
  from my_module1.whatever import my_function
  from my_module2.whatever import some_other_function
  result = (my_function(string), some_other_function(string))
  return result

my_rdd = sc.wholeTextFiles("abfss://raw@storage_account.dfs.core.windows.net/files/*.txt")
processed_rdd = my_rdd(lambda x: (x[0], parse_file(x[0],x[1])))

谢谢！

python apache-spark pyspark databricks azure-databricks

来源：https://stackoverflow.com/questions/62700559/pyspark-on-databricks-using-multiple-source-files-with-git