如何在pyspark中处理xml？

93ze6v8z 于 2021-05-24 发布在 Spark

关注(0)|答案(1)|浏览(480)

我使用databricksxml包来读取和处理wikipediaxml数据，但是我不知道如何处理嵌套。

df = spark.read.format('com.databricks.spark.xml').options(rowTag='page').load(data)

我使用上述命令读入数据，并将rowtag设置为page以获取所有页面信息。
一页的数据如下所示：

<page>
<title>User talk:Jroset24</title>
<ns>3</ns>
<id>63975912</id>
<revision>
  <id>957023082</id>
  <timestamp>2020-05-16T16:11:47Z</timestamp>
  <contributor>
    <username>HostBot</username>
    <id>16596082</id>
  </contributor>
  <comment>/* Jroset24, you are invited to the Teahouse! */ new section</comment>
  <model>wikitext</model>
  <format>text/x-wiki</format>
  <text bytes="1237" id="968918613" />
  <sha1>isokyoojfzhgql1po9r1qmctdfscv59</sha1>
</revision>
<revision>
  <id>959350629</id>
  <parentid>957023082</parentid>
  <timestamp>2020-05-28T10:41:03Z</timestamp>
  <contributor>
    <username>TheImaCow</username>
    <id>38905475</id>
  </contributor>
  <comment>Your draft page has been moved ([[WP:DFY|DFY]])</comment>
  <model>wikitext</model>
  <format>text/x-wiki</format>
  <text bytes="1987" id="971274481" />
  <sha1>bm5i6af24vvqq1vk0ajnc1wrnvp7ix3</sha1>
</revision>
<revision>
  <id>970319036</id>
  <parentid>959350629</parentid>
  <timestamp>2020-07-30T16:16:22Z</timestamp>
  <contributor>
    <username>Rich Smith</username>
    <id>13314572</id>
  </contributor>
  <comment>declined ([[WP:AFCH|AFCH]] 0.9.1)</comment>
  <model>wikitext</model>
  <format>text/x-wiki</format>
  <text bytes="5059" id="982346934" />
  <sha1>15rvufxszc80p75iwuzysyfxq2hz9u3</sha1>
</revision>

对于这一页，我有三个修订，请注意标签。作为结束表，我的目标是将其存储在数据库中：