如何使用Python将嵌套的XML转换为CSV

2sbarzqh  于 5个月前  发布在  Python
关注(0)|答案(2)|浏览(95)

我正在尝试写一个python程序,它可以将任何xml文件作为输入,并将其转换为csv文件,而不会丢失任何xml标签/元素。我愿意使用任何选项,只要它使用python。
我尝试使用xmltodict、json、csv和pandas python模块,能够读取xml并将其转换为字典。但我无法将此字典转换为可以写入csv文件的列表,以确保捕获所有xml字段。
我的示例xml文件:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
  <tag_1>
    <tag_2>
      <date value="06-30-2023">
        <data>
          <tag_3>val_3</tag_3>
          <tag_4>val_4</tag_4>
          <tag_5>val_5_1 &amp; val_5_2</tag_5>
          <tag_6>-0.157</tag_6>
        </data>
        <data>
          <tag_3>val_3</tag_3>
          <tag_4>val_4_2</tag_4>
          <tag_5>val_5_1</tag_5>
          <tag_6>-0.173</tag_6>
        </data>
      </date>
    </tag_2>
    <tag_7>
      <date value="06-30-2023">
        <data><tag_3>val_3</tag_3><tag_4>val_4</tag_4><tag_5>val_5_1 &amp; val_5_2</tag_5><tag_6>-0.157</tag_6>
        </data>
        <data><tag_3>val_3</tag_3><tag_4>val_4_2</tag_4><tag_5>val_5_1</tag_5><tag_6>-0.173</tag_6>
        </data>
      </date>
    </tag_7>
  </tag_1>

字符串
在阅读了上面的xml之后,我能够将其转换为字典:

{'tag_1': 
  {'tag_2': 
    {'date': 
      {'@value': '06-30-2023', 
       'data': [{'tag_3': 'val_3', 'tag_4': 'val_4', 'tag_5': 'val_5_1 & val_5_2', 'tag_6': '-0.157'}, 
                {'tag_3': 'val_3', 'tag_4': 'val_4_2', 'tag_5': 'val_5_1', 'tag_6': '-0.173'}
           ]
      }
     }, 
   'tag_7': 
    {'date': 
      {'@value': '06-30-2023', 
       'data': [{'tag_3': 'val_3', 'tag_4': 'val_4', 'tag_5': 'val_5_1 & val_5_2', 'tag_6': '-0.157'}, 
                {'tag_3': 'val_3', 'tag_4': 'val_4_2', 'tag_5': 'val_5_1', 'tag_6': '-0.173'}
               ]
      }
    }
  }
}


我的预期输出(在csv文件中)是:

tag_1,tag_2,date,data,tag_3,tag_4,tag_5,tag_6
tag_1,tag_2,06-30-2023,data,val_3,val_4,val_5_1 & val_5_2,-0.157
tag_1,tag_2,06-30-2023,data,val_3,val_4_2,val_5_1,-0.173
tag_1,tag_7,06-30-2023,data,val_3,val_4,val_5_1 & val_5_2,-0.157
tag_1,tag_7,06-30-2023,data,val_3,val_4_2,val_5_1,-0.173


到目前为止,我所尝试的:

import xmltodict
import json
import csv
import pandas as pd

with open("file_01.xml", "r", encoding="utf-8") as xml_fh:
    str_xml = xml_fh.read()

print(f"str_xml={type(str_xml)}={str_xml}")

dict_xml = xmltodict.parse(str_xml)
print(f"dict_xml={type(dict_xml)}={dict_xml}")
df = pd.DataFrame.from_dict(dict_xml, orient='index')
df.to_csv('file_01.csv', index = False)


我得到的实际结果是:

tag_2,tag_7
"{'date': {'@value': '06-30-2023', 'data': [{'tag_3': 'val_3', 'tag_4': 'val_4', 'tag_5': 'val_5_1 & val_5_2', 'tag_6': '-0.157'}, {'tag_3': 'val_3', 'tag_4': 'val_4_2', 'tag_5': 'val_5_1', 'tag_6': '-0.173'}]}}","{'date': {'@value': '06-30-2023', 'data': [{'tag_3': 'val_3', 'tag_4': 'val_4', 'tag_5': 'val_5_1 & val_5_2', 'tag_6': '-0.157'}, {'tag_3': 'val_3', 'tag_4': 'val_4_2', 'tag_5': 'val_5_1', 'tag_6': '-0.173'}]}}"


我错过了什么?

x6yk4ghg

x6yk4ghg1#

我们可以使用pd.json_normalize()来扁平化从XML创建的字典。然而,由于记录位于两个不同的键下:tag_2tag_7,我们需要循环遍历这些特定的标记以获取所有记录,然后连接这些字符串。

import pandas as pd
import xmltodict

with open("file_01.xml", "r", encoding="utf-8") as xml_fh:
    str_xml = xml_fh.read()

dict_xml = xmltodict.parse(str_xml)

df = pd.concat(
    [
        pd.json_normalize(
            dict_xml, 
            record_path=['tag_1', tag, 'date', 'data'],            # path to record list
            meta=[['tag_1', tag, 'date', '@value']])               # path to date
        .pipe(lambda x: x.rename(columns={x.columns[-1]: 'date'})) # rename date column
        .assign(tag_1='tag_1', tag_2=tag, data='data')             # add meta columns
        for tag in ('tag_2', 'tag_7')                              # loop over tags
    ]
)[['tag_1', 'tag_2', 'date', 'data', 'tag_3', 'tag_4', 'tag_5', 'tag_6']]
df.to_csv('file_01.csv', index=False)

字符串
这将创建以下CSV文件:

tag_1,tag_2,date,data,tag_3,tag_4,tag_5,tag_6
tag_1,tag_2,06-30-2023,data,val_3,val_4,val_5_1 & val_5_2,-0.157
tag_1,tag_2,06-30-2023,data,val_3,val_4_2,val_5_1,-0.173
tag_1,tag_7,06-30-2023,data,val_3,val_4,val_5_1 & val_5_2,-0.157
tag_1,tag_7,06-30-2023,data,val_3,val_4_2,val_5_1,-0.173


也许更主要的方法是规范化每个二级键下的相关子字典。请注意,在下面的代码中,record_pathmeta路径不再是列表。

def flatten_dict(dict_xml, level_2_tags):
    df = (
        pd.concat([
            pd.json_normalize(dict_xml['tag_1'][tag]['date'], 'data', '@value')
            .assign(tag_2=tag)
            for tag in level_2_tags
        ])
        .rename(columns={'@value': 'date'})
        .assign(tag_1='tag_1', data='data')
        .get(['tag_1', 'tag_2', 'date', 'data', 'tag_3', 'tag_4', 'tag_5', 'tag_6'])
    )
    return df

# test run
flatten_dict(dict_xml, ['tag_2'])           # when there is only tag_2 in level=2

flatten_dict(dict_xml, ['tag_2', 'tag_7'])  # when there are 2 tags in level=2

lvjbypge

lvjbypge2#

给定自定义格式,看起来最好的选择是使用嵌套列表解析:

df = pd.DataFrame([{'tag_1': k1, 'tag_2': k2, k3: d3['@value'], **d4}
                   for k1, d1 in dict_xml.items()
                   for k2, d2 in d1.items()
                   for k3, d3 in d2.items()
                   for d4 in d3['data']])

字符串
输出量:

tag_1  tag_2        date  tag_3    tag_4              tag_5   tag_6
0  tag_1  tag_2  06-30-2023  val_3    val_4  val_5_1 & val_5_2  -0.157
1  tag_1  tag_2  06-30-2023  val_3  val_4_2            val_5_1  -0.173
2  tag_1  tag_7  06-30-2023  val_3    val_4  val_5_1 & val_5_2  -0.157
3  tag_1  tag_7  06-30-2023  val_3  val_4_2            val_5_1  -0.173


CSV输出:

# df.to_csv('file_01.csv', index=False)

tag_1,tag_2,date,tag_3,tag_4,tag_5,tag_6
tag_1,tag_2,06-30-2023,val_3,val_4,val_5_1 & val_5_2,-0.157
tag_1,tag_2,06-30-2023,val_3,val_4_2,val_5_1,-0.173
tag_1,tag_7,06-30-2023,val_3,val_4,val_5_1 & val_5_2,-0.157
tag_1,tag_7,06-30-2023,val_3,val_4_2,val_5_1,-0.173

相关问题