pdf mcq到PandasDataframe?

wpcxdonn  于 2021-07-13  发布在  Java
关注(0)|答案(2)|浏览(227)

有没有办法把这样的文本从pdf转换成Dataframe?文本:
比较成本优势理论是由马歇尔、李嘉图、陶西格和海伯勒提出的
李嘉图的比较成本理论是基于以下哪种假设:a)共同市场b)同等成本c)垄断d)自由贸易
预期df:

The theory of comparative cost advantage theory was Introduced by-----                  Alfred Marshall     David Ricardo     Taussig     Heberler
The Ricardo’s comparative cost theory is based on which of the following assumption     Common Market       Equal cost        Monopoly    Free Trade
oxf4rvwz

oxf4rvwz1#

逐行用换行符分隔
按正则表达式逐列拆分

rawtxt = """The theory of comparative cost advantage theory was Introduced by----- a) Alfred Marshall b) David Ricardo c) Taussig d) Heberler
The Ricardo’s comparative cost theory is based on which of the following assumption a) Common Market b) Equal cost c) Monopoly d) Free Trade"""

df = pd.DataFrame({"rawtxt":rawtxt.split("\n")})
df.rawtxt.str.split(r"[a-z]\)").apply(pd.Series)

输出

012340比较成本优势理论是由李嘉图·马歇尔·李嘉图·托塔西格伯勒提出的,李嘉图的比较成本理论是基于以下哪一个假设:共同市场平等成本垄断自由贸易

kulphzqa

kulphzqa2#

假设您能够从pdf中提取文本,每个句子/问题之间用新行隔开,您可以像这样使用regex:

import re

regex = r"(.+)(a\).+).+(b\).+).+(c\).+).+(d\).+)"
pdf_txt = """The theory of comparative cost advantage theory was Introduced by----- a) Alfred Marshall b) David Ricardo c) Taussig d) Heberler\n 
            The Ricardo’s comparative cost theory is based on which of the following assumption a) Common Market b) Equal cost c) Monopoly d) Free Trade\n"""

matches = re.finditer(regex, pdf_txt, re.MULTILINE)

data = {1 : [], 2 : [], 3 : [], 4 : [], 5 : []}

for match_num, match in enumerate(matches, start=1):
    for group_num in range(0, len(match.groups())):
        data[group_num + 1].append(match.group(group_num + 1))

df = pd.DataFrame(data)
df.columns = ['Question', 'A', "B", "C", "D"]
print(df.head())

相关问题