如何使用python(beautifulsoup)从web抓取表？

zbsbpyhn 于 2021-09-29 发布在 Java

关注(0)|答案(0)|浏览(152)

我正在尝试从网站中提取表。我一直在使用beautifulsoup，但最后我得到了我刮过的table上的空行。


# import package

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

# to get the html of the page

req = Request('https://covid19.go.id/peta-risiko', headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
soup = BeautifulSoup(html, 'lxml')
type(soup)

# Get the title

title = soup.title
print(title)

# Print out the text

text = soup.get_text()
print(soup.text)

# to extract all the hyperlinks within the webpage

soup.find_all('a')

# use a for loop and the get('"href") method to extract and print out only hyperlinks

all_links = soup.find_all("a")

# To print out table rows only, pass the 'tr' argument in soup.find_all()

for link in all_links:
    print(link.get("href"))

# Print the first 10 rows for checking

rows = soup.find_all('tr')
print(rows[:10])

我明白了 [] 当我打印前10行时。我不知道这会发生。是否因为表格包含多页（第1页、第2页、第3页、下一页等）？。
有什么办法可以在这个网站上删除这个表吗？网页。我想得到一个包含以下列的表：provinsi、kota/kabupaten、status

python beautifulsoup web-scraping html-table scrape

来源：https://stackoverflow.com/questions/68543231/how-to-scrape-table-from-a-web-using-python-beautifulsoup