python使用beautifulsoup提取带有完整url的标题

xzlaal3s  于 2021-07-14  发布在  Java
关注(0)|答案(1)|浏览(208)

我是一个初级python程序员。为了练习,我试着从网页上获取文章标题及其URL的列表。到目前为止,我已经想出了以下代码:

import requests
from bs4 import BeautifulSoup as BS

with requests.session() as r:
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0'}
    r = requests.get('https://0xdf.gitlab.io', verify=False, headers=headers)
    response = r.text
    soup = BS(response, 'html.parser')
    tags = soup.find_all('a')

    for tag in tags:
        links = tag.get('href')
        if links[0] == '/':
            appended_link = 'https://0xdf.gitlab.io' + links
            print(appended_link)
        elif links[0] == '#':
            pass
        else:
            print(links)

然而,它没有提取我感兴趣的东西。我想要标题的文章旁边的完整网址。
谢谢

d6kp6zgx

d6kp6zgx1#

您可以使用以下示例从该页面+url中提取标题:

import requests
from bs4 import BeautifulSoup as BS

url = "https://0xdf.gitlab.io/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for link in soup.select(".post-link"):
    print(
        "{:<40} {}".format(
            link.get_text(strip=True), "https://0xdf.gitlab.io" + link["href"]
        )
    )

印刷品:

HTB: Toolbox                             https://0xdf.gitlab.io/2021/04/27/htb-toolbox.html
HTB: Bucket                              https://0xdf.gitlab.io/2021/04/24/htb-bucket.html
HTB: Laboratory                          https://0xdf.gitlab.io/2021/04/17/htb-laboratory.html
HTB: APT                                 https://0xdf.gitlab.io/2021/04/10/htb-apt.html
HTB: Time                                https://0xdf.gitlab.io/2021/04/03/htb-time.html
HTB: Luanne                              https://0xdf.gitlab.io/2021/03/27/htb-luanne.html
HTB: CrossFit                            https://0xdf.gitlab.io/2021/03/20/htb-crossfit.html
HTB: Optimum                             https://0xdf.gitlab.io/2021/03/17/htb-optimum.html
Reel2: Root Shell                        https://0xdf.gitlab.io/2021/03/15/reel2-root-shell.html
HTB: Reel2                               https://0xdf.gitlab.io/2021/03/13/htb-reel2.html
HTB: Sense                               https://0xdf.gitlab.io/2021/03/11/htb-sense.html
HTB: Passage                             https://0xdf.gitlab.io/2021/03/06/htb-passage.html
HTB: Sneaky                              https://0xdf.gitlab.io/2021/03/02/htb-sneaky.html
HTB: Academy                             https://0xdf.gitlab.io/2021/02/27/htb-academy.html
HTB: Beep                                https://0xdf.gitlab.io/2021/02/23/htb-beep.html
HTB: Feline                              https://0xdf.gitlab.io/2021/02/20/htb-feline.html
HTB: Charon                              https://0xdf.gitlab.io/2021/02/16/htb-charon.html
HTB: Jewel                               https://0xdf.gitlab.io/2021/02/13/htb-jewel.html
HTB: Apocalyst                           https://0xdf.gitlab.io/2021/02/09/htb-apocalyst.html
HTB: Doctor                              https://0xdf.gitlab.io/2021/02/06/htb-doctor.html
HTB: Europa                              https://0xdf.gitlab.io/2021/02/02/htb-europa.html
HTB: Worker                              https://0xdf.gitlab.io/2021/01/30/htb-worker.html
HTB: Compromised                         https://0xdf.gitlab.io/2021/01/23/htb-compromised.html
HTB: RopeTwo                             https://0xdf.gitlab.io/2021/01/16/htb-ropetwo.html

...and so on.

相关问题