Chrome 在谷歌搜索结果中从100个网站中抓取电子邮件

aoyhnmkz  于 2023-04-27  发布在  Go
关注(0)|答案(1)|浏览(4980)

我有一个代码,可以完美地保存来自单个网站的电子邮件:

import re
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.powercurbers.com/dealers/?region=13&area=318')
padla = driver.page_source
suka = r'''(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])'''
gnida = []
for huj in re.finditer(suka, padla):
    gnida.append(huj.group())

我现在想合并这个代码与一个获取所有网站从谷歌seach.我面临2个问题:首先:我看到brwoeser最多得到100个结果,页面确实有100个结果,但下面的代码只返回10个网站:

driver = webdriver.Chrome()
driver.get("https://www.google.com/search?q=%D0%B3%D1%80%D1%83%D0%B7%D0%BE%D0%BF%D0%B5%D1%80%D0%B5%D0%B2%D0%BE%D0%B7%D0%BA%D0%B8+%D0%B0%D1%80%D1%85%D0%B0%D0%BD%D0%B3%D0%B5%D0%BB%D1%8C%D1%81%D0%BA+%D0%B0%D1%81%D1%82%D1%80%D0%B0%D1%85%D0%B0%D0%BD%D1%8C&ei=AQhEZNPgBtSF9u8Pop-4qA4&ved=0ahUKEwiT5YuZ8b3-AhXUgv0HHaIPDuUQ4dUDCBA&uact=5&oq=%D0%B3%D1%80%D1%83%D0%B7%D0%BE%D0%BF%D0%B5%D1%80%D0%B5%D0%B2%D0%BE%D0%B7%D0%BA%D0%B8+%D0%B0%D1%80%D1%85%D0%B0%D0%BD%D0%B3%D0%B5%D0%BB%D1%8C%D1%81%D0%BA+%D0%B0%D1%81%D1%82%D1%80%D0%B0%D1%85%D0%B0%D0%BD%D1%8C&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIHCCEQoAEQCjIHCCEQoAEQCjoECAAQRzoFCAAQgAQ6BggAEBYQHjoFCCEQoAE6BAghEBVKBAhBGABQmAZYthhgjRxoAHACeACAAaEHiAHOHJIBDTAuMS4wLjEuMi4xLjKYAQCgAQHIAQjAAQE&sclient=gws-wiz-serp")
results_list = driver.find_elements(By.TAG_NAME, 'cite')

for i in range(len(results_list)):
    results_list[i] = results_list[i].text.replace(">", "/").replace("›", "/").replace(" ", "")
    if not validators.url(results_list[i]):
        results_list[i] = ''

results_list = list(filter(None, results_list))

列表的长度是10。为什么?有办法获取所有的站点吗?
第二:如何可以写一个循环来执行电子邮件抓取每个网站?当我写:

gnida = []
import re
for h in results_list:
    padla = driver.page_source
    for huj in re.finditer(suka, padla):
        gnida.append(huj.group())

gnida列表是空的。非常感谢任何帮助。

31moq8wy

31moq8wy1#

谷歌比雅虎更难刮,从技术上讲违反了政策。如果雅虎都是一样的,那么这里有一个关于如何获得顶级结果链接的选项:

import requests
from bs4 import BeautifulSoup

# Yahoo search URL
url = 'https://search.yahoo.com/search?p={}&fr=yfp-t-s&fp=1&toggle=1&cop=mss&ei=UTF-8&b={}'

# Search query
query = 'грузоперевозки архангельск астрахань'

# Number of results to fetch
num_results = 1000

# Results per page
results_per_page = 10

# Number of pages to fetch
num_pages = num_results // results_per_page

# List to store URLs
urls = []

# Loop over search result pages
for i in range(num_pages):
    # Calculate start index
    start = i * results_per_page + 1

    # Send GET request
    res = requests.get(url.format(query, start))
    #print(res.text)

    # Parse HTML
    soup = BeautifulSoup(res.content, 'lxml')

    theas = soup.find_all("a", attrs={"target":True})
    for a in theas:
        if "http" in a.get("href"):
            urls.append(a.get("href"))

# Print URLs
print(urls)

相关问题