Selenium:xpath可以在浏览器devtool上找到,但不能抓取

4smxwvx5  于 2023-02-08  发布在  其他
关注(0)|答案(4)|浏览(165)

我想抓取新闻网站Al Jazeera上的文章。我编写了相对的xpath,它可以引导我找到浏览器开发工具上的句子。但奇怪的是,当使用完全相同的xpath时,抓取文本失败。例如,有一条新闻(url:https://www.aljazeera.com/economy/2023/2/6/who-is-gautam-adani-and-why-is-he-controversial
x路径:

//header[@class="article-header"]/h1
//header[@class="article-header"]//em
//main[@id="main-content-area"]/div[2]/p[1]
//main[@id="main-content-area"]/div[2]/p[2]
//main[@id="main-content-area"]/div[2]/p[3]
//main[@id="main-content-area"]/div[2]/p[4]

......等等,但没有刮伤任何东西。
我都测试过了

.text
.get_attribute('textContent')

都失败了,因为没有不可见文本。
请帮我把这些段落抄下来。

d4so4syb

d4so4syb1#

所有的定位符都是正确的。要从website打印文本,理想情况下需要为visibility_of_element_located()导出WebDriverWait,并且可以使用以下locator strategies之一:

  • 代码块:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
s = Service('C:\\BrowserDrivers\\chromedriver.exe')
driver = webdriver.Chrome(service=s, options=options)
driver.get('https://www.aljazeera.com/economy/2023/2/6/who-is-gautam-adani-and-why-is-he-controversial')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//header[@class='article-header']/h1"))).text)
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//header[@class='article-header']//em"))).text)
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//main[@id='main-content-area']/div[2]/p[1]"))).text)
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//main[@id='main-content-area']/div[2]/p[2]"))).text)
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//main[@id='main-content-area']/div[2]/p[3]"))).text)
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//main[@id='main-content-area']/div[2]/p[4]"))).text)
  • 控制台输出:
Who is Gautam Adani and why is he controversial?
The Indian entrepreneur has seen his wealth plummet after a research firm accused him of ‘brazen stock manipulation’.
Allegations of stock market manipulation and fraud have halved the net worth of Indian tycoon Gautam Adani, one of the wealthiest people in the world, in less than two weeks and wiped more than $110bn from his listed firms in India.
With investor confidence shaken, legislators have demanded an investigation into his businesses. Here’s a look at who Adani is, what concerns have been raised and what has happened since.
Who is Gautam Adani?
He is the founder and chairman of the Adani Group, one of the largest business conglomerates in India. A native of Gujarat — the same state where India’s Prime Minister Narendra Modi is from — Adani, 60, is a college dropout. He walked away from his father’s textile shop to set up a commodities trading business in 1988, his entry into the world of business.
ekqde3dh

ekqde3dh2#

我希望这将为您的解决方案工作,请添加我在代码中定义的选项

from webdriver_manager.chrome import ChromeDriverManager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
options = Options()
# options.add_argument('--disable-blink-features=AutomationControlled')
service = ChromeService(executable_path=ChromeDriverManager().install())
options.add_experimental_option('excludeSwitches', ['enable-logging']) # KINDLY ADD THIS OPTION
driver = webdriver.Chrome(service=service, options=options)
URL = ' https://www.aljazeera.com/economy/2023/2/6/who-is-gautam-adani-and-why-is-he-controversial'
driver.get(URL)
# Define your code here
# //header[@class="article-header"]/h1
# //header[@class="article-header"]//em
# //main[@id="main-content-area"]/div[2]/p[1]
# //main[@id="main-content-area"]/div[2]/p[2]
# //main[@id="main-content-area"]/div[2]/p[3]
# //main[@id="main-content-area"]/div[2]/p[4]
h1_tag = driver.find_elements(By.XPATH, '//header[@class="article-header"]/h1')[0]
print(f'h1: {h1_tag.text}')
em_tag = driver.find_elements(By.XPATH, '//header[@class="article-header"]//em')[0]
print(f'em: {em_tag.text}')
for i in range(1, 5):
    p_tag = driver.find_elements(By.XPATH, f'//main[@id="main-content-area"]/div[2]/p[{i}]')[0]
    print(f'p{i}: {p_tag.text}')
driver.quit()
w1jd8yoj

w1jd8yoj3#

我重新编写了代码,它工作了。它不工作的原因是我试图把下面的代码扔到另一个集成代码中。也许在合并过程中有什么错误。
很难将不同的def(s)组合在一起。感谢所提供的答案。
下面的代码可以工作:

# import library
import os
from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# default parameters
desktop_path = os.path.join(os.path.join(os.environ['USERPROFILE']), 'Desktop')
edge_driver_path = desktop_path + r"\msedgedriver.exe"

# page url
url = "https://www.aljazeera.com/economy/2023/2/6/who-is-gautam-adani-and-why-is-he-controversial"

# xpath
new_title = "//header[@class='article-header']/h1"
new_brief = "//header[@class='article-header']//em"
new_par01 = "//main[@id='main-content-area']/div[2]/p[1]"
new_par02 = "//main[@id='main-content-area']/div[2]/p[2]"
new_par03 = "//main[@id='main-content-area']/div[2]/p[3]"
new_par04 = "//main[@id='main-content-area']/div[2]/p[4]"
new_par05 = "//main[@id='main-content-area']/div[2]/p[5]"
new_par06 = "//main[@id='main-content-area']/div[2]/p[6]"
new_par07 = "//main[@id='main-content-area']/div[2]/p[7]"
new_par08 = "//main[@id='main-content-area']/div[2]/p[8]"
new_par09 = "//main[@id='main-content-area']/div[2]/p[9]"
new_par10 = "//main[@id='main-content-area']/div[2]/p[10]"
xpath_list = [new_title, new_brief,
              new_par01, new_par02, new_par03, new_par04, new_par05,
              new_par06, new_par07, new_par08, new_par09, new_par10]

def paragraph_scraping(url, xpath_list):
    # the Edge driver
    s = Service(edge_driver_path)
    driver = webdriver.Edge(service=s)

    # open url
    driver.get(url)

    # manipulate browser windows to load information on page
    driver.set_window_size(1024, 600)
    driver.maximize_window()
    driver.execute_script("window.scrollTo(0, 1000)")
    time.sleep(0.5)
    driver.execute_script("window.scrollTo(0, 500)")
    time.sleep(0.5)
    driver.execute_script("window.scrollTo(0, 300)")
    time.sleep(0.5)
    driver.execute_script("window.scrollTo(0, 100)")
    time.sleep(1)

    # create paragraph container
    news_sentences = []
    for xpath in xpath_list:
        try:
            a = WebDriverWait(driver, 0.5)
            # title extract
            b = a.until(EC.presence_of_element_located((By.XPATH, xpath)))
            c = b.get_attribute('textContent')
            news_sentences.append(c)
        except:
            pass

    # join sentences
    news_paragraph = "\n".join(news_sentences)

    return news_paragraph

print(paragraph_scraping(url, xpath_list))
3j86kqsm

3j86kqsm4#

尝试使用完整xpath

find_element("xpath","/html/body/div[1]/div/div[3]/div/div/div/div[1]/main/div[2]/p[1]")

相关问题