使用scrapy和selenium对静态html进行爬网，但什么也得不到

ntjbwcob 于 2021-08-20 发布在 Java

关注(0)|答案(1)|浏览(246)

我对scrapy/selenium是新手。我想从网站上抓取所有论文标题https://thewebconf.org/www2019/accepted-papers/. 我认为这只是一个静态html页面，因为当我“查看页面源代码”时，所有内容都显示在源页面中。我的代码如下。

class Spider_WWW19(scrapy.Spider):
    name = "www19"
    start_urls = [
        'https://thewebconf.org/www2019/accepted-papers/'
    ]

    def __init__(self):
        # add chrome driver to win10 PATH
        self.driver = webdriver.Chrome()

    def parse(self, response):
        self.driver.get(response.url)
        WebDriverWait(self.driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//li//p[contains(@class, 'name')]"))
        )

        selenium_response_text = self.driver.page_source
        hxs = Selector(text=selenium_response_text)
        articles = hxs.xpath("//li//p[contains(@class, 'name')]/text()")
        for article in articles:
            yield {
                'title': article.text.strip(),
                'year': '2020',
                'conf': 'WWW',
                'conf_long': 'International World Wide Web Conference'
            }

但我有问题。
通过使用从scrapy获取，输出html为空。 scrapy fetch https://thewebconf.org/www2019/accepted-papers/ > out.html 通过使用不含selenium的scrapy，就没有输出，代码也就完成了。
通过使用selenium，生成的浏览器不会导航到https://thewebconf.org/www2019/accepted-papers/ 或者别的什么。刚刚关门。
谢谢你的帮助。

python selenium web-crawler scrapy

来源：https://stackoverflow.com/questions/68308868/using-scrapy-and-selenium-to-crawl-a-static-html-but-get-nothing