从两个不同的url获取数据到同一个ScrapyItem()

mnemlml8  于 5个月前  发布在  其他
关注(0)|答案(1)|浏览(73)

我是新的scrapy和我一直试图刮这个网站:https://quotes.toscrape.com/
我要的数据是

  • 报价;
  • 作者;
  • 出生日期和
  • 出生地。

为了得到前2个数据(引用和作者),我必须从
https://quotes.toscrape.com/
但要获得其他2(出生日期和出生地),我必须去“关于作者”:
第一个月
我的items.py代码是:

import scrapy

class QuotesItem(scrapy.Item):
quote = scrapy.Field()
author = scrapy.Field()
date_birth = scrapy.Field()
local_birth = scrapy.Field()

字符串
quotesipder.py的代码是:

import scrapy
from ..items import QuotesItem  

class QuotespiderSpider(scrapy.Spider):
    name = "quotespider"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["https://quotes.toscrape.com/"]

    def parse(self, response):
        all_items= QuotesItem()

        quotes = response.xpath("//div[@class='row']/div[@class='col-md-8']/div")

        for quote in quotes:
            all_items['quote'] = quote.xpath("./span[@class='text']/text()").get()
            all_items['author'] = quote.xpath("./span[2]/small/text()").get()
            # Here we get the first 2 datas.

            about = quote.xpath("./span[2]/small/following-sibling::a/@href").get()
            url_about = 'https://quotes.toscrape.com' + about  # URL to go to 'about author'.

            yield response.follow(url_about, callback=self.about_autor,
                                  cb_kwargs={'items': all_items})
            
            yield item

    def about_autor(self, response, items):  # Should get the other two datas (date_birth, local_bith)

        item['date_birth '] = response.xpath("/html/body/div/div[2]/p[1]/span[1]/text()").get()
        item['local_bith '] = response.xpath("/html/body/div/div[2]/p[1]/span[2]/text()").get()

        yield item


我试过使用cb_kwargs参数,就像在代码quotespider.py中一样,但它不起作用。

这就是我得到的:

[
{"quote": "quote1", 
"autor": "author1",  
"date_birth": "",
 "local_birth": ""}, # Empty for the first 10 items
...
{"quote":"quote10", 
"autor": "author10", 
"date_birth": "", 
"local_birth": ""}, # 10th element also empty

{"quote":"quote10", 
"autor": "author10", 
"date_birth": "December 16, 1775", 
"local_birth": "in Steventon"},  # 10th element *repeated* with wrong date_birth and local_birth 

{"quote": "quote10", 
"autor": "author10", 
"date_birth": "June 01, 1926", 
"local_birth": "United States"}, # 10th element *repeated* with wrong date_birth and local_birth


没有local_birthdate_birth被添加到前10个引号(在parse函数中添加),但最后一个引号是重复的,所有的'local-birth'和'date-birth'。

我期望得到的是这样的:

[{'quote': 'quote1',
'author': 'author1',
'date_birth': 'date_birth1',
'local_birth': 'local_birth1'},

{'quote': 'quote2',
'author': 'author2',
'date_birth': 'date_birth2',
'local_birth': 'local_birth2'},

{'quote': 'quote3',
'author': 'author3',
'date_birth': 'date_birth3',
'local_birth': 'local_birth'},
]

zfciruhq

zfciruhq1#

在你的代码中有一些需要修正的错误,比如在about_autor方法中,你传入了items,然后在方法体中使用的变量是item。还有一个yield item语句在你的parse方法中的yield response.follow调用下面,它肯定会抛出错误。
但除此之外,我将做一些额外的说明。

  • 当迭代选择器组时,您应该将项初始化移动到循环内部,这样您就可以在每次yield上产生唯一的项,而不会重复同一项的先前值。
  • cb_kwargs代表回调关键字参数,因此about_autor中的第二个参数应该是关键字参数。
  • 由于quotes站点的特点是来自同一作者的多个引用,因此您应该将dont_filter=True参数添加到对response.follow的调用中,以便在多次请求作者页面时不会过滤重复内容。

这对我来说似乎很好。
范例:

import scrapy

class QuotesItem(scrapy.Item):
    quote = scrapy.Field()
    author = scrapy.Field()
    date_birth = scrapy.Field()
    local_birth = scrapy.Field()

class QuotespiderSpider(scrapy.Spider):
    name = "quotespider"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["https://quotes.toscrape.com/"]

    def parse(self, response):
        quotes = response.xpath("//div[@class='row']/div[@class='col-md-8']/div")
        for quote in quotes:
            item= QuotesItem()
            item['quote'] = quote.xpath("./span[@class='text']/text()").get()
            item['author'] = quote.xpath("./span[2]/small/text()").get()
            # Here we get the first 2 datas.
            about = quote.xpath("./span[2]/small/following-sibling::a/@href").get()
            url_about = 'https://quotes.toscrape.com' + about  # URL to go to 'about author'.
            yield response.follow(url_about, callback=self.about_autor,
                                  cb_kwargs={'item': item}, dont_filter=True)

    def about_autor(self, response, item={}):  # Should get the other two datas (date_birth, local_bith)
        item['date_birth'] = response.xpath("/html/body/div/div[2]/p[1]/span[1]/text()").get()
        item['local_birth'] = response.xpath("/html/body/div/div[2]/p[1]/span[2]/text()").get()
        yield item

字符串
输出

{'author': 'Steve Martin',
 'date_birth': 'August 14, 1945',
 'local_birth': 'in Waco, Texas, The United States',
 'quote': '“A day without sunshine is like, you know, night.”'}
2023-11-22 15:56:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/J-K-Rowling/>
{'author': 'J.K. Rowling',
 'date_birth': 'July 31, 1965',
 'local_birth': 'in Yate, South Gloucestershire, England, The United Kingdom',
 'quote': '“It is our choices, Harry, that show what we truly are, far more '
          'than our abilities.”'}
2023-11-22 15:56:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Eleanor-Roosevelt/>
{'author': 'Eleanor Roosevelt',
 'date_birth': 'October 11, 1884',
 'local_birth': 'in The United States',
 'quote': '“A woman is like a tea bag; you never know how strong it is until '
          "it's in hot water.”"}
2023-11-22 15:56:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Albert-Einstein/>
{'author': 'Albert Einstein',
 'date_birth': 'March 14, 1879',
 'local_birth': 'in Ulm, Germany',
 'quote': '“The world as we have created it is a process of our thinking. It '
          'cannot be changed without changing our thinking.”'}
2023-11-22 15:56:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Marilyn-Monroe/>
{'author': 'Marilyn Monroe',
 'date_birth': 'June 01, 1926',
 'local_birth': 'in The United States',
 'quote': "“Imperfection is beauty, madness is genius and it's better to be "
          'absolutely ridiculous than absolutely boring.”'}
2023-11-22 15:56:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Thomas-A-Edison/>
{'author': 'Thomas A. Edison',
 'date_birth': 'February 11, 1847',
 'local_birth': 'in Milan, Ohio, The United States',
 'quote': "“I have not failed. I've just found 10,000 ways that won't work.”"}
2023-11-22 15:56:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Albert-Einstein/>
{'author': 'Albert Einstein',
 'date_birth': 'March 14, 1879',
 'local_birth': 'in Ulm, Germany',
 'quote': '“Try not to become a man of success. Rather become a man of value.”'}
2023-11-22 15:56:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Albert-Einstein/>
{'author': 'Albert Einstein',
 'date_birth': 'March 14, 1879',
 'local_birth': 'in Ulm, Germany',
 'quote': '“There are only two ways to live your life. One is as though '
          'nothing is a miracle. The other is as though everything is a '
          'miracle.”'}
2023-11-22 15:56:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Andre-Gide/>
{'author': 'André Gide',
 'date_birth': 'November 22, 1869',
 'local_birth': 'in Paris, France',
 'quote': '“It is better to be hated for what you are than to be loved for '
          'what you are not.”'}
2023-11-22 15:56:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Jane-Austen/>
{'author': 'Jane Austen',
 'date_birth': 'December 16, 1775',
 'local_birth': 'in Steventon Rectory, Hampshire, The United Kingdom',
 'quote': '“The person, be it gentleman or lady, who has not pleasure in a '
          'good novel, must be intolerably stupid.”'}

相关问题