获取`scrapy`以生成嵌套的数据结构

hzbexzde  于 8个月前  发布在  其他
关注(0)|答案(1)|浏览(63)

我正在使用scrapy抓取this网站并抓取数据
我希望抓取的数据具有嵌套结构。这样的事情

{
    denomination1: {
        
        date1: {
            bondNumbers: [...] },
        
        date2: {
            bondNumbers: [...] },
        
        ...                    },
    
    denomination2: {
        
        date1: {
            bondNumbers: [...] },
        
        date2: {
            bondNumbers: [...] },
        
        ...                    },
   
   ...
}

这是我写的spider

import scrapy

class Savings(scrapy.Spider):
    name        = 'savings'
    start_urls  = [
        'https://savings.gov.pk/download-draws/',
    ]

    def parse(self, response):

        for option in response.css('select option'):
            denomination = option.css('::text').get()
            url          = option.css('::attr(value)').get()
            yield {
                denomination: response.follow(url, self.parseDrawList)
            }

    def parseDrawList(self, response):

        for a in response.css('select option'):            
            date = a.css('::text').get()
            url  = a.css('::attr(value)').get()
            yield {
                date: response.follow(url, self.parseDraw)
            }

    def parseDraw(self, response):
        yield {
            'bondNumbers': response.selector.re(r'\d{6}'),
        }

每个函数在网页层次结构中抓取不同的页面(如果我们可以这样称呼它的话),因此,嵌套数据结构的每个级别将由来自不同级别的页面的数据填充。
这个代码不工作,并给我一个错误。
从我所看到的所有教程和文档中,没有一个使用scrapy来生成嵌套的数据结构。
有没有办法从scrapy获取嵌套数据?我还想知道这个解决方案是否不会牺牲scrapy的并发请求执行。

hgncfbus

hgncfbus1#

您需要从每个回调中获取信息,并使用请求Meta字典或response.follow中的cb_kwargs参数将其传递给下一个回调,然后在最后的回调中,您可以构建完全嵌套的结构并将其作为一个项产生。
举例来说:

import scrapy

class Savings(scrapy.Spider):
    name        = 'savings'
    start_urls  = [
        'https://savings.gov.pk/download-draws/',
    ]

    def parse(self, response):
        for option in response.css('select option'):
            denomination = option.css('::text').get()
            url          = option.css('::attr(value)').get()
            yield response.follow(url, self.parseDrawList, cb_kwargs={'denomination': denomination})

    def parseDrawList(self, response, denomination=None):
        for a in response.css('tr td a'):
            date = a.css('::text').get()
            url  = a.css('::attr(href)').get()
            yield response.follow(url, self.parseDraw, cb_kwargs={'denomination': denomination, "date": date})

    def parseDraw(self, response, denomination=None, date=None):
        yield {
            denomination: {
                date: {
                    'bondNumbers': response.selector.re(r'\d{6}')
                }
            }
        }

示例输出

2023-08-29 15:40:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk:443/wp-content/uploads/2017/03/16-02-2015Rs.1500.txt>
{'Rs. 1500/- Draws': {'16-02-2015': {'bondNumbers': ['749492', '457346', '692793', '914362', '000535', ...]}}}
2023-08-29 15:40:50 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://savings.gov.pk:443/wp-content/uploads/10-03-2021-Rs-25000-Premium.txt> from <GET http://savings.gov.pk/wp-content/uploads/10-03-2
021-Rs-25000-Premium.txt>
2023-08-29 15:40:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://savings.gov.pk:443/wp-content/uploads/10-12-2021-Rs-25000-Premium.txt> (referer: None)
2023-08-29 15:40:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk:443/wp-content/uploads/10-12-2021-Rs-25000-Premium.txt>
{'Rs 25000/- Premium Bonds Draws': {'10-12-2021': {'bondNumbers': ['016253', '067408', '038203', '171265', '551833', '655804', '916353', '001858', '064668', '149237', '220908', '293362', '361338', '447697', '512113', '610773', ... ]}}}
2023-08-29 15:40:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://savings.gov.pk:443/wp-content/uploads/10-03-2021-Rs-25000-Premium.txt> (referer: None)
2023-08-29 15:40:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk:443/wp-content/uploads/10-03-2021-Rs-25000-Premium.txt>

相关问题