scrapy 如何使用内存中的数据结构将cleaned_data变量导出到下一层?

omvjsjqw  于 5个月前  发布在  其他
关注(0)|答案(1)|浏览(54)

我建立了一个刮刀从一个网站上刮,刮刀工作完全正常。我曾经将数据存储到两个单独的JSON文件中,分别是raw_data. json和cleaned_data. json。但我目前正试图将刮刀与我公司的框架合并,在框架进程中没有本地存储。所以我尝试使用内存中的数据结构导出数据,该结构传递raw_data和cleaned_数据变量的下一步行动.我已经取得了刮刀一个包.所以导入工作良好,但当我测试运行它,输出只是空.我在这个问题上卡住了几天了,所以有办法吗?
我想做的就是把这两个变量raw_data和cleaned_data导出到run_spider.py中,当它们完成的时候,使用调度器?
这是我的一部分pipelines.py

class RawDataPipeline:

    def __init__(self):
        self.raw_data = []
    
    def process_item(self, item, spider):
        # Basic data validation: Check if the scraped item is not empty
        adapter = ItemAdapter(item)
        if adapter.get('project_source'):
            self.raw_data.append(adapter.asdict())
        return item
    
    def close_spider(self, spider):
    
        """
        with open('raw_data.json', 'w',encoding='utf-8') as file:
            json.dump(self.raw_data, file, indent=2, ensure_ascii=False)
        """
        spider.crawler.signals.send_catch_log(signal=spider.custom_close_signal, raw_data=self.raw_data)
        return self.raw_data

class CleanedDataPipeline:

    def __init__(self):
    
        self.cleaned_data = []
        self.list_dic = {}
    
    def process_item(self, item, spider):
        cleaned_item = self.clean_item(item)
        self.cleaned_data.append(cleaned_item)
        return item
    
    def close_spider(self, spider):
    
        # Convert values to list for keys in list_dic
        for key in self.list_dic:
            for cleaned_item in self.cleaned_data:
                self.convert_to_list(cleaned_item, key)
        
        #with open('cleaned_data.json', 'w', encoding='utf-8') as file:
        #    json.dump(self.cleaned_data, file, indent=2, ensure_ascii=False)
    
        # Log list_dic
        #spider.log("List_dic: %s" % json.dumps(self.list_dic, indent=2, ensure_ascii=False))
    
        spider.crawler.signals.send_catch_log(signal=spider.custom_close_signal, cleaned_data=self.cleaned_data)
    
        return self.cleaned_data

字符串
这是python脚本,我在这里启动spider,并尝试在spider关闭时获取数据

def spider_closed(signal, sender, **kwargs):
    # Access the data after the spider is closed
    raw_data = RawDataPipeline().raw_data
    cleaned_data = CleanedDataPipeline().cleaned_data

    print("Raw Data:", raw_data)
    print("Cleaned Data:", cleaned_data)

def run_spider():
    # Create a CrawlerProcess
    process = CrawlerProcess(settings)

    # Connect the spider_closed signal to your callback function
    dispatcher.connect(spider_closed, signal=signals.spider_closed)

    # Add your spider to the process
    process.crawl(NieuwbouwspiderSpider)

    # Start the crawling process
    process.start()

run_spider()


这不起作用,打印的结果只是空的。有解决方案吗?

bqucvtff

bqucvtff1#

我使用dispatcher.send和singnals.close_spider找到了解决方案。

相关问题