tripadvisor抓取python脚本正在导出多个不同版本的行

wgeznvg7  于 2021-08-20  发布在  Java
关注(0)|答案(1)|浏览(245)

我正在为一篇学术研究论文撰写这篇草稿。我绝对是一个新手,自学成才,并且已经拼凑起来了!
我想要的是:一个大约560行的csv;每个日期(mdyyyy)、审核、评级和用户名(用户名目前未计入脚本,仅供参考)各一列。
我已经让它运行没有错误,但输出是错误的。我有上千行。该脚本正在以多种格式循环和输出数据:1)带月份/日期的500ish行和审阅2)带评级的500ish行和审阅3)带名称、日期、审阅的500ish行都在同一列中。。。。等等
我花了几个小时试图解决这个问题,现在我有了另一个:
回溯(最近一次调用):第49行,在date=“”.join(date[j].text.split(“”[-2:])索引器中:列表索引超出范围
在3.9.6中运行这个,如果这有区别的话。
我有三个问题:
如何解决此日期超出范围的问题?
脚本是否有任何明显的错误导致它创建了数千行不同的格式?
如何在中添加用户名?我尝试过这样做,但似乎找不到正确的xpath。以下是我正在浏览的网站:https://www.tripadvisor.com/showuserreviews-g189447-d207187-r773649540-monastery_of_st_john-patmos_dodecanese_south_aegean.html

import csv
from selenium import webdriver
import time

# default path to file to store data

path_to_file = "D:\Documents\Archaeology\Projects\Patmos\scraped\monastery6.csv"

# default number of scraped pages

num_page = 5

# default tripadvisor website of hotel or things to do (attraction/monument)

url = "https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html"

# url = "https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html"

# if you pass the inputs in the command line

if (len(sys.argv) == 4):
    path_to_file = sys.argv[1]
    num_page = int(sys.argv[2])
    url = sys.argv[3]

# import the webdrive -- NMS 20210705

driver = webdriver.Chrome("C:/Users/nsusm/AppData/Local/Programs/Python/Python39/webdriver/bin/chromedriver.exe")
driver.get("https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html")

# open the file to save the review

csvFile = open(path_to_file, 'a')
csvFile = open(path_to_file, 'a', encoding="utf-8")
csvWriter = csv.writer(csvFile, delimiter=',')
csvWriter.writerow([str ('title'), str ('rating'), str ('review'), str ('date')])

# change the value inside the range to save more or less reviews

for i in range(0, 48, 1):

    # expand the review 
    time.sleep(2)

# define container (this is the whole box of the Trip Advisor review, excluding the date of the review)

    container = driver.find_elements_by_xpath(".//div[@class='review-container']")

# grab also the date of review

    date = driver.find_elements_by_xpath(".//class[@class='prw_reviews_stay_date_hsx']")

    for j in range(len(container)):

        rating = container[j].find_element_by_xpath(".//span[contains(@class, 'ui_bubble_rating bubble_')]").get_attribute("class").split("_")[3]
        title = container[j].find_element_by_xpath(".//div[contains(@class, noQuotes)]").text.replace("\n", "  ")
        review = container[j].find_element_by_xpath(".//p[@class='partial_entry']").text.replace("\n", "  ")
        date = " ".join(date[j].text.split(" ")[-2:])

# write data into csv

        csvWriter.writerow([title, rating, review, date])

# change the page

    driver.find_element_by_xpath('.//a[@class="nav next ui_button primary"]').click()

# quite selenium

driver.quit()

# FYI you need to close all windows for the file to write ```
ibps3vxo

ibps3vxo1#

那个日期查找器回来时是空的,所以[j]没能找到。审阅日期在容器中,因此您可以将其与其他内容一起使用。

rating = container[j].find_element_by_xpath(".//span[contains(@class, 'ui_bubble_rating bubble_')]").get_attribute("class").split("_")[3]
    person = container[j].find_element_by_class_name('info_text').text.split("\n")[0]#person but not place
    title = container[j].find_element_by_css_selector('span.noQuotes').text.replace("\n", "  ")
    review = container[j].find_element_by_xpath(".//p[@class='partial_entry']").text.replace("\n", "  ")
    review_date = container[j].find_element_by_class_name('ratingDate').text[9:]

更改:只是标题的范围,而不是整个分区。添加代码以查找person(第2行的剥离位置)在容器中找到日期并删除“Revied”

相关问题