tripadvisor抓取python脚本正在导出多个不同版本的行

我正在为一篇学术研究论文撰写这篇草稿。我绝对是一个新手，自学成才，并且已经拼凑起来了！
我想要的是：一个大约560行的csv；每个日期（mdyyyy）、审核、评级和用户名（用户名目前未计入脚本，仅供参考）各一列。
我已经让它运行没有错误，但输出是错误的。我有上千行。该脚本正在以多种格式循环和输出数据：1）带月份/日期的500ish行和审阅2）带评级的500ish行和审阅3）带名称、日期、审阅的500ish行都在同一列中。。。。等等
我花了几个小时试图解决这个问题，现在我有了另一个：
回溯（最近一次调用）：第49行，在date=“”.join（date[j].text.split（“”[-2:]）索引器中：列表索引超出范围
在3.9.6中运行这个，如果这有区别的话。
我有三个问题：
如何解决此日期超出范围的问题？
脚本是否有任何明显的错误导致它创建了数千行不同的格式？
如何在中添加用户名？我尝试过这样做，但似乎找不到正确的xpath。以下是我正在浏览的网站：https://www.tripadvisor.com/showuserreviews-g189447-d207187-r773649540-monastery_of_st_john-patmos_dodecanese_south_aegean.html

import csv
from selenium import webdriver
import time

# default path to file to store data

path_to_file = "D:\Documents\Archaeology\Projects\Patmos\scraped\monastery6.csv"

# default number of scraped pages

num_page = 5

# default tripadvisor website of hotel or things to do (attraction/monument)

url = "https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html"

# url = "https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html"

# if you pass the inputs in the command line

if (len(sys.argv) == 4):
    path_to_file = sys.argv[1]
    num_page = int(sys.argv[2])
    url = sys.argv[3]

# import the webdrive -- NMS 20210705

driver = webdriver.Chrome("C:/Users/nsusm/AppData/Local/Programs/Python/Python39/webdriver/bin/chromedriver.exe")
driver.get("https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html")

# open the file to save the review

csvFile = open(path_to_file, 'a')
csvFile = open(path_to_file, 'a', encoding="utf-8")
csvWriter = csv.writer(csvFile, delimiter=',')
csvWriter.writerow([str ('title'), str ('rating'), str ('review'), str ('date')])

# change the value inside the range to save more or less reviews

for i in range(0, 48, 1):

    # expand the review 
    time.sleep(2)

# define container (this is the whole box of the Trip Advisor review, excluding the date of the review)

    container = driver.find_elements_by_xpath(".//div[@class='review-container']")

# grab also the date of review

    date = driver.find_elements_by_xpath(".//class[@class='prw_reviews_stay_date_hsx']")

    for j in range(len(container)):

        rating = container[j].find_element_by_xpath(".//span[contains(@class, 'ui_bubble_rating bubble_')]").get_attribute("class").split("_")[3]
        title = container[j].find_element_by_xpath(".//div[contains(@class, noQuotes)]").text.replace("\n", "  ")
        review = container[j].find_element_by_xpath(".//p[@class='partial_entry']").text.replace("\n", "  ")
        date = " ".join(date[j].text.split(" ")[-2:])

# write data into csv

        csvWriter.writerow([title, rating, review, date])

# change the page

    driver.find_element_by_xpath('.//a[@class="nav next ui_button primary"]').click()

# quite selenium

driver.quit()

# FYI you need to close all windows for the file to write ```

那个日期查找器回来时是空的，所以[j]没能找到。审阅日期在容器中，因此您可以将其与其他内容一起使用。

rating = container[j].find_element_by_xpath(".//span[contains(@class, 'ui_bubble_rating bubble_')]").get_attribute("class").split("_")[3]
    person = container[j].find_element_by_class_name('info_text').text.split("\n")[0]#person but not place
    title = container[j].find_element_by_css_selector('span.noQuotes').text.replace("\n", "  ")
    review = container[j].find_element_by_xpath(".//p[@class='partial_entry']").text.replace("\n", "  ")
    review_date = container[j].find_element_by_class_name('ratingDate').text[9:]

更改：只是标题的范围，而不是整个分区。添加代码以查找person（第2行的剥离位置）在容器中找到日期并删除“Revied”

tripadvisor抓取python脚本正在导出多个不同版本的行

1条答案

相关问题

热门标签

最新问答