json R代码用于下载网站上提供的所有PDF文件：网页抓取

x8goxv8g 于 2023-05-08 发布在其他

关注(0)|答案(2)|浏览(232)

我想用R编写代码，它可以下载这个URL上的所有PDF：https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook%20of%20Statistics%20on%20Indian%20Economy，然后下载文件夹中的所有PDF。我在https://towardsdatascience.com的帮助下尝试了以下代码，但代码错误为

library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
page <- read_html("https://www.rbi.org.in/scripts/AnnualPublications.aspx? 
head=Handbook%20of%20Statistics%20on%20Indian%20Economy") %>%

raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>%  # find all links in the page
html_attr("href") %>% # get the url for these links
str_subset("\\.pdf") %>% # find those that end in pdf only
str_c("https://rbi.org.in", .) %>% # prepend the website to the url
map(read_html) %>% # take previously generated list of urls and read them
map(html_node, "#raw-url") %>% # parse out the 'raw' url - the link for the download button
map(html_attr, "href") %>% # return the set of raw urls for the download buttons
str_c("https://www.rbi.org.in", .) %>% # prepend the website again to get a full url
for (url in raw_list)
{ download.file(url, destfile = basename(url), mode = "wb") 
}

我无法解释为什么代码出错。如果有人能帮我。

JSON

来源：https://stackoverflow.com/questions/69733921/r-code-for-downloading-all-the-pdfs-given-on-a-site-web-scraping

2条答案

按热度按时间

axzmvihb1#

当尝试运行你的代码时，我遇到了“验证你是人类”和“请确保你的浏览器启用了Javascript”对话框。这表明您无法使用Rvest打开页面，而需要使用RSelenium browser automation。
下面是使用RSelenium的修改版本

library(tidyverse)
library(stringr)
library(purrr)
library(rvest)

library(RSelenium)

rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]

remDr$navigate("https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook%20of%20Statistics%20on%20Indian%20Economy")
page <- remDr$getPageSource()[[1]]
read_html(page) -> html

html %>%
html_nodes("a") %>%  
html_attr("href") %>% 
str_subset("\\.PDF") -> urls
urls %>% str_split(.,'/') %>% unlist() %>% str_subset("\\.PDF") -> filenames

for(u in 1:length(urls)) {
  cat(paste('downloading: ', u, ' of ', length(urls), '\n'))
  download.file(urls[u], filenames[u], mode='wb')
  Sys.sleep(1)
}

赞(0）回复(0）举报 2023-05-08

uxhixvfz2#

有一些小错误。本网站使用大写字母作为PDF结尾，您不需要使用str_c("https://rbi.org.in", .)。最后，我认为使用purrr的walk2函数更流畅（可能在原始代码中也是如此）。
我还没有执行的代码，因为我不需要这么多的PDF，所以，报告，如果它的工作。

library(tidyverse)
library(rvest)
library(stringr)
library(purrr)
page <- read_html("https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook%20of%20Statistics%20on%20Indian%20Economy")
  
  raw_list <- page %>% # takes the page above for which we've read the html
  html_nodes("a") %>%  # find all links in the page
  html_attr("href") %>% # get the url for these links
  str_subset("\\.PDF") %>% 
  walk2(., basename(.), download.file, mode = "wb")

赞(0）回复(0）举报 2023-05-08

我来回答

json R代码用于下载网站上提供的所有PDF文件：网页抓取

2条答案

相关问题

热门标签

最新问答