
The goal of heapsofpapers is to make it easy to
respectfully get, well, heaps of papers (and CSVs, and websites, and
similar). For instance, you may want to understand the state of open
code and open data across a bunch of different pre-print repositories,
e.g. Collins and Alexander, 2021, and in that case you need a way to
quickly download thousands of PDFs.
Essentially, the main function in the package,
heapsofpapers::get_and_save() is a wrapper around a
for loop and utils::download.file(), but there
are a bunch of small things that make it handy to use instead of rolling
your own each time. For instance, the package automatically slows down
your requests, lets you know where it is up to, and adjusts for papers
that you’ve already downloaded.
You can install heapsofpapers from GitHub with:
# install.packages("devtools")
devtools::install_github("RohanAlexander/heapsofpapers")Here is an example of getting two papers from SocArXiv, using the main
function heapsofpapers::get_and_save():
library(heapsofpapers)
two_pdfs <-
  tibble::tibble(
    locations_are = c("https://osf.io/preprints/socarxiv/z4qg9/download",
                      "https://osf.io/preprints/socarxiv/a29h8/download"),
    save_here = c("competing_effects_on_the_average_age_of_infant_death.pdf",
                  "cesr_an_r_package_for_the_canadian_election_study.pdf")
    )
heapsofpapers::get_and_save(
  data = two_pdfs,
  links = "locations_are",
  save_names = "save_here"
)By default, the papers are downloaded into a folder called ‘heaps_of’. You could also specify the directory, for instance, if you would prefer a folder called ‘inputs’. Regardless, if the folder doesn’t exist then you’ll be asked whether you want to create it.
heapsofpapers::get_and_save(
  data = two_pdfs,
  links = "locations_are",
  save_names = "save_here",
  dir = "inputs"
)Let’s say that you had already downloaded some PDFs, but weren’t sure
and didn’t want to download them again. You could use
heapsofpapers::check_for_existence() to check.
heapsofpapers::check_for_existence(data = two_pdfs, 
                                   save_names = "save_here")If you already have some of the files then
heapsofpapers::get_and_save() allows you to ignore those
files, and not download them again, by specifying that
dupe_strategy = "ignore".
heapsofpapers::get_and_save(
  data = two_pdfs,
  links = "locations_are",
  save_names = "save_here",
  dupe_strategy = "ignore"
)There are many packages that are designed for scraping websites for
instance, polite and rvest. Those
packages are more general and more useful in a wider range of scenarios
than ours is. Ours is focused on the specific use-case where you have a
large list of items that you need to download.
Please cite the package if you use it: Alexander, Rohan, and A Mahfouz, 2021, ‘heapsofpapers: Easily get heaps of papers’, 24 April, https://github.com/RohanAlexander/heapsofpapers.
We thank Alex Luscombe, Amy Farrow, Edward Morgan, Monica Alexander, Paul A. Hodgetts, Sharla Gelfand, Thomas William Rosenthal, and Tom Cardoso for their help.