The package htmldf
contains a single function
html_df()
which accepts a vector of urls as an input and
from each will attempt to download each page, extract and parse the
html. The result is returned as a tibble
where each row
corresponds to a document, and the columns contain page attributes and
metadata extracted from the html, including:
To install the CRAN version of the package:
install.packages('htmldf')
To install the development version of the package:
::install_github('alastairrushworth/htmldf') remotes
First define a vector of URLs you want to gather information from.
The function html_df()
returns a tibble
where
each row corresponds to a webpage, and each column corresponds to an
attribute of that webpage:
library(htmldf)
library(dplyr)
# An example vector of URLs to fetch data for
<- c("https://alastairrushworth.github.io/Visualising-Tour-de-France-data-in-R/",
urlx "https://medium.com/dair-ai/pytorch-1-2-introduction-guide-f6fa9bb7597c",
"https://www.tensorflow.org/tutorials/images/cnn",
"https://www.analyticsvidhya.com/blog/2019/09/introduction-to-pytorch-from-scratch/")
# use html_df() to gather data
<- html_df(urlx, show_progress = FALSE)
z
# have a quick look at the first page
glimpse(z[1, ])
## Rows: 1
## Columns: 17
## $ url <chr> "https://alastairrushworth.github.io/Visualising-Tour-de-Fra…
## $ title <chr> "Visualising Tour De France Data In R -"
## $ lang <chr> "en"
## $ url2 <chr> "https://alastairrushworth.github.io/Visualising-Tour-de-Fra…
## $ links <list> [<tbl_df[27 x 2]>]
## $ rss <chr> "https://alastairrushworth.github.io/feed.xml"
## $ tables <list> NA
## $ images <list> [<tbl_df[8 x 3]>]
## $ social <list> [<tbl_df[3 x 3]>]
## $ code_lang <dbl> 1
## $ size <int> 38445
## $ server <chr> "GitHub.com"
## $ accessed <dttm> 2022-07-09 16:01:32
## $ published <dttm> 2019-11-24
## $ generator <chr> NA
## $ status <int> 200
## $ source <chr> "<!DOCTYPE html>\n<!--\n Minimal Mistakes Jekyll Theme 4.4.…
To see the page titles, look at the titles
column.
%>% select(title, url2) z
## # A tibble: 4 × 2
## title url2
## <chr> <chr>
## 1 Visualising Tour De France Data In R - https://al…
## 2 A Gentle Introduction to PyTorch 1.2 | by elvis | DAIR.AI | Medium https://me…
## 3 Convolutional Neural Network (CNN) | TensorFlow Core https://ww…
## 4 Pytorch | Getting Started With Pytorch https://ww…
Where there are tables embedded on a page in the
<table>
tag, these will be gathered into the list
column tables
. html_df
will attempt to coerce
each table to tibble
- where that isn’t possible, the raw
html is returned instead.
$tables z
## [[1]]
## [1] NA
##
## [[2]]
## [1] NA
##
## [[3]]
## [[3]]$`no-caption`
## # A tibble: 0 × 0
##
##
## [[4]]
## [[4]]$`no-caption`
## # A tibble: 11 × 2
## X1 X2
## <chr> <chr>
## 1 Label Description
## 2 0 T-shirt/top
## 3 1 Trouser
## 4 2 Pullover
## 5 3 Dress
## 6 4 Coat
## 7 5 Sandal
## 8 6 Shirt
## 9 7 Sneaker
## 10 8 Bag
## 11 9 Ankle boot
html_df()
does its best to find RSS feeds embedded in
the page:
$rss z
## [1] "https://alastairrushworth.github.io/feed.xml"
## [2] NA
## [3] NA
## [4] NA
html_df()
will try to parse out any social profiles
embedded or mentioned on the page. Currently, this includes profiles for
the sites
bitbucket
, dev.to
, discord
,
facebook
, github
, gitlab
,
instagram
, kakao
, keybase
,
linkedin
, mastodon
, medium
,
orcid
, patreon
, researchgate
,
stackoverflow
, reddit
, telegram
,
twitter
, youtube
$social z
## [[1]]
## # A tibble: 3 × 3
## site handle profile
## <chr> <chr> <chr>
## 1 twitter @rushworth_a https://twitter.com/rushworth_a
## 2 github @alastairrushworth https://github.com/alastairrushworth
## 3 linkedin @in/alastair-rushworth-253137143 https://linkedin.com/in/alastair-ru…
##
## [[2]]
## # A tibble: 3 × 3
## site handle profile
## <chr> <chr> <chr>
## 1 twitter @dair_ai https://twitter.com/dair_ai
## 2 twitter @omarsar0 https://twitter.com/omarsar0
## 3 github @omarsar https://github.com/omarsar
##
## [[3]]
## # A tibble: 2 × 3
## site handle profile
## <chr> <chr> <chr>
## 1 twitter @tensorflow https://twitter.com/tensorflow
## 2 github @tensorflow https://github.com/tensorflow
##
## [[4]]
## # A tibble: 4 × 3
## site handle profile
## <chr> <chr> <chr>
## 1 twitter @analyticsvidhya https://twitter.com/analyticsvidhya
## 2 facebook @analyticsvidhya https://facebook.com/analyticsvidhya
## 3 linkedin @company/analytics-vidhya https://linkedin.com/company/analytics-vid…
## 4 youtube UCH6gDteHtH4hg3o2343iObA https://youtube.com/channel/UCH6gDteHtH4hg…
Code language is inferred from <code>
chunks using
a preditive model. The code_lang
column contains a numeric
score where values near 1 indicate mostly R code, values near -1
indicate mostly Python code:
%>% select(code_lang, url2) z
## # A tibble: 4 × 2
## code_lang url2
## <dbl> <chr>
## 1 1 https://alastairrushworth.github.io/Visualising-Tour-de-France-data…
## 2 -0.860 https://medium.com/dair-ai/pytorch-1-2-introduction-guide-f6fa9bb75…
## 3 -0.983 https://www.tensorflow.org/tutorials/images/cnn
## 4 -1 https://www.analyticsvidhya.com/blog/2019/09/introduction-to-pytorc…
Publication dates
%>% select(published, url2) z
## # A tibble: 4 × 2
## published url2
## <dttm> <chr>
## 1 2019-11-24 00:00:00 https://alastairrushworth.github.io/Visualising-Tour-de-F…
## 2 2019-09-01 18:03:22 https://medium.com/dair-ai/pytorch-1-2-introduction-guide…
## 3 2022-01-26 00:00:00 https://www.tensorflow.org/tutorials/images/cnn
## 4 2019-09-17 03:09:28 https://www.analyticsvidhya.com/blog/2019/09/introduction…
Any feedback is welcome! Feel free to write a github issue or send me a message on twitter.