Getting data into R by scraping the web using the rvest package

In this section, we will focus on web scraping and how to implement it using the rvest package.

Web scraping is the procedure of converting unstructured data into a structured format. Structured data can be easily accessed and used. We will use R for scraping the data of most popular feature films from the IMDb website.

The following steps are implemented to get data into R using the rvest package:

  1. Install the rvest package. It is mandatory to install it, as it does not come as a built-in library:
> install.packages('rvest') 
package 'rvest' successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\Radhika\AppData\Local\Temp\RtmpMvNUA5\downloaded_packages
  1. Include the installed package in R's workspace:
> library(rvest)
  1. Let's start web scraping the IMDb website, which displays the most popular feature films in a given year:
> url <- 'https://www.imdb.com/search/title?count=100&release_date=2017,2017&title_type=feature'> #Reading html code from mentioned url> webpage <- read_html(url)> webpage{xml_document}<html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<script type="text/ ...[2] <body id="styleguide-v2" class="fixed">\n\n <img height="1" width="1" style="display: ... 
  1. As you can see, there are various CSS selectors that can be used to scrape the required data:
> #Using CSS selectors to scrap the rankings section> rank_data_html <- html_nodes(webpage,'.text-primary')> rank_data_html{xml_nodeset (100)} [1] <span class="lister-item-index unbold text-primary">1.</span> [2] <span class="lister-item-index unbold text-primary">2.</span> [3] <span class="lister-item-index unbold text-primary">3.</span> [4] <span class="lister-item-index unbold text-primary">4.</span> [5] <span class="lister-item-index unbold text-primary">5.</span> [6] <span class="lister-item-index unbold text-primary">6.</span> [7] <span class="lister-item-index unbold text-primary">7.</span> [8] <span class="lister-item-index unbold text-primary">8.</span> [9] <span class="lister-item-index unbold text-primary">9.</span>[10] <span class="lister-item-index unbold text-primary">10.</span>[11] <span class="lister-item-index unbold text-primary">11.</span>[12] <span class="lister-item-index unbold text-primary">12.</span>[13] <span class="lister-item-index unbold text-primary">13.</span>[14] <span class="lister-item-index unbold text-primary">14.</span>[15] <span class="lister-item-index unbold text-primary">15.</span>[16] <span class="lister-item-index unbold text-primary">16.</span>[17] <span class="lister-item-index unbold text-primary">17.</span>[18] <span class="lister-item-index unbold text-primary">18.</span>[19] <span class="lister-item-index unbold text-primary">19.</span>[20] <span class="lister-item-index unbold text-primary">20.</span>...
  1. Use the following code to get the specific rank of each film:
> rank_data <- html_text(rank_data_html)> head(rank_data)[1] "1." "2." "3." "4." "5." "6."

In the next section, we will focus more on importing the data into R from databases using the required package.