February 14, 2020

Web Scraping

Web scraping also called web data extraction, is the process of extracting or scraping data from websites.

With web scraping, you can obtain the data that were not readily available online. You can

  • Scrape movie rating data to create movie recommendation engines.
  • Scrape text data from Wikipedia and other sources for making NLP-based systems or training deep learning models for tasks like topic recognition from the given text.
  • Scrape data from social media sites like Facebook and Twitter for performing tasks Sentiment analysis, opinion mining, etc.
  • Scrape user reviews and feedbacks from e-commerce sites like Amazon, Flipkart, etc.

There are browser extensions that help you to do web scraping, but here we will focus on the web scraping using R.

R Package rvest

First we load rvest package.

library('rvest')
## Loading required package: xml2
library('stringr')

R Package rvest

Package 'rvest' is with the following functions.

  • read_html(url) : scrape HTML content from a given URL

  • html_nodes(): identifies HTML wrappers

  • html_nodes(".class"): calls node based on CSS class

  • html_nodes("#id"): calls node based on div id

  • html_text(): strips the HTML tags and extracts only the text

  • html_attrs(): identifies attributes (useful for debugging)

  • html_table(): turns HTML tables into data frames

Example 1: Lego Movie

Let's scrape data from Lego Movie IMDb http://www.imdb.com/title/tt1490017/.

lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
lego_movie
## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n            <img height="1" widt ...

Example 1: Lego Movie Rating

lego_movie %>% html_nodes("strong")
## {xml_nodeset (13)}
##  [1] <strong title="7.8 based on 312,995 user ratings"><span itemprop="rating ...
##  [2] <strong><a href="/list/ls022124273?ref_=tt_rels_1">\nTop 25 Highest Gros ...
##  [3] <strong><a href="/list/ls023629787?ref_=tt_rels_2">\n100 of the Best Ani ...
##  [4] <strong><a href="/list/ls079321224?ref_=tt_rels_3">\n10 Best Action Hero ...
##  [5] <strong><a href="/list/ls079144244?ref_=tt_rels_4">\nTop 10 Top-Rated An ...
##  [6] <strong><a href="/list/ls073975282?ref_=tt_rels_5">\n2015 Oscar Snubs\n< ...
##  [7] <strong><a href="/list/ls097972770?ref_=tt_rls_1">\nComedies\n</a></strong>
##  [8] <strong><a href="/list/ls047958418?ref_=tt_rls_2">\nBUL IZLE\n</a></strong>
##  [9] <strong><a href="/list/ls069329917?ref_=tt_rls_3">\nPrimeiro\n</a></strong>
## [10] <strong><a href="/list/ls073192787?ref_=tt_rls_4">\nBest Movies of 2014\ ...
## [11] <strong><a href="/list/ls093802703?ref_=tt_rls_5">\nAnimated\n</a></strong>
## [12] <strong>The Lego Movie</strong>
## [13] <strong>Everything is indeed awesome</strong>

Example 1: Lego Movie Rating

Taking

<strong title="7.8 based on 312,983 user ratings"><span itemprop="ratingValue">7.8</span></strong>
lego_movie %>% html_nodes("strong span")
## {xml_nodeset (1)}
## [1] <span itemprop="ratingValue">7.8</span>
lego_movie %>% html_nodes("strong span") %>% html_text()
## [1] "7.8"

Example 1: Lego Movie Rating

rating <- lego_movie %>% 
  html_nodes("strong span") %>%
  html_text() %>%
  as.numeric()
rating
## [1] 7.8

Example 1: Lego Movie Rating

lego_movie %>% html_nodes("div")
## {xml_nodeset (785)}
##  [1] <div id="sis_pixel">\n                <!-- begin sis pixel slot -->\n\n\ ...
##  [2] <div id="sis_pixel_r2" style="height:1px; position: absolute; left: -100 ...
##  [3] <div id="9e48b356-a010-4789-b300-dfa51c861fdc">\n       <nav id="imdbHea ...
##  [4] <div id="nblogin" class="imdb-header__login-state-node"></div>\n
##  [5] <div class="ipc-page-content-container ipc-page-content-container--cente ...
##  [6] <div class="ipc-button__text">Menu</div>
##  [7] <div class="iRO9SK-8q3D8_287dhn28" role="presentation" aria-hidden="true ...
##  [8] <div class="_3rHHDKyPLOjL8tGKHWMRza" role="presentation" data-testid="pa ...
##  [9] <div class="_3bRJYEaOz1BKUQYqW6yb29" role="presentation" data-testid="pa ...
## [10] <div role="presentation" class="_3wpok4xkiX-9E61ruFL_RA sc-jzJRlG RJOHx" ...
## [11] <div class="_33PK8nBHiT1fGjnfXwum3v sc-cSHVUG kSadNP"><svg class="ipc-lo ...
## [12] <div class="_2BpsDlqEMlo9unX-C84Nji sc-fjdhpX gVXRSl" data-testid="nav-l ...
## [13] <div class="_1S9IOoNAVMPB2VikET3Lr2" aria-hidden="true" aria-expanded="f ...
## [14] <div class="_1IQgIe3JwGh2arzItRgYN3" role="presentation"><ul class="ipc- ...
## [15] <div class="_2BpsDlqEMlo9unX-C84Nji sc-fjdhpX gVXRSl" data-testid="nav-l ...
## [16] <div class="_1S9IOoNAVMPB2VikET3Lr2" aria-hidden="true" aria-expanded="f ...
## [17] <div class="_1IQgIe3JwGh2arzItRgYN3" role="presentation"><ul class="ipc- ...
## [18] <div class="_2BpsDlqEMlo9unX-C84Nji sc-fjdhpX gVXRSl" data-testid="nav-l ...
## [19] <div class="_1S9IOoNAVMPB2VikET3Lr2" aria-hidden="true" aria-expanded="f ...
## [20] <div class="_1IQgIe3JwGh2arzItRgYN3" role="presentation"><ul class="ipc- ...
## ...

Example 1: Lego Movie Rating

<div class="imdbRating" itemtype="http://schema.org/AggregateRating" itemscope="" itemprop="aggregateRating">
  <div class="ratingValue">
    <strong title="7.8 based on 312,983 user ratings"><span itemprop="ratingValue">7.8</span></strong>
    <span class="grey">/</span><span class="grey" itemprop="bestRating">10</span>
  </div>
    <a href="/title/tt1490017/ratings?ref_=tt_ov_rt"><span class="small" itemprop="ratingCount">312,983</span></a>
  <div class="hiddenImportant">
    <span itemprop="reviewCount">546 user</span>
    <span itemprop="reviewCount">462 critic</span>
  </div>
</div>
lego_movie %>% html_nodes("div.imdbRating")
## {xml_nodeset (1)}
## [1] <div class="imdbRating" itemtype="http://schema.org/AggregateRating" item ...

Example 1: Lego Movie Rating

<div class="imdbRating" itemtype="http://schema.org/AggregateRating" itemscope="" itemprop="aggregateRating">
  <div class="ratingValue">
    <strong title="7.8 based on 312,983 user ratings"><span itemprop="ratingValue">7.8</span></strong>
    <span class="grey">/</span><span class="grey" itemprop="bestRating">10</span>
  </div>
    <a href="/title/tt1490017/ratings?ref_=tt_ov_rt"><span class="small" itemprop="ratingCount">312,983</span></a>
  <div class="hiddenImportant">
    <span itemprop="reviewCount">546 user</span>
    <span itemprop="reviewCount">462 critic</span>
  </div>
</div>
lego_movie %>% html_nodes("div.imdbRating span")
## {xml_nodeset (6)}
## [1] <span itemprop="ratingValue">7.8</span>
## [2] <span class="grey">/</span>
## [3] <span class="grey" itemprop="bestRating">10</span>
## [4] <span class="small" itemprop="ratingCount">312,995</span>
## [5] <span itemprop="reviewCount">546 user</span>
## [6] <span itemprop="reviewCount">462 critic</span>

Example 1: Lego Movie Rating

lego_movie %>% html_nodes("div.imdbRating span")
## {xml_nodeset (6)}
## [1] <span itemprop="ratingValue">7.8</span>
## [2] <span class="grey">/</span>
## [3] <span class="grey" itemprop="bestRating">10</span>
## [4] <span class="small" itemprop="ratingCount">312,995</span>
## [5] <span itemprop="reviewCount">546 user</span>
## [6] <span itemprop="reviewCount">462 critic</span>
lego_movie %>% html_nodes("div.imdbRating span") %>% html_text()
## [1] "7.8"        "/"          "10"         "312,995"    "546 user"  
## [6] "462 critic"

Example 1: Lego Movie Rating

rating2 <- lego_movie %>% 
  html_nodes("div.imdbRating span") %>% 
  html_text()
rating2[1]
## [1] "7.8"

Example 1: Lego Movie Cast

lego_movie %>%
  html_nodes("div#titleCast.article td a") %>%
  html_text()
##  [1] ""                   " Will Arnett\n"     "Batman"            
##  [4] "Bruce Wayne"        ""                   " Elizabeth Banks\n"
##  [7] "Wyldstyle"          "Lucy"               ""                  
## [10] " Craig Berry\n"     ""                   " Alison Brie\n"    
## [13] "Unikitty"           ""                   " David Burrows\n"  
## [16] ""                   " Anthony Daniels\n" ""                  
## [19] " Charlie Day\n"     "Benny"              ""                  
## [22] " Amanda Farinos\n"  ""                   " Keith Ferguson\n" 
## [25] ""                   " Will Ferrell\n"    "Lord Business"     
## [28] "President Business" "The Man Upstairs"   ""                  
## [31] " Will Forte\n"      "Abraham Lincoln"    ""                  
## [34] " Dave Franco\n"     ""                   " Morgan Freeman\n" 
## [37] "Vitruvius"          ""                   " Todd Hansen\n"    
## [40] "Gandalf"            "Additional Voices"  ""                  
## [43] " Jonah Hill\n"      "Green Lantern"

Example 1: Lego Movie Cast

cast <- lego_movie %>%
  html_nodes("div#titleCast.article td a") %>%
  html_text() %>%
  trimws()
cast
##  [1] ""                   "Will Arnett"        "Batman"            
##  [4] "Bruce Wayne"        ""                   "Elizabeth Banks"   
##  [7] "Wyldstyle"          "Lucy"               ""                  
## [10] "Craig Berry"        ""                   "Alison Brie"       
## [13] "Unikitty"           ""                   "David Burrows"     
## [16] ""                   "Anthony Daniels"    ""                  
## [19] "Charlie Day"        "Benny"              ""                  
## [22] "Amanda Farinos"     ""                   "Keith Ferguson"    
## [25] ""                   "Will Ferrell"       "Lord Business"     
## [28] "President Business" "The Man Upstairs"   ""                  
## [31] "Will Forte"         "Abraham Lincoln"    ""                  
## [34] "Dave Franco"        ""                   "Morgan Freeman"    
## [37] "Vitruvius"          ""                   "Todd Hansen"       
## [40] "Gandalf"            "Additional Voices"  ""                  
## [43] "Jonah Hill"         "Green Lantern"

Example 1: Lego Movie Cast

cast <- cast[str_length(cast) > 1]; cast
##  [1] "Will Arnett"        "Batman"             "Bruce Wayne"       
##  [4] "Elizabeth Banks"    "Wyldstyle"          "Lucy"              
##  [7] "Craig Berry"        "Alison Brie"        "Unikitty"          
## [10] "David Burrows"      "Anthony Daniels"    "Charlie Day"       
## [13] "Benny"              "Amanda Farinos"     "Keith Ferguson"    
## [16] "Will Ferrell"       "Lord Business"      "President Business"
## [19] "The Man Upstairs"   "Will Forte"         "Abraham Lincoln"   
## [22] "Dave Franco"        "Morgan Freeman"     "Vitruvius"         
## [25] "Todd Hansen"        "Gandalf"            "Additional Voices" 
## [28] "Jonah Hill"         "Green Lantern"

Example 1: Lego Movie Cast

<div class="article" id="titleCast">
  <span class=rightcornerlink >
    <a href="/register/login?why=edit&ref_=tt_cl" rel="login">Edit</a>
  </span>
  <h2>Cast</h2>
  <table class="cast_list">
    <tr><td colspan="4" class="castlist_label">Cast overview, first billed only:</td></tr>
  <tr class="odd">
    <td class="primary_photo">
      <a href="/name/nm0004715/?ref_=tt_cl_i1"><img height="44" width="32" alt="Will Arnett" title="Will Arnett" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._CB468460248_.png" class="loadlate hidden " loadlate="https://m.media-amazon.com/images/M/MV5BNDkzMjEzNDMyN15BMl5BanBnXkFtZTcwNTk3ODEyOQ@@._V1_UY44_CR0,0,32,44_AL_.jpg" /></a>
    </td>
    <td><a href="/name/nm0004715/?ref_=tt_cl_t1"> Will Arnett </a></td>
 .....
cast <- lego_movie %>%
  html_nodes("#titleCast .primary_photo img") %>%
  html_attr("alt")
cast
##  [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"     "Alison Brie"    
##  [5] "David Burrows"   "Anthony Daniels" "Charlie Day"     "Amanda Farinos" 
##  [9] "Keith Ferguson"  "Will Ferrell"    "Will Forte"      "Dave Franco"    
## [13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"

Example 1: Lego Movie Cast

cast <- lego_movie %>%
  html_nodes("#titleCast .primary_photo img") %>%
  html_attr("alt")
cast
##  [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"     "Alison Brie"    
##  [5] "David Burrows"   "Anthony Daniels" "Charlie Day"     "Amanda Farinos" 
##  [9] "Keith Ferguson"  "Will Ferrell"    "Will Forte"      "Dave Franco"    
## [13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"

Example 1: Lego Movie Poster

<div class="poster">
  <a href="/title/tt1490017/mediaviewer/rm1316605952?ref_=tt_ov_i"> 
    <img alt="The Lego Movie Poster" title="The Lego Movie Poster"
src="https://m.media-amazon.com/images/M/MV5BMTg4MDk1ODExN15BMl5BanBnXkFtZTgwNzIyNjg3MDE@._V1_UX182_CR0,0,182,268_AL_.jpg" />
  </a>    
</div>
poster <- lego_movie %>%
  html_nodes(".poster img") %>%
  html_attr("src")
poster
## [1] "https://m.media-amazon.com/images/M/MV5BMTg4MDk1ODExN15BMl5BanBnXkFtZTgwNzIyNjg3MDE@._V1_UX182_CR0,0,182,268_AL_.jpg"

Example 2: Social Security Holders

Scrape 'the Number of Social Security card holders born in the U. S. by year of birth and sex' table from https://www.ssa.gov/oact/babynames/numberUSbirths.html.

births_html <- read_html("https://www.ssa.gov/oact/babynames/numberUSbirths.html")
<table class="t-stripe">
<thead>
<tr >
 <th style="background-color:#eeeeee;color:black" scope="col">Year of<br /> birth</td>
 <th style="background-color:#99ccff;color:black" scope="col">Male</th>
 <th style="background-color:pink;color:black" scope="col">Female</th>
 <th style="background-color:#eeeeee;color:black" scope="col">Total</th>
</tr>
</thead>
<tbody>
<tr><td>1880</td>
<td>118,399</td><td>97,606</td><td>216,005</td></tr>
<tr><td>1881</td>
<td>108,282</td><td>98,855</td><td>207,137</td></tr>
<tr><td>1882</td>
....

Example 2: Social Security Holders

births <- html_table(html_nodes(births_html, "table"))[[1]]
births

Example 2: Social Security Holders

births <- births %>% 
  apply(., c(1,2), str_replace_all, ",","") %>%
  apply(., c(1,2), as.numeric) %>%
  as.data.frame()
births

Example 3: Yelp Review

Example 3: Yelp Review

Check the maximum pages for the review. You can find that there are in total 8 pages of reviews.

Let's automate finding the maximum page number. Note that page number is with div which class name contains lemon and pagination.

max_page <- function(html){
  out_of <- html %>% html_nodes(xpath="//div[contains(@class, 'lemon') and contains(@class, 'pagination')]") %>% 
    html_nodes("span") %>% 
    html_text()
  out_of <- out_of[str_detect(out_of, "Page") & str_detect(out_of, "of")]
  out_of <- str_split(out_of, "of ")[[1]][2] %>% as.numeric()

  return(out_of)         
}
max_page(hunan_gardens)
## [1] 8

Example 3: Yelp Review

Example 3: Yelp Review

For each page we click, our link has the ending &start=20, &start=40, &start=60, …etc.

get_page_url <- function(url, pagenum){
  if (pagenum == 1) {
      return(url)
    } else {
      return( paste0(url, "&start=", (pagenum-1)*20) )
    }
}
get_page_url(hunan_gardens_url,1)
## [1] "https://www.yelp.com/biz/hunan-gardens-kalamazoo?osq=hunan+gardens"
get_page_url(hunan_gardens_url,3)
## [1] "https://www.yelp.com/biz/hunan-gardens-kalamazoo?osq=hunan+gardens&start=40"

Example 3: Yelp Review

Let's try obtaining html from the second page and get user information.

page_html <- read_html(get_page_url(hunan_gardens_url,2))

user_info <- page_html %>% 
  html_nodes(xpath="//div[contains(@class, 'lemon') and contains(@class, 'user-passport-info')]") %>% 
  html_nodes("span") %>% 
  html_text()
user_info
##  [1] "Tyler S."                "Kalamazoo, MI"          
##  [3] "Shannon I."              "Kalamazoo, MI"          
##  [5] "Jessica V."              "Kalamazoo, MI"          
##  [7] "Heather J."              "Westwood, Kalamazoo, MI"
##  [9] "Anne Marie H."           "Kalamazoo, MI"          
## [11] "Jordan Paige D."         "Wayland, MI"            
## [13] "Jessica S."              "Chicago, IL"            
## [15] "Susan S."                "Livonia, MI"            
## [17] "Mike S."                 "Kalamazoo, MI"          
## [19] "Jamie L."                "Portage, MI"            
## [21] "A G."                    "Kalamazoo, MI"          
## [23] "Bob S."                  "Oshtemo, MI"            
## [25] "Albert K."               "Troy, MI"               
## [27] "Shane B."                "Kalamazoo, MI"          
## [29] "Kate C."                 "Charlotte, NC"          
## [31] "Andrew M."               "Kalamazoo, MI"          
## [33] "Sarah S."                "Kalamazoo, MI"          
## [35] "Melanie S."              "Kalamazoo, MI"          
## [37] "Jamie K."                "Kalamazoo, MI"          
## [39] "Erika S."                "Colorado Springs, CO"

Example 3: Yelp Review

Now let's check the ratings. Ratings are a bit harder because the display is in image.

ratings <- page_html %>% 
  html_nodes(xpath="//div/span[contains(@class, 'lemon') and contains(@class, 'display--inline')]") %>% 
  html_nodes('div') %>% 
  html_attr('aria-label') %>% 
  trimws() %>% 
  na.omit()
head(ratings)
## [1] "4 star rating" "1 star rating" "4 star rating" "5 star rating"
## [5] "5 star rating" "5 star rating"
ratings <- ratings[str_detect(ratings, "star rating")] %>%
  sapply(., function(x) {str_split(x, " star rating")[[1]][1]}) %>%
  as.numeric()
head(ratings)
## [1] 4 1 4 5 5 5

Example 3: Yelp Review

Also let's get the comment data.

comment <- page_html %>% 
  html_nodes(xpath="//p[contains(@class, 'lemon') and contains(@class, 'comment')]") %>% 
  html_nodes('span') %>% 
  html_text() %>% 
  trimws()
head(comment)
## [1] "I am currently here, while everyone at my table has gotten their food but my mother and now most of our food is gone and to this moment she has still not received her food. It's noodles bro...why is it taking 2 hours?? Didn't even ask us what we wanted to drink. Just assumed. This place sucks bro."                                                                                                                                                                                                                                                                                                     
## [2] "The beef fried rice is where it's at! $8 for a take away container seemed a bit steep, but they did a good job of packing it in there and it really was tasty. The chicken lo mein is good, but for $8, I'll pass next time. I had an egg roll which was also good. They make their own hot mustard, which I found really impressive. Just to be clear, its really hot! But good. All in all, $20 was a bit more than I wanted to pay for dinner, but I had leftovers so I'm calling it a win."                                                                                                                 
## [3] "Update: try their boba tea - dark, smoky, and delicious! best deal on boba tea in town."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [4] "We've come here for years with my parents, who live in the area. Now that we're living here, it's on our regular roll call for eating out. Why?* Quiet, relaxed atmosphere* Delicious meals* A wide range of options for both lunch and dinner* An inexpensive lunch (meal, fried rice, soup, and egg roll or 2 crabmeat rangoons)* Interesting and polite wait staff* Lovely welcoming neko (cat) at the entrance****** Amazing crabmeat rangoons - they are huge!Lunch is a deal - $6.50-7.50 - for a wide range of items. But if you're there for dinner, try the noodles or the enormous soup bowls. Swoon."
## [5] "This is our go to place for Chinese takeout. Great options and friendly staff. We did order takeout tonight and they got one of the items wrong. When I called they were very nice and sent a new order out."                                                                                                                                                                                                                                                                                                                                                                                                   
## [6] "My husband, daughter and I went to Hunan Gardens for dinner last night before the theater and my daughter and I both ordered the Kung Pao Chicken. I am NOT kidding when I say the peanuts in the dish were larger than the pieces of chicken. We were both extremely disappointed and absolutely would have returned the dishes if it were not for the fact we were almost late for our show. It was like having ground chicken in the dish. I will not return to Hunan Gardens EVER again as it happened with both of our dishes."

Example 3: Yelp Review

We can make this all as a function and find the information on other pages too.

get_review <- function(url, pagenum) {
  page_html <- read_html(get_page_url(url, pagenum))
  
  user_info <- page_html %>% html_nodes(xpath="//div[contains(@class, 'lemon') and contains(@class, 'user-passport-info')]") %>% html_nodes("span") %>% html_text()

  ratings <- page_html %>% html_nodes(xpath="//div/span[contains(@class, 'lemon') and contains(@class, 'display--inline')]") %>% html_nodes('div') %>% html_attr('aria-label') %>% trimws() %>% na.omit()
  ratings <- ratings[str_detect(ratings, "star rating")] %>% sapply(., function(x) {str_split(x, " star rating")[[1]][1]}) %>% as.numeric()
  
  comment <- page_html %>% html_nodes(xpath="//p[contains(@class, 'lemon') and contains(@class, 'comment')]") %>% html_nodes('span') %>% html_text() %>% trimws()
  
  return(list(user_info = user_info, ratings = ratings, comment = comment))
}

Example 3: Yelp Review

Let's check for page 3.

review_pg3 <- get_review(hunan_gardens_url, 3)
head(review_pg3$ratings)
## [1] 4 5 4 4 4 4
head(review_pg3$comment)
## [1] "Been my favorite since 1992! The flavors are fantastic and the host Mike is a dear old friend!"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "I actually created a Yelp account just so I can review this place. My family and I moved to Kzoo a few months back, so I am not too familiar with the restaurants in the area. We had tried one other Chinese restaurant a few weeks ago and it was alright. So we were pleasantly surprised with the quality at Hunan Gardens. When I order Chinese I look for freshness of vegetables and overall cleanliness of food quality. Hunan Gardens delivered in these (and literally delivers too).We decided on the place because they deliver and are located less than 2 miles from our house. We had ordered the dinner size of the following items: Mongolian Beef, Empress Chicken, and Shrimp and cashews. Also, we had a small Chicken Lo Mein and Crab Ragoons. Our order came out just under 40 dollars.-The vegetable quality was excellent. Everything was fresh and cooked perfectly. Nothing was soggy. We requested our food to be cooked with mild spiciness and they listened. Some places can't even so that much.- The beef in the Mongolian beef was tender. Usually I am adamant about ordering beef from small places, as it tends to be dry and tough. but it was pretty good at HG.- the shrimp from the Shrimp and cashews was a little over cooked and smelled a *little* fishy but I got past it because the sauce was yummy and the veggies were tender.- the chicken was a little dry but that's what I'd expect for something to be fried then added to a wok stir fry. I ordered this thinking it was only going to be sauteed chicken (maybe I will request it that way next time), so when I saw it was fried in a batter I was disappointed. But it was alright with me in the end because it tasted good. Maybe next time I'll order the chicken on the side and mix at home because it got a little soggy.- lo mein was good too. Not salty and pasta was not over cooked.- the deal breaker was the crab ragoons. There was flavor in the filling: more than just cream cheese and crab meat. It was well seasoned and generously filled too. I am used to places skimping on the filling and not adding any seasoning so they usually taste bland. But at hunan Gardens I was delighted. I would just order a separate container of sweet and sour sauce versus what they gave us. I like that thick red stuff more.- delivery was estimated to be between 1 hour to 1 hour and 15 minutes. Our delivery guy was here promptly within 1 hour exactly. He was kind and courteous even for the snowy weather conditions. For dinner portions I was kind of expecting slightly more food, but we all ate to our fill with pleasure. My husband enjoyed it even as left overs later that night. I definitely will be ordering from them in the future. Maybe I will also go and dine- in to enjoy the whole Asian fusion cuisine."
## [3] "Very good food.  Great lunch place: soup, egg roll and entree for an affordable price.  Good service and good portions.  I would highly recommend..."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
## [4] "Was pretty good they gave me chick instead of steak probably because I asked for steak and the rice was okay"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [5] "Really great! Everything was seasoned perfectly. We ordered at an off time, so I expected the food would be somewhat subpar. It was actually hot and very fresh! And they deliver! I will be a regular for sure. I had the shrimp lo mein and my aunt had the Hunan Chicken and Shrimp."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [6] "I typically do not like Chinese food because the way they fix food is dirty, greasy and sickening at times, not to mention questionable ingredients that violates our vegan state of mind. As we ordered, it was encouraging to know they do not use MSG and they were sensitive to our needs...the girl who got my order double checked with us to see if we wanted to omit the egg from our dishes which we appreciated. The service was good and the food was pleasantly delicious and my wife and I could tell the food was clean, well prepared (free of oyster or fish sauce, MSG, etc.) with good ole wholesome natural goodness! My 10 month baby boy was able to eat the rice, so we were all happy!"

Example 3: Yelp Review

Since we already know how to get the maximum page number, we can get all data from page 1 to the last page.

Conclusion

  • Web scraping is a powerful tool to obtain large data available on web.

  • But setting up automated process requires understanding of html structure and takes practice.

  • Get yourself familiarized with the R package rvest. It may also be good to look at XPath too. (It's query language for selecting nodes from an XML document.)

  • Some websites ask you to provie API key (e.g. Google, Twitter). API is a unique identifier used to authenticate a user, developer, or calling program.

Thank You