Web scraping libraries and methodology_Hands-On Python Natural Language Processing-QQ阅读中文历史网

书名：Hands-On Python Natural Language Processing
作者名：Aman Kedia Mayank Rasu
本章字数：1390字
更新时间：2021-06-18 18:28:55

Web scraping libraries and methodology

While discussing NLTK, we highlighted the significance of a corpus or large repository of text for NLP research. While the available corpora are quite useful, NLP researchers may require the text of a particular subject. For example, someone trying to build a sentiment analyzer for financial markets may not find the available corpus (presidential speeches, movie reviews, and so on) particularly useful. Consequently, NLP researchers may have to get data from other sources. Web scraping is an extremely useful tool in this regard as it lets users retrieve information from web sources programmatically.

Before we start discussing web scraping, we wish to underscore the importance of complying with the respective website policies on web scraping. Most websites allow web scraping for individual non-commercial use, but you must always confirm the policy before scraping a website.

To perform web scraping, we will be using a test website (https://webscraper.io/test-sites/e-commerce/allinone) to implement our web scraping script. The test website is that of a fictitious e-commerce company that sells computers and phones.

Here's a screenshot of the website:

The website lists the products that it sells and each product has price and user rating information. Let's say we want to extract the price and user ratings of every laptop listed on the website. You can do this task manually, but that would be very time-consuming and inefficient. Web scraping helps us perform tasks like this much more efficiently and elegantly.

Now, let's get into how the preceding task could be carried out using web scraping tools in Python. First, we need to install the Requests and BeautifulSoup libraries, which are the most commonly used Python libraries for web scraping. The documentation for Requests can be accessed at https://requests.readthedocs.io/en/master/, while the documentation for BeatifulSoup can be accessed at https://www.crummy.com/software/BeautifulSoup/:

          pip install requests
          

          pip install beautifulsoup4

Once installed, we will import the Requests and BeautifulSouplibraries. The pandas library will be used to store all the extracted data in a data frame and export data into a CSV file:

import requests
from bs4 import BeautifulSoup
import pandas as pd

When we type a URL into our web browser and hit Enter, a set of events get triggered before the web page gets rendered in our browser. These events include our browser looking up the IP address of the website, our browser sending an HTTP request to the server hosting the website, and the server responding by sending another HTTP response. If everything is agreed to, a handshake between the server and your browser occurs and the data is transferred. The request library helps us perform all these steps using Python scripts.

The following code snippet shows how we can programmatically connect to the website using the Requests library:

url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops'
request = requests.get(url)

Running the preceding commands establishes a connection with the given website and reads the HTML code of the page. Everything we see on a website (text, images, layouts, links to other web pages, and so on) can be found in the HTML code of the page. Using the .text function of request, we can output the entire HTML script of the web page, as shown here:

request.text

Here's the output:

If you want to see the HTML code of the page on your browser, simply right-click anywhere on the page and select Inspect, as shown here:

This will open a panel containing the HTML code of the page. If you hover your mouse over any part of the HTML code, the corresponding section on the web page will be highlighted. This tells us that the code for the highlighted portion of the web page's code can be found by expanding that section of the HTML code:

HTML code is generally divided into sections, with a typical page having a header section and a body section. The body section is further divided into elements, with each element having attributes that are represented by a specific tag. In the preceding screenshot, we can see the various elements, classes, and tags of the HTML code. We will need to navigate through this complex-looking code and extract the relevant information (in our case, the product title, price, and rating). This seemingly complex task can be carried out quite conveniently using any of the web scraping libraries available. Beautiful Soup is one of the most popular scrapers out there, so we will see how it can help us parse the intimidating HTML code text. We strongly encourage you to visit Beautiful Soup's documentation page (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and gain a better understanding of this fascinating library.

We use the BeautifulSoup module and pass the HTML code (request.text) and a parameter called HTML Parser to it, which creates a BeautifulSoup HTML parser object. We can now apply many of the versatile BeautifulSoup functions to this object and extract the information that we seek. But before we start doing that, we will have to familiarize ourselves with the web page we are trying to scrape and identify where on the web page the elements that we are interested in are to be found. In the e-commerce website's HTML code, we can see that each product's detail is coded within a <div> tag (div refers to division in HTML) with col-sm-4 col-lg-4 col-md-4 as the class. If you expand the <div> tag by clicking on the arrow, you will see that, within the <div> tag, there are other tags and elements as well that store various pieces of information.

To begin with, we are interested in getting the list of product names. To find out where in the HTML code the product names are incorporated, we will have to hover the cursor above any of the product's names, right-click, and then click on Inspect.

This will open a panel containing the web page's HTML code, as shown in the following screenshot:

As we can see, the name of the product can be extracted from the title element of the <a> tag, which is within the caption subdivision of the code. Likewise, we can also find price information within the same caption subdivision but under the pull-right price class. Lastly, rating information can be extracted from the subdivision with the rating class:

We can now start formulating our web scraping strategy, which will involve iterating over all the code divisions with the col-sm-4 col-lg-4 col-md-4 class and then extracting the relevant information in each iteration. We'll use Beautiful Soup's find_all() function to identify all the <div> tags of the col-sm-4 col-lg-4 col-md-4 class. This function creates an iteratable object and we use a for loop to search each subdivision. We can extract the text from a BeautifulSoup object by using the .text function and can extract the name of an element by using the .get() function. Please refer to the following scraping code:

titles = []
prices = []
ratings = []
url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops'
request = requests.get(url)
soup = BeautifulSoup(request.text, "html.parser")
for product in soup.find_all('div', {'class': 'col-sm-4 col-lg-4 col-md-4'}):
    for pr in product.find_all('div', {'class': 'caption'}):
        for p in pr.find_all('h4', {'class': 'pull-right price'}):
            prices.append(p.text)
        for title in pr.find_all('a' , {'title'}):
            titles.append(title.get('title'))
    for rt in product.find_all('div', {'class': 'ratings'}):
        ratings.append(len(rt.find_all('span', \
                      {'class': 'glyphicon glyphicon-star'})))

As the last step, we pass the extracted information to a data frame and export the final result in a CSV file or other file type:

product_df = pd.data frame(zip(titles,prices,ratings), columns = \
                            ['Titles','Prices', 'Ratings'])

product_df.to_csv("ecommerce.csv",index=False)

The following is a partial screenshot of the file that was created:

Likewise, you can extract text information, such as user reviews and product descriptions, for NLP-related projects. Please note that scraped data may require further processing based on requirements.

The preceding steps demonstrate how we can programmatically extract relevant information from web sources using web scraping with relative ease using applicable Python libraries. The more complex the structure of a web page is, the more difficult it is to scrape that page. Websites also keep changing the structure and format of their web pages, which means large-scale changes need to be made to the underlying HTML code. Any change in the HTML code of the page necessitates a review of your scraping code. You are encouraged to practice scraping other websites and gain a better understanding of HTML code structure. We would like to reiterate that it is imperative that you comply with any web scraping restrictions or limits put in place by that website.