Parsing HTML with lxml

Another powerful, fast, and flexible parser is the HTML Parser that comes with lxml. As lxml is an extensive library written for parsing both XML and HTML documents, it can handle messed up tags in the process.

Let's start with an example.

Here, we will use the requests module to retrieve the web page and parse it with lxml:

#Importing modules 
from lxml import html 
import requests 
 
response = requests.get('http://packtpub.com/') 
tree = html.fromstring(response.content) 

Now the whole HTML is saved to tree in a nice tree structure that we can inspect in two different ways: XPath or CSS Select. XPath is used to navigate through elements and attributes to find information in structured documents such as HTML or XML.

We can use any of the page inspect tools, such as Firebug or Chrome developer tools, to get the XPath of an element:

If we want to get the book names and prices from the  list, find the following section in the source.

<div class="book-block-title" itemprop="name">Book 1</div> 

From this we can create Xpath as follows:

#Create the list of Books: 
 
books = tree.xpath('//div[@class="book-block-title"]/text()') 

Then we can print the lists using the following code:

print books 

Note

Learn more on lxml at http://lxml.de.

Scrapy

Scrapy is an open-source framework for web scraping and web crawling. This can be used to parse the whole website. As a framework, this helps to build spiders for specific requirements. Other than Scrapy, we can use mechanize to write scripts that can fill and submit forms.

We can utilize the command line interface of Scrapy to create the basic boilerplate for new spidering scripts. Scrapy can be installed with pip.

To create a new spider, we have to run the following command in the terminal after installing Scrapy:

 $ scrapy startproject testSpider

This will generate a project folder in the current working directory testSpider. This will also create a basic structure and files inside the folder for our spider:

Scrapy has CLI commands to create a spider. To create a spider, we have to enter the folder generated by the startproject command:

 $ cd testSpider

Then we have to enter the generate spider command:

 $ scrapy genspider pactpub pactpub.com

This will generate another folder, named spiders, and create the required files inside that folder. Then, the folder structure will be as follows:

Now open the items.py file and define a new item in the subclass called TestspiderItem:

from scrapy.item import Item, Field 
class TestspiderItem(Item): 
    # define the fields for your item here: 
    book = Field() 

Most of this crawling logic is given by Scrapy in the pactpub class inside the spider folder, so we can extend this to write our spider. To do this, we have to edit the pactpub.py file in the spider folder.

Inside the pactpub.py file, first we import the required modules:

from scrapy.spiders import Spider 
from scrapy.selector import Selector 
from pprint import pprint 
from testSpider.items import TestspiderItem 

Then, we have to extend the spider class of the Scrapy to define our pactpubSpider class. Here we can define the domain and initial URLs for crawling:

# Extend  Spider Class 
class PactpubSpider(Spider): 
    name = "pactpub" 
    allowed_domains = ["pactpub.com"] 
    start_urls = ( 
        'https://www.pactpub.com/all', 
    ) 

After that, we have to define the parse method, which will create an instance of TestspiderItem() that we defined in the items.py file, and assign this to the items variable.

Then we can add the items to extract, which can be done with XPATH or CSS style selectors.

Here, we are using XPATH selector:

    # Define parse 
    def parse(self, response): 
        res = Selector(response) 
        items = [] 
        for sel in res.xpath('//div[@class="book-block"]'): 
            item = TestspiderItem() 
            item['book'] = sel.xpath('//div[@class="book-block-title"]/text()').extract() 
            items.append(item) 
        return items 

Now we are ready to run the spider. We can run it using the following command:

 $ scrapy crawl pactpub --output results.json

This will start Scrapy with the URLs we defined and the crawled URLs will be passed to the testspiderItems and a new instance is created for each item.

E-mail gathering

Using the Python modules discussed previously, we can gather e-mails and other information from the web.

To get e-mail IDs from a website, we may have to write customized scraping scripts.

Here, we discuss a common method of extracting e-mails from a web page with Python.

Let's go through an example. Here, we are using BeautifulSoup and the requests module:

# Importing Modules  
from bs4 import BeautifulSoup 
import requests 
import requests.exceptions 
import urlparse 
from collections import deque 
import re 

Next, we will provide the list of URLs to crawl:

# List of urls to be crawled 
urls = deque(['https://www.packtpub.com/']) 

Next, we store the processed URLs in a set so as not to process them twice:

# URLs that we have already crawled 
scraped_urls = set() 

Collected e-mails are also stored in a set:

# Crawled emails 
emails = set() 

When we start scraping, we will take a URL from the queue and process it, and add it to the processed URLs. Also, we will do it until the queue is empty:

# Scrape urls one by one queue is empty 
while len(urls): 
    # move next url from the queue to the set of Scraped urls 
    url = urls.popleft() 
    scrapped_urls.add(url) 

With the urlparse module we will get the base URL. This will be used to convert relative links to absolute links:

    # Get  base url 
    parts = urlparse.urlsplit(url) 
    base_url = "{0.scheme}://{0.netloc}".format(parts) 
    path = url[:url.rfind('/')+1] if '/' in parts.path else url 

The content of the URL will be available from try-catch. In case of error, it will go to the next URL:

    # get url's content 
    print("Scraping %s" % url) 
    try: 
        response = requests.get(url) 
    except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError): 
        # ignore  errors 
        continue 

Inside the response, we will search for the e-mails and add the e-mails found to the e-mails set:

    # Search e-mail addresses and add them into the output set 
    new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I)) 
    emails.update(new_emails) 

After scraping the page, we will get all the links to other pages and update the URL queue:

    # find and process all the anchors 
    for anchor in soup.find_all("a"): 
        # extract link url 
        link = anchor.attrs["href"] if "href" in anchor.attrs else '' 
        # resolve relative links 
        if link.startswith('/'): 
            link = base_url + link 
        elif not link.startswith('http'): 
            link = path + link 
        # add the new url to the queue 
 
        if not link in urls and not link in scraped_urls: 
            urls.append(link)