Web scraping

Even though some sites offer APIs, most websites are designed mainly for human eyes and only provide HTML pages formatted for humans. If we want a program to fetch some data from such a website, we have to parse the markup to get the information we need. Web scraping is the method of using a computer program to analyze a web page and get the data needed.

There are many methods to fetch the content from the site with Python modules:

  • Use urllib/urllib2 to create an HTTP request that will fetch the webpage, and using BeautifulSoup to parse the HTML
  • To parse an entire website we can use Scrapy (http://scrapy.org), which helps to create web spiders
  • Use requests module to fetch and lxml to parse

urllib / urllib2 module

Urllib is a high-level module that allows us to script different services such as HTTP, HTTPS, and FTP.

Useful methods of urllib/urllib2

Urllib/urllib2 provide methods that can be used for getting resources from URLs, which includes opening web pages, encoding arguments, manipulating and creating headers, and many more. We can go through some of those useful methods as follows:

  • Open a web page using urlopen(). When we pass a URL to urlopen() method, it will return an object, we can use the read() attribute to get the data from this object in string format, as follows:
        import urllib 
 
        url = urllib.urlopen("http://packtpub.com/") 
 
        data = url.read() 
 
        print data 
  • The next method is parameter encoding: urlencode(). It takes a dictionary of fields as input and creates a URL-encoded string of parameters:
        import urllib 
 
        fields = { 
          'name' : 'Sean', 
          'email' : 'Sean@example.com' 
        } 
 
        parms = urllib.urlencode(fields) 
        print parms 
  • The other method is sending requests with parameters, for example, using a GET request: URL is crafted by appending the URL-encoded parameters:
        import urllib 
        fields = { 
          'name' : 'Sean', 
          'email' : 'Sean@example.com' 
        } 
        parms = urllib.urlencode(fields) 
        u = urllib.urlopen("http://example.com/login?"+parms) 
        data = u.read() 
 
        print data 
  • Using the POST request method, the URL-encoded parameters are passed to the method urlopen() separately:
        import urllib 
        fields = { 
          'name' : 'Sean', 
          'email' : 'Sean@example.com' 
        } 
        parms = urllib.urlencode(fields) 
        u = urllib.urlopen("http://example.com/login", parms) 
        data = u.read() 
        print data 
  • If we use response headers then the HTTP response headers can be retrieved using the info() method, which will return a dictionary-like object:
        u = urllib.urlopen("http://packtpub.com", parms) 
        response_headers = u.info() 
        print response_headers 
  • The output will look as follows:
  • We can also use keys() to get all the response header keys:
>>> print response_headers.keys() 
['via', 'x-country-code', 'age', 'expires', 'server', 'connection', 'cache-control', 'date', 'content-type']
  • We can access each entry as follows:
>>>print response_headers['server'] 
nginx/1.4.5 

Note

Urllib does not support cookies and authentication. Also, it only supports GET and POST requests. Urllib2 is built upon urllib and has many more features.

  • We can get the status codes with the code method:
            u = urllib.urlopen("http://packtpub.com", parms) 
            response_code = u.code 
            print response_code 
    
  • We can modify the request headers with urllib2 as follows:
        headers = { 
         'User-Agent' : 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
        rv:41.0) Gecko/20100101 Firefox/41.0' 
        }
        request = urllib2.Request("http://packtpub.com/",
         headers=headers)
        url = urllib2.urlopen(request)
        response = url.read()
  • Cookies can be used as follows:
        fields = {  
        'name' : 'sean',  
        'password' : 'password!',  
        'login' : 'LogIn'  
        }  
 
        # Here we creates a custom opener with cookies enabled 
        opener = urllib2.build_opener(  
        urllib2.HTTPCookieProcessor()  
        )  
 
        # creates request 
        request = urllib2.Request(  
          "http://example.com/login",  
          urllib.urlencode(fields))  
 
        # Login request sending 
        url = opener.open(request)  
        response = url.read()  
 
        # Now we can access the private pages with the cookie  
        # got from the above login request 
        url = opener.open("http://example.com/dashboard")  
        response = url.read() 

Requests module

We can also use the requests module instead of urllib/urllib2, which is a better option as it supports a fully REST API and it simply takes a dictionary as an argument without any parameters encoded:

import requests 
response = requests.get("http://packtpub.com", parms) 
 
# Response 
print response.status_code # Response Code   
print response.headers # Response Headers   
print response.content # Response Content 
 
# Request 
print response.request.headers # Headers we sent 

Parsing HTML using BeautifulSoup

The preceding modules are only useful to fetch files. If we want to parse HTML obtained via urlopen, we have to use the BeautifulSoup module. BeautifulSoup takes raw HTML and XML files from urlopen and pulls data out of it. To run a parser, we have to create a parser object and feed it some data. It will scan through the data and trigger the various handler methods. Beautiful Soup 4 works on both Python 2.6+ and Python 3.

The following are some simple examples:

  • To prettify the HTML, use the following code:
         from bs4 import BeautifulSoup  
 
         parse = BeautifulSoup('<html><head><title>Title of the
         page</title></head><body><p id="para1" 
         align="center">This is a paragraph<b>one</b><a 
         href="http://example1.com">Example Link 1</a> </p><p 
         id="para2">This is a paragraph<b>two</b><a 
         href="http://example2.com">Example Link 2</a></p></body>
         </html>')  
 
         print parse.prettify()  
  • The output will be as follows:
  • Some example ways to navigate through the HTML with BeautifulSoup are as follows:
parse.contents[0].name
>>> u'html'
parse.contents[0].contents[0].name
>>> u'head'
head = soup.contents[0].contents[0]
head.parent.name
>>> u'html'
head.next
>>> <title>Page title</title>
head.nextSibling.name
>>> u'body'
head.nextSibling.contents[0]
>>> <p id="para1" align="center">This is a paragraph<b>one</b><a href="http://example1.com">Example Link 1</a> </p>
head.nextSibling.contents[0].nextSibling
>>> <p id="para2">This is a paragraph<b>two</b><a href="http://example2.com">Example Link 2</a></p> 
  • Some ways to search through the HTML for tags and properties are as follows:
parse.find_all('a')
>>> [<a href="http://example1.com">Example Link 1</a>, <a href="http://example2.com">Example Link 2</a>]
parse.find(id="para2")
>>> <p id="para2">This is a paragraph<b>two</b><a href="http://example2.com">Example Link 2</a></p>

Download all images on a page

Now we can write a script to download all images on a page and save them in a specific location:

# Importing required modules 
import requests   
from bs4 import BeautifulSoup   
import urlparse #urlparse is renamed to urllib.parse in Python  
 
 
# Get the page with the requests 
response = requests.get('http://www.freeimages.co.uk/galleries/food/breakfast/index.htm')   
 
 
# Parse the page with BeautifulSoup 
parse = BeautifulSoup(response.text) 
 
# Get all image tags 
image_tags = parse.find_all('img') 
 
# Get urls to the images 
images = [ url.get('src') for url in image_tags] 
# If no images found in the page 
 
if not images:   
    sys.exit("Found No Images") 
# Convert relative urls to absolute urls if any 
images = [urlparse.urljoin(response.url, url) for url in images]   
print 'Found %s images' % len(images) 
 
# Download images to downloaded folder 
for url in images:   
    r = requests.get(url) 
    f = open('downloaded/%s' % url.split('/')[-1], 'w') 
    f.write(r.content) 
    f.close() 
    print 'Downloaded %s' % url