Web-scraping: retrieving and processing web resources

There are many reasons why we might want to retrieve data from the internet, particularly as many academic research resources are becoming increasingly available online.

Examples include gene sequence databases for a range of organism (flybase, fishbase FIND URLS), image databases for computer vision benchmarking and/or training machine learning based systems (URLS), analyzing academic publication databases such as the ArXiV preprint server (URL), to name but a few.

Retrieving a web page

Python makes retrieving online data relatively simple, even when just using the Standard Library! In fact, we did just that for one of the introductory exercises, though you were instructed to “blindly” copy and paste the relevant (two!) lines of code:

import urllib.request
text = urllib.request.urlopen("http://www.textfiles.com/etext/AUTHORS/SHAKESPEARE/shakespeare-romeo-48.txt").read().decode('utf8')

The Standard library module in question, urllib, contains a submodule request with the urlopen function that returns a file-like object (i.e. that has a read member function) which we can use to access the web resource, which in the case of the exercise was a remotely hosted text file.

No robots…

Servers have the ability to deny access to their resources to web robots i.e. programs that scour the internet for data, such as spiders/web crawlers. One way of doing this is to the robots.txt file located at the url.

For the purpose of our academic exercise, we’re going to identify ourselves as web browsers, such that we gain access to such resources.

To do so, we need to set the user-agent associated with our request, as follows:

import urllib.request
req = urllib.request.Request(
    url, 
    data=None, 
    headers={
		"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
    }
)
text = urllib.request.urlopen(req).read().decode("utf-8")

Finding data in a page

When interfacing with web-pages, i.e. HTML documents, things get a little more tricky as the raw data contains information about the web-page layout, design, interactive (Javascript) libraries, and more, which make understanding the page source more difficult.

Fortunately there are Python modules for “parsing” such information into helpful structures.

Here we will use the lxml module.

To turn the difficult to understand text into a structured object, we can use

from lxml import html
import urllib.request
url = 
text = urllib.request.urlopen(url).read().decode('utf8')
htmltree = html.fromstring(text)

The htmltree variable holds a tree structure that represents the web-page.

Background

For more information about the node structure of HTML documents, see e.g. http://www.w3schools.com/js/js_htmldom_navigation.asp

For more general information about HTML see https://en.wikipedia.org/wiki/HTML

Now the the raw text has been parsed into a tree, we can query the tree.

Query expressions

To find elements in the html tree, there are several mechanisms we can use with lxml.

Css Selector style expressions

CSS selector expressions are applied using cssselect and correspond to the part of the CSS that determines which elements the subsequent style specificiations are applied to, in a web page’s CSS (style) file.

For example, to extract all divs, we would use

divs = htmltree.cssselect("div")

or to get all hyperlinks on a web page,

hrefs = [a.attrib('href') for a in htmltree.cssselect("a")]

XPath

XPath (from XML Path Language) is a language for selecting nodes from an XML tree.

A simple example to retrieve all hyperlinks (as above) would be:

hrefs = htmltree.xpath("//a/@href")

A quick and simple approach to determining a specific xpath expression for a given element, you can inspect the corresponding element using e.g. Google Chrome’s inspect option when right clicking on a web-page element.

In the element view, right click and select Copy > Copy XPath.

More information about the XPath syntax can be found here: https://en.wikipedia.org/wiki/XPath

Exercise : Pulling tabulated data from a website

Write a script that

  • Pulls the text from https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/timeseries/ukpop/pop
  • Turns it into an element tree
  • Extacts the table on the page
  • Plots the first column vs the second column using Matplotlib

Important notes

  • When viewing the website, click on table to make the table visible; the table is present in the html so this isn’t an issue when scraping!
  • You will need to change the User-Agent as described above.

Passing simple data to the server: URL encoded arguments

Very simple data passing to the server is often done in the form of URL parameters.

Whenever you have seen a url that looks like, e.g.

https://www.google.co.uk/webhp?ie=UTF-8#q=python

(feel free to click on the link!) you are passing information to the server.

In this case the information is the form of two variables, ie=UTF-8 and q=python. The first is presumably inforation about the encoding that my browser requested, and the second is the query term I had asked for, in this case from Google.

Simple URL arguments follow an initial question mark, ?, and are ampersand (&) separated key=value pairs.

Google uses a fragment identifier, which starts at the hash symbol # for the actual query part of the request, q=python.

For such websites, passing data to the server is a relatively simple task of determining which query parameters the site takes, and then encoding your own.

For example if we create a list of search terms

terms = [
	"python image processing",
	"python web scraping",
	"python data analysis", 
	"python graphical user interface",
	"python web applications",
	"python ploting",
	"python numerical analysis",
	"python generators",
]

we could generate valid search URLs using

baseurl = "https://www.google.co.uk/webhp?ie=UTF-8#q="
urls = [ baseurl + t.replace(" ", "+") for t in terms ]

for url in urls:
	# Perform request, parse the data, etc

Here we simply determined the “base” part of the URL and then added on the search terms, replacing the human readable spaces with the “+” seperators.

urllib also provides a function to perform this parameter encoding for us,

data = {'name' : 'Joe Bloggs', 'age': 42 }
url_values = urllib.parse.urlencode(data)
print(url_values)

generates

age=42&name=Joe+Bloggs

For complicated encodings, this is certainly better than manually generating the parameter string.

Making POST requests

So far we have been dealing with what are known as GET requests.

GET was the first method used by the HTTP protocol (i.e. the Web).

Nowadays there are several more methods that can be used:

  • GET
  • HEAD
  • POST
  • PUT
  • DELETE
  • TRACE
  • OPTIONS
  • CONNECT
  • PATCH

Most of these are relatively technical methods; the POST method, hoewever, is in widespread usage as the standard mechanism to send data to the server.

POST requests are commonly associated with forms on web pages. Whenever a form is submitted, a POST request is made at a specific URL, which then elicits a response.

Many web-based resources use forms to retrieve data.

Luckily for us, making a POST request is relatively straight forward with Python; the tricky part is usually determining the POST parameters.

For example, to POST data to Exeter Universities phone directory (http://www.exeter.ac.uk/phone/) we could use

import urllib.request
import urllib.parse
url = "http://www.exeter.ac.uk/phone/results.php"
data = urllib.parse.urlencode({"page":"1", "fuzzy":"0", "search":"Jeremy Metz"}).encode("utf-8")
req = urllib.request.Request( url, data=data ) 
text = urllib.request.urlopen(req).read().decode("utf-8")

Alternatives to urllib

urllib comes with Python and is relatively straight forward to use. There are however alternative libraries that achieve the same tasks, such as request, which aims to be a simple alternative to urllib.

For full web crawling tasks, scrapy is a popular web spider/crawler library, but is beyond the scope of this section.

Dealing with Javascript rendered objects

Many modern websites use Javascript to generate generate content dynamically.

This makes things trickier when we try to scrape data from such pages, as the raw page text doesn’t contain the data we want!

Ultimately, we will need to run the Javascript to generate the data we want, and there are several options for doing so, including

  • Selenium : A full web automation and testing framework
  • ghost.py : webkit client
  • dryscrape : A light-weight webikit client
  • Build your own! : The PyQt5 library includes a web renderer!

Webkit, mentioned above, is the layout engine component used in many modern browsers like Safari, Google Chrome, and Opera (Firefox, on the other hand uses an equivalent engine called Gecko).

As an example, the site http://pycoders.com/archive/ contains a javascript generated list of archives.

Using PyQt5 we can pull this list using

import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebKitWidgets import QWebPage
from lxml import html

#Take this class for granted.Just use result of rendering.
class Render(QWebPage):
  def __init__(self, url):
    self.app = QApplication(sys.argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.mainFrame().load(QUrl(url))
    self.app.exec_()

  def _loadFinished(self, result):
    self.frame = self.mainFrame()
    self.app.quit()

url = 'http://pycoders.com/archive/'
r = Render(url) 
result = r.frame.toHtml()
htmltree = html.fromstring(result)
archive_links = htmltree.xpath('//div[2]/div[2]/div/div[@class="campaign"]/a/@href')
print(archive_links)