Skip to content

Use parslepy with ScraperWiki

Paul Tremberth edited this page Jul 17, 2013 · 42 revisions

Francis et al. at ScraperWiki very kindly accepted to include parslepy as an available package to use in "Code in your browser" tool if you code in Python.

Here I will show you how to use parslepy for your scrapers and hopefully you'll find it as useful and easy to use as I think it is:) Full documentation is here.

Example website to scrape: Domino Records

As an example, I'll work with the Domino Records UK website http://www.dominorecordco.com/, especially the album releases pages. At the end of the tutorial we should have a list of Domino Records releases, with artist name, album name, catalog number, cover image and release date.

Your first scraping rules

With parslepy, you must know some CSS or XPath syntax. CSS is probably easier for most of us, but sometimes XPath is the only option. A good introduction to CSS selectors can be found here on W3Schools.

For each piece of information you're interested in in a web page, you should determine the containing HTML element, it's class and or id to be as unambiguous as possible, and translate that into CSS or XPath. You can use your favorite browser developer tool, such as Firefox's Firebug or Chrome's "Inspect element" (or even the quite amazing SelectorGadget).

Then, you decide on an output name for this piece of data and you have one of your extraction rule for parslepy:

# the page title as a CSS selector: a TITLE tag inside a HEAD tag
{
"page_title": "head title"
}

If I also want to get the logo on Domino's homepage, I would add this rule:

{
    "page_title": "head title",
    "logo_url": "#logolink img @src"
}

This second rule a mix of CSS (#logolink img) + an attribute (@src) with the syntax borrowed from XPath. Indeed, CSS syntax alone cannot point to an attribute, only tags.

By default, parslepy will extract the text content of a tag or the element's attribute's value if you use the @ syntax at the end of your rule selector.

As Francis put it in his tweet about parslepy, the package works well for templated HTML content: content that has been generated by some sort of template giving structure and hierarchy to elements in a web page (sections, headers, lists etc.)

ScraperWiki + parslepy

If you open ScraperWiki "Code in your browser tool" you have something like this:

#!/usr/bin/env python

import scraperwiki
import requests

html = requests.get("http://www.flourish.org/blog")
print html.content

# Saving data:
# unique_keys = [ 'id' ]
# data = { 'id':12, 'name':'violet', 'age':7 }
# scraperwiki.sql.save(unique_keys, data)

Let's change that to import parslepy and fetch http://www.dominorecordco.com/

#!/usr/bin/env python

import scraperwiki
import requests
import parslepy

html = requests.get("http://www.dominorecordco.com")
print html.content

Let's throw in our scraping rules and feed them into parslepy to parse the webpage. parslepy central class is parslepy.Parselet() that you instantiate with scraping rules as a dict (more options in the docs).

#!/usr/bin/env python
import scraperwiki
import requests
import parslepy

html = requests.get("http://www.dominorecordco.com")

rules = {
    "page_title": "head title",
    "logo_url": "#logolink img @src",
}
parselet = parslepy.Parselet(rules)
print parselet.parse_fromstring(html.content)

And, in the console for now, you end up with

{'logo_url': '/resources/sitewide/logo_with_transparency.gif', 'page_title': u'Domino | Home'}

Scraping Domino Records' releases page

Now, let's get started with the releases page: http://www.dominorecordco.com/uk/releases/. When you look at the source code, the interesting bits seem to be in <div id="content" class="articles">, more acurately in <div class="leftDoubleCol"> where there is a bunch of <div id="article_7406" class="greyborderbottom pad20bottom margin20bottom"> corresponding to the individual releases.

Within each of the article div tags, i.e. for each release,

  • the first div contains an img, and event a a link to a bigger image file
  • the first span tag contains the release name
  • in the h4 tag, there's a <span class="upperh4"> with the artist name and link to the artist's page, and also contains the catalog number and release date
  • and finally, there a link to a page with mode information; the link text is "View full information?"

If you translate that to parslepy rules, you get

{
    "release_image_url":           ".leftDoubleCol div[id^=article_] div a img @src",
    "release_bigimage_url":        ".leftDoubleCol div[id^=article_] div a @href",
    "release_name":                ".leftDoubleCol div[id^=article_] span",
    "artist_name":                 ".leftDoubleCol div[id^=article_] h4 span.upperh4 a",
    "artist_page_url":             ".leftDoubleCol div[id^=article_] h4 span.upperh4 a @href",
    "release_artist_catalog_date": ".leftDoubleCol div[id^=article_] h4 span.upperh4",
    "release_more_info_link":      ".leftDoubleCol div[id^=article_] a:contains('View full information?') @href",
}

And when you run this in ScraperWiki like this:

#!/usr/bin/env python

import scraperwiki
import requests
import parslepy
import pprint

html = requests.get("http://www.dominorecordco.com/uk/releases/")

rules = {
    "release_image_url":           ".leftDoubleCol div[id^=article_] div a img @src",
    "release_bigimage_url":        ".leftDoubleCol div[id^=article_] div a @href",
    "release_name":                ".leftDoubleCol div[id^=article_] span",
    "artist_name":                 ".leftDoubleCol div[id^=article_] h4 span.upperh4 a",
    "artist_page_url":             ".leftDoubleCol div[id^=article_] h4 span.upperh4 a @href",
    "release_artist_catalog_date": ".leftDoubleCol div[id^=article_] h4 span.upperh4",
    "release_more_info_link":      ".leftDoubleCol div[id^=article_] a:contains('View full information?') @href",
}
parselet = parslepy.Parselet(rules)
pprint.pprint(parselet.parse_fromstring(html.content))

You get this:

{'artist_name': u'Quasi',
 'artist_page_url': '/artists/quasi/',
 'release_artist_catalog_date': u'Quasi | WIG312 | Released: 30/09/13',
 'release_bigimage_url': '/images/artists/quasi/quasi_molecity.jpg',
 'release_image_url': '/images/artists/quasi/150_150/quasi_molecity.jpg',
 'release_more_info_link': '/uk/albums/09-07-13/mole-city/',
 'release_name': u'MOLE CITY'}

Nesting to make things more readable

But there's a lot of duplicate .leftDoubleCol div[id^=article_] in all these rules. Can't we clean this up a little?

Sure we can, with a little bit of rule nesting: the common selector part is attached to a higher level new "nesting" key within brackets; it's the selector "scope" for the now lover-level subsequent rules. Let's call this nesting key release:

rules = {
    "release(.leftDoubleCol div[id^=article_])": {
        "release_image_url":           "div a img @src",
        "release_bigimage_url":        "div a @href",
        "release_name":                "span",
        "artist_name":                 "h4 span.upperh4 a",
        "artist_page_url":             "h4 span.upperh4 a @href",
        "release_artist_catalog_date": "h4 span.upperh4",
        "release_more_info_link":      ".//a[contains(., 'View full information?')]/@href",
    }
}

And now you get this, with the same fields, but inside a "release" key in the output:

{'release': {'artist_name': u'Quasi',
             'artist_page_url': '/artists/quasi/',
             'release_artist_catalog_date': u'Quasi | WIG312 | Released: 30/09/13',
             'release_bigimage_url': '/images/artists/quasi/quasi_molecity.jpg',
             'release_image_url': '/images/artists/quasi/150_150/quasi_molecity.jpg',
             'release_more_info_link': '/uk/albums/09-07-13/mole-city/',
             'release_name': u'MOLE CITY'}}

You'll have noticed that I changed the last rule to use an XPath expression because a:contains(...) throws (non-fatal?) exceptions like this Exception UnicodeDecodeError: UnicodeDecodeError('utf8', '\xc0n\xb4\x01', 0, 1, 'invalid start byte') in 'lxml.etree._xpath_function_call' ignored (don't know why)

But... what a minute, won't we supposed to get ALL releases and not just the first one?

Loops within a selector scope

Indeed, and to achieve that, you just need to add extract square brackets around your nested rules. Basically we want to go through all the .leftDoubleCol div[id^=article_] and extract data with the previous 7 rules for each of these segments of the page (and let's rename the nesting keys to plurals releases)

rules = {
    "releases(.leftDoubleCol div[id^=article_])": [
        {
            "release_image_url":           "div a img @src",
            "release_bigimage_url":        "div a @href",
            "release_name":                "span",
            "artist_name":                 "h4 span.upperh4 a",
            "artist_page_url":             "h4 span.upperh4 a @href",
            "release_artist_catalog_date": "h4 span.upperh4",
            "release_more_info_link":      ".//a[contains(., 'View full information?')]/@href",
        }
    ]
}

And now you get this as output (I cut out some of it for readability):

{'releases': [{'artist_name': u'Quasi',
               'artist_page_url': '/artists/quasi/',
               'release_artist_catalog_date': u'Quasi | WIG312 | Released: 30/09/13',
               'release_bigimage_url': '/images/artists/quasi/quasi_molecity.jpg',
               'release_image_url': '/images/artists/quasi/150_150/quasi_molecity.jpg',
               'release_more_info_link': '/uk/albums/09-07-13/mole-city/',
               'release_name': u'MOLE CITY'},
              {'artist_name': u'Arctic Monkeys',
               'artist_page_url': '/artists/arctic-monkeys/',
               'release_artist_catalog_date': u'Arctic Monkeys | WIG317 | Released: 09/09/13',
               'release_bigimage_url': '/images/artists/arctic_monkeys/am_am_final_packshot.jpg',
               'release_image_url': '/images/artists/arctic_monkeys/150_150/am_am_final_packshot.jpg',
               'release_more_info_link': '/uk/albums/21-06-13/a-m/',
               'release_name': u"'AM'"},
              ...
              {'artist_name': u'Various Artists',
               'artist_page_url': '/artists/various-artists/',
               'release_artist_catalog_date': u'Various Artists | RUG508 | Released: 03/12/12',
               'release_bigimage_url': '/images/artists/various_artists/Domino_Remixes.jpg',
               'release_image_url': '/images/artists/various_artists/70_70/Domino_Remixes.jpg',
               'release_more_info_link': '/uk/singles/02-11-12/motion-sickness-remixes-vol-2/',
               'release_name': u'MOTION SICKNESS REMIXES VOL 2.'}]}

Now we can even save that into ScraperWiki's database (using scraperwiki.sql.save(), see https://github.com/scraperwiki/scraperwiki-python#saving-data), looping on value under key releases in the extracted dict and use the release_artist_catalog_date value as unique key

#!/usr/bin/env python
import scraperwiki
import requests
import parslepy

html = requests.get("http://www.dominorecordco.com/uk/releases/")
rules = {
    "releases(.leftDoubleCol div[id^=article_])": [
        {
            "release_image_url":           "div a img @src",
            "release_bigimage_url":        "div a @href",
            "release_name":                "span",
            "artist_name":                 "h4 span.upperh4 a",
            "artist_page_url":             "h4 span.upperh4 a @href",
            "release_artist_catalog_date": "h4 span.upperh4",
            "release_more_info_link":      ".//a[contains(., 'View full information?')]/@href",
        }
    ]
}
parselet = parslepy.Parselet(rules)
extracted = parselet.parse_fromstring(html.content)
for release in extracted.get("releases"):
    scraperwiki.sql.save(unique_keys=['release_artist_catalog_date'], data=release)

What about the other pages?

Following next page links

At the bollom of the page, there are links to older releases, page 1 to page 10 and a link called "Next". Let point that with a bit of CSS selector (it's the last link inside a divwith id=paginate)

{ "next_page_url": "#paginate a:last-of-type @href"}

You'll get something like

{'next_page_url': '/uk/releases/?p=2',
 'releases': [{'artist_name': u'Quasi',
...

Now, let's write a little loop to fetch each page one at a time

#!/usr/bin/env python

import scraperwiki
import requests
import parslepy
import pprint

rules = {
    "releases(.leftDoubleCol div[id^=article_])": [
        {
            "release_image_url":           "div a img @src",
            "release_bigimage_url":        "div a @href",
            "release_name":                "span",
            "artist_name":                 "h4 span.upperh4 a",
            "artist_page_url":             "h4 span.upperh4 a @href",
            "release_artist_catalog_date": "h4 span.upperh4",
            "release_more_info_link":      ".//a[contains(., 'View full information?')]/@href",
        }
    ],
    "next_page_url": "#paginate a:last-of-type @href",
}
parselet = parslepy.Parselet(rules)

# first URL to start the crawling on
next_url = "http://www.dominorecordco.com/uk/releases/"
while next_url:
    print "fetching", next_url

    # get page content
    html = requests.get(next_url)

    # process with Parslepy
    extracted = parselet.parse_fromstring(html.content)
    pprint.pprint(extracted)

    # do we have a next page to scrape?    
    if "next_page_url" in extracted:
        next_url = extracted["next_page_url"]
        print "next URL to fetch", next_url
    else:
        # this will get us out of the loop
        next_url = None

And you get... an Exception!

next URL to fetch /uk/releases/?p=2
fetching /uk/releases/?p=2
Traceback (most recent call last):
  File "./code/scraper", line 30, in <module>
    html = requests.get(next_url)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 55, in get
    return request('get', url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 324, in request
    prep = req.prepare()
  File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 222, in prepare
    p.prepare_url(self.url, self.params)
  File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 291, in prepare_url
    raise MissingSchema("Invalid URL %r: No schema supplied" % url)
requests.exceptions.MissingSchema: Invalid URL u'/uk/releases/?p=2': No schema supplied
Finished run: 2013-07-17 14:46:18+00:00 Exit code: 1

Indeed, that's because /uk/releases/?p=2 is not a full URL, we need to combine it with the current page address. For that you can use the handy urlparse.urljoin() (see http://docs.python.org/2/library/urlparse.html#urlparse.urljoin)

And now the full scraper

#!/usr/bin/env python
import scraperwiki
import requests
import parslepy
import pprint
import urlparse

rules = {
    "releases(.leftDoubleCol div[id^=article_])": [
        {
            "release_image_url":           "div a img @src",
            "release_bigimage_url":        "div a @href",
            "release_name":                "span",
            "artist_name":                 "h4 span.upperh4 a",
            "artist_page_url":             "h4 span.upperh4 a @href",
            "release_artist_catalog_date": "h4 span.upperh4",
            "release_more_info_link":      ".//a[contains(., 'View full information?')]/@href",
        }
    ],
    "next_page_url": "#paginate a:last-of-type @href",
}
parselet = parslepy.Parselet(rules)

# first URL to start the crawling on
next_url = "http://www.dominorecordco.com/uk/releases/"
while next_url:
    print "fetching", next_url
    current_url = next_url

    # get page content
    html = requests.get(next_url)

    # process with Parslepy
    extracted = parselet.parse_fromstring(html.content)
    # when debugging
    #pprint.pprint(extracted)
    # saving into database
    for release in extracted.get("releases"):
        scraperwiki.sql.save(unique_keys=['release_artist_catalog_date'], data=release)

    # do we have a next page to scrape?    
    if "next_page_url" in extracted:
        next_url = urlparse.urljoin(
            current_url,
            extracted["next_page_url"])
        
        # on the last page, the last link is the same as the current URL
        # this means we're at the end of the releases pages
        if next_url == current_url:
            break
        
        print "next URL to fetch", next_url
    else:
        # this will get us out of the loop
        next_url = None

In the "View in a table" tool, you'll have just over 1000 album releases by Domino Records UK. Of course, there is still room for improvements, for example making all the URLs for images and detail pages complete with http://www.dominorecordco.com but I hope you understand how to use parslepy now.

Comments, suggestions and bug reports all welcome! (https://github.com/redapple/parslepy and https://github.com/redapple/parslepy/issues/new)

Enjoy!