-
Notifications
You must be signed in to change notification settings - Fork 15
Use parslepy with ScraperWiki
Francis et al. at ScraperWiki very kindly accepted to include parslepy as an available package to use in "Code in your browser" tool if you code in Python.
Here I will show you how to use parslepy for your scrapers and hopefully you'll find it as useful and easy to use as I think it is:) Full documentation is here.
As an example, I'll work with the Domino Records UK website http://www.dominorecordco.com/, especially the album releases pages. At the end of the tutorial we should have a list of Domino Records releases, with artist name, album name, catalog number, cover image and release date.
With parslepy, you must know some CSS or XPath syntax. CSS is probably easier for most of us, but sometimes XPath is the only option. A good introduction to CSS selectors can be found here on W3Schools.
For each piece of information you're interested in in a web page, you should determine the containing HTML element, it's class
and or id
to be as unambiguous as possible, and translate that into CSS or XPath. You can use your favorite browser developer tool, such as Firefox's Firebug or Chrome's "Inspect element" (or even the quite amazing SelectorGadget).
Then, you decide on an output name for this piece of data and you have one of your extraction rule for parslepy:
# the page title as a CSS selector: a TITLE tag inside a HEAD tag
{
"page_title": "head title"
}
If I also want to get the logo on Domino's homepage, I would add this rule:
{
"page_title": "head title",
"logo_url": "#logolink img @src"
}
This second rule a mix of CSS (#logolink img
) + an attribute (@src
) with the syntax borrowed from XPath. Indeed, CSS syntax alone cannot point to an attribute, only tags.
By default, parslepy will extract the text content of a tag or the element's attribute's value if you use the @ syntax at the end of your rule selector.
As Francis put it in his tweet about parslepy, the package works well for templated HTML content: content that has been generated by some sort of template giving structure and hierarchy to elements in a web page (sections, headers, lists etc.)
If you open ScraperWiki "Code in your browser tool" you have something like this:
#!/usr/bin/env python
import scraperwiki
import requests
html = requests.get("http://www.flourish.org/blog")
print html.content
# Saving data:
# unique_keys = [ 'id' ]
# data = { 'id':12, 'name':'violet', 'age':7 }
# scraperwiki.sql.save(unique_keys, data)
Let's change that to import parslepy and fetch http://www.dominorecordco.com/
#!/usr/bin/env python
import scraperwiki
import requests
import parslepy
html = requests.get("http://www.dominorecordco.com")
print html.content
Let's throw in our scraping rules and feed them into parslepy to parse the webpage. parslepy central class is parslepy.Parselet()
that you instantiate with scraping rules as a dict
(more options in the docs).
#!/usr/bin/env python
import scraperwiki
import requests
import parslepy
html = requests.get("http://www.dominorecordco.com")
rules = {
"page_title": "head title",
"logo_url": "#logolink img @src",
}
parselet = parslepy.Parselet(rules)
print parselet.parse_fromstring(html.content)
And, in the console for now, you end up with
{'logo_url': '/resources/sitewide/logo_with_transparency.gif', 'page_title': u'Domino | Home'}
Now, let's get started with the releases page: http://www.dominorecordco.com/uk/releases/. When you look at the source code, the interesting bits seem to be in <div id="content" class="articles">
, more acurately in <div class="leftDoubleCol">
where there is a bunch of <div id="article_7406" class="greyborderbottom pad20bottom margin20bottom">
corresponding to the individual releases.
Within each of the article div
tags, i.e. for each release,
- the first
div
contains animg
, and event aa
link to a bigger image file - the first
span
tag contains the release name - in the
h4
tag, there's a<span class="upperh4">
with the artist name and link to the artist's page, and also contains the catalog number and release date - and finally, there a link to a page with mode information; the link text is "View full information?"
If you translate that to parslepy rules, you get
{
"release_image_url": ".leftDoubleCol div[id^=article_] div a img @src",
"release_bigimage_url": ".leftDoubleCol div[id^=article_] div a @href",
"release_name": ".leftDoubleCol div[id^=article_] span",
"artist_name": ".leftDoubleCol div[id^=article_] h4 span.upperh4 a",
"artist_page_url": ".leftDoubleCol div[id^=article_] h4 span.upperh4 a @href",
"release_artist_catalog_date": ".leftDoubleCol div[id^=article_] h4 span.upperh4",
"release_more_info_link": ".leftDoubleCol div[id^=article_] a:contains('View full information?') @href",
}
And when you run this in ScraperWiki like this:
#!/usr/bin/env python
import scraperwiki
import requests
import parslepy
import pprint
html = requests.get("http://www.dominorecordco.com/uk/releases/")
rules = {
"release_image_url": ".leftDoubleCol div[id^=article_] div a img @src",
"release_bigimage_url": ".leftDoubleCol div[id^=article_] div a @href",
"release_name": ".leftDoubleCol div[id^=article_] span",
"artist_name": ".leftDoubleCol div[id^=article_] h4 span.upperh4 a",
"artist_page_url": ".leftDoubleCol div[id^=article_] h4 span.upperh4 a @href",
"release_artist_catalog_date": ".leftDoubleCol div[id^=article_] h4 span.upperh4",
"release_more_info_link": ".leftDoubleCol div[id^=article_] a:contains('View full information?') @href",
}
parselet = parslepy.Parselet(rules)
pprint.pprint(parselet.parse_fromstring(html.content))
You get this:
{'artist_name': u'Quasi',
'artist_page_url': '/artists/quasi/',
'release_artist_catalog_date': u'Quasi | WIG312 | Released: 30/09/13',
'release_bigimage_url': '/images/artists/quasi/quasi_molecity.jpg',
'release_image_url': '/images/artists/quasi/150_150/quasi_molecity.jpg',
'release_more_info_link': '/uk/albums/09-07-13/mole-city/',
'release_name': u'MOLE CITY'}
But there's a lot of duplicate .leftDoubleCol div[id^=article_]
in all these rules. Can't we clean this up a little?
Sure we can, with a little bit of rule nesting: the common selector part is attached to a higher level new "nesting" key within brackets; it's the selector "scope" for the now lover-level subsequent rules. Let's call this nesting key release
:
rules = {
"release(.leftDoubleCol div[id^=article_])": {
"release_image_url": "div a img @src",
"release_bigimage_url": "div a @href",
"release_name": "span",
"artist_name": "h4 span.upperh4 a",
"artist_page_url": "h4 span.upperh4 a @href",
"release_artist_catalog_date": "h4 span.upperh4",
"release_more_info_link": ".//a[contains(., 'View full information?')]/@href",
}
}
And now you get this, with the same fields, but inside a "release" key in the output:
{'release': {'artist_name': u'Quasi',
'artist_page_url': '/artists/quasi/',
'release_artist_catalog_date': u'Quasi | WIG312 | Released: 30/09/13',
'release_bigimage_url': '/images/artists/quasi/quasi_molecity.jpg',
'release_image_url': '/images/artists/quasi/150_150/quasi_molecity.jpg',
'release_more_info_link': '/uk/albums/09-07-13/mole-city/',
'release_name': u'MOLE CITY'}}
You'll have noticed that I changed the last rule to use an XPath expression because
a:contains(...)
throws (non-fatal?) exceptions like thisException UnicodeDecodeError: UnicodeDecodeError('utf8', '\xc0n\xb4\x01', 0, 1, 'invalid start byte') in 'lxml.etree._xpath_function_call' ignored
(don't know why)
But... what a minute, won't we supposed to get ALL releases and not just the first one?
Indeed, and to achieve that, you just need to add extract square brackets around your nested rules. Basically we want to go through all the .leftDoubleCol div[id^=article_]
and extract data with the previous 7 rules for each of these segments of the page (and let's rename the nesting keys to plurals releases
)
rules = {
"releases(.leftDoubleCol div[id^=article_])": [
{
"release_image_url": "div a img @src",
"release_bigimage_url": "div a @href",
"release_name": "span",
"artist_name": "h4 span.upperh4 a",
"artist_page_url": "h4 span.upperh4 a @href",
"release_artist_catalog_date": "h4 span.upperh4",
"release_more_info_link": ".//a[contains(., 'View full information?')]/@href",
}
]
}
And now you get this as output (I cut out some of it for readability):
{'releases': [{'artist_name': u'Quasi',
'artist_page_url': '/artists/quasi/',
'release_artist_catalog_date': u'Quasi | WIG312 | Released: 30/09/13',
'release_bigimage_url': '/images/artists/quasi/quasi_molecity.jpg',
'release_image_url': '/images/artists/quasi/150_150/quasi_molecity.jpg',
'release_more_info_link': '/uk/albums/09-07-13/mole-city/',
'release_name': u'MOLE CITY'},
{'artist_name': u'Arctic Monkeys',
'artist_page_url': '/artists/arctic-monkeys/',
'release_artist_catalog_date': u'Arctic Monkeys | WIG317 | Released: 09/09/13',
'release_bigimage_url': '/images/artists/arctic_monkeys/am_am_final_packshot.jpg',
'release_image_url': '/images/artists/arctic_monkeys/150_150/am_am_final_packshot.jpg',
'release_more_info_link': '/uk/albums/21-06-13/a-m/',
'release_name': u"'AM'"},
...
{'artist_name': u'Various Artists',
'artist_page_url': '/artists/various-artists/',
'release_artist_catalog_date': u'Various Artists | RUG508 | Released: 03/12/12',
'release_bigimage_url': '/images/artists/various_artists/Domino_Remixes.jpg',
'release_image_url': '/images/artists/various_artists/70_70/Domino_Remixes.jpg',
'release_more_info_link': '/uk/singles/02-11-12/motion-sickness-remixes-vol-2/',
'release_name': u'MOTION SICKNESS REMIXES VOL 2.'}]}
Now we can even save that into ScraperWiki's database (using scraperwiki.sql.save()
, see https://github.com/scraperwiki/scraperwiki-python#saving-data), looping on value under key releases
in the extracted dict and use the release_artist_catalog_date
value as unique key
#!/usr/bin/env python
import scraperwiki
import requests
import parslepy
html = requests.get("http://www.dominorecordco.com/uk/releases/")
rules = {
"releases(.leftDoubleCol div[id^=article_])": [
{
"release_image_url": "div a img @src",
"release_bigimage_url": "div a @href",
"release_name": "span",
"artist_name": "h4 span.upperh4 a",
"artist_page_url": "h4 span.upperh4 a @href",
"release_artist_catalog_date": "h4 span.upperh4",
"release_more_info_link": ".//a[contains(., 'View full information?')]/@href",
}
]
}
parselet = parslepy.Parselet(rules)
extracted = parselet.parse_fromstring(html.content)
for release in extracted.get("releases"):
scraperwiki.sql.save(unique_keys=['release_artist_catalog_date'], data=release)
What about the other pages?
At the bollom of the page, there are links to older releases, page 1 to page 10 and a link called "Next". Let point that with a bit of CSS selector (it's the last link inside a div
with id=paginate
)
{ "next_page_url": "#paginate a:last-of-type @href"}
You'll get something like
{'next_page_url': '/uk/releases/?p=2',
'releases': [{'artist_name': u'Quasi',
...
Now, let's write a little loop to fetch each page one at a time
#!/usr/bin/env python
import scraperwiki
import requests
import parslepy
import pprint
rules = {
"releases(.leftDoubleCol div[id^=article_])": [
{
"release_image_url": "div a img @src",
"release_bigimage_url": "div a @href",
"release_name": "span",
"artist_name": "h4 span.upperh4 a",
"artist_page_url": "h4 span.upperh4 a @href",
"release_artist_catalog_date": "h4 span.upperh4",
"release_more_info_link": ".//a[contains(., 'View full information?')]/@href",
}
],
"next_page_url": "#paginate a:last-of-type @href",
}
parselet = parslepy.Parselet(rules)
# first URL to start the crawling on
next_url = "http://www.dominorecordco.com/uk/releases/"
while next_url:
print "fetching", next_url
# get page content
html = requests.get(next_url)
# process with Parslepy
extracted = parselet.parse_fromstring(html.content)
pprint.pprint(extracted)
# do we have a next page to scrape?
if "next_page_url" in extracted:
next_url = extracted["next_page_url"]
print "next URL to fetch", next_url
else:
# this will get us out of the loop
next_url = None
And you get... an Exception!
next URL to fetch /uk/releases/?p=2
fetching /uk/releases/?p=2
Traceback (most recent call last):
File "./code/scraper", line 30, in <module>
html = requests.get(next_url)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 324, in request
prep = req.prepare()
File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 222, in prepare
p.prepare_url(self.url, self.params)
File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 291, in prepare_url
raise MissingSchema("Invalid URL %r: No schema supplied" % url)
requests.exceptions.MissingSchema: Invalid URL u'/uk/releases/?p=2': No schema supplied
Finished run: 2013-07-17 14:46:18+00:00 Exit code: 1
Indeed, that's because /uk/releases/?p=2
is not a full URL, we need to combine it with the current page address. For that you can use the handy urlparse.urljoin()
(see http://docs.python.org/2/library/urlparse.html#urlparse.urljoin)
And now the full scraper
#!/usr/bin/env python
import scraperwiki
import requests
import parslepy
import pprint
import urlparse
rules = {
"releases(.leftDoubleCol div[id^=article_])": [
{
"release_image_url": "div a img @src",
"release_bigimage_url": "div a @href",
"release_name": "span",
"artist_name": "h4 span.upperh4 a",
"artist_page_url": "h4 span.upperh4 a @href",
"release_artist_catalog_date": "h4 span.upperh4",
"release_more_info_link": ".//a[contains(., 'View full information?')]/@href",
}
],
"next_page_url": "#paginate a:last-of-type @href",
}
parselet = parslepy.Parselet(rules)
# first URL to start the crawling on
next_url = "http://www.dominorecordco.com/uk/releases/"
while next_url:
print "fetching", next_url
current_url = next_url
# get page content
html = requests.get(next_url)
# process with Parslepy
extracted = parselet.parse_fromstring(html.content)
# when debugging
#pprint.pprint(extracted)
# saving into database
for release in extracted.get("releases"):
scraperwiki.sql.save(unique_keys=['release_artist_catalog_date'], data=release)
# do we have a next page to scrape?
if "next_page_url" in extracted:
next_url = urlparse.urljoin(
current_url,
extracted["next_page_url"])
# on the last page, the last link is the same as the current URL
# this means we're at the end of the releases pages
if next_url == current_url:
break
print "next URL to fetch", next_url
else:
# this will get us out of the loop
next_url = None
In the "View in a table" tool, you'll have just over 1000 album releases by Domino Records UK.
Of course, there is still room for improvements, for example making all the URLs for images and detail pages complete with http://www.dominorecordco.com
but I hope you understand how to use parslepy now.
Comments, suggestions and bug reports all welcome! (https://github.com/redapple/parslepy and https://github.com/redapple/parslepy/issues/new)
Enjoy!