Scraping parlamento.pt

Because scraping is the boring part of exploring data I tried to find the easiest, less time-consuming, yet easy to maintain set of tools to do it.
Since the URLs are (somewhat) well known, I was able to skip the crawler, url discovery part of this projects, I just curl(ed) the page that I wanted and then parsed it to get the information I needed.

Fortunately the parlamento.pt site is built in such a way that I don’t have to do the usual regexp or string manipulation magic to get what I want, as all the elements of the page have a unique id that i can reference. I just used a simple xpath expression and I have what I need.

At this point I had the information I needed. All I had to do was save it somewhere. My early experiments were using the traditional relational database approach (in this case a MySQL database) that I work with everyday. This is not only tedious but also a very closed approach because all I get is another silo with information (a.k.a. table) … if I want to make it available to users i either have to create database users, make some database interface or API for others to use, or go the data dump road.

But neverthless it stayed that way a couple of months until I tried couchdb.
This happened during the hackathon. I went to Lisbon to get to know other folks that seemed interested in this opendata thing and these guys (bpedro, ruidlopes, jneves and others, sorry didn’t get your name) were really into couchdb so I decided to give a try.

And I think it’s the best solution so far. What i really like is the “field” creation as you go kind of thing, revisions and specially the json output that you get for free. I think it’s really appropriate for what I’m doing… (to be continued)

Publicado

30 de Dezembro, 2010

Ler/ Ver/ Ouvir/ Passear

por

Vitor Silva

Etiquetas:

dados publicos, hacklaviva, open data, scraping, transparência

Comentários

3 comentários a “Scraping parlamento.pt”

Luis Confraria

30 de Dezembro, 2010

Hi there,
I’ve done a little experiment with parlamento.pt a few months ago: http://agorasou.eu/parlamento/
Used YQL, and wrote some custom tables for it: https://github.com/luisbug/Parlamento/tree/master/yqltables
Maybe it can be helpful in anyway.

Responder
max ogden

31 de Dezembro, 2010

Cool stuff! I’d also encourage checking out http://scraperwiki.com

Responder
vitorsilva

31 de Dezembro, 2010

>>luis
nice!

>>max
thxs, I already knew of scraperwiki but still haven’t had the time to really understand what it is

Responder

Scraping parlamento.pt

Comentários

3 comentários a “Scraping parlamento.pt”

Deixe um comentário Cancelar resposta