Because scraping is the boring part of exploring data I tried to find the easiest, less time-consuming, yet easy to maintain set of tools to do it.
Since the URLs are (somewhat) well known, I was able to skip the crawler, url discovery part of this projects, I just curl(ed) the page that I wanted and then parsed it to get the information I needed.
Fortunately the parlamento.pt site is built in such a way that I don’t have to do the usual regexp or string manipulation magic to get what I want, as all the elements of the page have a unique id that i can reference. I just used a simple xpath expression and I have what I need.
At this point I had the information I needed. All I had to do was save it somewhere. My early experiments were using the traditional relational database approach (in this case a MySQL database) that I work with everyday. This is not only tedious but also a very closed approach because all I get is another silo with information (a.k.a. table) … if I want to make it available to users i either have to create database users, make some database interface or API for others to use, or go the data dump road.
But neverthless it stayed that way a couple of months until I tried couchdb.
This happened during the hackathon. I went to Lisbon to get to know other folks that seemed interested in this opendata thing and these guys (bpedro, ruidlopes, jneves and others, sorry didn’t get your name) were really into couchdb so I decided to give a try.
And I think it’s the best solution so far. What i really like is the “field” creation as you go kind of thing, revisions and specially the json output that you get for free. I think it’s really appropriate for what I’m doing… (to be continued)