So what am I using to scrap the parlamento.pt site?
Well, first let me tell you that i’m mostly a Microsoft Developer. I develop web applications with Visual Basic and (less) C# (using Visual Studio) and also MS SQLServer but I’ve also played and did a few projects with PHP and MySQL.
Since the Hacklaviva guys are full-blown GPLers I decided to go the FLOSS way and use PHP.
The complete vision of the process is something like the image below
But in this post i’m concentrating on the scraping part of it.
Worth mentioning is that the strategy I use to get the information from the parlamento.pt site to my database is a one-step access/parse/save.
I don’t have different scripts to access and store the page on my server and then have other set of scripts to parse it and save it on the database.
I have a php file (application) that accesses the page (or several instances of that page) I want to extract information from.
To access the pages, and since the urls are (somewhat) well known, as it’s usually the same file with a different id, I use cURL
I’m presenting myself as googlebot but i try to be gentle on the server by limiting access to one page per second
And to store it in couchdb i’m using PHP on Couch