{"id":2661,"date":"2010-12-31T14:05:17","date_gmt":"2010-12-31T14:05:17","guid":{"rendered":"http:\/\/osmeusapontamentos.com\/?p=2661"},"modified":"2010-12-31T14:05:17","modified_gmt":"2010-12-31T14:05:17","slug":"tools-and-libs-for-scraping-parlamento-pt","status":"publish","type":"post","link":"https:\/\/osmeusapontamentos.com\/index.php\/2010\/12\/31\/tools-and-libs-for-scraping-parlamento-pt\/","title":{"rendered":"Tools and Libs for Scraping parlamento.pt"},"content":{"rendered":"<p>So what am I using to scrap the <a href=\"http:\/\/www.parlamento.pt\/\">parlamento.pt<\/a> site?<br \/>\nWell, first let me tell you that i&#8217;m mostly a Microsoft Developer. I develop web applications with Visual Basic and (less) C# (using Visual Studio) and also MS SQLServer but I&#8217;ve also played and did a few projects with PHP and MySQL.<br \/>\nSince the <a href=\"http:\/\/hacklaviva.net\">Hacklaviva<\/a> guys are full-blown <a href=\"http:\/\/en.wikipedia.org\/wiki\/GNU_General_Public_License\">GPL<\/a>ers I decided to go the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Free_and_open_source_software\">FLOSS<\/a> way and use PHP.<\/p>\n<p>The complete vision of the process is something like the image below<br \/>\n<img decoding=\"async\" src=\"https:\/\/osmeusapontamentos.com\/img\/data_to_data.png\" alt=\"\" width=\"650\" \/><br \/>\nBut in this post i&#8217;m concentrating on the scraping part of it.<br \/>\nWorth mentioning is that the strategy I use to get the information from the parlamento.pt site to my database is a one-step access\/parse\/save.<br \/>\nI don&#8217;t have different scripts to access and store the page on my server and then have other set of scripts to parse it and save it on the database.<br \/>\nI have a php file (application) that accesses the page (or several instances of that page) I \u00a0want to extract information from.<\/p>\n<p>To access the pages, and since the urls are (somewhat) well known, as it&#8217;s usually the same file with a different id, I use <a href=\"http:\/\/php.net\/manual\/en\/book.curl.php\">cURL<\/a><br \/>\nI&#8217;m presenting myself as googlebot but i try to be gentle on the server by limiting access to one page per second<br \/>\n<img decoding=\"async\" src=\"https:\/\/osmeusapontamentos.com\/img\/opendata_curl.png\" alt=\"\" \/><\/p>\n<p>To get the information i want, i use the <a href=\"http:\/\/php.net\/manual\/en\/class.domdocument.php\">DOMDocument<\/a> object to which i apply an <a href=\"http:\/\/www.php.net\/manual\/en\/class.domxpath.php\">xpath<\/a> expression<br \/>\n<img decoding=\"async\" src=\"https:\/\/osmeusapontamentos.com\/img\/opendata_domxpath.png\" alt=\"\" \/><\/p>\n<p>And to store it in couchdb i&#8217;m using <a href=\"https:\/\/github.com\/dready92\/PHP-on-Couch\/tree\/master\">PHP on Couch<\/a><br \/>\n<img decoding=\"async\" src=\"https:\/\/osmeusapontamentos.com\/img\/opendata_couchdbstoring.png\" alt=\"\" \/><\/p>\n","protected":false},"excerpt":{"rendered":"<p>So what am I using to scrap the parlamento.pt site? Well, first let me tell you that i&#8217;m mostly a Microsoft Developer. I develop web applications with Visual Basic and (less) C# (using Visual Studio) and also MS SQLServer but I&#8217;ve also played and did a few projects with PHP and MySQL. Since the Hacklaviva [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[181,290,467,592,644],"class_list":["post-2661","post","type-post","status-publish","format-standard","hentry","category-ler-ver-ouvir-passear","tag-dados-publicos","tag-hacklaviva","tag-open-data","tag-scraping","tag-transparencia"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.6 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Tools and Libs for Scraping parlamento.pt - Os Meus Apontamentos<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/osmeusapontamentos.com\/index.php\/2010\/12\/31\/tools-and-libs-for-scraping-parlamento-pt\/\" \/>\n<meta property=\"og:locale\" content=\"pt_PT\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Tools and Libs for Scraping parlamento.pt - Os Meus Apontamentos\" \/>\n<meta property=\"og:description\" content=\"So what am I using to scrap the parlamento.pt site? Well, first let me tell you that i&#8217;m mostly a Microsoft Developer. I develop web applications with Visual Basic and (less) C# (using Visual Studio) and also MS SQLServer but I&#8217;ve also played and did a few projects with PHP and MySQL. Since the Hacklaviva [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/osmeusapontamentos.com\/index.php\/2010\/12\/31\/tools-and-libs-for-scraping-parlamento-pt\/\" \/>\n<meta property=\"og:site_name\" content=\"Os Meus Apontamentos\" \/>\n<meta property=\"article:published_time\" content=\"2010-12-31T14:05:17+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/osmeusapontamentos.com\/img\/data_to_data.png\" \/>\n<meta name=\"author\" content=\"Vitor Silva\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@vitorsilva\" \/>\n<meta name=\"twitter:site\" content=\"@vitorsilva\" \/>\n<meta name=\"twitter:label1\" content=\"Escrito por\" \/>\n\t<meta name=\"twitter:data1\" content=\"Vitor Silva\" \/>\n\t<meta name=\"twitter:label2\" content=\"Tempo estimado de leitura\" \/>\n\t<meta name=\"twitter:data2\" content=\"1 minuto\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/osmeusapontamentos.com\\\/index.php\\\/2010\\\/12\\\/31\\\/tools-and-libs-for-scraping-parlamento-pt\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/osmeusapontamentos.com\\\/index.php\\\/2010\\\/12\\\/31\\\/tools-and-libs-for-scraping-parlamento-pt\\\/\"},\"author\":{\"name\":\"Vitor Silva\",\"@id\":\"https:\\\/\\\/osmeusapontamentos.com\\\/#\\\/schema\\\/person\\\/d508df9c3ffc8b4e64a18dbf0ba18dd8\"},\"headline\":\"Tools and Libs for Scraping parlamento.pt\",\"datePublished\":\"2010-12-31T14:05:17+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/osmeusapontamentos.com\\\/index.php\\\/2010\\\/12\\\/31\\\/tools-and-libs-for-scraping-parlamento-pt\\\/\"},\"wordCount\":255,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/osmeusapontamentos.com\\\/#\\\/schema\\\/person\\\/d508df9c3ffc8b4e64a18dbf0ba18dd8\"},\"image\":{\"@id\":\"https:\\\/\\\/osmeusapontamentos.com\\\/index.php\\\/2010\\\/12\\\/31\\\/tools-and-libs-for-scraping-parlamento-pt\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/osmeusapontamentos.com\\\/img\\\/data_to_data.png\",\"keywords\":[\"dados publicos\",\"hacklaviva\",\"open data\",\"scraping\",\"transpar\u00eancia\"],\"articleSection\":[\"Ler\\\/ Ver\\\/ Ouvir\\\/ Passear\"],\"inLanguage\":\"pt-PT\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/osmeusapontamentos.com\\\/index.php\\\/2010\\\/12\\\/31\\\/tools-and-libs-for-scraping-parlamento-pt\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/osmeusapontamentos.com\\\/index.php\\\/2010\\\/12\\\/31\\\/tools-and-libs-for-scraping-parlamento-pt\\\/\",\"url\":\"https:\\\/\\\/osmeusapontamentos.com\\\/index.php\\\/2010\\\/12\\\/31\\\/tools-and-libs-for-scraping-parlamento-pt\\\/\",\"name\":\"Tools and Libs for Scraping parlamento.pt - Os Meus Apontamentos\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/osmeusapontamentos.com\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/osmeusapontamentos.com\\\/index.php\\\/2010\\\/12\\\/31\\\/tools-and-libs-for-scraping-parlamento-pt\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/osmeusapontamentos.com\\\/index.php\\\/2010\\\/12\\\/31\\\/tools-and-libs-for-scraping-parlamento-pt\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/osmeusapontamentos.com\\\/img\\\/data_to_data.png\",\"datePublished\":\"2010-12-31T14:05:17+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/osmeusapontamentos.com\\\/index.php\\\/2010\\\/12\\\/31\\\/tools-and-libs-for-scraping-parlamento-pt\\\/#breadcrumb\"},\"inLanguage\":\"pt-PT\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/osmeusapontamentos.com\\\/index.php\\\/2010\\\/12\\\/31\\\/tools-and-libs-for-scraping-parlamento-pt\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"pt-PT\",\"@id\":\"https:\\\/\\\/osmeusapontamentos.com\\\/index.php\\\/2010\\\/12\\\/31\\\/tools-and-libs-for-scraping-parlamento-pt\\\/#primaryimage\",\"url\":\"https:\\\/\\\/osmeusapontamentos.com\\\/img\\\/data_to_data.png\",\"contentUrl\":\"https:\\\/\\\/osmeusapontamentos.com\\\/img\\\/data_to_data.png\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/osmeusapontamentos.com\\\/index.php\\\/2010\\\/12\\\/31\\\/tools-and-libs-for-scraping-parlamento-pt\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"In\u00edcio\",\"item\":\"https:\\\/\\\/osmeusapontamentos.com\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Tools and Libs for Scraping parlamento.pt\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/osmeusapontamentos.com\\\/#website\",\"url\":\"https:\\\/\\\/osmeusapontamentos.com\\\/\",\"name\":\"Os Meus Apontamentos\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/osmeusapontamentos.com\\\/#\\\/schema\\\/person\\\/d508df9c3ffc8b4e64a18dbf0ba18dd8\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/osmeusapontamentos.com\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"pt-PT\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/osmeusapontamentos.com\\\/#\\\/schema\\\/person\\\/d508df9c3ffc8b4e64a18dbf0ba18dd8\",\"name\":\"Vitor Silva\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"pt-PT\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/f81f58ad909e8a213ab0a690f6ed65e5c0e0e2274bf35ac49ff31d7988d483ce?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/f81f58ad909e8a213ab0a690f6ed65e5c0e0e2274bf35ac49ff31d7988d483ce?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/f81f58ad909e8a213ab0a690f6ed65e5c0e0e2274bf35ac49ff31d7988d483ce?s=96&d=mm&r=g\",\"caption\":\"Vitor Silva\"},\"logo\":{\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/f81f58ad909e8a213ab0a690f6ed65e5c0e0e2274bf35ac49ff31d7988d483ce?s=96&d=mm&r=g\"},\"sameAs\":[\"https:\\\/\\\/www.linkedin.com\\\/in\\\/vitormrsilva\",\"https:\\\/\\\/x.com\\\/vitorsilva\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Tools and Libs for Scraping parlamento.pt - Os Meus Apontamentos","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/osmeusapontamentos.com\/index.php\/2010\/12\/31\/tools-and-libs-for-scraping-parlamento-pt\/","og_locale":"pt_PT","og_type":"article","og_title":"Tools and Libs for Scraping parlamento.pt - Os Meus Apontamentos","og_description":"So what am I using to scrap the parlamento.pt site? Well, first let me tell you that i&#8217;m mostly a Microsoft Developer. I develop web applications with Visual Basic and (less) C# (using Visual Studio) and also MS SQLServer but I&#8217;ve also played and did a few projects with PHP and MySQL. Since the Hacklaviva [&hellip;]","og_url":"https:\/\/osmeusapontamentos.com\/index.php\/2010\/12\/31\/tools-and-libs-for-scraping-parlamento-pt\/","og_site_name":"Os Meus Apontamentos","article_published_time":"2010-12-31T14:05:17+00:00","og_image":[{"url":"https:\/\/osmeusapontamentos.com\/img\/data_to_data.png","type":"","width":"","height":""}],"author":"Vitor Silva","twitter_card":"summary_large_image","twitter_creator":"@vitorsilva","twitter_site":"@vitorsilva","twitter_misc":{"Escrito por":"Vitor Silva","Tempo estimado de leitura":"1 minuto"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/osmeusapontamentos.com\/index.php\/2010\/12\/31\/tools-and-libs-for-scraping-parlamento-pt\/#article","isPartOf":{"@id":"https:\/\/osmeusapontamentos.com\/index.php\/2010\/12\/31\/tools-and-libs-for-scraping-parlamento-pt\/"},"author":{"name":"Vitor Silva","@id":"https:\/\/osmeusapontamentos.com\/#\/schema\/person\/d508df9c3ffc8b4e64a18dbf0ba18dd8"},"headline":"Tools and Libs for Scraping parlamento.pt","datePublished":"2010-12-31T14:05:17+00:00","mainEntityOfPage":{"@id":"https:\/\/osmeusapontamentos.com\/index.php\/2010\/12\/31\/tools-and-libs-for-scraping-parlamento-pt\/"},"wordCount":255,"commentCount":0,"publisher":{"@id":"https:\/\/osmeusapontamentos.com\/#\/schema\/person\/d508df9c3ffc8b4e64a18dbf0ba18dd8"},"image":{"@id":"https:\/\/osmeusapontamentos.com\/index.php\/2010\/12\/31\/tools-and-libs-for-scraping-parlamento-pt\/#primaryimage"},"thumbnailUrl":"https:\/\/osmeusapontamentos.com\/img\/data_to_data.png","keywords":["dados publicos","hacklaviva","open data","scraping","transpar\u00eancia"],"articleSection":["Ler\/ Ver\/ Ouvir\/ Passear"],"inLanguage":"pt-PT","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/osmeusapontamentos.com\/index.php\/2010\/12\/31\/tools-and-libs-for-scraping-parlamento-pt\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/osmeusapontamentos.com\/index.php\/2010\/12\/31\/tools-and-libs-for-scraping-parlamento-pt\/","url":"https:\/\/osmeusapontamentos.com\/index.php\/2010\/12\/31\/tools-and-libs-for-scraping-parlamento-pt\/","name":"Tools and Libs for Scraping parlamento.pt - Os Meus Apontamentos","isPartOf":{"@id":"https:\/\/osmeusapontamentos.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/osmeusapontamentos.com\/index.php\/2010\/12\/31\/tools-and-libs-for-scraping-parlamento-pt\/#primaryimage"},"image":{"@id":"https:\/\/osmeusapontamentos.com\/index.php\/2010\/12\/31\/tools-and-libs-for-scraping-parlamento-pt\/#primaryimage"},"thumbnailUrl":"https:\/\/osmeusapontamentos.com\/img\/data_to_data.png","datePublished":"2010-12-31T14:05:17+00:00","breadcrumb":{"@id":"https:\/\/osmeusapontamentos.com\/index.php\/2010\/12\/31\/tools-and-libs-for-scraping-parlamento-pt\/#breadcrumb"},"inLanguage":"pt-PT","potentialAction":[{"@type":"ReadAction","target":["https:\/\/osmeusapontamentos.com\/index.php\/2010\/12\/31\/tools-and-libs-for-scraping-parlamento-pt\/"]}]},{"@type":"ImageObject","inLanguage":"pt-PT","@id":"https:\/\/osmeusapontamentos.com\/index.php\/2010\/12\/31\/tools-and-libs-for-scraping-parlamento-pt\/#primaryimage","url":"https:\/\/osmeusapontamentos.com\/img\/data_to_data.png","contentUrl":"https:\/\/osmeusapontamentos.com\/img\/data_to_data.png"},{"@type":"BreadcrumbList","@id":"https:\/\/osmeusapontamentos.com\/index.php\/2010\/12\/31\/tools-and-libs-for-scraping-parlamento-pt\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"In\u00edcio","item":"https:\/\/osmeusapontamentos.com\/"},{"@type":"ListItem","position":2,"name":"Tools and Libs for Scraping parlamento.pt"}]},{"@type":"WebSite","@id":"https:\/\/osmeusapontamentos.com\/#website","url":"https:\/\/osmeusapontamentos.com\/","name":"Os Meus Apontamentos","description":"","publisher":{"@id":"https:\/\/osmeusapontamentos.com\/#\/schema\/person\/d508df9c3ffc8b4e64a18dbf0ba18dd8"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/osmeusapontamentos.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"pt-PT"},{"@type":["Person","Organization"],"@id":"https:\/\/osmeusapontamentos.com\/#\/schema\/person\/d508df9c3ffc8b4e64a18dbf0ba18dd8","name":"Vitor Silva","image":{"@type":"ImageObject","inLanguage":"pt-PT","@id":"https:\/\/secure.gravatar.com\/avatar\/f81f58ad909e8a213ab0a690f6ed65e5c0e0e2274bf35ac49ff31d7988d483ce?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/f81f58ad909e8a213ab0a690f6ed65e5c0e0e2274bf35ac49ff31d7988d483ce?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f81f58ad909e8a213ab0a690f6ed65e5c0e0e2274bf35ac49ff31d7988d483ce?s=96&d=mm&r=g","caption":"Vitor Silva"},"logo":{"@id":"https:\/\/secure.gravatar.com\/avatar\/f81f58ad909e8a213ab0a690f6ed65e5c0e0e2274bf35ac49ff31d7988d483ce?s=96&d=mm&r=g"},"sameAs":["https:\/\/www.linkedin.com\/in\/vitormrsilva","https:\/\/x.com\/vitorsilva"]}]}},"_links":{"self":[{"href":"https:\/\/osmeusapontamentos.com\/index.php\/wp-json\/wp\/v2\/posts\/2661","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/osmeusapontamentos.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/osmeusapontamentos.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/osmeusapontamentos.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/osmeusapontamentos.com\/index.php\/wp-json\/wp\/v2\/comments?post=2661"}],"version-history":[{"count":0,"href":"https:\/\/osmeusapontamentos.com\/index.php\/wp-json\/wp\/v2\/posts\/2661\/revisions"}],"wp:attachment":[{"href":"https:\/\/osmeusapontamentos.com\/index.php\/wp-json\/wp\/v2\/media?parent=2661"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/osmeusapontamentos.com\/index.php\/wp-json\/wp\/v2\/categories?post=2661"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/osmeusapontamentos.com\/index.php\/wp-json\/wp\/v2\/tags?post=2661"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}