How Web Scraping Is Revealing Lobbying and Corruption in Peru

How Web Scraping Is Revealing Lobbying and Corruption in Peru(blog.scrapinghub.com)

398 points by bezzi 10 years ago | 72 comments

kilotaras 10 years ago |

I'm from Ukraine and the biggest success in battling corruption comes from system called Prozorro[1] (transparently) for government tenders.

It started as volunteer project and some projections put savings at around 10% of total budget after it will become mandatory in April.

[1] https://github.com/openprocurement/

carlosp420 10 years ago |

Hi there, I am the author of the blog post. I will be happy to answer any question.

gearhart 10 years ago | |

This is great work. Forgive me if I'm missing it, but since the blog post implies you're aggregating and cleaning the data from several lists, is there any way to see the latest additions (RSS etc?) rather than directly searching for individuals?

It would make it more useful for flagging up potential stories, as well as researching stories journalists are already writing.

disclosure: I work for a company that provides real-time data to journalists for story discovery, and I know we'd certainly be interested

carlosp420 10 years ago | | |

I never thought of that, but certainly having a RSS feed is a great idea. I have not done it as the journalists have not requested it. So far they have been asking me for more spiders so Manolo would include visit records from other Peruvian institutions.

nsoldiac 10 years ago | |

Carlos, super buen trabajo, felicitaciones!! Llevo tiempo estudiando temas relacionados a tecnología vs. corrupción desde acá en Berkeley. Tengo testimonios interesantes de contactos que han vivido el cambio post-tecnología en el gobierno. Perú tiene harto potencial en esta área. Si necesitas ayuda en cualquier momento feliz de apoyarte!

carlosp420 10 years ago | | |

muchas gracias! En el Perú ya hay varios grupos de periodistas que se han asociado con programadores para hacer proyectos interesantes de periodismo de datos. Está Ojo Publico, Convoca e IDL reporteros. Pero igual no nos damos abasto hay tanto por hacer!

RodericDay 10 years ago | | |

Igualmente. También vivo en el extranjero (Canada), pero estoy totalmente dispuesto a apoyar en cualquier iniciativa de este tipo.

juandbarraza24 10 years ago | |

Good work! It would be interesting to cross match the visits with any other source of information (newspaper, wikileaks, etc.) Over a timeline to recreate the hole event of someone. This will allow to identify patterns and their modus operandi.

nsoldiac 10 years ago | |

It would be interesting to see the volume of visits by government office year over year. I have a feeling that periods around elections might look very different. Also would be interested to see distribution color-coded by industry. Mining and contracting should pop up for certain time periods and government agencies.

carlosp420 10 years ago | | |

yes. So far we have a very simple API http://manolo.rocks/docs/ With this API, it is possible to download all the structured data kept in Manolo and do such interesting analyses.

Or maybe that can be implemented in Manolo's GUI. It should not be difficult as it is based on Django.

sergiotapia 10 years ago | |

Existe alguna fuente de informacion como la de Peru, para Bolivia? Me imagino que hay mucho que descrubir sobre la corrupcion en Bolivia y el trafico de influencias.

ecthiender 10 years ago |

Very interesting, how tools like these can be so much helpful for journalists and generally transparency in government functions.

Probably world changing, when considering that even semi-technical folks can cook up tools to dig into things like this.

I know this tool was by a developer, but scrapinghub has web UI to make scrapers.

unsettledtck 10 years ago | |

Full disclosure, I work for Scrapinghub and the web UI you speak of is Portia - our open source visual web scraper. It's for those who range from non-technical to technical but want a quick way to scrape data. I think it's extremely important to develop tools to democratize the acquisition of data regardless of technical background and skill. Glad you find the article and tool interesting!

ecthiender 10 years ago | | |

Yes, totally agree with you on the great potential of tools for easy data acquisition.

I have personally used Scrapy in the past, I find it to be a great tool.

Congratulations on your work!

benologist 10 years ago | |

A similar thing happened in Costa Rica -

    “You can’t visit 160,000 people,” she notes. “But 
    you can easily interrogate 160,000 records.”

http://foreignpolicy.com/2015/05/27/the-data-sleuths-of-san-...

xiphias 10 years ago |

Can you draw a covisit graph of people? Who visited the building at the same times as somebody else. The strength of the connections could be visitedboth^2/( visitedwithouttheother1+1)*(visitedwithouttheother2+1)))

alecco 10 years ago |

In other countries, corrupt politicians found out a simple captcha per n items is good enough to defeat analysis.

smarx007 10 years ago | |

https://anti-captcha.com/ & https://rucaptcha.com/ - I think that can be best summarised as "from Russia with love" :)

danso 10 years ago |

FWIW, if you live in the U.S., then you benefit from having such data in great quantity, though I don't think it's sliced-and-diced to near the potential that it has:

Lobbyists have to follow registration procedures, and their official interactions and contributions are posted to an official database that can be downloaded as bulk XML:

http://www.senate.gov/legislative/lobbyingdisc.htm#lobbyingd...

Could they lie? Sure, but in the basic analysis that I've done, they generally don't feel the need to...or rather, things that I would have thought that lobbyists/causes would hide, they don't. Perhaps the consequences of getting caught (e.g. in an investigation that discovers a coverup) far outweigh the annoyance of filing the proper paperwork...having it recorded in a XML database that few people take the time to parse is probably enough obscurity for most situations.

There's also the White House visitor database, which does have some outright admissions, but still contains valuable information if you know how to filter the columns:

https://www.whitehouse.gov/briefing-room/disclosures/visitor...

But it's also a case (as it is with most data) where having some political knowledge is almost as important as being good at data-wrangling. For example, it's trivial to discover that Rahm Emanuel had few visitors despite is key role, so you'd have to be able to notice than and then take the extra step to find out his workaround:

http://www.nytimes.com/2010/06/25/us/politics/25caribou.html

And then there are the many bespoke systems and logs you can find if you do a little research. The FDA, for example, has a calendar of FDA officials' contacts with outside people...again, it might not contain everything but it's difficult enough to parse that being able to mine it (and having some domain knowledge) will still yield interesting insights: http://www.fda.gov/NewsEvents/MeetingsConferencesWorkshops/P...

There's also OIRA, which I haven't ever looked at but seems to have the same potential of finding underreported links if you have the patience to parse and text mine it: https://www.whitehouse.gov/omb/oira_0910_meetings/

And of course, there's just the good ol FEC contributions database, which at least shows you individuals (and who they work for): https://github.com/datahoarder/fec_individual_donors

This is not to undermine what's described in the OP...but just to show how lucky you are if you're in the U.S. when it comes to dealing with official records. They don't contain everything perhaps but there's definitely enough (nevermind what you can obtain through FOIA by being the first person to ask for things) out there to explore influence and politics without as many technical hurdles.

prawn 10 years ago |

Peruvians, do you think this would cause a majority of meetings to be held outside of public office buildings or via secretive messaging system?

dkarp 10 years ago |

This is really impressive, even more so by the fact that it has already led to discoveries being made.

Web scraping is a really powerful tool for increasing transparency on the internet especially with how transient online data is.

My own project, Transparent[1], has similar goals.

[1] https://www.transparentmetric.com/

Angostura 10 years ago |

This is a fascinating project - If successful I suspect the result will be that lobbying to longer takes place in the government offices: "Shall we meet at that little place down the street", or will be carried out over the phone.

jorgecurio 10 years ago |

Really interesting use of data extraction....

For developers and managers out there, do you prefer to build your own in-house scrapers or use Scrapy or tools like Mozenda instead? What about import.io and kimono?

I'm asking because lot of developers seem to be adamant against using web scraping tools they didn't develop themselves. Which seems counter productive because you are going into technical debt for an already solved problem.

So developers, what is the perfect web scraping tool you envision?

And it's always a fine balance between people who want to scrape Linkedin to spam people, others looking to do good with the data they scrape, and website owners who get aggressive and threatening when they realize they are getting scraped.

It seems like web scraping is a really shitty business to be in and nobody really wants to pay for it.