The Panama Papers Search Tool Began as an Academic Skunkworks Project

Datetime:2016-08-23 02:03:00          Topic: Solr  Coder           Share

A version of this post titled “Panamania” originally appeared in the Cyber Saturday edition ofData Sheet, Fortune’s daily tech newsletter.

The Panama Papers—the biggest leak in data journalism history, as Edward Snowden christened it —is not so much a leak as a hemorrhage.

Mossack Fonseca, the Panamanian law firm that specializes in creating companies in off-shore tax havens, lost 2.6 terabytes worth of data, equivalent to 11.5 million documents. To pore over that many reams of documents required lots of reporters, lots of eyeballs, and lots of tech.

I caught up with Mar Cabra, head of the data and research unit at the International Consortium of Investigative Journalists, which coordinated the reporting effort, on Friday afternoon to discuss how the global investigation—more than 400 reporters in 80 countries—took place. “This would not have been possible without technology,” she said.

One aspect I found interesting was her team’s use of open source info-retrieval software: In particular, Apache Tika, Apache Solr, and Blacklight. These tools allowed reporters to dig into the cache and turn up their findings, which in many cases involved tying global leaders to tax-dodging accounts . Tika extracts document data; Solr indexes it; and Blacklight provides a user interface, the packaging and presentation. Why this specific set of tools? “We chose Solr because project Blacklight existed,” Cabra said, mentioning that her team had adopted the search software by mid-2014 for earlier projects. “It’s an interface that’s intuitive and easy to use.”

For more on search software, watch:

I spoke with Erik Hatcher, one of the original developers of Blacklight, on Friday as well. He said he wrote the precursor code—a Ruby on Rails application that layers on top of Solr’s Java code, for the programmers among us—while working in a research group at the University of Virginia. He created the tool to do analytics and search on a database of 19th century literature and poetry. Then he adapted it to accommodate the entirety of the university’s library records.

Hatcher said he’s proud that the search software—today used everywhere from the Rock and Roll Hall of Fame to inside national security organizations—was used in the Panama Papers data dump. “Oftentimes these tools get the job done, but they’re not really exposed in and of themselves,” he said. “They’re just a means to an end—they don’t get as much press.”

“I’m happy in this case that these technologies are being showcased for the power they offer,” he added.

Cabra said that her team is now considering using a bit of rival search software—Elasticsearch—for an upcoming project. She said the group is interested in assembling a centralized cache of all the leaks the consortium has worked on so far. “We call it a knowledge center,” she told me. “It’s going to be a global repository of everything we have.”

Expect a one stop shop for all your investigative journalism needs.





About List