Case Study: Elasticsearch Powers Giant Oak’s Fight Against Organized Crime

Datetime:2016-08-23 02:09:10          Topic: Elastic Search           Share

“To do our work, we need lots of data and getting often very messy data into one place, and we useElasticsearch for that,” said Gary Shiffman , founder and CEO. Shiffman, who has a Ph.D. in economics, also teaches graduate-level courses at Georgetown University on the economies of violent groups such as ISIS.

Giant Oak focuses on data analytics, modeling human behavior, and bad guys, Shiffman explains. The company does research and development funded by DARPA , the U.S. government’s Defense Advanced Research Projects Agency, and also commercializes its tools for clients.

One project, called Unicorn , which provides visualization and summarization of a collection of documents, grew out of efforts to stem trafficking in rhino horns and elephant ivory. Unicorn is open sourced and available on GitHub.

Rhino horn, which is used as an aphrodisiac, sells for between $25,000 and $30,000 a pound on the black market, making it more valuable than gold, social scientist Sherry Forbes told an audience at the Elastic{on} user conference earlier this year. The global rhino population has decreased by 90 percent by 1970, she said, meaning these animals could become extinct in our lifetime.

“We collected a large amount of very messy data and used Elasticsearch to house, organize and build analytics on top of it. We were able to help law enforcement authorities identify interesting connections that nobody previously had any idea about,” Shiffman said in an interview.

“We followed one case into abalone trafficking, and that was linked to a designated drug kingpin. We were able to help analysts and law enforcement authorities prove that the people moving these illicit goods including rhino horns around the globe were also moving drugs.”

Their research linked rhino poaching to Asian organized crime groups, Forbes said.

Social science drives the company. Giant Oak delivers software, but is not involved in any of the resulting investigations or prosecution; Shiffman makes clear. However, it says that more than 100 arrests have been made related to its research to identify victims of human trafficking.

It’s doing that by evaluating data points in online ads and discussion boards for sex.

“We’re still doing the scientific research on if you can determine what victims of trafficking look like in ads for sex. There’s a very robust online marketplace, and we spend a lot of time thinking about how to find and rescue people who are trapped in this terrible situation,” he said.

“If you’re in this market, you have to advertise. And you have to put out information about the business.”

It’s part of DARPA’s Memex project to index the web for various specific use cases, including fighting crime, as opposed the commercial focus of search engines such as Google or Bing, explained Forbes, a Ph.D. macroeconomist for the company.

Making Sense of a Market

The company has scraped 85 million ads for sex from sites such as Craigslist and Backpage.com and is comparing them to results on 2 million user review sites.

“This is the online market for sex,” Forbes explained at the conference. “We’re economists. We’re really good at looking at markets. We know how to make sense of them.”

These ads provide data elements such as locations, phone numbers, prices, text and images.

That work involved first coming to understand what is normal for a market, then trying to identify the differences between trafficked and non-trafficked providers.

Reviews that indicate the provider was less than happy to be there could be an indicator of trafficking, Forbes explained, though that’s only one component.

She pointed to two tools built on Elasticsearch being developed as part of Memex that Giant Oak really likes: Evidently LE from Lattice incorporates machine learning, data management and natural language processing in the fight against human trafficking; and Dig (Domain-specific Insight Graphs) , from the Information Sciences Institute at USC, which is used to cross-reference automatically huge global databases to track criminals.

The Memex program is agnostic toward technology – it doesn’t promote one over the other – but a lot of teams are using Elasticsearch, she said in an interview.

“There seems to be a developer consensus – not on everything – but on honing in on Elasticsearch for the CDR,” [crawl data repository], which hosts code and schema information related to the project.

The Elastic suite tools Logstash and Kibana , though meant to process log files, also work well on event data such as messages with time stamps, she said of one project with federal law enforcement.

“It organizes that quite well. We were able to get a lot out of that using Kibana to help visualize some of the data we were looking at. You can get histograms and timelines through Kibana with no additional code.”

Ease of Use

The company builds different tools for the various use cases, but with core underlying technology.

“The reason we like Elasticsearch is that it’s really easy to use. It’s really easy to ingest data and do simple queries on it,” Forbes said.

“In the past, if you’re an economist or political scientist, and you have an idea, you could rough out essentially what you’d like to do, but then you’d have to hand off to an engineer to actually go out and build it. The nice thing about the Elastic suite is that you can do all that development work yourself.”

If you have an idea for a proof-of-concept, you can get something off the ground really quickly without any additional lines of code,” she said.

“It’s really good on unstructured text. If you’ve looked at these ads, they’re the definition of unstructured. There’s not a lot to go on. So if you’re wanting to do something like price analytics, it’s really difficult without something like Elasticsearch. It’s a really good way to take unstructured text, index it, then search it.”

Its data sets may be small for one-off proof–of-concepts up to fairly large. Elasticsearch scales easily without speed being an issue, she said.

In addition to Elasticsearch, Giant Oak sometimes uses Hadoop and HDFS, sometimes SQL and Spark, Shiffman said. It builds its analytics in Python, and on GitHub offers geodict , a simple Python library/tool for pulling location information from unstructured text.

“Our employees tend to be very applied-math-type folks, so Python’s a great language for them. The math people don’t have to hand their work off; they’re actually doing the coding themselves,” Shiffman said.

He has said that if data is the new oil, analytics is the new refinery .

“In the last several years, we focused on making data available,” he said. “If data is oil by analogy, we’re at the point now where we need to take crude oil and turn it into something people can use, like kerosene.”

Feature Image via Pixabay.





About List