MicrobeSifter: mapping complex microbe genomics samples using NLP

Datetime:2016-08-23 02:16:07         Topic: Natural Language Processing  DataBase          Share        Original >>
Here to See The Original Article!!!

June 7, 2016

Nicholas Baro

Scientific Project Manager

Seven Bridges

Insight Fellow, 2015

Harvard University

Nicholas is an Insight alumnus from the first Health Data Science session in Boston and is now a Scientific Project Manager at Seven Bridges. While at Insight, he partnered with One Codex to develop a topic modeling approach to estimate relative species abundance in complex genomic samples. This content originally appeared on his personal blog.


While the cost of DNA sequencing continues to plummet, the data being generated is increasingly complex. A single environmental sample containing hundreds or thousands of different organisms can be sequenced to determine the relative abundance of each organism present. The ability to reproducibly deliver results that are both accurate and sensitive is paramount.

One Codex is a biotech company founded in 2014 in San Francisco with the goal of bringing robust, modern software engineering to microbial genomics. Their customers are public health professionals, research scientists, and industrial engineers who need a powerful and dependable platform for characterizing bacteria, viruses, and fungi from next-generation sequencing (NGS) data. In collaboration with One Codex I developed a model that uses taxonomy classifications generated by their existing pipeline to estimate the relative abundance of organisms from complex mixtures.

The problem

NGS technologies rely upon shearing or breaking down genomic DNA into small fragments. Specific patterns in the fragments of DNA are then used to map the reads back to a reference genome or bin fragments into different taxonomy classes. However, because some genes are conserved, or because two species are more related, there is not enough information contained in the fragment to conclusively map the fragment at the species-specific level.

Take a look at the cartoon below, some reads have information that are conclusively determinant (i.e. they have a pattern that uniquely maps to that organism at the species-level alone), while other fragments may be ambiguous. Instead of being binned at the species level, they may cluster at a hierarchy above (i.e. class, order, family).

Let try to put this in context...

Microbes are everywhere and they are an essential part of our human existence. They contribute to just about every single environment they reside in and are responsible for converting key elements of life into biologically accessible forms for us to enjoy.

Metagenomics, the study of genetic material from these microbial communities, has emerged as a powerful genetic tool that allows investigators to survey the constituents of a microbial community without the need to culture member organisms in the laboratory. To make this distinction clear, let's consider what a microbiologist would have done 10 years ago and compare it to what one might do today.

Let's give the same problem to two investigators and see how they would each go about the problem. Professor Coultare, a man of patience, a true reductionist by training, and Professor Progressive, an investigator who values his time.

In the cartoon above, Professor Coultare must identify what organisms make up his unknown sample. He makes some assumptions about what he thinks might be in the mix, and attempts to isolate and cultivate individual organisms by growing them on special media. He then must wait 2-5 days to get results, that tell him nothing about the relative abundance of each species present, but merely confirms it was there.

Meanwhile, Professor Progressive has already extracted the genomic DNA from his entire sample and sheared each one of the organisms genomes into small fragments of DNA. These tiny fragments have specific patterns that can then be used to classify and map the fragments back to a database of all known organisms.

The data science pipeline


I received 500 FASTQ-formatted sequencing files consisting of ~1M+ reads. In addition, I received access to the One Codex platform and an API key, which allowed me to upload the FASTQ files to the platform programmatically to retrieve both read-level TSV classification results and JSON analysis summaries. The JSON file included a list with support for distinct taxonomic groupings (e.g., Enterobacteriacaea, Escherichia, E. coli). I also received truth tables with the relative abundance of the species in the FASTQ training samples and a JSON representation of the taxonomy used in current One Codex results.


I created scripts to upload all FASTQ files to One Codex platform and stored results to my local environment in compressed format. For each input FASTQ I received two outputs, one aligned to the fully One Codex database and another to the RefSeq database. Data were organized into a relational database using SQL, so that I could easily reference files for analysis.


I also received a taxonomy .json file that contained all known taxonomy classifications. Because I was only interested in taxonomy classifications at the species-level, I had to recursively 'roll-up' classifications below the species level. For example, if there are patterns of DNA that specifically match a particular strain of pathogenic E. coli O157:H7, I would need to take that taxonomy classification and represent or count it towards E. coli at the species level.

Rolling-up known taxonomy to species-level


Latent Dirichlet Allocation , a process exploited in natural language processing, is a generative model that allows for topics to be automatically discovered given a set of documents containing words. It allows for a set of observations (words) to be explained by unobserved groups (topics). Before we get into more detail, I will start with a simple example.

Imagine you have the following sentences:

Now we simply substitute classifications for words, reads for documents, and species for topics, and voila, we have adopted the LDA model for discovering species!!

LDA represents reads as a mixture of species and tries to assign probabilities to each taxonomy classification. This assumption can be made based on the way the reads were generated. There was some finite number of genomes that were sheared and you are trying to piece fragments of genomes back together to discover the species that contributed in the first place. Once you've identified the species, you can go back and calculate their distribution across all the reads. To extend the analogy, you could also imagine you have three different books. Now imagine the only thing I give you are books with each one of the pages removed and scrambled. You are trying to figure out the topic of each book by reading the words that appear on each page.

Wait, how exactly does this work?

Suppose you have a bunch of reads (documents), each consisting of taxonomy classifications (words). Suppose you have some information about how many different organisms there might be based on the number of reads that are mapping at the species level alone. Using LDA you can learn the species-level representation of each read. The first step involves going through each read and randomly assigning each taxonomy classification to one of the species-level topics. Then you go through each read and look at the individual classifications and ask two questions:

  1. How prevalent is the classification across species?
    • Probability (species | read)
    • The proportion of taxonomy classifications that are currently assigned at the species level x
  2. How prevalent are species across the classifications?
    • Probability (taxonomy classification | species)
    • The proportion of assignments to species over all reads that come from that particular taxonomy classifications

As each taxonomy classification is seen, it is assigned to a new species based on the distributions of classification across species and the species across all taxonomy classifications.

The priors for each species distribution are randomly and uniformly assigned. As you can see below, at the beginning of the sampling, all species have an equal chance. As new taxonomy classifications are seen, the model updates its prior beliefs. This iterative process is known as the Gibbs Sampling Monte Carlo process. Eventually the distribution of topics converges, and we can use this information to infer species distribution.

This generates a list of species taxonomy classifications ranked by probability.

For example, in the figure below, we were expecting three species-level taxonomy classifications. The output assigns a probability to each taxonomy classification that was used to identify the unknown species. The most probable taxonomy classification allows us to identify the species. We can then go back and find the distribution of species across reads to identify relative abundance of each organism.


  • It is challenging to map genomic fragments from a mixed microbial sample to species-specific level due to lack of complete reference genomes.
  • The problem of assigning genomic fragments to species is similar to assigning words to topics in NLP.
  • I exploited the LDA approach to assign a probability to each taxonomy classification; it helps identify unknown species and estimate relative abundance of each organism in a complex genomic sample.


Put your ads here, just $200 per month.