Luis Pedro Coelho
Initial note: For the full story, see the GUNC preprint. Luis helped a bit developing the ideas behind the current iteration of the CSS and wrote this blogpost, but most of the work was done by Askarbek, Anthony, and Sebastian.
Let’s say you have a metagenome-assembled genome (MAG) and want to check if it’s good. What should you do? Use checkM! Let’s say you already used CheckM and want even more assurances. In particular, you want check for possible contamination to ensure (to the extent that it’s technically feasible) that your MAG contains only sequence from a single real genome; what can you do?
You could use taxonomic classification of the genes in the genome. In the ideal case, this would solve your problems directly: if all the genes are classified as coming from the same source, then this is your genome; otherwise, you have contamination. Unfortunately, this ideal case only works if there are no errors in the prediction. In practice, taxonomic predictions are imperfect, so what can you do?
If you have genes from two taxa (A & B) this can be because (1) your so-called genome is actually a mixture of A and B or (2) the predictor is not perfect and gets confused between the two? How can you tell? One intuition is to look at the pattern of errors. A MAG is, effectively, a bin (i.e., a group) of contigs and each contig will have a number of genes: if A and B are randomly spread out, then this does not provide any evidence for contamination:
However, if A and B follow contig boundaries, then this is likely contamination:
This assumes that most errors in MAGs are binning errors (i.e., the individual contigs are good, but they are grouped incorrectly), which seems to be empirically true.
If all you care about is getting the intuition behind a tool, you can stop here, but now I want to show you how we went from this intuition into the Clade Separation Score implemented in the tool.
The first idea was to think in terms of information: how much information about the taxonomic is provided by the contig information? This naturally lead to the first version of our metric as a normalized form of conditional entropy:
Here H(T|C) is the entropy of the taxonomic assignments, given the contig assignments; which is normalized by H(T) the overall taxonomic entropy. If there is no relationship, then H(T|C) = H(T), so C₁ = 0 (we subtract from 1 to make larger numbers denote more evidence of contatimation). On the other hand, if the taxonomic assignments perfectly follow the contigs, then H(T|C) = 0 and C₁ = 1.
Great! So, is this what we used in the paper? No.
Note that this is not yet something that can be applied to data as we cannot actually know the true value of H(T|C) (or even H(T), although that is less problematic). We can estimate it from data, though. The simplest way is to use the plugin estimators:
In the above, N is the total number of genes and is the number of genes in the contig c. The problem is that these entropy estimators are biased and heavily biased when is generally small (Basharin, 1959).
Maybe a thought experiment would illustrate the problem: imagine you have a genome whose taxonomic assignments are exactly evenly split between two taxa (A and B) and without any relationship between the contig structure and the taxonomic assignments. Imagine, however, that the genome is heavily fragmented (i.e., it consists of many short contigs rather than a long chromossome-length sequence). In this case, even though H(T|C) = H(T), the estimated versions would be biased: Ĥ(T|C) « Ĥ(T). In particular, any contigs with a single gene would be contributing zero to the sum above. Contigs with 2 genes would also show up as pure A or pure B contigs half the time (randomly). Only for much larger contigs do the estimates start to be similar to the true underlying value.
There has been some work to tackle similar issues when evaluating clustering results (Nguyen et al., 2009; other too), but our problem is slightly different: the fragmentation of the genome is something that we cannot do anything about, but should not contribute any evidence of contamination.
Our first idea was to use a more robust estimator than the plugin estimator above. This did seem to improve the results, but there was another idea that was both conceptually very simple and easy to implement: we kept the plugin estimator for Ĥ(T|C) but also compute the same formula for a random assignment of genes to contigs that preserves the distribution of contig sizes. Conceptually, we perfom a permutation test: we repeatedly reassign the taxonomic genes labels across all the genes in the genome and compute the average value of the plugin estimator. Because of the bias discussed above, for fragmented genomes, this value is less than Ĥ(T), even though H(T|R) = H(T) (if the assignment is random, it cannot provide any information about the taxonomy). This average we call Ĥ(T|R), leading to
In practice, we do not need to actually perform the permutation test and can save ourselves the computational expense of multiple randomizations to get a mean value. In fact, another advantage of this approach is that it’s very easy to compute the closed form expression for the expected value and this is what we use in the released version of the tool‡.
As mentioned, there is much more material in the preprint, including extensive validation of this concept in both simulated and real data. All in all, we are confident that a significant fraction (~5% in isolate genome databases, and 15-30% in MAG databases) of extant published genomes contain some contamination and more stringent quality-control would further improve these databases.
The tool itself can be found at gunc.embl.de.
Basharin GP. On a statistical estimate for the entropy of a sequence of independent random variables. Theory of Probability & Its Applications. 1959;4(3):333-6. — You know you are doing real math when the papers you are reading have been translated from Russian.
Nguyen XV, Epps J, Bailey J. Information theoretic measures for clusterings comparison: is a correction for chance necessary?. ICML 2009
‡ Technically, we needed to assume contigs are being sampled with replacement whereas doing permutations is sampling without replacement, but this does not meaningfully affect the outcome.