S3N2Bin: Semi-supervised learning for better binning

Shaojun Pan and Luis Pedro Coelho

Update (June 2021): This blogspot originally used the old name of the tool (S3N2Bin), which is now called SemiBin. The blogpost and the links have been updated to use the new name.

tl;dr SemiBin is a new tool for binning (inferring MAGs, metagenome-assembled genomes), exploiting deep-semi supervised learning. It can assemble more high-quality genomes than other tools.

The background: what is contig binning?

The goal is to derive metagenome-assembled genomes (MAGs). The first step (after QC) is to assemble contigs. Since from a metagenome one can rarely obtain genomes as a single (circular) contig, the next step is to bin the contigs: i.e., to group the contigs that are inferred to belong to the same genome together into the same bin.

The intuition: what is semi-supervised?

For binning, one case use reference-based methods, whereby one uses existing genomes (assumed to be high-quality) to guide the binning. This can work well for discovering new strains of known species, but cannot discover new species. In contrast, reference-independent methods do not rely on prior knowledge and, thus, can discover new species. Due to this flexibility, reference-independent methods are the most-widely used methods nowadays.

The most widely used contig binning methods are reference-independent methods in order to provide this flexibility. In fact, some of the most exciting MAG-driven discoveries have been unexplored clades

However, we felt that to completely disregard reference information was to throw the baby out with the bathwater and built a model which combines the best of both worlds, in that it can (1) use reference information to improve binning, while still (2) recovering genomes from unknown species. This is the meaning behind semi-supervised.

The results: How well does it work? (very well)

We tested on two real datasets: One from the human gut (Wirbel et al., 2019) and another from an animal gut, namely dog gut (Coelho et al., 2018). (Tests on an environmental dataset are on-going).

The main output is the number of high-quality (>90% completeness & <5% contamination, as measured by CheckM and further filtered by GUNC) MAGs. We compared against Maxbin2, Metabat2 as well as the newer Vamb, which shares the most ideas with our method.

Testset, binning mode Maxbin2 Vamb Metabat2 SemiBin(CAT) SemiBin(mmseqs)
Human gut, single-sample binning 487 998 1060 1320 1497
Human gut, multi-sample binning - 1318 - 1364 1549
Dog gut, single-sample binning 593 1220 1404 1834 2415
Dog gut, multi-sample binning - 3106 - 2934 3448

The multi-sample binning mode, introduced by Vamb, builds sample specific bins while integrating data from across multiple samples. S³N²Bin also supports multi-sample binning and it produces the best overall results.

The tool

A test version is available on Github: BigDataBiology/SemiBin. It may be a bit rough around the edges (for example, not a lot of effort has gone into providing high-quality error messages for users) and it may still change as we develop it, but the functionality of gets great binning results should be there.

Please use Github issues for bug reports and the Discussions for more open-ended discussions/questions.