Shaojun Pan and Luis Pedro Coelho
Update (June 2021): This blogspot originally used the old name of the tool (S3N2Bin), which is now called SemiBin. The blogpost and the links have been updated to use the new name.
tl;dr SemiBin is a new tool for binning (inferring MAGs, metagenome-assembled genomes), exploiting deep-semi supervised learning. It can assemble more high-quality genomes than other tools.
The background: what is contig binning?
The goal is to derive metagenome-assembled genomes (MAGs). The first step (after QC) is to assemble contigs. Since from a metagenome one can rarely obtain genomes as a single (circular) contig, the next step is to bin the contigs: i.e., to group the contigs that are inferred to belong to the same genome together into the same bin.
The intuition: what is semi-supervised?
For binning, one case use reference-based methods, whereby one uses existing genomes (assumed to be high-quality) to guide the binning. This can work well for discovering new strains of known species, but cannot discover new species. In contrast, reference-independent methods do not rely on prior knowledge and, thus, can discover new species. Due to this flexibility, reference-independent methods are the most-widely used methods nowadays.
The most widely used contig binning methods are reference-independent methods in order to provide this flexibility. In fact, some of the most exciting MAG-driven discoveries have been unexplored clades
However, we felt that to completely disregard reference information was to throw the baby out with the bathwater and built a model which combines the best of both worlds, in that it can (1) use reference information to improve binning, while still (2) recovering genomes from unknown species. This is the meaning behind semi-supervised.
The results: How well does it work? (very well)
The main output is the number of high-quality (>90% completeness & <5% contamination, as measured by CheckM and further filtered by GUNC) MAGs. We compared against Maxbin2, Metabat2 as well as the newer Vamb, which shares the most ideas with our method.
|Testset, binning mode||Maxbin2||Vamb||Metabat2||SemiBin(CAT)||SemiBin(mmseqs)|
|Human gut, single-sample binning||487||998||1060||1320||1497|
|Human gut, multi-sample binning||-||1318||-||1364||1549|
|Dog gut, single-sample binning||593||1220||1404||1834||2415|
|Dog gut, multi-sample binning||-||3106||-||2934||3448|
The multi-sample binning mode, introduced by Vamb, builds sample specific bins while integrating data from across multiple samples. S³N²Bin also supports multi-sample binning and it produces the best overall results.
A test version is available on Github: BigDataBiology/SemiBin. It may be a bit rough around the edges (for example, not a lot of effort has gone into providing high-quality error messages for users) and it may still change as we develop it, but the functionality of gets great binning results should be there.