by Shaojun Pan,Luis Pedro Coelho.
The goal is to derive metagenome-assembled genomes (MAGs). The first step (after QC) is to assemble contigs. Since from a metagenome one can rarely obtain genomes as a single (circular) contig, the next step is to bin the contigs: i.e., to group the contigs that are inferred to belong to the same genome together into the same bin.
For binning, one case use reference-based methods, whereby one uses existing genomes (assumed to be high-quality) to guide the binning. This can work well for discovering new strains of known species, but cannot discover new species. In contrast, reference-independent methods do not rely on prior knowledge and, thus, can discover new species. Due to this flexibility, reference-independent methods are the most-widely used methods nowadays.
The most widely used contig binning methods are reference-independent methods in order to provide this flexibility. In fact, some of the most exciting MAG-driven discoveries have been unexplored clades
However, we felt that to completely disregard reference information was to throw the baby out with the bathwater and built a model which combines the best of both worlds, in that it can (1) use reference information to improve binning, while still (2) recovering genomes from unknown species. This is the meaning behind semi-supervised.
We tested on two real datasets: One from the human gut (Wirbel et al., 2019) and another from an animal gut, namely dog gut (Coelho et al., 2018). (Tests on an environmental dataset are on-going).
The main output is the number of high-quality (>90% completeness & <5% contamination, as measured by CheckM and further filtered by GUNC) MAGs. We compared against Maxbin2, Metabat2 as well as the newer Vamb, which shares the most ideas with our method.
Testset, binning mode | Maxbin2 | Vamb | Metabat2 | SemiBin(CAT) | SemiBin(mmseqs) |
---|---|---|---|---|---|
Human gut, single-sample binning | 487 | 998 | 1060 | 1320 | 1497 |
Human gut, multi-sample binning | - | 1318 | - | 1364 | 1549 |
Dog gut, single-sample binning | 593 | 1220 | 1404 | 1834 | 2415 |
Dog gut, multi-sample binning | - | 3106 | - | 2934 | 3448 |
The multi-sample binning mode, introduced by Vamb, builds sample specific bins while integrating data from across multiple samples. S³N²Bin also supports multi-sample binning and it produces the best overall results.
A test version is available on Github: BigDataBiology/SemiBin. It may be a bit rough around the edges (for example, not a lot of effort has gone into providing high-quality error messages for users) and it may still change as we develop it, but the functionality of gets great binning results should be there.
Please use Github issues for bug reports and the Discussions for more open-ended discussions/questions.
Copyright (c) 2018–2024. Luis Pedro Coelho and other group members. All rights reserved.