Machine-learning based prospection of antimicrobial peptides (AMPs) from metagenomes using Macrel

Célio Dias Santos-Júnior, Shaojun Pan, Xing-Ming Zhao, and
Luis Pedro Coelho

luispedro@big-data-biology.org
@luispedrocoelho
@BigDataBiology

Slides available at http://big-data-biology.org/presentations/2020-07-15_macrel/

Antimicrobial peptides: why do we care?

  • Short (< 100 amino acids) peptides
  • Inhibit bacteria/fungi/...
  • Clinical & industrial applications
  • Large-scale metagenomic catalogs generally ignore shorter genes

(From Mookherjee et al., Nat Rev. Drug Disc, 2020)

Standard gene prospecting approaches do not work for smORFs

The "standard" pipeline:

  1. Pre-process samples
  2. Assembly to obtain contigs
  3. Gene prediction: too many false positives (see Shaojun Pan's talk!)
  4. Homology-based functional prediction: too many false negatives

Standard AMP prediction tools cannot be used directly

  • Training sets are balanced (or close to balanced) in AMPs/non-AMPs
  • Close homologs used in training and testing
  • Technical issues for large scale usage: only available as a webserver

Macrel: an end-to-end pipeline for metagenomes

Macrel sacrifices recall for higher precision

We split the dataset so that there are no close homologs (>80% identity) in training & testing.

In simulated metagenomes, we recover peptides that we embeded there

  • Simulated metagenomes with abundances from real data (484 genomes) & varying sequencing depths
  • >80% of peptides recovered (after singleton elimination) were present in the original data.

In real metagenomes, we recover peptides that are expressed

Impossible to be certain, but promising evidence

  • 92.8% co-predicted by another tool
  • 53.8% had detectable transcription

Our methods are available as a web-server and on the command line

As webserver: http://big-data-biology.org/software/macrel/

As a command-line/Python tool:


# Download data examples
macrel get-examples


# Run macrel on peptides
macrel peptides \
    --fasta example_seqs/expep.faa.gz \
    --output out_peptides \
    --threads 4

# Run macrel on contigs (gene prediction followed by peptide-prediction):
$ macrel contigs \
    --fasta example_seqs/excontigs.fna.gz \
    --output out_contigs
                        

Summary & ongoing work

Acknowledgements

  • Célio Dias Santos-Júnior
  • Shaojun Pan (see his presentation coming up right after this one!)
  • Xing-Ming Zhao

Thank You