Table of Contents |
Motivation
One of the key tools for studying microbial diversity of a site is the targeted amplicon sequencing
of an informative marker, such as the ribosomal small subunit (16S ribosomal DNA). The accuracy of
targeted amplicon sequencing strongly depends on the choice of primer pairs. There is thus the need
for automated methods to design and update bacterial 16S primers, able to take into account new
sequencing technologies with different read length and growing available information about different
bacterial species.
Results
We developed an algorithm for optimal multi-objective primer design, which: 1) maximizes the balance
between efficiency and specificity in the targeted amplicon amplification; 2) maximizes the coverage,
in terms of the fraction of all bacterial 16S sequences matched by at least one forward and one
reverse primer; 3) minimizes the variability, in terms of differences in the number of combinations
of primers from the forward and reverse set matching each bacterial 16S. As an example of application,
we present the design of optimal primers for amplicons in the range 700 - 800 bp, suitable for
sequencing technologies like Roche’s 454. Results show that our strategy is able to find better primer
pairs than the ones available in the literature according to all three optimization criteria. We
validated experimentally three of these primer pairs on multiple bacterial species, belonging to
different genera and phyla, confirming the predicted efficiency and wide coverage.
Availability
Availability: Source code of the developed algorithm is released under the GNU General Public Licence
and is available at http://www.dei.unipd.it/~baruzzog/mopo16S.html.
mopo16S (Multi-Objective Primer Optimisation for 16s experiments) is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License.
If you use the software for your research, please refer to the original mopo16S paper with the citation below.
C++ source code of the mopo16S algorithm can be downloaded here.
Download mopo16S here.
NOTE: mopo16S could exploit multithreading parallelism through the openMP library to considerably reduce the execution time. If your compiler is not compatible with openMP (check here), please download the no-multithreading version of mopo16S here.
Decompress the archive.
mopo16S requires the SeqAn c++ library (version ≥ 2.0.0). Set the correct include directory of your SeqAn installation
in the Makefile.rules
file and then compile with
make
NOTE: if you are using SeqAn version ≥ 2.2.0 you should update the C++ version in release/Makefile
file: change the flag from -std=c++11
to -std=c++14
.
OPTIONAL: to build the html documentation (requires the Doxygen software):
To run, mopo16S requires
Both files should be in .fasta format, with the latter saved
alternating forward and corresponding reverse primers.
To generate the two files, rep_set.fasta and good_pairs.fasta, starting from the same repositories used in the mopo16S paper, first set the required GreenGeenes cluster similarity level for the reference set in the shell script download_data.sh, then run:
cd data
./download_data.sh
Please note that the higher the similarity level, the larger will be the representative set (and thus the file to be downloaded).
Then, edit the R script process_data.R to set the required GreenGeenes cluster similarity level
for the reference set and the minimum and maximum desired amplicon length, minimum and
maximum primer length and maximum total degeneracy for the candidate primer pairs.
Then, run the R script with
Rscript process_data.R
to generate the files rep_set.fasta and good_pairs.fasta (requires R and the additional R packages Biostrings and xlsx).
Usage: mopo16S [OPTIONS] reference_set_file initial_primer_pairs_file reference_set_file is a .fasta file containing the reference set of sequences for which the primer are designed. initial_primer_pairs_file is a .fasta file containing a set of (possibly degenerate) primer pairs from which to start the optimisation, saved alterning forward and corresponding reverse primers. Common options: -s, --seed=LONG Seed of the random number generator (default 0) -r, --restarts=INT Number of restarts of the multi-objective optimisation algorithm (default 20) -R, --runs=INT Number of runs of the multi-objective optimisation algorithm (default 20) -o, --outFileName=FNAME Root name of the output files (default "out") -I, --outInitFileName=FNAME Root name of the files where the initial good primer pairs should be saved (default "init") -G, --threads=INT Number of threads for parallel execution (default 1) -V, --verbose=INT Verbosity level (default 0). If 0, no extra output would be created. If not 0, for each run would be created 3 files: 1) primers scores file 2) primers sequences file 3) optimization steps performed at each restart -h, --help Print this help and exit Coverage-related options: -M, --maxMismatches=INT Maximum number of mismatches between the non-3'-end of the primer and a 16S sequence to consider the latter covered by the primer, in case also the 3'-end perfectly matches (default 2) -S, --maxALenSpanC=INT Maximum amplicon length span considered when computing coverage (half above, half below median) (default 200) Efficiency-related options: -l, --minPrimerLen=INT Minimum primer length (default 17) -L, --maxPrimerLen=INT Maximum primer length (default 21) -m, --minTm=INT Minimum primer melting temperature (default 52) -c, --minGCCont=DOUBLE Minimum primer GC content (default 0.5) -C, --maxGCCont=DOUBLE Maximum primer GC content (default 0.7) -D, --maxDimers=INT Maximum number of self-dimers, ie of dimers between all possible gap-less alignments of the primer with its reverse complement (default 8) -p, --maxHomopLen=INT Maximum homopolymer length (default 4) -d, --maxDeltaTm=INT Maximum span of melting temparatures for the primer sets (default 3) -e, --maxALenSpanE=INT Maximum span (maxALenSpanE) between median and -q, --maxALenSpanEQ=DOUBLE given quantile (maxALenSpanEQ) of amplicon length (default 50 and 0.01, respectively) Fuzzy tolerance intervals for efficiency-related options: -t, --minTmInterv=INT Fuzzy tolerance interval for minimum melting temperature (default 2) -g, --minGCContInt=DOUBLE Fuzzy tolerance interval for minimum GC content (default 0.1) -i, --maxDimersInt=INT Fuzzy tolerance interval for maximum number of self dimers (default 3) -T, --deltaTmInt=INT Fuzzy tolerance interval for span of melting temperatures of the primer set (default 2) -P, --maxHLenInt=INT Fuzzy tolerance interval for maximum homopolymer length (default 2) -E, --maxALenSpanEI=INT Fuzzy tolerance interval for maximum span between median and given quantile amplicon length (default 50) Mandatory arguments to long options are also mandatory for any corresponding short options.
The output consists of two pairs of file:
init.primers
and init.scores
out.primers
and out.scores
The first pair contains the initial set of good primer pairs, the second set the final Pareto front of optimal primer pairs.
Files *.primers contain one line for each primer pair: each line contains a tab delimited list of all forward primers, followed by "x" and then by the list of all reverse primers.
Files *.scores contain one line for each line in the corresponding *.primers. Each line contain the values of the Efficiency, Coverage and Matching-bias score of the primer pair.
Sambo F, Finotello F, Lavezzo E, Baruzzo G, Masi G, Peta E, Falda M, Toppo S, Barzon L, Di Camillo B. Optimizing PCR primers targeting the bacterial 16S ribosomal RNA gene. BMC Bioinformatics 2018, 19.1: 343.