Online Material for the article:

Optimizing PCR primers targeting the bacterial 16S ribosomal RNA gene

Authors: Francesco Sambo, Francesca Finotello, Enrico Lavezzo, Giacomo Baruzzo, Giulia Masi, Elektra Peta, Marco Falda, Stefano Toppo, Luisa Barzon and Barbara Di Camillo

Publication: Sambo et al., BMC Bioinformatics, 2018, 19.1: 343. https://doi.org/10.1186/s12859-018-2360-6

Last updated: Apr 2018

Table of Contents

Abstract
mopo16S Source Code
Installation instructions
Prepare input files
Software usage
Citation

1. Abstract

Motivation
One of the key tools for studying microbial diversity of a site is the targeted amplicon sequencing of an informative marker, such as the ribosomal small subunit (16S ribosomal DNA). The accuracy of targeted amplicon sequencing strongly depends on the choice of primer pairs. There is thus the need for automated methods to design and update bacterial 16S primers, able to take into account new sequencing technologies with different read length and growing available information about different bacterial species.

Results
We developed an algorithm for optimal multi-objective primer design, which: 1) maximizes the balance between efficiency and specificity in the targeted amplicon amplification; 2) maximizes the coverage, in terms of the fraction of all bacterial 16S sequences matched by at least one forward and one reverse primer; 3) minimizes the variability, in terms of differences in the number of combinations of primers from the forward and reverse set matching each bacterial 16S. As an example of application, we present the design of optimal primers for amplicons in the range 700 - 800 bp, suitable for sequencing technologies like Roche’s 454. Results show that our strategy is able to find better primer pairs than the ones available in the literature according to all three optimization criteria. We validated experimentally three of these primer pairs on multiple bacterial species, belonging to different genera and phyla, confirming the predicted efficiency and wide coverage.

Availability
Availability: Source code of the developed algorithm is released under the GNU General Public Licence and is available at http://www.dei.unipd.it/~baruzzog/mopo16S.html.

2. Source Code

mopo16S (Multi-Objective Primer Optimisation for 16s experiments) is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License.

If you use the software for your research, please refer to the original mopo16S paper with the citation below.

C++ source code of the mopo16S algorithm can be downloaded here.

3. Installation instructions

Download mopo16S here.

NOTE: mopo16S could exploit multithreading parallelism through the openMP library to considerably reduce the execution time. If your compiler is not compatible with openMP (check here), please download the no-multithreading version of mopo16S here.

Decompress the archive.

mopo16S requires the SeqAn c++ library (version ≥ 2.0.0). Set the correct include directory of your SeqAn installation in the Makefile.rules file and then compile with

make

NOTE: if you are using SeqAn version ≥ 2.2.0 you should update the C++ version in release/Makefile file: change the flag from -std=c++11 to -std=c++14.

OPTIONAL: to build the html documentation (requires the Doxygen software):

make doc

4. Prepare the input files

To run, mopo16S requires

a reference set of 16s sequences for which to design the optimal set of primers and
a set of (possibly degenerate) candidate primer pairs from which to begin the search for the optimal primer set.

Both files should be in .fasta format, with the latter saved alternating forward and corresponding reverse primers.

To generate the two files, rep_set.fasta and good_pairs.fasta, starting from the same repositories used in the mopo16S paper, first set the required GreenGeenes cluster similarity level for the reference set in the shell script download_data.sh, then run:

cd data ./download_data.sh

Please note that the higher the similarity level, the larger will be the representative set (and thus the file to be downloaded).

Then, edit the R script process_data.R to set the required GreenGeenes cluster similarity level for the reference set and the minimum and maximum desired amplicon length, minimum and maximum primer length and maximum total degeneracy for the candidate primer pairs.
Then, run the R script with

Rscript process_data.R

to generate the files rep_set.fasta and good_pairs.fasta (requires R and the additional R packages Biostrings and xlsx).

5. Software Usage

     Usage: mopo16S [OPTIONS] reference_set_file initial_primer_pairs_file

     reference_set_file is a .fasta file containing the reference set of
     sequences for which the primer are designed.

     initial_primer_pairs_file is a .fasta file containing a set of (possibly
     degenerate) primer pairs from which to start the optimisation, saved
     alterning forward and corresponding reverse primers.

     Common options:

       -s, --seed=LONG             Seed of the random number generator (default 0)

       -r, --restarts=INT          Number of restarts of the multi-objective
                                   optimisation algorithm (default 20)

       -R, --runs=INT              Number of runs of the multi-objective
                                   optimisation algorithm (default 20)

       -o, --outFileName=FNAME     Root name of the output files (default "out")

       -I, --outInitFileName=FNAME Root name of the files where the initial good
                                   primer pairs should be saved (default "init")

       -G, --threads=INT           Number of threads for parallel execution (default 1)

       -V, --verbose=INT           Verbosity level (default 0). If 0, no extra 
                                   output would be created. If not 0, for each 
                                   run would be created 3 files: 
                                   1) primers scores file
                                   2) primers sequences file
                                   3) optimization steps performed at each restart

       -h, --help                  Print this help and exit

     Coverage-related options:

       -M, --maxMismatches=INT     Maximum number of mismatches between the
                                   non-3'-end of the primer and a 16S sequence to
                                   consider the latter covered by the primer, in
                                   case also the 3'-end perfectly matches
                                   (default 2)

       -S, --maxALenSpanC=INT      Maximum amplicon length span considered when
                                   computing coverage (half above, half below 
                                   median) (default 200)

     Efficiency-related options:

       -l, --minPrimerLen=INT      Minimum primer length (default 17)

       -L, --maxPrimerLen=INT      Maximum primer length (default 21)

       -m, --minTm=INT             Minimum primer melting temperature (default 52)

       -c, --minGCCont=DOUBLE      Minimum primer GC content (default 0.5)

       -C, --maxGCCont=DOUBLE      Maximum primer GC content (default 0.7)

       -D, --maxDimers=INT         Maximum number of self-dimers, ie of dimers
                                   between all possible gap-less alignments of the
                                   primer with its reverse complement (default 8)

       -p, --maxHomopLen=INT       Maximum homopolymer length (default 4)

       -d, --maxDeltaTm=INT        Maximum span of melting temparatures for the
                                   primer sets (default 3)

       -e, --maxALenSpanE=INT      Maximum span (maxALenSpanE) between median and
       -q, --maxALenSpanEQ=DOUBLE  given quantile (maxALenSpanEQ) of amplicon
                                   length (default 50 and 0.01, respectively)

     Fuzzy tolerance intervals for efficiency-related options:

       -t, --minTmInterv=INT       Fuzzy tolerance interval for minimum melting
                                   temperature (default 2)

       -g, --minGCContInt=DOUBLE   Fuzzy tolerance interval for minimum GC
                                   content (default 0.1)

       -i, --maxDimersInt=INT      Fuzzy tolerance interval for maximum number of
                                   self dimers (default 3)

       -T, --deltaTmInt=INT        Fuzzy tolerance interval for span of melting
                                   temperatures of the primer set (default 2)

       -P, --maxHLenInt=INT        Fuzzy tolerance interval for maximum
                                   homopolymer length (default 2)

       -E, --maxALenSpanEI=INT     Fuzzy tolerance interval for maximum span
                                   between median and given quantile amplicon
                                   length (default 50)

     Mandatory arguments to long options are also mandatory for any corresponding
     short options.

The output consists of two pairs of file:

init.primers and init.scores
out.primers and out.scores

The first pair contains the initial set of good primer pairs, the second set the final Pareto front of optimal primer pairs.

Files *.primers contain one line for each primer pair: each line contains a tab delimited list of all forward primers, followed by "x" and then by the list of all reverse primers.

Files *.scores contain one line for each line in the corresponding *.primers. Each line contain the values of the Efficiency, Coverage and Matching-bias score of the primer pair.

6. Citation

Sambo F, Finotello F, Lavezzo E, Baruzzo G, Masi G, Peta E, Falda M, Toppo S, Barzon L, Di Camillo B. Optimizing PCR primers targeting the bacterial 16S ribosomal RNA gene. BMC Bioinformatics 2018, 19.1: 343.