Friday, November 15, 2013

New SPAdes Assembler Used on MiSeq Reads for 6 Burkholderia Genomes

Introduction

This post introduces a practical assembly strategy using the new SPAdes assembler with Illumina MiSeq reads.  Utilizing the sequence analysis tools sickle, FLASH, SPAdes, and in-house Perl scripts we assemble 6 Burkholderia genomes to draft status using both paired-end (PE) and mate-pair (MP) reads.  This exercise demonstrates the practicality of using the Illumina MiSeq for small scale assembly projects.

The SPAdes Assembler

Many popular de novo assemblers, including SPAdes, rely on a computational data structure called a de Bruijn graph.  SPAdes uses a multisized de Bruijn graph to balance trade-offs between small and large k-mer sizes.  A smaller k causes more repeat regions to collapse into the same node tangling the graph.  However, larger k values can fragment the graph in low coverage regions.  Multisized de Bruijn graphs allow the size of k to vary based on regional coverage depth.  SPAdes also has a unique method for handling PE reads.  Typically PE information is incorporated after the initial assembly by mapping PE reads back to assembled contigs.  When corresponding pairs map to the ends of different contigs those contigs can be merged into a single scaffold.  SPAdes improves on this by incorporating PE distance information directly into the de Bruijn graph.  Because of these features SPAdes can be run with any number of single-end (SE), PE, or MP read files.

Assembling Burkhoderia Genomes with SPAdes

Burkholderia is a genus of bacteria with strains having a variety of environmental effects ranging from plant growth promoting activity to human and plant pathogenesis.  Because our lab is particularly interested in phenotypes of association between bacteria and plants, we sequenced 6 strains of burkholderia isolated from the endophyte compartment (i.e. inner root) of A. thaliana using two runs on our Illumina MiSeq machine.  The first run was done using a standard PE library, and the second was done using a MP library.


The pipeline used for assembly is outlined and described below.  



FLASH
FLASH is a software package the merges PE reads into a single read.  In PE sequencing there is a distribution of DNA fragment lengths from which ends are sequenced (Supp Fig 1).  As reads continue to lengthen more PE reads can potentially overlap.  The main advantage to merging the overlapping reads is the error correction step.  Furthermore, longer reads tend to yield better assemblies because of their ability to span repetitive regions that fragment assemblies.  While longer merged reads may span repetitive regions it is still possible that spanned repeats are incorrectly assembled.  In terms of draft assembly this is a minor point.

After experimenting with various combinations of parameters I decided on the following FLASH parameters:

-m 25 -r 250 -f 500 -s 100

Adapter Trimming
A preprocessing step is required to trim MP reads at the Illumina junction adapter.  For details on how to do this and why it is necessary see this post.  

Sickle
Graph assemblers can be sensitive to erroneous k-mers produced from sequencing errors which convolute the graph with false nodes and connections.  Rigorous quality filtering prior to assembly reduces the number of false k-mers thereby improving assembly.  Sickle is a software package for quality filtering which relies on a sliding window algorithm over the read quality scores.  If the average quality of a window drops below a given minimum the remaining bases are trimmed off.  Any trimmed or untrimmed reads shorter than a given length are removed.  Sickle also produces a file containing high quality reads where the corresponding pair failed the quality check.  These reads can be used in downstream analysis as single-end reads.  

The parameter setting that I used when running Sickle are:

-t sanger -l 127 -q 20

The '-t' parameter specifies the quality value type.  The '-l' parameter specifies the minimum read length and '-q' is the minimum average window quality score.  At this stage it is important to consider the '-l' parameter if running SPAdes is your next step.  The SPAdes parameter '-k' takes a list of k-mers from which to build a multi-sized graph.  For long Illumina reads the SPAdes manual advises '-k' be set to '21,33,55,77,99,127.'  Because the longest k-mer is 127bp reads in you dataset should be at least that long.  To allow for shorter reads when running Sickle simply lower the '-l' parameter and ensure the largest '-k' value is shorter than or equal to that value.  Because many of the MP reads are already short I ran sickle with the -l parameter set to 75 and the max SPAdes -k parameter set to 75.

I should note that SPAdes has a built-in correction algorithm--hammer.  Hammer looks at k-mer frequency to identify potential false k-mers.  However, I recommend running Sickle first for 2 reasons:
  1. Sickle is a much faster correction algorithm than hammer.  Hence, limiting the number of false k-mers that hammer must identify speeds up SPAdes.
  2. Sickle uses quality values.  While quality values are not perfect, they do correlate with the correctness of bases.  If reads have any low-quality k-mers it is possible by chance that some of these k-mers will be the same causing hammer to classify them as a true k-mer.  However, it is likely that one error inside a high quality k-mer will not be trimmed by Sickle, but will be identified by hammer.  This intuition says that a combination of Sickle and hammer is the best regime for quality filtering.
SPAdes
I ran the SPAdes assembler twice for each genome--first with both PE and MP reads and then with only the PE reads.  The second run shows that improvements were made when MP reads are included.  The following command was used to run SPAdes with both PE and MP reads:

spades.py -k 21,33,55,75 -t 5 --careful -s merged_PE_reads.fastq -1 fwd_PE_sickled_reads.fastq -2 rev_PE_sickled_reads.fastq -s PE_sickled_single_reads.fastq --mp1-1 fwd_MP_sickled_reads.fastq --mp1-2 rev_MP_sickled_reads.fastq --mp1-s singles_MP_sickled_reads.fastq --mp1-rf -o spades_out

Using only PE reads (i.e. the merged reads, fwd high quality reads, reverse high quality reads, and single high quality reads) SPAdes was run using the following command:

spades.py -k 21,33,55,77,99,127 -t 5 --careful -s out.extendedFrags.fastq -1 out.notCombined_1_qc.fastq -2 out.notCombined_2_qc.fastq -s out.notCombined_singles_qc.fastq   -o spades_out

QUAST
When the assembly is complete QUAST can be used to compare and view assembly results.  For examples see the Results section of this post.


Results

Assembly
All 6 genomes assembled very well when both PE and MP reads were used.


Does merging PE reads help?
To illustrate how merging reads prior to assembly can improve assembly, I assembled one Burkholderia genome using three different input sets:
  1. Merged and unmerged reads
  2. Raw PE reads with no merging
  3. Only reads that merge

Under the assumption that improved assembly statistics (e.g. number of contigs, N50, etc.) indicate a better assembly, a combination of merged and unmerged reads (option 1) yields the best assembly.

Does adding MP reads help?
When assembling with only PE reads the resulting assemblies are worse than when MP reads are included.


To test if the assembly improvements were primarily a factor of increased coverage or information incorporated in MP reads, I reassembled the PE+MP reads for B1/BMP1 treating the MP reads as SE reads.  Despite MP reads substantial increase in coverage (52 to 103), when treated as SE reads only minor assembly improvements are observed.  This suggests that distance information provided by MP reads can have a greater effect on improving assemblies than additional coverage.


Conclusions

Using a combination of FLASH, Sickle, SPAdes, and in-house Perl scripts we assemble 6 Burkholderia genomes to draft status.  This exercise demonstrates the practicality of the Illumina MiSeq for small scale sequencing projects consisting of a handful of bacterial genomes.  A single PE MiSeq run may be sufficient to assembly a small number of bacterial genomes, however the additional coverage and pairing information of MP reads can improve PE assemblies.  Merging overlapping PE reads and quality trimming before assembly can increase N50 and other assembly statistics.


Supplemental Figures
Figure 1 -- PE insert size distribution for one sample (B1)

Saturday, November 2, 2013

Difference Between Paired-End and Mate-Pair Reads

In DNA sequencing lingo the words "paired-end" (PE) and "mate-pair" (MP) are frequently used interchangeably.  While the underlying principles between PE and MP reads have strong similarities, there are inherent differences that are crucial to understand.

The similarities between PE and MP reads include:
  • Reads come in pairs
  • Pairs come from the ends of the same DNA strand

The differences between PE and MP reads include:
  • Library preparation protocols -- In short, PE protocols attach an adapter, SP1, to the fwd end and another adapter, SP2, to the reverse end.  The first sequencing step is started by targeting SP1 to generate the forward read.  The second sequencing step targets SP2 to generate the reverse read.  For MP protocols longer DNA sequences are circularized using biotinylated adapters.  During the circularization process the DNA strand ends are connected with the biotinylated adapter between them.  Circularized DNA are sheared and the biotinylated adapters connecting stand ends are pulled down.  These reads can then be sequenced using the same SP1-SP2 adapter protocols used in PE sequencing.  
  • Insert size -- The insert size refers to the distance between the pairs.  PE reads generally have a smaller insert size (< 1kp) than MP (2-5 kb).  The difference in insert size stems from the difference in protocols.  Depending on the length of your reads it is possible for PE reads to have overlapping ends.
  • Read orientation -- PE reads come in forward-reverse (FR) orientation where read 1 is the forward read and read 2 is the reverse read.  Because of the circularization step MP reads com in reverse-forward (RF) orientation where read 1 is the reverse read and read 2 is the forward read.  These differences are especially important to understand for assembly algorithms and projects.
  • Read Trimming -- Theoretically, PE reads require no trimming before sequence analysis.  However, in practice it is recommended that low quality portions of the read be trimmed using tools like Sickle.  Alternatively, MP reads require trimming because biotinylated adapters are often present in the middle of one or both MP reads.  Adapter trimming software generally remove adapters and any sequence beyond the adapter.  Software options for adapter trimming include cutadapt, Trimmomatic, and FastqMcf.  For more reading on adapter trimming see this post and this post.

For more details on the sequencing protocols see the Illumina documentation for PE and MP sequencing.


Tuesday, October 22, 2013

Read simulators review with an emphasis on metagenomics

Why Read Simulation?

Simulations are an important aspect of bioinformatics that can be used for testing and benchmarking algorithms, optimizing parameters, and generating optimal study design.  The following are examples of how read simulations have been successfully utilized in each of these instances.

Testing Algorithms

After writing a new algorithm how do we ensure that it works?  The most effective way is to run the algorithm on a set of data where the answers are known.  This is a perfect application for simulated data.  Katharina Hoff use DNA sequence simulations to test the effects of sequencing error from several sequencing technologies on specialized metagenomic gene prediction algorithms.  She concludes that gene prediction accuracy is poor in the presence of sequencing errors using these algorithms.  Furthermore, gene prediction algorithms not specialized for metagenomic data perform as well or better than their specialize counterparts.  This suggests that metagenomic gene predictors could be improved by being more robust to sequence error.

Benchmarking Algorithms

Benchmarking is used to compare accuracy, precision, technical requirements, and other attributes of algorithms.  Simulated reads provide a common benchmark for comparing assembly algorithms in the Assemblathon competition [Earl, et al.], and have also been used to benchmark read mapping algorithms such as Bowtie, Soap, and Pass [Horner].

Optimizing Parameters

Researchers often ignore the detailed parameters of algorithms and programs.  This can frequently cause problems by violating assumptions made in the software development process.  Furthermore, the effects of parameters on results are largely unknown yet can significantly effect results and conclusions.  To better understand the effects of parameters used by the algorithm FLASH, Magoc et al. use simulated reads to build ROC curves illustrating the trade-offs between correctly and incorrectly merged paired-end reads using different values for the mismatch and minimum overlap parameters (Figures 5 and 6).

Optimizing Study Design

Prior to sequencing, it is common to ask questions like, "how much coverage do we need," "what read length should we sequence at," "what sequencing platform is best for our project," "should we use paired-end (PE) or single-end (SE) reads," etc.  These and other similar questions can be answered at least in part by sequence simulation.  For example, the 1000 Genomes Project Consortium used ART to test the effects of read length and PE insert size on a reads ability to map to the human genome.  They conclude that longer reads substantial increase mappability especially for SE reads.  Furthermore, increasing insert size also marginally improves mappability.

Furthermore, Mende et al. use sequence simulations to test several aspects of study design for whole metagenome sequencing.  They conclude that quality control measures such as quality filtering and quality trimming have a substantial impact on assembly by improving accuracy and extending contig lengths.  They also evaluate assembly quality on Sanger, pyrosequencing, and Illumina platforms with communities of low (10 genomes), medium (100 genomes), and high (400 genomes) complexity.  For the low complexity community all platforms were more or less equal in terms of assembly and accurately represented the functional aspects of the community.  For the medium complexity community Illumina produced the best assembly and most accurately represented the functional elements.  With the high complexity community none of the platforms were particularly good, however, because of the longer reads, Sanger was still able to represent much of the functional composition.  


Things to look for in a simulator?

  • Good/appropriate error model
  • Models read lengths
  • Models coverage bias
  • Includes quality values
  • Single-end and Paired-end reads
  • Multiple sequencing platform capabilities (e.g. Illumna, 454, etc)
  • Easy to install
  • Easy to use
  • Good documentation

What are the simulators our there for...

Genomic DNA sequences



  • wgsim relies on a simple uniform substitution error model.  The uniform error model does not reflect error patterns as accurately as the models and algorithms described below.  Another major weakness of wgsim its inability to generate INDEL errors.  Wgsim seems like the most basic read simulator.
  • simNGS models the processes in an Illumina sequencing run such as cluster generation and cluster intensity.  The model must be input as the "runfile" where each line contains parameters for each cycle.  The number of cycles contained in the "runfile" corresponds to the number of bases simNGS can model.
  • MAQ is primarily an assembly algorithm (hence the large number of citations).  It also includes a uniform reference sequence mutation algorithm.  To simulate reads it uses a first-order markov chain to model quality at each cycle.  Using that model it generates quality values and then bases based on those quality values.  Documentation is lacking, so I'm not entirely sure all the values in the table are correct.
  • GemSIM takes as input a SAM file and FASTQ file for empirical error model generation.  This is advantageous because it can easily be extended to new sequencing technologies or upgrades as they are released.  It can also simulate metagenome reads based on given abundance proportions.  
  • ART uses quality score data from real Illumina reads to model substitution rates.  Quality scores are generally very informative for Illumina reads, but may not be perfect.  They map real Illumina reads to a reference to model INDEL rates.  I would prefer they do that with substitutions as well.  Currently the longest Illumina read length model can generate 75bp reads.  This is becoming less useful as read lengths continue to grow.  I am less familiar with 454 and Sanger reads so I won't comment on those models.  
  • Mason was build using models empirically derived from fly and yeast reads.  According to the technical report Mason can model 100bp Illumina reads from the GAII, however it appears that it may be possible to build a more up-to-date model.  Mason uses previously published models for both 454 and Sanger reads.  Mason was developed in C++ and is easily extendable making it useful for incorporating into your personal code.
  • FlowSim is specifically for 454 Pyrosequences.  FlowSim builds an empirical model derived using quality trimmed E. coli K-12 and D. Labrax reads.  Because the 454 platform interprets reads first as flow grams, FlowSim models these flow grams, specifically the homopolymer distributions.  During the simulation process, flow grams are generated and are subsequently interpreted into base calls and corresponding quality scores.  The number of cycles indicates roughly the number of bases in each read.
  • SimSeq has limited documentation!  One of the useful attributes of SimSeq is its model for mate-pair chimeras.
  • pIRS builds an error model based on SOAP2 mapping results or a given SAM file.  It uses a combination of two matrices to generate bases and quality values.  These matrices are based of empirical training data and a first-order markov chain (similar to MAQ) of quality scores.  This seems like a very good model for both bases and quality values.  One of the defining features of pIRS is its simulation of coverage bias across a genome based on %GC content.  It also includes an option for mutating the given genome to fabricate reference sequence heterogeneity.

Metagenomic Sequences



  • NeSSM can simulate coverage bias but relies on the mapping distributions from real metagenome sequences to the reference database.  This is less useful approach because it cannot be applied to model coverage bias of previously unsequenced metagenomes.  A very positive feature of NeSSM is its GPU capabilities.  GPU processing is much faster which facilitates simulating large scale sequencing runs from platforms like the Illumina HiSeq.  The download link in the paper was not working, so I was unable to download and test this software.
  • Grinder -- Can also simulate PCR dependent amplicon sequences for any amplicon based on given primers or a database of reference amplicons suca!  For amplicon simulations it additionally produces chimeric reads and gene copy number biases.  Grinder allows users to input abundance profiles, model alpha and beta diversity.  Quality scores in Grinder are assigned a single "high" quality value for all correct bases and a single "low" quality value for bases designated as errors.  Grinder includes a graphical user interface (GUI) and command line interface (CLI).  Grinder can also be run on the web through the Galaxy interface.
  • MetaSim -- Stores user defined genomes in a database.  MetaSim can simulate reads from any of the genomes in the database using various empirically defined models or a user defined custom model.  Abundance measures for each genome allow users to simulate communities of variable abundance.  To simulate heterogeneity in the community, MetaSim can mutate the selected reference genomes based on a phylogenetic tree input describing the degree of mutation.  MetaSim also includes a GUI, however more advanced options can be accessed via the CLI.  One of the short-comings of MetaSim is the lack of simulated quality values.  As an increasing number of algorithms rely on these values this highlights an essential feature that is missing.
  • Bear is still under development.  
  • GemSIM see above.
As a disclaimer I should say that I have only tried a few of these simulators.  There may be special features that I have missed.

Discussion

Simulations are an important and under utilized method for testing algorithms and planning experiments.  It can be tempting to haphazardly start sequencing or push your data through a collaborators favorite algorithm using the default parameters.  These temptations can lead to failed experiments, unusual results, or a misunderstanding of data.  A more systematic approach would be to build informative simulations to anticipate and address problematic algorithm assumptions and experimental design flaws.  This approach requires a large up-front cost, however it can save time and money in the long run.  

As a cautionary note, simulations are not biology.  Sequence simulations are an imperfect image of a very complicated world.  Conclusions drawn from simulation experiments must be taken with a grain of salt.  In some cases it is vital to design biological experiments to validate such conclusions.  The utility of sequence simulations is to perform a cheap, first-pass test of algorithms and experimental designs.  

While sequence simulators are not expected to perfectly represent real sequences, the closer they can get to that point the better.  Most simulators carefully model quality of bases, but there are only a couple that model biases imposed by sequencing platforms and sample preparation protocols such as GC content bias.  This is one area where sequence simulators have room to improve.  However, building such models is challenging due to the large number of variables that contribute to such biases.  For example, there are hundreds of available sample prep protocols for hundreds of biological techniques.  Building a universal model to account for biases from each of these protocols is an enormous task especially in a world where protocols can have a life expectance rate of only a few months.  Furthermore, unique biases can be introduced by the set of DNA itself.  Sequencing human DNA will produce different biases with different levels of effect than sequencing a single bacterial strain.  

In conclusion, sequence simulators can be extremely useful for testing and benchmarking algorithms, optimizing parameters, and designing experiments.  There are a number of simulators available each with its own strengths and weaknesses.  Carfull consideration should be used when picking a simulator and interpreting simulation experiments.

Other interesting reads

  • Lessons learned from simulated metagenomics datasets
  • Zagordi, et al. Deep Sequencing of a Genetically Heterogeneous Sample: Local Haplotype Reconstruction and Read Error Correction.  2012.
  • Pignatelli, et al. Evaluating the fidelity of de novo short read metagenomic assembly using simulated data. 2011.
  • Morgan, et al. Metagenomic Sequencing of an In Vitro-Simulated Microbial Community. 2010.
  • Mavromatis, et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. 2007.

UPDATE - Oct 8, 2014
  • The new BEAR simulator seems like a good option as well.

Wednesday, October 9, 2013

Merging Overlapping Paired-End Reads

Paired-End Sequencing Overview

Amplicon sequencing is a technique in which PCR primers are used to amplify a small portion of a genome.  For example, if a gene of interest is preceded by a unique DNA sequence, S1, and followed by another unique DNA sequence, S2, PCR amplification using S1 and S2 as primers will exponentially copy the DNA segment flanked by S1 and S2 (see this video).  The resulting DNA can be sequenced using a paired-end (PE) sequencing protocol where DNA molecules in a sample are sequenced from both ends.  In some cases amplicons (i.e. the portion of DNA flanked by the two primers) are short enough such that the forward read overlaps the reverse read.  It can be advantageous to merge these reads into a single sequence for the following reason:

  • Sequenced reads generally have a higher density of errors near the end of the read.  The extra information gained by overlapping the tail end of the forward read with the tail end of the reverse read all for correction of most of these errors.

Merging PE Reads with FLASH

While there are several bioinformatics tools and algorithms that would be appropriate for merging overlapping PE reads, I prefer to use the assembly tool FLASH.  FLASH is freely available under the GPLv3 license and can be downloaded via SourceForge.

Reasons to use FLASH include:
  • Easy to download and install
  • Fast!
  • Uses quality scores to correct possible sequencing errors in the overlapping region.  This is by far the most important reason to use FLASH.  Correcting potential sequencing errors can lead to much less noise and more definitive results.  
Reasons you might want to find another tool:
  • FLASH is not able to look for gapped alignments between read pairs.  Ungapped alignments are appropriate when using Illumina reads, however they can be problematic for other sequencing platforms such as 454.
FLASH is easy to use if you are familiar with the Unix command line interface.  First, navigate to where you installed flash and type the command:

./flash -h

This will ensure that FLASH has been installed correctly and prints the FLASH help page to your terminal screen.  To run FLASH you will need two FASTQ formatted files representing the forward and reverse reads to merge.  These files should have the same number of reads and should be ordered such that the first read in the forward file is paired with the first read in the reverse file.

Similar to other bioinformatics applications, FLASH has several parameters that can be adjusted to maximize the algorithm's effectiveness.  The FLASH help page (./flash -h) give a full explanation of each parameter.  To run FLASH using all default parameters type:

./flash <forward reads file path> <reverse reads file path>

The following are a list of useful parameters that I often use:

-m, --min-overlap=NUM:  The minimum required overlap between the fwd and rev read.  Beware if this number is too small spurious overlaps will be formed.  I recommend at least and overlap of 10bp.  See the FLASH paper for more details.

-M, --max-overlap:  The expected overlap length of the fwd and rev reads.  This parameter can also be calculated using the -r, -f, and -s parameters.

-x, --max-mismatch-density=NUM:  The ratio of number of mismatched bases and overlap length.  This parameter controls the stringency for which pairs are considered to be mergable.  A high ratio of number of mismatches to overlap length will merge more pairs, but some of these pairs may be false merges.  Alternatively, a low ratio will merge fewer pairs but ensures that false merges are excluded.

-r, --read-len=LEN:  The average length of the raw reads.  This number is used in calculations for the -M parameter (--max-overlap).

-f, --fragment-len=LEN:  The expected amplicon length.

-s, --fragment-len-stddev=LEN:  The expected standard deviation of amplicon lengths.

-d, --output-directory=DIR:  The output directory.

-o, --output-prefix=PREFIX:  The output prefix.  For example if "-o my_output" is used all output files will have the prefix "my_output."  This parameter can be useful for organizing different FLASH runs.

Refer to the FLASH help page for explanations about other parameters.


Quality of Merged PE Reads

During the merging process of overlapping PE reads, discrepancies between bases in the overlapping region can be corrected by choosing the base with the highest quality score to represent the consensus base.  To illustrate quality improvements from merging PE reads I use a subset of recently published data where we cloned a known 16S gene into a plasmid and sequenced it using our typical 16S metagenome profiling protocol for 2x250 bp reads on the Illumina MiSeq.  The primed V4 amplicon is roughly 253bp.  After removing extraneous bases added by our protocol (e.g. molecule tags, primers), the PE reads are expected to have an overlapping region of ~183bp.  Because the overlapping region is large the lower quality tails of both the forward and reverse read overlap in high quality regions in the corresponding read.



The distribution of reads based on the errors per base (EPB) in each read is as follows.


The green line (merged reads) has shifted toward the y-axis indicating that there are more reads with lower EPB compared to forward reads and especially reverse reads.  Notice that the y-axis is in a log scale, so while the differences between groups seems small it is actually quite substantial.  The mean EPB for all merged reads is 0.001903164 while the forward and reverse reads are 0.003959693 and 0.00590493, respectively.  

One caveat to this investigation is the quality filtering done prior to analysis.  Naturally, low quality reads are less likely to merge causing merged reads to have artificially higher quality.   To compensate for this effect we rigorously filtered raw reads before calculating EPB.  Consequently, the divergence between merged reads and unfiltered fwd/rev raw reads will be more pronounced than is reflected in the figure strengthening the conclusion that merged reads have higher quality than fwd/rev raw reads.  

Yet another caveat to consider when designing an overlapping PE sequencing experiment is the expected overlap length.  In this study we use a small amplicon which allows the error prone tails to overlap in regions of higher quality in the corresponding read.  With some amplicons this will not be possible, however even when low quality regions overlap each other corrections can still be made but at a lower confidence.  Also, the V4 amplicon has very little variability in terms of length.  Again, this is not the case with all amplicons and can effect merging success.  For example, the ITS region is another amplicon used in metagenome profiling, but its length variability is substantially higher than the 16S V4 amplicon.  Therefore, reads that fail to overlap in an ITS experiment cannot be assumed to be low quality sequence and should be incorporated in downstream analysis as non-overlapping PE reads.

In conclusion, overlapping PE reads can easily be merged using tools like FLASH.  Furthermore, merging these reads yields higher quality merged sequences because sequencing errors in the overlapped portion can be corrected based on read quality scores.  Higher quality sequences reduce noise introduced by sequencing error and consequently lead to more robust downstream analysis and conclusions. 

Friday, September 20, 2013

Installing Perl Modules

One of the great advantages of Perl is the substantial amount of code developed by the Perl community.  However, installing and using external code can be a minor annoyance for experienced Perl programmers and a major headache for Perl beginners.  For personal future reference and to help ease the pain of this learning curve, I am writing this post to outline the main installation strategies for Perl modules and give some tips for getting things working smoothly.

There are two main approaches to installing Perl modules--CPAN and manual installation via ExtUtils::MakeMaker or Module::Build.  CPAN is a fantastic resource.  My preference for installing Perl modules is to use the cpanm script (which automates the CPAN download, build, and install process) coupled with local::lib.  I also recently learned that cpanm can take a URL argument to a git or SourceForge developers release.  It will also automatically install any dependencies declared in the build script.  The command would look something like this:

cpanm http://sourceforge.net/projects/bioutilsperllib/files/latest/download?source=directory

local::lib is an important part of the installation process.  Many instructional websites illustrate the installation process with sudo commands.  If you are a system administrator or you know that several users will need the Perl module you are installing then it is best to use the sudo command to install into the public set of Perl modules.  However, for personal use it is best to install in a local directory.  The Perl local::lib module can take care of this very nicely for you.  Note that there are other ways to trick CPAN into installing a Perl module in a local directory, but these are not nearly as clean as using local::lib.  local::lib is surprising easy to set up by simply following the instructions on the CPAN documentation page here.  Once local::lib is setup cpanm will recognize these setting and automatically install into your local Perl module set.

The second option for installing Perl modules is a manual installation via Module::Build or ExtUtils::MakeMaker.  If you are a serious Perl programmer I would highly recommend becoming very familiar with one of these two installation modules.  I personally like Module::Build, but I admit that I have had limited experience with ExtUtils::MakeMaker.  To simply use these for installing Perl modules only a basic understanding of a handful of commands is required.

First, the module package must be downloaded to the machine where you plan to install.  I recommend having a directory where you keep all the module you download.  The directory I have designated for this is $HOME/build.  Once you have the module downloaded to your build directory you will have to unpackage it with the unzip command (or a comparable command).  This will create a directory where you unzipped the module.  Navigate into the directory.  If you see a file called Build.PL you know the module was built using Module::Build.  Use the following commands to build, test and install the module:

perl Build.PL
./Build
./Build test
./Build install

If the module fails to install because required dependencies are missing you can try the command:

./Build installdeps

If you do not see the Build.PL file then you should see a file named Makefile.PL.  This indicates that the module was built using ExtUtils::MakeMaker.  To install such a module use the following commands:

perl Makefile.PL
make
make test
make install

For details on the wide range of capabilities for both these modules carefully read the documentation pages (Module::Build, ExtUtils::MakeMaker).

For more information on installing Perl modules see the following:


As a side note the following command is useful for determining if a module is already installed and where it is located:

perldoc -l <module name>


Tuesday, April 23, 2013

BioUtils vs. BioPerl

Perl is a popular language used in bioinformatics.  Its popularity is based primarily on its easiness to learn and plethora of open source libraries.  BioPerl is an example of one such library.  As a bioinformatician, I appreciate BioPerl because it quickly and easily allows me to perform common, menial tasks such as parsing fasta files and manipulating basic DNA sequence objects.  However, BioPerl's speed and memory scale poorly and are becoming limiting factors in large-scale sequencing projects.  In order to accommodate many diverse data types, the object hierarchy and object attributes of BioPerl have become somewhat convoluted thereby increasing memory and runtime requirements.

To address memory and runtime limitations of BioPerl, I developed BioUtils, a collection of simple Perl modules for FASTA/Q formated DNA sequences.  BioUtils can:
  • store FASTA/Q sequences (including quality values)
  • Read/write FASTA/Q files
  • Perform simple quality control on FASTA/Q sequences
  • Provide summary info for a set of FASTA/Q sequences
  • Build a consensus sequence from 2 or more FASTQ sequences
Because of its simplicity, BioUtils requires much less memory and drastically reduces runtime of the above operations compared to BioPerl.  The following figure displays the CPU time for reading and writing FASTQ sequences from files with increasing numbers of sequences.


Currently, a full Illumina MiSeq run can produce ~6 million paired-end reads (i.e. ~12 million raw reads).  Extrapolating from the figure above, the projected runtime of BioPerl for a complete MiSeq dataset would take nearly 2 and a half hours compared to only 5 minutes using BioUtils.  

Because BioUtils is implemented using inside-out objects described in Perl Best Practices (Damian Conway), it is difficult to measure memory usage.  However, a peek at the source code will confirm that BioUtils stores less metadata and thus requires less memory than BioPerl.

More information and a BioUtils download can be found on sourceforge: https://sourceforge.net/projects/bioutilsperllib/?source=directory


Becoming Googlable

This week has been a very "progressive" week in my life with regards to my involvement in social media.  I have never subscribed to social media because I have seen how its addictive nature leads to many lost hours with little or no benefit (which I still believe it true for sites like Facebook and Twitter).  However, I have begun to see the advantages of being "googlable" and have consequently decided to take the plunge into social media by having a personal website and blog.  A growing number of scientific researchers have personal websites to advertise their research and blogs to quickly communicate novel discoveries and ideas.  I hope to follow that model by using my personal website to deploy some of my useful bioinformatics tools and by blogging about interesting scientific ideas.  Hopefully this experimentation with social media will be productive both to myself and others.