When dealing with metabarcoding data, we typically cluster sequences by similarity into operational taxonomic units (OTUs), as sequences are affected by sequencing errors. However, just clustering sequence data will not be efficient without pre filtering data e.g. remove low quality reads and singletons. Here we are discussing one example bioinformatics pipeline I am using to process metabarcoding data of a size sorting experiment (Figure 1).
B) Pre-Processing of sequence data
After an initial quality check on the raw sequencing data and demultiplexing of pooled samples, you want to apply some read filtering of some sort. High throughput sequencers still have an error rate of up to 1-2%, like the Illumina system. This does not necessary mean all reads will have sequencing errors, but many will with some even showing more then 3% false base calls. Ideally you want to detect all reads affected by sequencing errors, e.g. by removing reads with low phred scores. However, phred scores are just an estimated error probability, the chance that the base written at each position in the sequence is false. So even with very restrictive phred score filtering we will still have sequences wich are affected by sequencing errors, even though the phred scores look good. Thus additional filtering and sequence processing strategies are needed.
Mean phred scores and max expected errors
However, lets first talk about phred score based quality filtering. Forward and reverse Illumina PE reads are typically merged (usearch –fastq_mergepairs) and then every read below an average phred score is discarded. Filtering reads on average phred scores is just wrong. Every body does it, and I have been guilty of this as well (Elbrecht & Leese 2015), but please don’t! As discussed by Robert Edgar, phred scores and actual error probabilities don’t scale in a linear but exponential fashion (Edgar & Flyvbjerg 2015, see also the table in the phred error score wikipedia article). This means that when discarding sequences based on their average phred scores you will introduce quite some bias in your dataset, discarding sequences in a highly inconsistent fashion (see the USEARCH manual for further discussion on this). You can work around this problem, by simply summing up the error probabilities across the whole sequence and translate this into the number of expected errors. So an expected error of 1.9 means that ~ 2 bases in your sequence are called wrong. Typically all sequences above one error are discarded.
One should point out that the expected error value is independent of read length, so one could argue that is makes sense to increase the tolerated expected error score for longer reads. I typically keep the more conservative max 1 expected error (maxee=1) threshold for long reads, but if you louse a lot of reads there is nothing wrong with trying a maxee of 2 for an e.g. 400 bp sequence.
Remove Primer Sequences
After PE merging you want to make sure to remove primer sequences from both sides of each sequence, and discard the read if the primer sequence is not reliably detected. I am using Cutadapt to do this (Martin 2011). It should also be noted that our illumina fusion primers are using in line tags for sample tagging(e.g. Elbrecht & Leese 2016, Figure S3). When demultiplexing reads, these are removed so the primer sequence can be easily trimed away the remaining sequence of the primer it self.
Reverse Complement and Trimming
If you use our parallel sequencing approach (Elbrecht & Leese 2015, see also Figure S2) some of your samples will be sequenced and merged in the reverse complement direction. It makes sense to flip these reads around, so all samples have the same sequence direction (usearch -fastx_revcomp). This enables you to cluster sequences to OTUs with only one strand direction, which reduces the number of (false) OTUs slightly, but more importantly you can also use one strand when mapping your reads against the OTUs and you don’t have to account for sequencing direction when comparing your OTUs against reference databases. Just makes your life a lot easier in downstream analysis to have all sequences oriented the same direction.
Additionally in this project we have sequenced the samples from this project with 4 freshwater invertebrate primer sets, targeting a 217 to 421 bp region of the COI barcoding gene (Elbrecht & Leese 2016). Thus the complete dataset currently contains sequences of different length (see Figure 2), and we can’t cluster the sequences into OUT as terminal gaps will be scored as mismatches with usearch –cluster_otus. While it is technically possible to cluster sequences of mixed length using the UCLUST algorithm, it’s easier to trimm the reads to the same 217 bp region, so the reads can be clustered with the UPARSE algorithm (which includes chimera detection and removal). You can use usearch –fastx_truncate to trimm away excess bases.
We should point out that we are truncating bases here which were previously used to calculate the max expected error values. Thus we technically don’t have the same error rate cutoff for all samples due to different amplicon length. The same thing does hold true to a small extend to the primer sequences, which were still included when the PE reads were merged (and quality filtered in the same step). Thus also the primer sequences did contribute to the expected error score, even though they are later removed from the actual sequence.
Therefore, ideally you would merge all reads without discarding any, then remove primers and do your trimming, and finally filter all sequences (which have now the same length) for low quality reads using maxee. However, under the assumption that sequencing errors are randomly distributed about all taxa in your sample and you have enough sequences to work, we can get away with already applying the max expected errors step in the merge PE reads step!
Min & Max Sequence length
When doing metabarcoding using the COI marker, all reads should have the same length, as COI is a coding gene. You might get 3, 6 oder even 9 base pairs deleted or inserted if the gene codes for additional or less amino acids in the protein, but typically you wold not expect any sequences shorter than that. That being sad, primers might bind internal of the actual marker due to high degeneracy or low annealing temperatures and errors is sequencing can happen as well. This will lead to some shorter or longer sequences appearing, which can also translate into some OTUs which are shorter or longer than expected (Figure 3). The problem can be, that these short OTUs might be identical to the actual full length OTU. When mapping reads with sequencing errors towards the end of the sequence against both OTUs, the short OTU might be favoured, as it represents a 100% match. This will not be the case very often, but it does not hurt to remove the OTUs that deviate e.g. +/- 10 bp from the expected length, to remove noise in the dataset. You can remove of target sequences with Usearch already in the –fastq_mergepairs step or afterwards with –sortbylength (in this case we are not interested in the sorting, but in removing reads that are to long or short).
Dereplication (including singletons!)
Your reads are now quality filtered, and to reduce necessary computing power needed for downstream analysis are dereplicated (Usearch –derep_fulllength). When dereplicating, identical sequences are condensed into one and the number of sequences saved in a size annotation in the sequence name (activate with the -sizeout option, e.g. “size=20;” means the specific sequence was 20 times in the dataset).
At this stage it is important to retain singletons; sequences which appear only one single time in the dereplicated file (size=1;). Singletons are often sequences which are highly affected by sequencing errors or are chimeric. Typically singletons are removed from the whole dataset before clustering to remove noise and the number of false OTUs (see next section). However, on the processed raw data, we want to keep singletons as we map all dereplicated reads against the OTUs to generate an OTU hit table. Permanently removing singletons from the whole dataset is possible, but you would throw away a substantial proportion of your sequence data which is not necessary.
C) Sequence Clustering
Next, you want to cluster your sequences, which is a key step to remove noise introduced by sequencing errors. You probably have 100.000+ unique sequences in your dataset, but maybe only a thousand actual taxa. Comparing all sequences against NCBI and BOLD is tedious and takes a long time. Thus, we want to cluster the sequences by similarity (typically 3%) into operational taxonomic units, which then can be compared against reference databases. There are plenty of clustering algorithms out there, but I highly to use usearch –cluster_otus!
When clustering sequences you have to consider two main points:
- Combine all sequences from the samples you want to analyse together into one single file, which is then used for clustering. Other wise you will not be able to properly compare samples with each other, they all need the same OTU basis for later analysis.
- Dereplicate the file containing all sequences from the samples you are interested in and remove singletons (or more). As we discussed above, removing singletons is necessary, as other wise the number of OTUs will be highly inflated by seququences with sequencing errors.
I also want to point out that the removal of singletons is highly dependent on sequencing depth. If you only have e.g. one million sequences left after pre processing because you work on a small data set, old sequencing data or have applied restrictive error filtering it might make sense to not discard singletons to not lose to much data (just give it I try). However, in most cases you really have to discard singletons! I would actually argue that in cases where you have 100+ million sequences it makes sense to even exclude more than singletons. With enough sequencing depth also rare read errors will happen twice or more by chance, thus the threshold of excluding sequences has to be increased as well, other wise you will end up including plenty of sequencing noise again. I would recommend e.g. to exclude all sequences with a size of 1 or 2 before clustering. The exact numbers of cause depends on the noise in your sequence data and the settings of the previous quality filtering steps applied in preprocessing. Try different settings and see what gives you an realistic number of OTUs. Notice that you will filter out a lot of false OTUs in the next steps as well, so if you expect 200 taxa in your sample, having 1000 – 2000 OTUs at this stage is still OK.
D) Data filtering!
Currently probably ~80% of your OTUs are still false, based ambiguous sequences or undetected chimeras. To remove these sequences you ideally have two independent PCR replicates sequenced for each sample, so you only trust OTUs which are present with a certain abundance in both samples. In this case however, we don’t have PCR replicates, so restrictive filtering based on the number of sequences assigned to each OTU is all we can do ; )
We can map all samples against our generated OTUs using usearch –usearch_global. This will give us an OTU table with number of reads assigned to each OTU for all sample. I like to discard OTUs which have no high abundance hits for at least one sample. In previous projects on mock communities I discarded all OTUs which don’t have over 0.003% reads for at least one sample (Elbrecht & Leese 2015, 2016, Elbrecht et al. 2016). If you have high sequencing depth or nosy data you can also increase this threshold to e.g. 0.01%, which will give you more reliable OTUs (only notice that you might lose certain low abundance taxa!).
Setting such a threshold only makes sense if your sequencing depth is high enough. If you only have ~10.000 sequences per sample, discarding OTUs below 0.003% abundance will do nothing to your dataset, as 0.003% of 10.000 sequences corresponds to 0.3 of a sequence.
with a minimum of 0.003%, you want at least 33.333 sequences per sample, ideally 10x that so your data is less affected by stochastic effects and variation. With 333.333 sequences in a per sample you would discard OTUs which are below 10 reads for all samples (333.333*0.003%=10).
One additional advantage of this threshold cut of is, that rarefaction of your sequencing data is usually not necessary, as long as each sample has at least 33.333 sequences. Because we are calculating with relative abundance and apply a 0.003% threshold, it does not matter if the sample had 3 million or one million reads. This even has the advantage that stochastic effects are reduced, as samples have not bee reduced to the lowest sequence number. If a sample has below 33.333 sequences, I would rather considering removing it from the dataset all together instead of reducing the whole dataset to that read number.
Notice that you should map your samples again against the OTU subset which remains after removing low abundance OTUs as some reads which clustered previously against low abundance OTUs might noch match the common OTUs.
As you can see there are many steps in wich you can and should filter your metabarcoding sequence data, not only removing singletons and apply clustering. At the end of the day you have to play around with the settings to find which suit your specific dataset.
Currently people have different metabarcoding pipelines, and as long as you are not doing very exotic things your data will probably be still reliable. Thus I usually don’t give people a hard time about the fine details any nuances (except if they make very critical mistakes). But I hope this article did give you some new impulses, which might improve your pipeline. As long as you apply the same pipeline across all samples, the biases potentially introduced at some points don’t have to critically impacts as they are similar for all samples.
As always, let me know hat you think, tweet me feedback / opinions @luckylionde
PS: Forget estimation of diversity based on OTU numbers, unless you have sequenced PCR replicates and have applied very clean bioinformatics, your OTU numbers will be quite inflated / noisy. But thats an story for another time = )
Edgar & Flyvbjerg 2015. Error filtering, pair assembly and error correction for next-generation sequencing reads. Bioinformatics
Elbrecht & Leese (2016) Development and validation of DNA metabarcoding COI primers for aquatic invertebrates using the R package “PrimerMiner”. PeerJ Preprints
Martin (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet journal