NextGeneration sequencing machines produce nowadays huge amounts of sequence data and although they seem to be relatively cheap in terms of bp per dollar and with super quick sample preparation protocols … analysis of the data takes weeks, months, years.
The molecular methods/approaches involved in sample preparation procedures and sequencing are relatively same across all sequencers from current manufacturers. That also means similar issues may be observed in the data and in general, issues commonly known to molecular biologists from their molecular cloning experience do appear in sequencing data as well. Obviously, lots of PCR-based artefacts are not “visible” on agarose gel stained by ethidium bromide but once you sequence the sample molecules lots of unwanted items show up.
For example, once I saw data from a sequencing experiment which utilized about 18 pairs of PCR primers. Regretfully, in the data I found over 200 different types of PCR products (instead of just 18). Not surprisingly, primers annealed to unwanted regions of templates during PCR, priming oligos were ligated together, etc. One should not be surprised by the fact that sequencing data of PCR-amplified samples are scattered by many artefacts and should accept the data needs a lot of cleanup.
A tool for multiple analyses of “Next Gen” sequencing datasets
SFF Inspector is a tool to lookup adapters/artefacts in data from Roche / 454 Life Sciences instruments / IonTorrent (to some extent also Illumina), remove them and analyze properties of the reads before and after trimming. The tool can automatically find adapters and Multiplex IDentifiers (MIDs alias barcodes) in the sequences and annotate them. A specialized code was developed to cope with custom MID tags and mis-carried experimental design involving multiple levels of adapters wrapping up the sample insert. The output in some cases provides an answer what went wrong during sample preparation or sequencing and likely caused seemingly short reads. Results of the analysis can be directly compared with laboratory quality control steps performed during sample preparation procedure and therefore give confidence what happened with the sample along the way to and through the sequencer.
Analysis of many datasets available through NCBI SRA database brought me to invent lots of analytical parameters and to draw for each dataset about two dozens of figures. The differences between individual sequencing runs, samples, … sequencing providers … are now easily available for your interpretation and troubleshooting. Now you can tweak your sample preparation procedure, do quality control of all PCR steps, evaluate sequencing performance … Thanks to the fixes/polishing of the data your future de novo assembly attempts will be smoother and just better. Further to note, based on the analysis one can select/discard certain types of reads, artefacts, adapters, MIDs, by their length or several other properties. A lot of unexpected contaminants show up during this thorough analysis and therefore it may be desirable to discard the reads instead of just trimming them.
The tool was developed during continuous analysis and re-analysis of over 1800 publicly available datasets and can cope with just all issues found so far. The development also gave a good overview of the Roche/454 and IonTorrent sequencing technology over the past years. Let me say that none of the adapter-trimming tools available elsewhere on internet could do something comparable to what SFF Inspector does. They are just missing too many our features, some logic must be clearly missing and their query sequences are not incomplete.
Why our pipeline processing?
1. Uses unique, unpublished database of adapters collected from public sequencing datasets. These covereal several dozens of experimental setups. Please refer to Supported protocols page for more information about the plethora of adapter queries and their combinations.
2. Automated detection of laboratory protocol(s) on the fly and automatic selection of proper sets of queries.
3. Support for Roche GSMIDs, TiMIDs, RLMIDs and any custom sample barcodes (e.g. barcodes from GATC Biotech, Matz’s lab).
4. Support for multiple layers of adapters surrounding the sample DNA sequence, even interspersed with sample barcodes. Yay.
5. Support for multiple layers of sample barcodes. Sigh.
6. Precise removal of various artefacts distilled from more than 1800 datasets worldwide. Trimming respects the underlying technology and therefore aligns matches to “sequencing flows” to trim “over-called” sequences.
7. Built-in abilitity to report even multiple errors in disjoint locations of every read sequence.
8. Support for chimeric reads, mRNA polyA-tails and their reverse-complements interrupted by sequencing errors.
9. Challenging filtering rules and code to adjust the matches and report just the one which is the most meaningful (not always statistically the best one!).
10. Produces fancy PNG figures a CSV tables, corrected SFF files, tunes quality-based trim points, adapter-trimmed FASTQ and FASTA/QUAL files, corrects mRNA polyA-tail sequences.
11. Actually, it does not require SFF files for input and works fine even with FASTA/FASTQ files. But, the SFF format is preferred.
We offer data cleanup, error-correction, normalization, cleanup, assembly, reference mapping, variant calling, annotation.