Primarily we offer raw sequence data cleanup, evaluation of sequencing run quality and troubleshooting based on interpretation of uncovered adaptors, sample tags, artifacts, sample insert sizes. Definitely, we do not offer “yet another adaptor removal tool”. Although are many tools around all are performing suboptimally. Not only in terms of algorithms but also in terms of queries. Nobody put so far enough effort to analyse existing issues so the existing tools do not look for the most queries we collected ourselves. Typically, users are left to find on their own what they want to look for, and trim away. Maybe they ask in some internet forum for some query sequences but everybody is clueless. Would anybody sink into the data for long enough, he/she would find plethora of issues and would evetually realize that the existing software tools are not made to handle them well. So, current solutions are fundamentally wrong.
To date the work on 2227 datasets showed that one or two adapter query sequences do not suffice, at all. Not even a hundred, sadly. Any adaptor-removing software should be able to interpret on the fly type of the experimental design and adjust to it by itself (and generate the query sequences on its own).
While most of our efforts went into Roche 454 datasets here are some general comments about our procedure. We analyze reads from every sequencing region independently and collect almost 50 parameters. From these we draw nice charts for each region, split reads by sample MIDs, etc. Actually, we generate plenty of charts (see example images). The key knowledge is a proper detection and exact localization of adapters and artifacts. It does not matter much whether the artifacts are related to the sequencing technology or preceeding sample preparation protocol.
The hunt for adapters / sample tags / artifacts involves hundreds of distinct searches through every raw sequencing read in the dataset and laborious comparison of their matches. Due to different types of sequencing errors popping out in some alignments we do not even use a single tool for the searches. Really, we use two tools independently and then go through their results and select one, the proper-one match to record an answer. Still this is not enough and in some cases we record even some additional backup matches. What a mess, right? Well, this is the reality.
The software was developed by a molecular biologist by education and heart, with molecular-biological interpretation of the problem in mind. The analyses so far have resulted in discrimination of dozens of different experimental setups, each is associated with its own issues and eventually sharing some common issues with another. A brief listing of available methods (differing by design) at Supported protocols page. Of course, in reality one can see even more variants. From the long listing of methods and from the text on this page it should be clear that there is really not a single query sequence, or two, or three … 😉 , to be used by Trimmomatic, Flexbar, whatever adaptor removal tool you know. Even if you would feed them with the right set of queries they would fail to find some matches, because the alignment-producing code cannot be universal. A lot of extra work is done in addition, to e.g. rescue some discarded candidate hits, select as the best match a hit which is sunk somewhere in the middle of the crowd, having sub-optimal score, e-value and other parameters.
The most effort we put so far into shotgun sequencing data. We did process also amplicon datasets but honestly, most amplicon users do not care about stuff which just does not align against their reference sequence. This is not to say there are no artifacts in amplicon sequencing. Not at all, there are many! It seems losses of sequencing capacity between 30 – 60 % are quite normal and “accepted”. However, for shotgun-sequencing and paired-end sequencing (using the amplicon setup in case Roche 454 protocols) one cannot neglect these issues. And that is where we can help as well.
We list here protocols used for sample preparation or followup sequencing. Protocols differ by their complexity and in turn by number of artifacts they inherently generate and also by their sequencing overhead (the ratio between usable nucleotides / sequenced nucleotides). Admittedly, general calculations of sequencing cost of achieved nucleotide per dollar fail in the sense that it does not matter how many nucleotides were output by software of the sequencer, no matter if we speak about so called ‘high-qual’ or ‘high+low qual’ basecalls. In contrary, what counts is how many of them remain after proper cleanup of the data from sequencing adaptors, sample tags, artifacts. Notably, there are datasets around from which merely nothing is left after cleanup.