Example results

To show advantages of properly cleaned and trimmed sequencing reads we perfomed several assemblies and present their comparison to the previously published results. Unfortunately, the numbers do not show that the 3rd-party assemblies contain assembly chimeras. However, there are two types of issue with assemblies.

  1. Unremoved adaptor/MID/artefact caused false contig join, possibly when the requirement for an overlap was set too low and the contaminating sequence was longer. In this case the unremoved contaminants promoted contig joining and the bad assembly appears to have fever contigs than a stricter-one. A naive user is happy with this assembly as it seems to have the lowest contig count.
  2. Unremoved contaminants prevented contig joining. If the assembly was strict enough it resulted in many contigs which do not assemble together because the adaptor/MID/artefact sequences are well shorter than the minimal overlap required for a join. The unremoved sequences are flanking on either side of a contig and prevent contig joining.

In real world assemblies, both phenomenons compete with each other in every assembly project. Interestingly, even a stricter assembly but with proper trimming results in better assemblies. The improvements are sometimes not drammatic but that only underscores that the simple statistics usually presented (number of contigs, scaffolds, their average lengths, etc.) does not describe the key properties of assemblies and definitely not their quality. The positive message is that when one removes the contaminating sub-sequences (causing false joins or even preventing any joins, see above) the assembler can still find new/different paths for merging the reads and in overall, cleaned datasets do assemble better, faster and take less memory, even under stricter criteria.

 

Microbacterium (5.0Mbp genome size, FLX+ whole genome shotgun and FLX+ 3.3kbp paired-end reads)

An example of Microbacterium project shows that one is asking for misassemblies when is using default assembly settings (newbler 2.8) with overlap requirement only 40 nt (the usual artefacts are close in length to this threshold or even a bit longer so they are easy to overlap and merge). With increased overlap length threshold one yields even slightly better assembly (still uncleaned dataset). For sure less false contigs were joined but still, suboptimal. If one stays with the stricter assembly settings and takes a cleaned version of the dataset the results are in some way better (one scaffold less) and in some way not (the longest scaffold contigs broke into two). Important is to realize that the numbers hardly describe quality of the assembly. Finally, an automatically tuned cleaned dataset with adjusted trim points results in fairly nice assembly, with a single scaffold, considerably lower count of large contigs, compared to just cleaned assembly it has much higher largest scaffold contig size, average scaffold contig size and average large contig size (the last column in green).

 

  Mis-assembled (defaults) Uncleaned (stricter assembly) Cleaned & autotuned
all contigs (>500b) 208 204 97
Large contigs (>1kb) 107 105 59
Large avgContigSize 35231 35875 63764
scaffold contigs 88 85 16
avgScaffoldContigSize 42617 44071 237784
Scaffolds (>2kb) 4 3 1
largest scaffold contig 341414 340802 956203
largest scaffold size 2266091 3118929 3810055
scaffold contig bases 3750306 3746084 3804556
numWithPairedRead 81454 81454 106192
numberWithBothMapped 77702 77904 102053
numAlignedReads 935427 936174 968529

Acinetobacter  sp. SH024 genome, ASM16363v1

SRP000509 Acinetobacter sp. SH024 ASM16363v1 http://www.ncbi.nlm.nih.gov/assembly/GCA_000163635.1#/st Cleaned
all contigs (>500b) 89 83
Large contigs (>1kb)   69
Large avgContigSize   56734
scaffolds 26 13
avgScaffoldsize   302821
N50ScaffoldSize 870269 2314520
largestScaffoldSize   2314520
largestContigSize   289152
N50ContigSize 86799 109286
Total assembled sequence 3970841 3918191

 

Oncorhynchus mykiss (a rainbow trout, fish) transcriptome, SRP005674

author’s assembly Cleaned Cleaned + Optimized
all contigs (>500b) 55793 33034 32692
Large contigs (>1kb)   9321 9452
Large avgContigSize   778 777
Isogroups   21317 21104
Isotigs   24634 24506
avgIsotigSize   585 632
largestContigSize   27021 27021

 

Corvus corone transcriptome, SRP000770

Newbler 2.0.00.20 (40nt overlap 90identity) Newbler 3.0 (80nt overlap 96identity, nourt, cleaned)
 
numberOfIsogroups 6026
numberOfIsotigs 6142
avgIsotigSize 652
N50IsotigSize 756
largestIsotigSize 4625
LargeContigs 2948
LargeAvgContigSize 939
LargeN50ContigSize 990
numberOfAllContigs 19552 6257
inputFileNumReads 387025
numAlignedReads 227874
numAlignedBases 48814890
numberAssembled 207172
numberPartial 20699
numberSingleton 111471
numberRepeat 16
numberOutlier 1620
numberTooShort 10713

Vulpes vulpes transcriptome (SRP005414, both tame + aggressive fox samples assembled together)

Newbler 2.3 (40nt_overlap_90identity)
Newbler 2.3 (40nt_overlap_90identity, cleaned)
Newbler 3.0 (40nt_overlap_90identity_nourt, cleaned)
Newbler 3.0 (80nt_overlap_90identity_nourt, cleaned)
Kukekova et al., 2011
numberOfIsogroups
59731
53079
40246
38574
numberOfIsotigs
87400
88529
75591
73693
avgIsotigSize
1820
2018
2201
2230
N50IsotigSize
3293
3563
3654
3628
largestIsotigSize
17286
17378
15988
LargeContigs
58596
47819
47200
LargeAvgContigSize
1185
1358
1352
LargeN50ContigSize
1419
1642
1624
allContigMetrics
98296
88124
85527
inputFileNumReads
5945235
5937702
85.41%
5945235
5945235
numAlignedReads
5071311
85.80%
5093099
85.83%
4859742
81.90%
numAlignedBases
1845165508
1840500903
85.83%
1827008477
85.20%
numberAssembled
4551142
76.60%
4407089
4497111
75.79%
4325940
72.90%
numberPartial
571726
9.60%
639262
533416
8.99%
533317
8.99%
numberSingleton
562591
9.50%
582266
604808
10.19%
874432
14.74%
numberRepeat
14192
0.20%
28261
87355
1.47%
14207
0.24%
numberOutlier
181469
3.10%
210630
144330
2.43%
119124
2.01%
numberTooShort
63630
1.10%
70194
66655
1.12%
66655
1.12%

Ziziphus celata (a plant) transcriptome

author’s assembly our assembly
total reads 655337 655337
aligned reads 443006
assembled reads 474025 388927
unassembled reads 181312
contigs 84645
avg contig length 408
all contigs (>500nt) 24685
large contigs (>1000nt) 10072
avg large contig length 745
largest contig 5256
isotigs 22551
avg isotig length 615
largest isotig 14001
isogroups 19945

 

 

Phalaenopsis aphrodite (a plant) transcriptome

SRP005898, Plant & Cell Physiology
author’s assembly our assembly
total reads 3302528 3302765
aligned reads 2944414
assembled reads 2676907
singletons 85144 284568
Contigs (>200nt, >500nt, respectively) 34563 39867
avg contig length (>200nt) 1194
all contigs (>500nt) 39867
large contigs (>1000nt) 20946
avg large contig length 1239
largest contig 11757
isotigs 30861
avg isotig length 1483
largest isotig 11757
isogroups 20578