Kevin's GATTACA World: Illumina produces 3k of 8500 bp reads on HiSeq using Moleculo Technology

Wednesday 24 July 2013

Illumina produces 3k of 8500 bp reads on HiSeq using Moleculo Technology

Keith blogged about how super long read sequencing methods would be a threat to Illumina in Jan 2013. Today, Illumina can now openly acknowledge the shortcomings of their short reads for various applications like

assembly of complex genomes (polyploid, containing excessive long repeat regions, etc.),
accurate transcript assembly,
metagenomics of complex communities,
and phasing of long haplotype blocks.

the reason?
This latest set of data released on BaseSpace

Read length distribution of synthetic long reads for a D. melanogaster library

The data set, available as a single project in BaseSpace, can be accessed here.

image source: http://blog.basespace.illumina.com/2013/07/22/first-data-set-from-fasttrack-long-reads-early-access-service/

with the integration of Moleculo they have managed to generate ~30 gb of raw sequence data. They have refrained from talking about 'key analysis metrics' that's available in the pdf report. Perhaps it's much easier to let the blogosphere and data scientists dissect the new data themselves.

Am wondering when the 454 versus Illumina Long Reads side-by-side comparison will pop up

UPDATE:

Can't find the 'key analysis metrics' in the pdf report files. Perhaps it's still being uploaded? *shrugs*
so please update me if you see it otherwise I just have to run something on it

These are the files that I have now

total 512M
259M Jul 18 01:01 mol-32-2832.fastq.gz
44K Jul 24 2013 FastTrackLongReads_dmelanogaster_281c.pdf
149K Jul 24 2013 mol-32-281c-scaffolds.txt
44K Jul 24 2013 FastTrackLongReads_dmelanogaster_2832.pdf
151K Jul 24 2013 mol-32-2832-scaffolds.txt
253M Jul 24 2013 mol-32-281c.fastq.gz

md5sums
6845fc3a4da9f93efc3a52f288e2d7a0 FastTrackLongReads_dmelanogaster_281c.pdf
02f5de4f7e15bbcd96ada6e78f659fdb FastTrackLongReads_dmelanogaster_2832.pdf
586599bb7fca3c20ba82a82921e8ba3f mol-32-281c-scaffolds.txt
b25010e9e5e13dc7befc43b5dff8c3d6 mol-32-281c.fastq.gz
6822cfbd3eb2a535a38a5022c1d3c336 mol-32-2832-scaffolds.txt
873f09080cdf59ed37b3676cddcbe26f mol-32-2832.fastq.gz

I have ran FastQC (FastQC v0.10.1) on both samples the images below are from 281c.
you can download the full HTML report here
https://www.dropbox.com/sh/5unu3zba9u21ywj/JT4HdkzfOP/mol-32-281c_fastqc.zip
https://www.dropbox.com/s/mpxa5wx51iqmiz3/mol-32-2832_fastqc.zip

Reading about the Moleculo sample prep method, it seems like it's just a rather ingenious way to stitch short reads which are barcoded to form a single long contig. if that is the case, then I am not sure if the base quality scores here are meaningful anymore since it's a mini-assembly. Also this takes out any quantitative value of the number of reads I presume. So accurate quantification of long RNA molecules or splice variants isn't possible. Nevertheless it's an interesting development on the Illumina platform. Looking forward to seeing more news about it.

Other links

Illumina Long-Read Sequencing Service
Moleculo technology: synthetic long reads for genome phasing, de novo sequencing
CoreGenomics: Genome partitioning: my moleculo-esque idea
Moleculo and Haplotype Phasing - The Next Generation TechnologistNext Generation Technologist
Abstract: Production Of Long (1.5kb – 15.0kb), Accurate, DNA Sequencing Reads Using An Illumina HiSeq2000 To Support De Novo Assembly Of The Blue Catfish Genome (Plant and Animal Genome XXI Conference)
http://www.moleculo.com/ (no info on this page though)
Illumina Announces Phasing Analysis Service for Human Whole-Genome Sequencing - MarketWatch

Illumina Announces Moleculo Long Read Technology and Phasing As Service
First publication using the Long Read Seq (LRseq) The genome sequence of the colonial chordate, Botryllus schlosseri | eLife Contains a diagram explaining the LRSeq protocol. This experiment yielded ~1000 6.3kb fragments

Patent information on the Long Read technology
https://docs.google.com/viewer?url=patentimages.storage.googleapis.com/pdfs/US20130079231.pdf

1 comment:

Anonymous7 August 2013 at 20:56
I'm intrigued by the long tail (head, rather) of short synthetic reads. Can you tell from the files, or from the read names, which synthetic reads came from the same pool? Then after mapping, one can find out whether these are the result of fragmented assemblies of each amplified long molecule. Even without the pooling info, mapping may shed some light on this...
ReplyDelete
Replies

Add comment

Kevin's GATTACA World

Wednesday 24 July 2013

Illumina produces 3k of 8500 bp reads on HiSeq using Moleculo Technology

UPDATE:

Other links

1 comment:

Datanami, Woe be me

Analytics code

Contributors