Early Access

It is possible to download the data directly from the Illumina runfolder, though potentially slightly confusing.

HiSeq Directory Structure

The contents of a HiSeq runfolder is quite different from that of the previous version.

The result files are no longer stored in the <runfolder>/Data/ hierarchy, but are now in <runfolder>/Unaligned/Project_<Library_ID>_<Index Number>/Sample_<Library_ID> [ 1 ] for the raw FASTQ files, and Aligned Files will end up in a <runfolder>/Aligned/Project_<Library_ID>_<Index_Number>/Sample_<Library_ID> directory.

For example library 12410 which was given a barcode index of 5 would be found at:


[ 1 ] One additional complication with finding your data in Unaligned is that is merely the default location, Sometimes we will need to reprocess the intensities, and there will be additional Unaligned/Aligned directories. Check the timestamp and try the later one, or ask Igor for more information.

HiSeq Fastqs

The HiSeq pipeline splits the reads into multiple FASTQ files to make it easier for their pipeline to parallelize. Additionally Illumina has finally decided to conform to the FASTQ standard and now uses Phred33 quality scores instead of their previous Phred64 encoding.

If you want to map the FASTQ files with TopHat or BowTie, you can directly process reads by providing a comma separated a list of filenames. Both programs can also directly process .gz (gzip) or .bz (bzip2) compressed files.

For example:

$ bowtie hg19 -1 12345_ATCACG_L007_R1_001.fastq.gz,12345_ATCACG_L007_R1_002.fastq.gz -2 12345_ATCACG_L007_R2_001.fastq.gz,12345_ATCACG_L007_R2_002.fastq.gz

Contents of GA, GAII, GAIIx Flowcell/Cycle Directory

The older GA/GAII series had the result files deep within the runfolder tree:

The QSeq files which can be converted to FASTQ are in <runfolder>/Data/Intensities/Bustard_<date>/s_<lane>_<read>_<tile>_qseq.txt

The aligned files are in <runfolder>/Data/Intensities/Bustard_<date>/GERALD_<date>/<Result Files>.

Eland Result Files

Depending on the age of the flowcell the aligned files will be one of the following:

Web Access

https://jumpgate.caltech.edu/library/ - has data organized by library id

Unfortunately downloading large files over HTTP was somewhat unreliable leading us to...

Long Term File Storage

The data is also accessible from /woldlab/loxcyc/data00/solexa-sequence on the the wold-lab cluster, everything but the most recent run or two is in the flowcell directory, the directory of results organized by library lags much longer. (I'm trying to shorten the length of time it takes for me to push the results to the flowcell directory)

look in:

The flowcell/cycle directory is the destination of the primary result files, the library directory is filled from the elements in the flowcell directory tree. (And the srf directory is going to slowly go away).

See result Files for descriptions of the various aligned files.

As of December 2009, the IVC.html and pngs are being copied to the cycle directory. (Not that there's an easy way to view the plot right now).

There is a "run-date.xml" that contains a somewhat complicated xml file that holds all of the parameter information about a run that I could find. (The htsworkflow code has tools to access the xml file if needed).

Also as of December 2009, I'm switching from storing srf files to storing the qseq files. Though the srf files store slightly more information, the qseqs should be easier to work with (and I can compress them much better). A qseq tar file contains all of the qseqs for a particular "read end".

For example:


lane is a lower-case L followed by the number 1.

In addition the srf utility doesn't handle multiplexed runs properly. (Multiplexing is treated as an additional read, so a paired end run with multiplexing actually produces 3 read files.)

StoredSequencerOutput (last edited 2011-12-29 21:11:13 by diane)