It is possible to download the data directly from the Illumina runfolder, though potentially slightly confusing.
HiSeq Directory Structure
The contents of a HiSeq runfolder is quite different from that of the previous version.
The result files are no longer stored in the <runfolder>/Data/ hierarchy, but are now in <runfolder>/Unaligned/Project_<Library_ID>_<Index Number>/Sample_<Library_ID> [ 1 ] for the raw FASTQ files, and Aligned Files will end up in a <runfolder>/Aligned/Project_<Library_ID>_<Index_Number>/Sample_<Library_ID> directory.
For example library 12410 which was given a barcode index of 5 would be found at:
[ 1 ] One additional complication with finding your data in Unaligned is that is merely the default location, Sometimes we will need to reprocess the intensities, and there will be additional Unaligned/Aligned directories. Check the timestamp and try the later one, or ask Igor for more information.
The HiSeq pipeline splits the reads into multiple FASTQ files to make it easier for their pipeline to parallelize. Additionally Illumina has finally decided to conform to the FASTQ standard and now uses Phred33 quality scores instead of their previous Phred64 encoding.
If you want to map the FASTQ files with TopHat or BowTie, you can directly process reads by providing a comma separated a list of filenames. Both programs can also directly process .gz (gzip) or .bz (bzip2) compressed files.
$ bowtie hg19 -1 12345_ATCACG_L007_R1_001.fastq.gz,12345_ATCACG_L007_R1_002.fastq.gz -2 12345_ATCACG_L007_R2_001.fastq.gz,12345_ATCACG_L007_R2_002.fastq.gz
Contents of GA, GAII, GAIIx Flowcell/Cycle Directory
The older GA/GAII series had the result files deep within the runfolder tree:
The QSeq files which can be converted to FASTQ are in <runfolder>/Data/Intensities/Bustard_<date>/s_<lane>_<read>_<tile>_qseq.txt
Eland Result Files
Depending on the age of the flowcell the aligned files will be one of the following:
https://jumpgate.caltech.edu/library/ - has data organized by library id
Unfortunately downloading large files over HTTP was somewhat unreliable leading us to...
Long Term File Storage
The data is also accessible from /woldlab/loxcyc/data00/solexa-sequence on the the wold-lab cluster, everything but the most recent run or two is in the flowcell directory, the directory of results organized by library lags much longer. (I'm trying to shorten the length of time it takes for me to push the results to the flowcell directory)
flowcell/<flowcell id>/<cycle id> usually contains the raw data in eland, srf, or fastq formats.
- srf/ for older srf files (can be converted to fastqs)
library/<library id> for files by library id
The flowcell/cycle directory is the destination of the primary result files, the library directory is filled from the elements in the flowcell directory tree. (And the srf directory is going to slowly go away).
See result Files for descriptions of the various aligned files.
As of December 2009, the IVC.html and pngs are being copied to the cycle directory. (Not that there's an easy way to view the plot right now).
There is a "run-date.xml" that contains a somewhat complicated xml file that holds all of the parameter information about a run that I could find. (The htsworkflow code has tools to access the xml file if needed).
Also as of December 2009, I'm switching from storing srf files to storing the qseq files. Though the srf files store slightly more information, the qseqs should be easier to work with (and I can compress them much better). A qseq tar file contains all of the qseqs for a particular "read end".
- site: woldlab
- run date: 091029
- machine name: HWI-EAS229
- run number: 0002
- flowcell id: 42TKGAAXX
- lane: l1
- read: r1
lane is a lower-case L followed by the number 1.
In addition the srf utility doesn't handle multiplexed runs properly. (Multiplexing is treated as an additional read, so a paired end run with multiplexing actually produces 3 read files.)