Current Data Processing

Rough overview of our current processing pipeline, disk usage is for an estimated 2x100 run.




Jumpgate is the initial "spool" machine that holds the runfolder as it is produced by the current sequencing software provided by Illumina and copied with robocopy.

Rsync Images

At the moment, we are copying the runfolder images directly from the spool to an 8 drive disk enclosure containing 1 TB drives. For 38 BP runs, the whole run will fit on a single disk, for longer runs we split the runfolder and images up over several drives. We're currently thinking of switching to the wiebetech RTX 800 as it should allow faster swapping of hard disks. (It's unclear how long we're going to be doing this step, as its much easier to just copy the images to the analysis raid array and then delete them there after the image save time.) (Also if necessary to fit within the 925 available GB of the disk, I sometimes avoid copying parts of the Data directory that are constructed by the RTA or IPAR software for this "images" archive).

Copier bot

We have a software bot called copier that automatically copies files runfolders from jumpgate to our analysis server several buildings over.


The pipeline is currently the GAPipeline 1.4.0

Collect Reads

Once the pipeline runs, a script "runfolder" will extract the eland_extended files, and capture the run parameters and archive that to our on-line archive on loxcyc. Another tool "srf" constructs a set of srf files that contain all reads, not just the ones that pass filter which are also added to the on-line archive. We are currently using the illumina2srf tool provided in the GAPipeline-1.4.0.tar.gz release.

Delete Images

After a month or two pass, we delete the images from a runfolder.

Delete Unnecessary Files

At the end of a runfolders easily accessible life, we delete some of the unnecessary files. The "unnecessary files" are: "RunLog*.xml", "pipeline*.txt" (produced by our pipeline running script), "*.log", "Calibration*", and "ReadPrep*". Also we go into Data/C*Firecrest (if it exists) and do a make clean_intermediate.

For IPAR_1.3 or RTA Intensities we can delete the Bustard and GERALD directories, as we can recover those files by re-running the pipeline. (In my tests I couldn't get Firecrest to be perfectly deterministic, where I could get the same results with IPAR runs.)

Compress Runfolder

Once the runfolder has been tar-ed and compressed it's copied to the hot-swappable drive array, and then we remove the disk, index it with our inventory system and put it on our shelf. As a result our on-line storage needs only grow by an overall average of 5.2 GiB per flowcell.

I haven't compressed a 2x100 run yet, so I'm not sure what the final file size will be. It tends to be about half of the original runfolder size.

DataHandling (last edited 2009-08-13 17:30:49 by BeaverNet-165)