qseq format

  1. Machine name (hopefully) unique
  2. Run number: (hopefully) unique
  3. lane number: [1..8]
  4. Tile number: positive integer
  5. X: x coordinate of the spot, (can be negative)
  6. Y: y coordinate of the spot, can be negative)
  7. Index: positive integer, should be greater than 0 (the files I see have it == to 0)
  8. Read number, 1 for single read, 1 or 2 for paired end
  9. sequence
  10. quality, the calibrated quality string.
  11. filter, did the read pass qc 0 - no, 1 - yes

The quality score in the qseq file (as of pipeline 1.3 is the Phred score encoded as an ASCII character by adding 64 to the Phred value.)

This is a function to convert qseq to Fastq using Phred33 quality scores reasonably quickly in python. It requires numpy. Some software like bowtie will accept either Phred33 and Phred64 formatted reads.

destination is a python stream. qseqs is a list of open streams for all the qseq files.

def qseq2fastq(destination, qseqs, trim=None, pf=False):
    for qstream in qseqs:
        for line in qstream:
            # parse line
            record = line.strip().split('\t')
            machine_name = record[0]
            run_number = record[1]
            lane_number = record[2]
            tile = record[3]
            x = record[4]
            y = record[5]
            index = record[6]
            read = record[7]
            sequence = record[8].replace('.','N')
            # Illumina scores are Phred + 64
            # Fastq scores are Phread + 33
            # the following code grabs the string, converts to short ints and
            # subtracts 31 (64-33) to convert between the two score formats.
            # The numpy solution is twice as fast as some of my other
            # ideas for the conversion.
            # sorry about the uglyness in changing from character, to 8-bit int
            # and back to a character array
            quality = numpy.asarray(record[9],'c')
            quality.dtype = numpy.uint8
            quality -= 31
            quality.dtype = '|S1'
            # I'd like to know what the real numpy char type is
            # instead of '|S1' 

            destination.write('@%s_%s:%s:%s:%s:%s/%s%s%s' % ( \
                machine_name,
                run_number,
                lane_number,
                tile,
                x,
                y,
                read,
                pass_qc_msg,
                os.linesep))
            destination.write(sequence[trim])
            destination.write(os.linesep)
            destination.write('+')
            destination.write(os.linesep)
            destination.write(quality[trim].tostring())
            destination.write(os.linesep)

The trick using numpy is a bit confusing but noticeably faster. Its roughly equivalent to doing:

quality = []
for c in record[9]:
  # convert character to an integer, 
  # subtract 31, 
  # convert back to a character
  quality.append( chr(ord(c)-31) )
quality = "".join(quality)

QSeq (last edited 2010-11-04 23:22:30 by diane)