Small RNA Transfrags from CSHL
Small RNA reads from Cold Spring Harbor Lab (CSHL) were assembled 
into transfrags by merging overlapping reads.  In order to minimize 
ambiguity from reads that have the potential to map to multiple genomic 
loci, only the uniquely mapping reads were used to generate transfrags.  
The BED6+ format files are based on, but not generated directly by, the 
"intervals-to-contigs" Galaxy tool written by Assaf Gordon (gordon@cshl.edu) in the Hannon lab at CSHL. Below is a description of the columns in this format, and how each column is calculated.
Output Columns
(Bed-style transfrag information)
- 
- chromosome
 
- 
- transfrag's start coordinate
 
- 
- transfrag's end coordinate
 
- 
- Contig's name.  The numeral in the name indicates rank in terms of abundance within this dataset.
 
- 
- Score (0 to 1000).  Scores are calculated thusly: 1000*[# reads in transfrag]/[# reads in most abundant transfrag in this dataset]
 
- 
- Strand (orientation, + or -)
 
(Additional Sequences Information)
- 
- transfrag's length (number of covered bases = end - start)
 
- 
- number of unique sequences in this transfrag
 
- 
- total reads count in this transfrag
 
- 
- minimum sequence-count value
 
- 
- maximum sequence-count value
 
- 
- average seqeunce-count value
 
- 
- first-quartile sequence-count value
 
- 
- median sequence-count value
 
- 
- third-quartile sequence-count value
 
(Additional Reads Information)
- 
- minimum reads-count value
 
- 
- maximum reads-count value
 
- 
- average seqeunce-count value
 
- 
- first-quartile reads-count value
 
- 
- median reads-count value
 
- 
- third-quartile reads-count value
 
(Additional Intervals Information)
- 
- number of regions in this transfrag (each region has different value 
for sequence-count and reads-count)
 
- 
- starting coordinates of significant regions in this transfrag (see 
example below)
 
- 
- length (in bases) of each significant regions
 
- 
- sequence-count for each significant region
 
- 
- reads-count for each significant region
 
- 
- Integrated reads-count sum (inner-product of columns 22 and 24)
 
Concrete Example
Assume the following intervals over an imaginary chromosome chr1:
chr1        100     132     4
chr1        110     142     3
chr1        130     160     7
chr1        170     201     3
chr1        190     225     1
Plotting these intervals:
 
These intervals cover two transfraguous regions (marked in red): 100-160
 and 170-225.
The output file will contain two lines (one for each transfrag):
chr1      100     160     transfrag-1     1000     +     60      3       14      1       3       1.6     1       2       2       4       14      7.35    7       7       7       5       100,110,130,133,143     10,20,3,10,17   1,2,3,2,1       4,7,14,10,7     441
chr1      170     225     transfrag-2     286      +     55      2       4       1       2       1.21818 1       1       1       1       4       2.38182 1       3       3       3       170,190,202             20,12,23        1,2,1           3,4,1   131
The rest of the explanation will focus on the first transfrag only:
(transfrag information):
- transfrag is on chromosome chr1 (column 1)
- transfrag has sense orientation (column 6) - assumed so 
beacause no orientation information was found.
- transfrag starts at coordinate 100 (column 2)
- transfrag ends at coordinate 160 (column 3)
- transfrag's name is transfrag-1 (column 4).  It is the most abundant transfrag in the sample, hence the rank score of 1.
- transfrag has a score of 1000 (column 5) - it is the most abundant 
congtig on this chromosome.  The second transfrag would have a score of 
1000*(4/14) = 286.
- transfrag has sense orientation (column 6) - assumed so 
beacause no orientation information was found.
- transfrag covers 60 bases (column 7)
- transfrag has 3 sequences (column 8)
- transfrag has 14 reads (column 9)
(sequence-count information):
- minimum sequence-count is 1 (only one interval covers coordinates 
100 to 110) (column 10)
- maximum sequence-count is 3 (three intervals are covering 
coordinates 130 to 132) (column 11)
- average sequence-count value is 1.6 ( 60 bases are covered, with 
coverage sum = 10x1 + 20x2 + 3x3 + 10x2 + 17x1 = 96. 96/60=1.6 ) (column
 12)
- first-quartile sequence-count value is 1 ( There are 27 bases 
covered with value 1, 30 bases covered with value 2, and three bases 
covered with value 3) (column 13)
- median sequence-count value is 2 (column 14)
- third-quartile sequence-count value is 2 ( column 15 )
(reads-count information):
- minimum reads-count is 4 (coordinates 100 to 110 are covered by the 
lowest number of reads = 4) (column 16)
- maximum reads-count is 14 (three intervals, whose reads-count sum is
 14, are covering coordinates 130 to 132) (column 17)
- average reads-count value is 7.35 ( 60 bases are covered, with 
coverage sum = 10x4 + 20x7 + 3x14 + 10x10 + 17x7 = 441. 441/60=7.35 ) 
(column 18)
- first-quartile reads-count value is 7 ( 10 bases coverged with 4, 37
 covered with 7, 10 covered with 10, 3 covered with 14 ) (column 19)
- median reads-count value is 7 (column 20)
- third-quartile reads-count value is 7 ( column 21 )
(significant regions information):
- This transfrag has five significant regions (column 22).
- Significant coordinates are ones in which the 
sequence-count and reads-count change. In this examples, the coordinates
 are 100,110,130,133,143. Look at the plot to better understand how 
these coodinates are determined. (column 23).
- Each significant region covers a varied number of bases. Example: 
the region which starts at 100 covers 10 bases. the region which starts 
at 110 covers 20 bases. the region which starts at 130 covers 3 bases, 
etc. (column 24)
- Sequence-Count value for each significant region . Example: the 
region which starts at 100 has sequence-count=1. (column 25)
- Reads-Count value for each significant region . Example: the region 
which starts at 100 has reads-count=4. (column 26)
- Integrated reads-count sum - (Inner product of column 22 and column 
26) = 10*4 + 20*7 + 3*14 + 10*10 + 17*7 = 441