Bioinformatics Analysis FAQ - Highly expressed genes in testing of differential gene expression

This question is initiated as CuffDiff will skip testing gene loci that have more than X reads mapped for all samples.  By default, CuffDiff sets X to be 1,000,000 for all samples.  However, there are more confound statistical consideration behind this cutoff.

In the Module 2 RNA-seq lecture, we mentioned that the "per million mapped reads" part of that normalization can get skewed in cases where one or more of the samples have a subset of genes at ridiculously high levels of expression.  (Recall the example of globins in blood serum.)  Effectively, when one compares two samples (or groups of samples) in the traditional manner, he/she is making the explicit assumption that the total number of transcripts between your tow conditions are the same.  That assumption can be grossly wrong in some systems.  In those cases, one need to perform the quartile normalization (not to be confused with quantile normalization used in microarrays!)  Quartile normalization ignores temporarily the upper quartile (top 25%) of genes that are most highly expressed, and counts only the number of mapped reads for the lower 75% of genes.  This gives a much better "per million mapped reads" estimate for the well-behaved genes to use in the denominator of that part of the normalization.  That way, if the two conditions are wildly out of sync with their most highly expressed genes, it won't adversely affect the normalization.

The nice thing about quartile normalization is that

(1) in most cases, the non-expressed, lowly-expressed and modestly-expressed portions of the system are actually pretty well-behaved and consistent across a wide range of samples.  They can be used as a very accurate baseline for the "depth of sequencing", regardless of how asymmetric the underlying differences in expression at the higher end.  Such as, if 5 of the 6 samples are well-behaved, they won't really be much affected, but the poorly-behaved sample will be properly weighted relative to the others. This is illustrated with a concrete example below.

(2) Although it figures out the denominator (i.e., normalization factor describing the "per million mapped reads" part of the abundance) based on the lower 75%, it applies that normalization factor to everything. So one won't have to apply it separately to the lower or the upper sets.

If you think about it, what you really want to do when comparing samples is make sure that if you arbitrarily sequence a sample to twice the depth of another sample, that you correct for that over-sampling in the end. Empirically, it has been shown for many systems that the level of expression of the moderately-expressed genes from cell type to cell type has a pretty constant distribution, but the highest expressed genes in some cases can be orders of magnitude higher than others. [NOTE: this may not hold for dying cells, which may have systematic lowering of ALL transcripts, so hopefully you're cells that ultimately die are not doing this yet!] In these normal systems, if a couple genes make up 70% of all the reads we recover, we're drowning out all the rest of the signal. Effectively, we've wasted most of our sequencing power on those couple genes, and our actual sequencing of the rest of the system at that condition was way lower.

This is an idealized example for illustrative purposes:

Sample_ID gene raw_1Kb_read_count

1 Act7 1000

1 GapDH 2000

1 Sec4 500

...

1 Bglob1 10000000

TOTAL WITH Bglob1: 12,000,000

TOTAL WITHOUT: 2,000,000

2 Act7 500

2 GapDH 1000

2 Sec4 250

...

2 Bglob1 10

TOTAL WITH Bglob1: 1,000,010

TOTAL WITHOUT: 1,000,000

Sample 1 is dominated by Bglob1 (e.g., blood). Sample 2 has effectively little or no expression of that gene (e.g., skin). We see right away that all genes except Bglob1 in sample 2 have half as many reads as were mapped in sample 2, and intuitively we can assume that our effective sampling of sample 2 was half as deep as sample 1. I am only showing 3 other genes here for simplicity, but let's assume that nearly all of the other genes showed this general reduction between samples, with some here and there actually differentially expressed. Let's also assume that really isn't a biological across-the-boards depression of transcription going on.

The normalization for #mapped reads is straightforward. For sample 1, Act7 is 1 Kb and has 1000 reads. (1000 reads / 12,000,000 total mapped reads) * 1,000,000 = 83.3 FPKM. For sample 2, Act7 has 500 reads. The basic normalization would yield (500 reads / 1,000,010) * 1,000,000 = 499.995 FPKM.  The 83.3 FPKM was hugely deflated because of the skewing caused by the very high expression of Bglob1, (and not that the 499.995 FPKM in sample 2 was inflated).

Quartile normalization would remove the top quarter of genes (let's say the rest of the removal was inconsequential relative to the 10 Million reads mapped to Bglob1 in the first sample). So now, sample 1 has 2 million mapped reads that count, and sample 2 has 1 million that count. To calculate abundances, sample 1 has Act7 at (1000 reads / 2,000,000) * 1,000,000 = 500 FPKM. And now, sample 2 has Act7 at (500 reads / 1,000,000) * 1,000,000 = 500 FPKM. So now it follows our intuition.

Notice that Bglob1 still follows our intuition when normalized with the new denominator we're using: Bglob1 in sample1 = (10,000,000 / 2,000,000) * 1,000,000 = 5,000,000 FPKM (i.e., a high number, with "per million mapped reads loosely taken to mean million mapped reads of bulk of the ordinary genes"). Bglob1 in sample 2 would be (10 / 1,000,000) * 1,000,000 = 10 FPKM (i.e., considerably smaller than Act7 as expected).

As you saw from the better quartile normalization, the 499.995 was hardly affected (it went to 500), whereas the 83.3 FPKM was moved back up to where it should have been had it not been drowned out by all the reads from Bglob1.