Chapter 3 Transcript expression

In the previous chapters, I showed importing transcript-level data with tximeta, summarizing the data to gene-level counts, and modeling the gene-level counts with DESeq2. Now I return to transcript-level data and demonstrate how we can perform statistical testing on transcripts, that is, across all the isoforms of all the genes. There are two important aspects to consider when performing transcript-level analysis:

Uncertainty - Because the isoforms of a gene often have a considerable amount of sequence similarity resulting from shared exons, and because short read RNA-seq protocols involve generating fragments that do not span the entire transcript, there can be considerable uncertainty in assigning a given fragment to a particular transcript (see next chapter for a discussion of long read protocols). The uncertainty is not constant across the transcripts, and depends on many factors, some inherent to the gene model, such as the size of the alternative exons, and some inherent to the experiment, such as the sequencing depth, fragment length, read length, and technical biases producing non-uniform coverage.
Isoform switching - We may perform testing for differential expression of each transcript (I will perform this analysis in this chapter), or we may also consider testing whether the usage of the isoforms within a gene changes across condition. The latter question is often termed differential transcript usage (DTU), and can be related to differential transcript expression (DTE), but they are not identical questions. For example, if all of the isoforms of a gene increase in their expression across condition with equal fold change, this is an example of differential gene expression (DGE) and DTE, but not DTU, as the proportions of the individual isoforms did not change.

Regarding isoform switching, one reference which explores Bioconductor packages that can be used to detect DTU is the rnaseqDTU Bioconductor workflow (2). This workflow demonstrates optimal filtering techniques (30), how the methods DEXSeq (31) and DRIMSeq (32) can be applied to estimated transcript counts, and how stageR (33) can be utilized to detect which genes and which isoforms contain evidence of DTU while controlling overall error rates.

I will first introduce the experimental data, and then discuss various approaches used to analyze transcript-level data. I will load some processed RNA-seq data from an experiment by (34), a subset of which is available in the macrophage Biconductor package. The experiment involved measuring transcription in macrophage cell lines from a number of human donors, both untreated, as well as treated with IFNg, Salmonella, and IFNg combined with Salmonella. Here I will focus on the samples that were untreated and treated with IFNg. As each cell line was from a human donor, I will also control for a baseline donor effect when comparing across treatment.

The macrophage dataset has paired samples from 6 of the donors (all female), and has been quantified using Salmon. One unique aspect of Salmon is that it allows for GC bias correction at the fragment level during quantification, which is critical for reliable identification of the correct expressed isoform in experiments that have non-uniform coverage along the transcripts (6, 35).

Here I will perform differential transcript expression (DTE) analysis. A key aspect, compared to gene-level analysis, is that there is much more uncertainty in the assignments of fragments to transcripts. A number of statistical methods have been proposed to take this measurement uncertainty into account when performing downstream testing, including BitSeq (36), mmdiff (37), IsoDE (38), and Sleuth (39), the latter which leverages bootstrap quantification estimates from the kallisto (40) quantification method. These methods incorporate measurement uncertainty into parametric models for differential expression where biological variability is also modeled. The exception is IsoDE which compares bootstrap distributions of transcript expression for two samples at a time.

Here, I will use a nonparametric method that takes into account both inferential uncertainty of fragment assignments, as well as biological variability across samples, called Swish (41), which is available in the fishpond Bioconductor package. Swish stands for “SAMseq With Inferential Samples Helps”, as it is based on the existing statistical method for differential gene expression, SAMseq (42). The key idea is to make use of nonparametric testing methods such as the Mann-Whitney Wilcoxon statistic, which operate only on the ranks of the data across samples. The original SAMseq method performed resampling of the counts in order to account for sequencing depth differences. Here, Swish will make use of multiple values in each cell of the count matrix that were computed by the Salmon software, using a technique called Gibbs sampling. For more details on the Gibbs sampling procedure, consult the publication of Salmon (6) and mmseq (43). Finally, the test statistics are averaged over the multiple versions (or "inferential replicates") of the counts matrix, and q-values for false discovery rate estimation are computed via a permutation technique (44).

I begin by locating the files in the macrophage package. As before, this step is not useful for a typical RNA-seq workflow, as the data will not be contained in an R package, but contained in some directory on a server or compute cluster. In lieu of the system.file command below, you should just specify the dir variable to be a path to the files, e.g. /path/to/data/dir.

dir <- system.file("extdata", package="macrophage")
list.files(dir)

## [1] "coldata.csv"                  
## [2] "errs"                         
## [3] "gencode.v29_salmon_0.12.0"    
## [4] "gencode.v29.annotation.gtf.gz"
## [5] "PRJEB18997.txt"               
## [6] "quants"                       
## [7] "supp_table_1.csv"             
## [8] "supp_table_7.csv"

I then read in the sample table, and use dplyr to select certain columns, convert columns into factors, and add a new column pointing to the quantification files.

library(readr)
library(dplyr)
coldata <- read_csv(file.path(dir,"coldata.csv"))

## Parsed with column specification:
## cols(
##   names = col_character(),
##   sample_id = col_character(),
##   line_id = col_character(),
##   replicate = col_double(),
##   condition_name = col_character(),
##   macrophage_harvest = col_character(),
##   salmonella_date = col_character(),
##   ng_ul_mean = col_double(),
##   rna_extraction = col_character(),
##   rna_submit = col_character(),
##   library_pool = col_character(),
##   chemistry = col_character(),
##   rna_auto = col_double()
## )

lvls <- c("naive","IFNg","SL1344","IFNg_SL1344")
coldata <- coldata %>% 
  dplyr::select(names, id=sample_id, line=line_id, 
                condition=condition_name) %>%
  mutate(line=factor(line),
         condition=factor(condition, levels=lvls),
         files=file.path(dir, "quants", names, "quant.sf.gz"))

I will only consider for this demonstration the untreated and IFNg treated samples:

coldata <- coldata %>% filter(condition %in% c("naive","IFNg"))
coldata$condition <- droplevels(coldata$condition)

The coldata sample table now looks like:

head(coldata)

## # A tibble: 6 x 5
##   names     id     line  condition files                         
##   <chr>     <chr>  <fct> <fct>     <chr>                         
## 1 SAMEA103… diku_A diku… naive     /Library/Frameworks/R.framewo…
## 2 SAMEA103… diku_B diku… IFNg      /Library/Frameworks/R.framewo…
## 3 SAMEA103… eiwy_A eiwy… naive     /Library/Frameworks/R.framewo…
## 4 SAMEA103… eiwy_B eiwy… IFNg      /Library/Frameworks/R.framewo…
## 5 SAMEA103… fikt_A fikt… naive     /Library/Frameworks/R.framewo…
## 6 SAMEA103… fikt_B fikt… IFNg      /Library/Frameworks/R.framewo…

Test that all the files exist as I specified:

all(file.exists(coldata$files))

## [1] TRUE

As before, I use tximeta to read in the quantification data. This time I do not set dropInfReps=TRUE, as I will need the inferential replicates created by Salmon to perform DTE with Swish. The inferential replicates allows the analysis to take into account the uncertainty of fragment assignment to transcripts.

library(tximeta)
suppressPackageStartupMessages(library(SummarizedExperiment))
y <- tximeta(coldata)

## importing quantifications

## reading in files with read_tsv

## 1 2 3 4 5 6 7 8 9 10 11 12 
## found matching linked transcriptome:
## [ GENCODE - Homo sapiens - release 29 ]
## loading existing TxDb created: 2020-03-12 01:46:43
## loading existing transcript ranges created: 2020-03-12 01:48:11
## fetching genome info for GENCODE

For speed of the demonstration, I subset to only the transcripts on chromosome 1 (this would not be recommended for a typical analysis).

y <- y[seqnames(y) == "chr1",]

I load the fishpond package, which contains the methods for running Swish. There are three basic steps: scaling of the inferential replicates to make them comparable despite different sequencing depth, filtering out lowly expressed transcripts, and the testing itself. The scaling method by default uses the median ratio method of DESeq (24). The labelKeep function by default will keep those transcripts with 3 or more samples with a count of 10 or higher. For scRNA-seq with UMI de-duplication, it is recommended to lower the minimal count to a lower value such as 3. The minimal number of samples can be increased for experiments with many samples.

library(fishpond)

y <- scaleInfReps(y, quiet=TRUE)
y <- labelKeep(y)
y <- y[mcols(y)$keep,]

Because the method makes use of permutations, it is required to set a seed for computational reproducibility. I specify to test across the condition variable, while controlling for a pairing variable line. The line variable indicates which donor the cell line came from.

set.seed(1)
y <- swish(y, x="condition", pair="line", quiet=TRUE)

## note: less permutations are available than requested

## 64 are available

After running swish, all of the results are stored in the metadata columns (mcols) of the object y. I look to see how many transcripts have a small q-value (analogous to an adjusted p-value, this should provide a set with a nominal FDR control).

names(mcols(y))

##  [1] "tx_id"     "gene_id"   "tx_name"   "log10mean" "keep"     
##  [6] "stat"      "log2FC"    "pvalue"    "locfdr"    "qvalue"

table(mcols(y)$qvalue < .05)

## 
## FALSE  TRUE 
##  5081  1329

One important aspect in testing across many features, in particular where the uncertainty level is so heterogeneous, is to consider if the p-value distribution is roughly uniform, with the exception of the rejected tests. Here Swish provides a roughly uniform distribution, with a spike on the left side representing the rejections of the null hypothesis.

hist(mcols(y)$pvalue, col="grey",
     main="", xlab="p-values")

As with DESeq2 I can make an MA-plot, with the differential transcripts highlighted in blue (here at 5% FDR).

plotMASwish(y, alpha=.05)

I can also examine individual transcripts with evidence of differential expression. As each sample is represented by a distribution of possible estimated counts from Salmon, Swish uses boxplots to represent the differences in expression across samples:

idx <- with(mcols(y), which(pvalue < .05 & log2FC > 4))
plotInfReps(y, idx[1], x="condition", cov="line", xaxis=FALSE)

This chapter gives a basic introduction to DTE using nonparametric testing with the Swish method. For more details on transcript-level analysis, it is recommended to consult the fishpond Bioconductor package vignette, or the rnaseqDTU workflow on Bioconductor (2).

References

2. Love M, Soneson C, Patro R. 2018. Swimming downstream: Statistical analysis of differential transcript usage following salmon quantification. F1000Research. 7(952):

6. Patro R, Duggal G, Love M, Irizarry R, Kingsford C. 2017. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods. 14:417–19

24. Anders S, Huber W. 2010. Differential expression analysis for sequence count data. Genome Biology. 11:R106

30. Soneson C, Matthes KL, Nowicka M, Law CW, Robinson MD. 2016. Isoform prefiltering improves performance of count-based methods for analysis of differential transcript usage. Genome Biology. 17(1):12

31. Anders S, Reyes A, Huber W. 2012. Detecting differential usage of exons from RNA-seq data. Genome Research. 22(10):2008–17

32. Nowicka M, Robinson M. 2016. DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics. F1000Research. 5(1356):

33. Van den Berge K, Soneson C, Robinson MD, Clement L. 2017. stageR: a general stage-wise method for controlling the gene-level false discovery rate in differential expression and differential transcript usage. Genome Biology. 18(1):151

34. Alasoo K, Rodrigues J, Mukhopadhyay S, Knights A, Mann A, et al. 2018. Shared genetic effects on chromatin and gene expression indicate a role for enhancer priming in immune response. Nature Genetics. 50:424–31

35. Love MI, Hogenesch JB, Irizarry RA. 2016. Modeling of rna-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. Nature Biotechnology. 34(12):1287–91

36. Glaus P, Honkela A, Rattray M. 2012. Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics. 28(13):

37. Turro E, Astle WJ, Tavaré S. 2013. Flexible analysis of RNA-seq data using mixed effects models. Bioinformatics. 30(2):180–88

38. Al Seesi S, Temate-Tiagueu Y, Zelikovsky A, Măndoiu II. 2014. Bootstrap-based differential gene expression analysis for RNA-Seq data with and without replicates. BMC Genomics. 15(Suppl 8):

39. Pimentel H, Bray NL, Puente S, Melsted P, Pachter L. 2017. Differential analysis of RNA-seq incorporating quantification uncertainty. Nature Methods. 14(7):687–90

40. Bray NL, Pimentel H, Melsted P, Pachter L. 2016. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology. 34(5):525

41. Zhu A, Srivastava A, Ibrahim J, Patro R, Love M. 2019. Nonparametric expression analysis using inferential replicate counts. Nucleic Acids Research

42. Li J, Tibshirani R. 2011. Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-Seq data. Statistical Methods in Medical Research. 22(5):519–36

43. Turro E, Su S-Y, Gonçalves Â, Coin LJ, Richardson S, Lewin A. 2011. Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads. Genome Biology. 12(2):R13

44. Storey J, Tibshirani R. 2003. Statistical significance for genome-wide experiments. Proceedings of the National Academy of Sciences. 100(16):9440–5