Chapter 1 Quantification

One key initial step in analyzing RNA-seq data is to quantify, or estimate, the number of fragments in the experiment that can be assigned to each feature, whether a gene or a transcript (an isoform of a gene). Alongside quantification, it is strongly recommended to perform quality control (QC) checks on the sequence files. Reports spanning multiple samples can be generated with the MultiQC software (4). Here I will not start with quality control checks, but instead move straight to quantification and import of estimated counts into R, although an example quality control report can be found in the airway2 package on GitHub: MultiQC report.

The data I will examine first in this section is from an experiment of the effect of knocking down OCT4 and BRG1 in mouse embryos (5). In particular, I will examine the treatment effect of knocking down OCT4. The experiment had four groups of samples, each replicated in triplicate. The RNA-seq data from the experiment is available in the oct4 Bioconductor package.

I will use the Salmon software for estimating transcript abundance (6). Briefly, Salmon uses the sequenced reads and the reference transcripts, and constructs a generative model for the observed data, which includes modeling of various technical biases that are commonly observed in RNA-seq. Salmon then outputs the estimated counts for each transcript, and an “effective length” of the transcript, which is shorter than the full length of the transcript if it was biased to having fewer reads due to technical artifact, and longer than the full length of the transcript if it was biased to having more reads. For details about the Salmon method and software, refer to the Salmon website.

All 12 samples from the experiment were quantified using Salmon. The jobs for processing the reads with Salmon were executed via Snakemake (7), a convenient tool for scheduling and executing repetitive rule-based operations on input data. The exact lines of code can be seen in the Snakefile.

Most of the packages shown in this online book live within the Bioconductor project (8). Bioconductor objects are more complex than basic objects in R, for example numeric, character, or other simple objects, in that they often have attached metadata, such as additional information about rows and columns, or other metadata. We will see how to make use of the metadata throughout the various sections by examining, e.g. colData for information about the columns of a matrix, or mcols for metadata columns.

I begin by loading some data in the Bioconductor package oct4. Note that this step is not useful for a typical RNA-seq workflow, as the data will not be contained in an R package, but contained in some directory on a server or compute cluster. So in lieu of the system.file command below, which is used here to locate a file within an R package, you could just specify the dir variable to be a path to the files, e.g. /path/to/data/dir. Note also that this and other packages must be installed via the Bioconductor package installation instructions.

## [1] "coldata.csv"     "list"            "quants"         
## [4] "SraRunTable.txt"

I will use the readr and dplyr packages to read in a CSV file with information about the samples. Because we are typically working with “tall” count matrices, with the rows representing genomic features such as genes or transcripts, and columns representing samples, the sample information is referred to as the column data or coldata. For more information on dplyr, see the excellent dplyr online documentation.

## Parsed with column specification:
## cols(
##   names = col_character(),
##   line = col_character(),
##   condition = col_character()
## )

I next set the levels of the line and condition factor variables so that OCT4 and untrt are the reference levels, and I specify the path to the quantification files I want to read in.

In the last step of the mutate call, note I specify paths to the quantification files quant.sf.gz. Typically, these files are not gzipped, but I have compressed the quant.sf files to reduce the size of the data package. Note also that, although I point only to the quant.sf files, the entire directory containing that file is required for proper import of the quantification data. There are other metadata files that provide important information about the experiment, including uncertainty information. The metadata files enable automatic identification of the feature set, a computational reproducibility feature that will be described later in this section.

All the files exist at the locations I specified, which in this case have the pattern /<DIR>/quants/<NAMES>/quant.sf.gz.

## [1] TRUE

I will now use the Bioconductor package tximeta (9) to read in the quantification data, and create a SummarizedExperiment object. The SummarizedExperiment class is described in (8), in particular diagrammed in this figure. Here, the tximeta function performs a number of operations on behalf of the user, identifying the reference transcripts that were used to quantify the RNA-seq reads, automatically downloading (or loading) the relevant genomic locations of the transcripts, and attaching the relevant genomic context (the genome version and chromosome names and lengths). tximeta can perform these operations by leveraging the hash signature of the sequence of the transcripts, which is stored when processing bulk RNA-seq with Salmon, or when processing single cell RNA-seq with the alevin software (10) (distributed with Salmon).

A small note: I specify dropInfReps=TRUE, because I will be performing gene-level analysis, and I won’t make use of the uncertainty information in this first analysis. I will show in a later section how to make use of the inferential replicates computed by Salmon during quantification.

## importing quantifications
## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 9 10 11 12 
## found matching transcriptome:
## [ GENCODE - Mus musculus - release M20 ]
## loading existing TxDb created: 2020-03-25 23:26:23
## generating transcript ranges
## fetching genome info for GENCODE

In the above output, we can see that tximeta identified that the mouse reference transcripts, release M20, were used from GENCODE (11). This information would be automatically identified, even if the group that performed the quantification did not document this when uploading processed expression data to a public repository.

We can see the matrices of data that have been compiled, and examine the genomic locations of the features (rows):

## [1] "counts"    "abundance" "length"
## GRanges object with 137271 ranges and 3 metadata columns:
##                        seqnames          ranges strand |
##                           <Rle>       <IRanges>  <Rle> |
##   ENSMUST00000193812.1     chr1 3073253-3074322      + |
##   ENSMUST00000082908.1     chr1 3102016-3102125      + |
##   ENSMUST00000162897.1     chr1 3205901-3216344      - |
##   ENSMUST00000159265.1     chr1 3206523-3215632      - |
##   ENSMUST00000070533.4     chr1 3214482-3671498      - |
##                    ...      ...             ...    ... .
##   ENSMUST00000082419.1     chrM     13552-14070      - |
##   ENSMUST00000082420.1     chrM     14071-14139      - |
##   ENSMUST00000082421.1     chrM     14145-15288      + |
##   ENSMUST00000082422.1     chrM     15289-15355      + |
##   ENSMUST00000082423.1     chrM     15356-15422      - |
##                            tx_id              gene_id
##                        <integer>      <CharacterList>
##   ENSMUST00000193812.1         1 ENSMUSG00000102693.1
##   ENSMUST00000082908.1         2 ENSMUSG00000064842.1
##   ENSMUST00000162897.1      4203 ENSMUSG00000051951.5
##   ENSMUST00000159265.1      4204 ENSMUSG00000051951.5
##   ENSMUST00000070533.4      4205 ENSMUSG00000051951.5
##                    ...       ...                  ...
##   ENSMUST00000082419.1    138833 ENSMUSG00000064368.1
##   ENSMUST00000082420.1    138834 ENSMUSG00000064369.1
##   ENSMUST00000082421.1    138825 ENSMUSG00000064370.1
##   ENSMUST00000082422.1    138826 ENSMUSG00000064371.1
##   ENSMUST00000082423.1    138835 ENSMUSG00000064372.1
##                                     tx_name
##                                 <character>
##   ENSMUST00000193812.1 ENSMUST00000193812.1
##   ENSMUST00000082908.1 ENSMUST00000082908.1
##   ENSMUST00000162897.1 ENSMUST00000162897.1
##   ENSMUST00000159265.1 ENSMUST00000159265.1
##   ENSMUST00000070533.4 ENSMUST00000070533.4
##                    ...                  ...
##   ENSMUST00000082419.1 ENSMUST00000082419.1
##   ENSMUST00000082420.1 ENSMUST00000082420.1
##   ENSMUST00000082421.1 ENSMUST00000082421.1
##   ENSMUST00000082422.1 ENSMUST00000082422.1
##   ENSMUST00000082423.1 ENSMUST00000082423.1
##   -------
##   seqinfo: 22 sequences (1 circular) from mm10 genome

Because the tximeta function has identified the correct reference transcripts that were used for quantification, and stored this as metadata in the SummarizedExperiment object, I can obtain additional information, such as the correspondence of transcripts to genes. I can then perform a summarization task, without having to look up this information manually.

## loading existing TxDb created: 2020-03-25 23:26:23
## obtaining transcript-to-gene mapping from database
## loading existing gene ranges created: 2020-03-25 23:26:26
## summarizing abundance
## summarizing counts
## summarizing length

Gene-level summarization of transcript-level quantification data is described in the paper for tximport (12). The tximeta function internally calls methods from the tximport package during its operation. Note that tximport provides similar functionality to tximeta, but instead of returning a rich Bioconductor object, tximport returns a simple list of matrices.

Below I show some examples of operations that can be easily performed because I have a SummarizedExperiment object with the appropriate gene ranges attached. I can easily subset the genes (rows) based on overlaps with a given genomic range using standard square bracket indexing. The range-based class system and set of methods for operating on genomic ranges is provided by the GenomicRanges package, which has useful help pages and vignettes (13).

## GRanges object with 53697 ranges and 1 metadata column:
##                         seqnames              ranges strand |
##                            <Rle>           <IRanges>  <Rle> |
##    ENSMUSG00000000001.4     chr3 108107280-108146146      - |
##   ENSMUSG00000000003.15     chrX   77837901-77853623      - |
##   ENSMUSG00000000028.15    chr16   18780447-18811987      - |
##   ENSMUSG00000000031.16     chr7 142575529-142578143      - |
##   ENSMUSG00000000037.16     chrX 161117193-161258213      + |
##                     ...      ...                 ...    ... .
##    ENSMUSG00000117651.1    chr17   32731010-32731806      + |
##    ENSMUSG00000117652.1    chr18     6910459-6936621      - |
##    ENSMUSG00000117653.1    chr17   94811611-94812922      - |
##    ENSMUSG00000117654.1    chr17   32863662-32877942      - |
##    ENSMUSG00000117655.1    chr19   57497390-57512784      + |
##                                       gene_id
##                                   <character>
##    ENSMUSG00000000001.4  ENSMUSG00000000001.4
##   ENSMUSG00000000003.15 ENSMUSG00000000003.15
##   ENSMUSG00000000028.15 ENSMUSG00000000028.15
##   ENSMUSG00000000031.16 ENSMUSG00000000031.16
##   ENSMUSG00000000037.16 ENSMUSG00000000037.16
##                     ...                   ...
##    ENSMUSG00000117651.1  ENSMUSG00000117651.1
##    ENSMUSG00000117652.1  ENSMUSG00000117652.1
##    ENSMUSG00000117653.1  ENSMUSG00000117653.1
##    ENSMUSG00000117654.1  ENSMUSG00000117654.1
##    ENSMUSG00000117655.1  ENSMUSG00000117655.1
##   -------
##   seqinfo: 22 sequences (1 circular) from mm10 genome
## class: RangedSummarizedExperiment 
## dim: 18 12 
## metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
## assays(3): counts abundance length
## rownames(18): ENSMUSG00000025916.10 ENSMUSG00000025917.9
##   ... ENSMUSG00000103448.1 ENSMUSG00000103810.1
## rowData names(1): gene_id
## colnames(12): SRX2236945 SRX2236946 ... SRX2236955
##   SRX2236956
## colData names(3): names line condition

I can likewise easily subset the samples by referring to the colData columns, e.g. colData(gse)$line. A shortcut for referring to colData columns is to just use the dollar sign directly on the object, gse$line.

## class: RangedSummarizedExperiment 
## dim: 53697 6 
## metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
## assays(3): counts abundance length
## rownames(53697): ENSMUSG00000000001.4
##   ENSMUSG00000000003.15 ... ENSMUSG00000117654.1
##   ENSMUSG00000117655.1
## rowData names(1): gene_id
## colnames(6): SRX2236945 SRX2236946 ... SRX2236949
##   SRX2236950
## colData names(3): names line condition

Finally, I demonstrate that it is easy to add alternative identifiers, making use of the organism annotation packages in Bioconductor. For example, to add gene symbols, I can use tximeta’s addIds function, and specify the SYMBOL column.

## mapping to new IDs using 'org.Mm.eg.db' data package
## if all matching IDs are desired, and '1:many mappings' are reported,
## set multiVals='list' to obtain all the matching IDs
## it appears the rows are gene IDs, setting 'gene' to TRUE
## 'select()' returned 1:many mapping between keys and
## columns

To check all the available columns for an organism package, use the columns function:

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MGI"          "ONTOLOGY"     "ONTOLOGYALL" 
## [17] "PATH"         "PFAM"         "PMID"         "PROSITE"     
## [21] "REFSEQ"       "SYMBOL"       "UNIGENE"      "UNIPROT"

References

4. Ewels P, Magnusson M, Lundin S, Käller M. 2016. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 32(19):3047–8

5. King HW, Klose RJ. 2017. The pioneer factor oct4 requires the chromatin remodeller brg1 to support gene regulatory element function in mouse embryonic stem cells. eLife. 6:e22631

6. Patro R, Duggal G, Love M, Irizarry R, Kingsford C. 2017. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods. 14:417–19

7. Köster J, Rahmann S. 2012. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics. 28(19):2520–2

8. Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, et al. 2015. Orchestrating high-throughput genomic analysis with Bioconductor. Nature Methods. 12(2):115–21

9. Love MI, Soneson C, Hickey PF, Johnson LK, Pierce NT, et al. 2019. Tximeta: reference sequence checksums for provenance identification in RNA-seq. bioRxiv

10. Srivastava A, Malik L, Smith TS, Sudbery I, Patro R. 2019. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biology. 20(65):

11. Frankish A, GENCODE-consoritum, Flicek P. 2018. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Research

12. Soneson C, Love MI, Robinson M. 2015. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research. 4(1521):

13. Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, et al. 2013. Software for Computing and Annotating Genomic Ranges. PLoS Computational Biology. 9(8):e1003118+