Chapter 1 Quantification
One key initial step in analyzing RNA-seq data is to quantify, or estimate, the number of fragments in the experiment that can be assigned to each feature, whether a gene or a transcript (an isoform of a gene). Alongside quantification, it is strongly recommended to perform quality control (QC) checks on the sequence files. Reports spanning multiple samples can be generated with the MultiQC software (4). Here I will not start with quality control checks, but instead move straight to quantification and import of estimated counts into R, although an example quality control report can be found in the airway2 package on GitHub: MultiQC report.
The data I will examine first in this section is from an experiment of the effect of knocking down OCT4 and BRG1 in mouse embryos (5). In particular, I will examine the treatment effect of knocking down OCT4. The experiment had four groups of samples, each replicated in triplicate. The RNA-seq data from the experiment is available in the oct4 Bioconductor package.
I will use the Salmon software for estimating transcript abundance (6). Briefly, Salmon uses the sequenced reads and the reference transcripts, and constructs a generative model for the observed data, which includes modeling of various technical biases that are commonly observed in RNA-seq. Salmon then outputs the estimated counts for each transcript, and an “effective length” of the transcript, which is shorter than the full length of the transcript if it was biased to having fewer reads due to technical artifact, and longer than the full length of the transcript if it was biased to having more reads. For details about the Salmon method and software, refer to the Salmon website.
All 12 samples from the experiment were quantified using Salmon. The jobs for processing the reads with Salmon were executed via Snakemake (7), a convenient tool for scheduling and executing repetitive rule-based operations on input data. The exact lines of code can be seen in the Snakefile.
Most of the packages shown in this online book live within the Bioconductor
project (8). Bioconductor objects are more complex than
basic objects in R, for example numeric, character, or other
simple objects, in that they often have attached metadata, such as
additional information about rows and columns, or other metadata.
We will see how to make use of the metadata throughout the various
sections by examining, e.g. colData
for information about the columns
of a matrix, or mcols
for metadata columns.
I begin by loading some data in the Bioconductor package oct4.
Note that this step is not useful for a typical RNA-seq workflow,
as the data will not be contained in an R package, but contained
in some directory on a server or compute cluster. So in lieu
of the system.file
command below, which is used here to locate
a file within an R package, you could just specify the dir
variable to be a path to the files, e.g. /path/to/data/dir
.
Note also that this and other packages must be installed via the
Bioconductor package installation instructions.
## [1] "coldata.csv" "list" "quants"
## [4] "SraRunTable.txt"
I will use the readr and dplyr packages to read in a CSV
file with information about the samples. Because we are typically
working with “tall” count matrices, with the rows representing
genomic features such as genes or transcripts, and columns
representing samples, the sample information is referred to
as the column data or coldata
. For more information on
dplyr, see the excellent
dplyr online documentation.
## Parsed with column specification:
## cols(
## names = col_character(),
## line = col_character(),
## condition = col_character()
## )
I next set the levels
of the line
and condition
factor variables
so that OCT4
and untrt
are the reference levels, and I specify
the path to the quantification files I want to read in.
In the last step of the mutate
call, note I specify paths to the
quantification files quant.sf.gz
. Typically, these files are not
gzipped, but I have compressed the quant.sf
files to reduce the
size of the data package. Note also that, although I point only to the
quant.sf
files, the entire directory containing that file
is required for proper import of the quantification data.
There are other metadata files that provide important information
about the experiment, including uncertainty information. The
metadata files enable automatic identification of the feature set,
a computational reproducibility feature that will be described later
in this section.
suppressPackageStartupMessages(library(dplyr))
coldata <- coldata %>%
mutate(line=factor(line, levels=c("OCT4","BRG1")),
condition=factor(condition, levels=c("untrt","trt")),
files=file.path(dir, "quants", names, "quant.sf.gz"))
All the files exist at the locations I specified, which in this case
have the pattern /<DIR>/quants/<NAMES>/quant.sf.gz
.
## [1] TRUE
I will now use the Bioconductor package tximeta (9) to
read in the quantification data, and create a SummarizedExperiment
object. The SummarizedExperiment class is described in (8),
in particular diagrammed in
this figure.
Here, the tximeta
function performs a number of operations on
behalf of the user, identifying the reference transcripts that were used
to quantify the RNA-seq reads, automatically downloading (or loading)
the relevant genomic locations of the transcripts, and attaching the
relevant genomic context (the genome version and chromosome names and
lengths). tximeta can perform these operations by leveraging the
hash signature of the sequence of the transcripts, which is stored
when processing bulk RNA-seq with Salmon, or when processing single
cell RNA-seq with the alevin software (10)
(distributed with Salmon).
A small note: I specify dropInfReps=TRUE
, because I will be
performing gene-level analysis, and I won’t make use of the
uncertainty information in this first analysis. I will show
in a later section how to make use of the inferential replicates
computed by Salmon during quantification.
## importing quantifications
## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 9 10 11 12
## found matching transcriptome:
## [ GENCODE - Mus musculus - release M20 ]
## loading existing TxDb created: 2020-03-25 23:26:23
## generating transcript ranges
## fetching genome info for GENCODE
In the above output, we can see that tximeta
identified that
the mouse reference transcripts, release M20, were used from
GENCODE (11). This information would be automatically
identified, even if the group that performed the quantification
did not document this when uploading processed expression data to a
public repository.
We can see the matrices of data that have been compiled, and examine the genomic locations of the features (rows):
## [1] "counts" "abundance" "length"
## GRanges object with 137271 ranges and 3 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## ENSMUST00000193812.1 chr1 3073253-3074322 + |
## ENSMUST00000082908.1 chr1 3102016-3102125 + |
## ENSMUST00000162897.1 chr1 3205901-3216344 - |
## ENSMUST00000159265.1 chr1 3206523-3215632 - |
## ENSMUST00000070533.4 chr1 3214482-3671498 - |
## ... ... ... ... .
## ENSMUST00000082419.1 chrM 13552-14070 - |
## ENSMUST00000082420.1 chrM 14071-14139 - |
## ENSMUST00000082421.1 chrM 14145-15288 + |
## ENSMUST00000082422.1 chrM 15289-15355 + |
## ENSMUST00000082423.1 chrM 15356-15422 - |
## tx_id gene_id
## <integer> <CharacterList>
## ENSMUST00000193812.1 1 ENSMUSG00000102693.1
## ENSMUST00000082908.1 2 ENSMUSG00000064842.1
## ENSMUST00000162897.1 4203 ENSMUSG00000051951.5
## ENSMUST00000159265.1 4204 ENSMUSG00000051951.5
## ENSMUST00000070533.4 4205 ENSMUSG00000051951.5
## ... ... ...
## ENSMUST00000082419.1 138833 ENSMUSG00000064368.1
## ENSMUST00000082420.1 138834 ENSMUSG00000064369.1
## ENSMUST00000082421.1 138825 ENSMUSG00000064370.1
## ENSMUST00000082422.1 138826 ENSMUSG00000064371.1
## ENSMUST00000082423.1 138835 ENSMUSG00000064372.1
## tx_name
## <character>
## ENSMUST00000193812.1 ENSMUST00000193812.1
## ENSMUST00000082908.1 ENSMUST00000082908.1
## ENSMUST00000162897.1 ENSMUST00000162897.1
## ENSMUST00000159265.1 ENSMUST00000159265.1
## ENSMUST00000070533.4 ENSMUST00000070533.4
## ... ...
## ENSMUST00000082419.1 ENSMUST00000082419.1
## ENSMUST00000082420.1 ENSMUST00000082420.1
## ENSMUST00000082421.1 ENSMUST00000082421.1
## ENSMUST00000082422.1 ENSMUST00000082422.1
## ENSMUST00000082423.1 ENSMUST00000082423.1
## -------
## seqinfo: 22 sequences (1 circular) from mm10 genome
Because the tximeta
function has identified the correct
reference transcripts that were used for quantification,
and stored this as metadata in the SummarizedExperiment
object, I can obtain additional information, such as the
correspondence of transcripts to genes. I can then perform a
summarization task, without having to look up this information
manually.
## loading existing TxDb created: 2020-03-25 23:26:23
## obtaining transcript-to-gene mapping from database
## loading existing gene ranges created: 2020-03-25 23:26:26
## summarizing abundance
## summarizing counts
## summarizing length
Gene-level summarization of transcript-level quantification data
is described in the paper for tximport (12).
The tximeta
function internally calls methods from the
tximport package during its operation. Note that tximport
provides similar functionality to tximeta, but instead of
returning a rich Bioconductor object, tximport returns
a simple list of matrices.
Below I show some examples of operations that can be easily performed because I have a SummarizedExperiment object with the appropriate gene ranges attached. I can easily subset the genes (rows) based on overlaps with a given genomic range using standard square bracket indexing. The range-based class system and set of methods for operating on genomic ranges is provided by the GenomicRanges package, which has useful help pages and vignettes (13).
## GRanges object with 53697 ranges and 1 metadata column:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## ENSMUSG00000000001.4 chr3 108107280-108146146 - |
## ENSMUSG00000000003.15 chrX 77837901-77853623 - |
## ENSMUSG00000000028.15 chr16 18780447-18811987 - |
## ENSMUSG00000000031.16 chr7 142575529-142578143 - |
## ENSMUSG00000000037.16 chrX 161117193-161258213 + |
## ... ... ... ... .
## ENSMUSG00000117651.1 chr17 32731010-32731806 + |
## ENSMUSG00000117652.1 chr18 6910459-6936621 - |
## ENSMUSG00000117653.1 chr17 94811611-94812922 - |
## ENSMUSG00000117654.1 chr17 32863662-32877942 - |
## ENSMUSG00000117655.1 chr19 57497390-57512784 + |
## gene_id
## <character>
## ENSMUSG00000000001.4 ENSMUSG00000000001.4
## ENSMUSG00000000003.15 ENSMUSG00000000003.15
## ENSMUSG00000000028.15 ENSMUSG00000000028.15
## ENSMUSG00000000031.16 ENSMUSG00000000031.16
## ENSMUSG00000000037.16 ENSMUSG00000000037.16
## ... ...
## ENSMUSG00000117651.1 ENSMUSG00000117651.1
## ENSMUSG00000117652.1 ENSMUSG00000117652.1
## ENSMUSG00000117653.1 ENSMUSG00000117653.1
## ENSMUSG00000117654.1 ENSMUSG00000117654.1
## ENSMUSG00000117655.1 ENSMUSG00000117655.1
## -------
## seqinfo: 22 sequences (1 circular) from mm10 genome
## class: RangedSummarizedExperiment
## dim: 18 12
## metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
## assays(3): counts abundance length
## rownames(18): ENSMUSG00000025916.10 ENSMUSG00000025917.9
## ... ENSMUSG00000103448.1 ENSMUSG00000103810.1
## rowData names(1): gene_id
## colnames(12): SRX2236945 SRX2236946 ... SRX2236955
## SRX2236956
## colData names(3): names line condition
I can likewise easily subset the samples by referring to
the colData
columns, e.g. colData(gse)$line
.
A shortcut for referring to colData
columns is to just use
the dollar sign directly on the object, gse$line
.
## class: RangedSummarizedExperiment
## dim: 53697 6
## metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
## assays(3): counts abundance length
## rownames(53697): ENSMUSG00000000001.4
## ENSMUSG00000000003.15 ... ENSMUSG00000117654.1
## ENSMUSG00000117655.1
## rowData names(1): gene_id
## colnames(6): SRX2236945 SRX2236946 ... SRX2236949
## SRX2236950
## colData names(3): names line condition
Finally, I demonstrate that it is easy to add alternative
identifiers, making use of the organism annotation packages
in Bioconductor. For example, to add gene symbols, I can
use tximeta’s addIds
function, and specify the SYMBOL
column.
## mapping to new IDs using 'org.Mm.eg.db' data package
## if all matching IDs are desired, and '1:many mappings' are reported,
## set multiVals='list' to obtain all the matching IDs
## it appears the rows are gene IDs, setting 'gene' to TRUE
## 'select()' returned 1:many mapping between keys and
## columns
To check all the available columns for an organism package,
use the columns
function:
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
## [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
## [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL"
## [13] "IPI" "MGI" "ONTOLOGY" "ONTOLOGYALL"
## [17] "PATH" "PFAM" "PMID" "PROSITE"
## [21] "REFSEQ" "SYMBOL" "UNIGENE" "UNIPROT"
References
4. Ewels P, Magnusson M, Lundin S, Käller M. 2016. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 32(19):3047–8
5. King HW, Klose RJ. 2017. The pioneer factor oct4 requires the chromatin remodeller brg1 to support gene regulatory element function in mouse embryonic stem cells. eLife. 6:e22631
6. Patro R, Duggal G, Love M, Irizarry R, Kingsford C. 2017. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods. 14:417–19
7. Köster J, Rahmann S. 2012. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics. 28(19):2520–2
8. Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, et al. 2015. Orchestrating high-throughput genomic analysis with Bioconductor. Nature Methods. 12(2):115–21
9. Love MI, Soneson C, Hickey PF, Johnson LK, Pierce NT, et al. 2019. Tximeta: reference sequence checksums for provenance identification in RNA-seq. bioRxiv
10. Srivastava A, Malik L, Smith TS, Sudbery I, Patro R. 2019. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biology. 20(65):
11. Frankish A, GENCODE-consoritum, Flicek P. 2018. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Research
12. Soneson C, Love MI, Robinson M. 2015. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research. 4(1521):
13. Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, et al. 2013. Software for Computing and Annotating Genomic Ranges. PLoS Computational Biology. 9(8):e1003118+