Adaptive priors for estimating effect size with genomic data

Michael Love Dept. of Biostatistics Dept. of Genetics UNC-Chapel Hill

Work in collaboration with

  • Anqi Zhu (UNC-CH)
  • Joseph Ibrahim (UNC-CH)

Aims and activity of the Love lab

  • Software and workflows for genomic data science
  • Statistical method development
  • Collaborations: Genetics, Biology, CS, Statistics

We often begin with plots

Crucial for high dimensional analysis:

  • effect size over mean (“MA” plot)
  • SD over mean
  • systematic variation (boxplots, PCA)

Gene expression

  • Many collaborations around gene expression:
    • neurons (autism, schizophrenia) *
    • adipose (T2D) *
    • macrophage (arthritis) *
    • colon (IBD) *
    • breast tumor
    • airway (HIV)
  • Often, a counting technology
    • Sequencing of cDNA fragments
    • Nanostring

* also interested in chromatin accessibility or conformation

Log fold change

Expression of \(G\) genes, 5 replicates in 2 groups:

\[[X_{g1}, \dots, X_{g5}] \quad \textrm{vs} \quad [Y_{g1}, \dots, Y_{g5}]\]

\[ E(X_{gi}) = \mu_{gX} \]

\[ E(Y_{gi}) = \mu_{gY} \]

\[ \beta_g \equiv \log_2 \left( \frac{\mu_{gY}}{\mu_{gX}} \right) \]

Complexities

  • Counts \(X_{gi}\) and \(Y_{gi}\) are actually estimated [1-3]
  • Technical artifacts: per-sample scaling factor / offset [4-5]
  • Batches and PCR amplification effects
  • Dispersion parameter must be estimated [6-8]
  • Zero component for highly amplified data [9]

References
[1] Issue of DGE (Trapnell et al., 2013)
[2] Bias models (Patro, Duggal, Love, Irizarry, & Kingsford, 2017),
[3] Isoform offset (Soneson, Love, & Robinson, 2015),
[4] TMM (M. D. Robinson & Oshlack, 2010),
[5] Median ratio (Anders & Huber, 2010),
[6] edgeR GLM (McCarthy, Chen, & Smyth, 2012),
[7] DSS (H. Wu, Wang, & Wu, 2012),
[8] DESeq2 (Love, Huber, & Anders, 2014),
[9] ZI weights (Van den Berge et al., 2018)

Effect size in simulated data

  • Simulation of over-dispersed count data
  • 2500 “genes”, 5 vs 5 replicates

Effect size over mean (“MA”)

Estimate over truth

Faceted by mean

Pseudocount 1

Pseudocount 5

Normal prior

Pseudocount 5

Normal prior

What do we see


With prior:

  • Reduced variance
  • Genes can be ranked
  • Moderate bias

Effect size estimation

  • Perhaps we can do better than pseudocount
  • Why bother? Just use p-values

A distribution of effect sizes

  • Highly replicated yeast dataset (Schurch et al., 2016)
  • 42 vs 44 biological replicates!
  • \(\Delta snf2\) mutant
    • Catalytic component of the SWI/SNF complex
    • Chromatin-remodeling
    • Regulation of a large number of genes

LFC in highly replicated yeast

LFC with 2x SE

(1700 black, 5100 red)

Consider utility of p-values

  • Null likely false for (nearly) all genes in this example
  • Ranking by prob under the null seems non-obvious
  • We can estimate the effect size for small or large n

Normal prior in DESeq2

  • MLE coefficients, \(\hat{\beta}_g\)
  • Adapt prior to these
  • Normal has thin tails
  • Match Normal to upper 5% weighted quantile
  • Inverse weight by expected variance of log counts

(Love et al., 2014)

Other priors with heavy tails

  • ashr (Stephens, 2016)
    • \(\pi_0 \delta_0 + \sum_{k=1}^K \pi_k N(\cdot; 0, \sigma_k^2)\)
    • \(\sigma_1, \dots, \sigma_k\) a large, dense, fixed grid
  • apeglm (Zhu, Ibrahim, & Love, 2018)
    • Cauchy prior with scale parameter \(S\)

Method details

  • ashr:
    • set \(\sigma_k\) grid based on min \(\hat{s}_g\) and max \(\hat{\beta}_g^2 - \hat{s}_g^2\)
    • estimate \(\pi_k\) using \(\hat{\beta}_g\) and \(\hat{s}_g\)
    • \(P(\hat{\beta}_g | \beta_g, \hat{s}_g) = N(\hat{\beta}_g; \beta_g, \hat{s}_g^2)\)
    • return the posterior mean, SD
  • apeglm:
    • set the scale \(S\) of Cauchy using \(\hat{\beta}_g\) and \(\hat{s}_g\)
    • use data \(Y_g\), parameters \(\theta_g\), likelihood (NB, ZINB, etc.)
    • return the posterior mode, SD

Connection to bayesglm

  • Fixed 2.5 scale Cauchy prior
  • Posterior mode and SE via an approximate EM within IRLS

(Gelman, Jakulin, Pittau, & Su, 2008)

Connection to bayesglm

  • want something better than the unstable estimates produced by the current default - maximum likelihood
  • [Cauchy] allows for occasional large coefficients while still performing a reasonable amount of shrinkage for coefficients near zero

(Gelman et al., 2008)

apeglm implementation

Approximate Posterior Estimation for GLM

  • Prior fit to \(\hat{\beta}_g\) and \(\hat{s}_g\)
  • C++ version of L-BFGS-B to obtain posterior mode
    • <1 s for 10k genes
  • optim to obtain Hessian for Laplace approximation
    • ~3 s

Solve for \(A\) in equation of the form (Efron & Morris, 1975)

\(A = f(A, \hat{\beta}_g^2, e_g^2)\)

Motivated by the following hierarchical model:

\[ \hat{\beta}_g \sim N(\beta_g, e_g^2) \]

\[ \beta_g \sim N(0, A) \]

Two approximations to note:

  1. We plug in \(\hat{s}_g^2\) for \(e_g^2\)
  2. Estimate scale of a Normal then use a Cauchy

False Sign Rate (FSR)

  • p-value*
  • adjusted p-value
  • q-value
  • local false discovery rate
  • local false sign rate*
  • s-value

Prior scale & False Sign Rate (FSR)

Prior scale & estimation error

Prior scale & estimation error

Return to yeast example

3 vs 3 replicates

Yeast with Normal prior

Yeast with apeglm

Yeast with ashr

What else are we doing

Simplify statistical routines

# import the data
se <- tximeta(samples)
gse <- summarizeToGene(se)

# build dataset
dds <- DESeqDataSet(gse, ~batch + condition)

# variance stabilization
vsd <- vst(dds)

# size factors, dispersion estimation
dds <- DESeq(dds)

# shrinkage estimation
res <- lfcShrink(dds, coef=3)

Paper and software

  • apeglm preprint on bioRxiv (accepted at Bioinformatics)
  • lfcShrink(dds, coef=2, type="apeglm")
  • lfcShrink(dds, coef=2, type="ashr")

Acknowledgments

  • Work by Anqi Zhu and Joseph Ibrahim
  • Comments from Wolfgang Huber and Cecile Le Sueur
  • Funding:
    • MIL - R01 HG009125, P01 CA142538, P30 ES010126
    • JGI and AZ - R01 GM070335 and P01 CA142538

Previous work on LFC shrinkage in RNA-seq

  • ShrinkBayes (Mark A. van de Wiel et al., 2012; Mark A van de Wiel, Neerincx, Buffart, Sie, & Verheul, 2014)
    • shrinkage of many parameters simultaneously
    • non-equal mixture proportions for + and - effects
    • spike + parametric, spike + non-parametric
    • Integrated Nested Laplace Approximation INLA

Connection to dispersion prior

  • DSS - model log dispersion as Normal. Method of moments estimator and simulation to define hyper-parameters
  • DESeq2 - model log dispersion as Normal. Cox-Reid estimator and closed form to define hyper-parameters
  • BASiCS - model log dispersion with Student’s t

\[ \log(\delta_g) \sim f(\mu_g) + \varepsilon, \quad \varepsilon \sim t_\nu(0,\sigma^2) \]

here we opt for a Student-t distribution as it leads to inference that is more robust to the presence of outlier genes

Connection to bayesglm

Cauchy can be modeled as a mixture of Normals:

\[ \beta \sim N(0, \sigma^2) \]

\[ \sigma^2 \sim \textrm{Scale-inv-} \chi^2(\nu, S^2) \]

  • \(\nu=1\) and \(S=2.5\) are fixed
  • Within IRLS, perform an EM step
  • Treat \(\beta\) as missing data, estimate \(\sigma\)

(Gelman et al., 2008)

Problem case for apeglm

Error and concordance for various methods

Extra slides / see paper

Yeast: n=3 and n=5

Simulation: n=5 and n=10

Simulation: n=30 and n=50

Anders, S., & Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11, R106. https://doi.org/10.1186/gb-2010-11-10-r106

Efron, B., & Morris, C. (1975). Data Analysis Using Stein’s Estimator and Its Generalization. Journal of the American Statistical Association, 70(350), 311–319.

Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y.-S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 1360–1383. http://www.stat.columbia.edu/~gelman/research/published/priors11.pdf

Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550. https://doi.org/10.1186/s13059-014-0550-8

McCarthy, D. J., Chen, Y., & Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40, 4288–4297. https://doi.org/10.1093/nar/gks042

Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods. http://dx.doi.org/10.1038/nmeth.4197

Robinson, M. D., & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(r25). https://doi.org/10.1186/gb-2010-11-3-r25

Schurch, N. J., Schofield, P., Gierliński, M., Cole, C., Sherstnev, A., Singh, V., … Barton, G. J. (2016). How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA, 22(6), 839–851. https://doi.org/10.1261/rna.053959.115

Soneson, C., Love, M. I., & Robinson, M. (2015). Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research, 4(1521). https://doi.org/10.12688/f1000research.7563.1

Stephens, M. (2016). False discovery rates: A new deal. Biostatistics, 18(2). https://doi.org/10.1093/biostatistics/kxw041

Trapnell, C., Hendrickson, D. G., Sauvageau, M., Goff, L., Rinn, J. L., & Pachter, L. (2013). Differential analysis of gene regulation at transcript resolution with RNA-seq. Nature Biotechnology. http://dx.doi.org/10.1038/nbt.2450

Van den Berge, K., Perraudeau, F., Soneson, C., Love, M. I., Risso, D., Vert, J.-P., … Clement, L. (2018). Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications. Genome Biology, 19(24). https://doi.org/10.1186/s13059-018-1406-4

Wiel, M. A. van de, Leday, G. G., Pardo, L., Rue, H., Vaart, A. W. van der, & Wieringen, W. N. van. (2012). Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics, 14(1), 113–128. https://doi.org/10.1093/biostatistics/kxs031

Wiel, M. A. van de, Neerincx, M., Buffart, T. E., Sie, D., & Verheul, H. M. (2014). ShrinkBayes: a versatile R-package for analysis of count-based sequencing data in complex study designs. BMC Bioinformatics, 15(116). https://doi.org/10.1186/1471-2105-15-116

Wu, H., Wang, C., & Wu, Z. (2012). A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics, 14(2). https://doi.org/10.1093/biostatistics/kxs033

Zhu, A., Ibrahim, J. G., & Love, M. I. (2018). Heavy-tailed prior distributions for sequence count data: Removing the noise and preserving large differences. bioRxiv. https://doi.org/10.1101/303255