Chapter 4 Limitations and extensions

In this final chapter, I discuss limitations to the methods presented earlier, and extensions for analyzing high dimensional counts in contexts beyond what was previously covered.

  1. Single cell RNA-seq - The DESeq2 framework shown in the gene-level analysis chapter was designed for bulk RNA-seq, in which the Negative Binomial GLM assessing differences across samples was suitable both in terms of distribution and in terms of answering many biological questions of interest. In single cell RNA-seq, there are new considerations and questions of interest. One aspect is that, with UMI barcoding, there is a need for quantification methods that resolve errors and de-duplicate the read data into molecule counts per cell. The alevin method (10), packaged within the Salmon software, can accomplish this UMI de-duplication, and can resolve the increased rate of multi-mapping reads seen in 3’ tagged sequencing, through an approach similar to that taken by Salmon. The quantification from alevin can be easily imported into R/Bioconductor using the tximeta software seen in the quantification chapter.

    After quantification, there are many choices regarding the analysis pipeline, I refer to Bioconductor’s online book for single cell analysis, and recent reviews for systematic comparisons. (45) have recently published an overview and online book for performing analysis of scRNA-seq data using Bioconductor packages. (46) evaluates methods for detecting differences in expression across groups of cells. (47) evaluates methods for dimension reduction, which is often performed in the context of cell clustering and lineage reconstruction. (48) evaluates methods for clustering to recover sub-populations of cells. Finally, I note that the NB methods shown in the gene-level chapter can be combined with other statistical methods to add and model a zero component, in the case that the Negative Binomial is not a suitable distribution (49). The zero component may not be needed for all scRNA-seq datasets however, in particular if UMI de-duplication is possible.
  2. Long reads - The data presented in previous sections involved sequencing relatively short sequences of the cDNA fragments. They sequences are short in the sense that they do not come close to capturing the entire sequence of the transcript for most mammalian transcripts. However, new technologies have emerged in the past decade that allow for high-throughput sequencing of lengths that approach the entire transcript length. This necessitates new methods for alignment (the long sequences nevertheless have a higher error rate than the “short” reads). One of the most popular methods for aligning long reads is minimap2 (50). Following alignment, it is possible to again quantify expression using Salmon and import the data into R/Bioconductor using tximeta. A systematic evaluation of quantification using the Nanopore long read technology has been performed by (51). A pipeline for long read mapping with minimap2 and quantification with Salmon has been recently published with an associated GitHub repository (52). Finally, a review of bioinformatic pipelines for long read data analysis has recently been published by (53).
  3. Genetic variation - An aspect not explored in the previous sections was genetic variation across the samples in the exonic sequence. One analysis of interest is to identify common genetic variants in the exonic sequence, and to quantify, among the samples that are heterozygous for a given exonic SNP, the expression of each allele. Best practices for allelic expression analysis have been presented by (54), and an evaluation of EM-based methods for assessing allelic expression have been proposed and compared by (55). Aside from interest in quantifying allelic expression in the presence of heterozygous exonic positions, (56) have examined the effect of genetic variation on transcript and gene expression quantification.
  4. Microbiome - I have described here various methods for analyzing counts reflecting the abundance of RNA molecules across samples. Another type of high dimensional count dataset with similar but distinct analysis considerations is that produced in a microbiome or metagenomic study, in which the counts reflect the abundance of certain taxa across samples. The count data is arranged in a similar format to gene expression, but with the taxa replacing the transcripts or genes on the rows of the matrix.

    While many have considered using gene expression normalization and testing methods for analyzing this type of data, a number of the assumptions used in gene expression models may be invalid for particular microbiome datasets. In particular, I demonstrated in the first exploration of gene expression counts that there were thousands of features in which the changes from sample to sample were minimal. There was a clear center of the distribution of log fold changes that could be used to estimate the size factors for scaling normalization across samples. In particular microbiome studies, this assumption may not fit, as there may not be a group of taxa that can be assumed roughly equally abundant across all samples in a dataset. In addition, there may be too few taxa, such that the Poisson modeling assumption no longer makes sense, and so a compositional model may better capture the distributional properties (57). A recent benchmarking effort compares compositional methods as well as single cell RNA-seq methods for analyzing microbiome datasets for differences in abundance of taxa (58).

    Alternative pipelines for analyzing microbiome abundance data have been detailed by (59). There may be more interesting and relevant approaches to modeling the counts besides the GLM, and latent variable models are considered and applied to microbiome datasets recently by (60). Finally, statistical considerations of various diversity measures for count-based microbiome studies have been explored recently by (61).

References

10. Srivastava A, Malik L, Smith TS, Sudbery I, Patro R. 2019. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biology. 20(65):

45. Amezquita RA, Lun ATL, Becht E, Carey VJ, Carpp LN, et al. 2020. Orchestrating single-cell analysis with bioconductor. Nature Methods. 17(2):137–45

46. Soneson C, Robinson MD. 2018. Bias, robustness and scalability in single-cell differential expression analysis. Nature Methods. 15(4):255–61

47. Sun S, Zhu J, Ma Y, Zhou X. 2019. Accuracy, Robustness and Scalability of Dimensionality Reduction Methods for Single Cell RNAseq Analysis. bioRxiv

48. Duo A, Robinson M, Soneson C. 2018. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Research. 7(1141):

49. Van den Berge K, Perraudeau F, Soneson C, Love MI, Risso D, et al. 2018. Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications. Genome Biology. 19(24):

50. Li H. 2018. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 34(18):3094–3100

51. Soneson C, Yao Y, Bratus-Neuenschwander A, Patrignani A, Robinson MD, Hussain S. 2019. A comprehensive examination of Nanopore native RNA sequencing for characterization of complex transcriptomes. Nature Communications. 10(1):3359

52. Cruz-Garcia L, O’Brien G, Sipos B, Mayes S, Love M, et al. 2019. Generation of a transcriptional radiation exposure signature in human blood using long-read nanopore sequencing. Radiation research

53. Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q. 2020. Opportunities and challenges in long-read sequencing data analysis. Genome Biology. 21(1):30

54. Castel SE, Levy-Moonshine A, Mohammadi P, Banks E, Lappalainen T. 2015. Tools and best practices for data processing in allelic expression analysis. Genome Biology. 16(1):195

55. Raghupathy N, Choi K, Vincent MJ, Beane GL, Sheppard KS, et al. 2018. Hierarchical analysis of RNA-seq reads improves the accuracy of allele-specific expression. Bioinformatics. 34(13):2177–84

56. Srivastava A, Malik L, Sarkar H, Zakeri M, Almodaresi F, et al. 2019. Alignment and mapping methodology influence transcript abundance estimation. bioRxiv

57. Fernandes AD, Reid JN, Macklaim JM, McMurrough TA, Edgell DR, Gloor GB. 2014. Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome. 2(1):15

58. Calgaro M, Romualdi C, Waldron L, Risso D, Vitulo N. 2020. Assessment of single cell rna-seq statistical methods on microbiome data. bioRxiv

59. Callahan B, Sankaran K, Fukuyama J, McMurdie P, Holmes S. 2016. Bioconductor Workflow for Microbiome Data Analysis: from raw reads to community analyses. F1000Research. 5(1492):

60. Sankaran K, Holmes SP. 2018. Latent variable modeling for the microbiome. Biostatistics. 20(4):599–614

61. Willis AD. 2019. Rarefaction, alpha diversity, and statistics. Frontiers in Microbiology. 10:2407