Content area

Abstract

This dissertation work is motivated by the large influx of sequencing data: that is, both in terms of the amount and the type of data, where current statistical and computational methods are inadequate in addressing the data manipulation and hence the corresponding scientific questions of interest.

In Chapter 1, we address a current issue regarding a data analysis platform to conduct large amount of Next Generation Sequencing based methylation data. Bisulfite sequencing allows base-pair resolution DNA methylation and has recently been adapted for use in single cells. We present a set of preprocessing pipelines that allow users to ensure 1) reproducibility, 2) scalability, 3) integration with publicly available data, and 4) access to best-practice methods. The workflows produce output for visualization and further downstream analysis. Optional use of cloud computing resources facilitates analysis of large datasets, and integration with existing methylation data.

In Chapter 2, we focus our attention on sparsity in single-cell DNA methylation data. Single-cell DNA methylation analysis has the potential to produce high resolution methylation landscape and elucidate the heterogeneity in methylation. But it suffers from low coverage due to the low quantity of input DNA. We find that on average, only about 5 – 10\% of CpGs are observed in typical single-cell libraries. We show how missingness of methylation status can bias metrics such as mean methylation estimates and clustering analyses. We propose a joint analysis approach that leverages bulk sequencing data, to infer bias-corrected single-cell methylation status.

In Chapter 3, we consider sparsity in the rare variant data and how it can be utilized to infer population structure. Population-substructure in genetic studies is often assessed by principal component analysis of genetic relatedness matrices (GRM). With the general availability of whole-genome sequencing (WGS) platforms, rare variant data are now widely available. As such data are genetically “younger” than common variants, they should enable for a fine-scale assessment of the substructure. Here, using the 1,000 genomes project data, we compare the features of Jaccard-based GRMs with standard approaches that utilizes the genetic covariance matrix, with respect to their ability to examine and infer fine-scale population substructure.

Details

Title
Methods for Analyzing Sparse Genetic and Epigenetic Data: Single Cells to Population Levels
Author
Kangeyan, Divy S.
Publication year
2019
Publisher
ProQuest Dissertations & Theses
ISBN
9798684632785
Source type
Dissertation or Thesis
Language of publication
English
ProQuest document ID
2465783832
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.