Content area

Abstract

This dissertation advances the field of modern statistical theory and methodology by focusing on two primary areas: first, the quantification of uncertainty beyond mere estimation in combinatorial inference theory; and second, addressing the complexities and challenges inherent in electronic health records (EHR).

Chapter 1 introduces a novel combinatorial inference framework to conduct general uncertainty quantification in ranking problems. By considering the Bradley-Terry-Luce model, we aim to infer both local and global ranking properties, and generalize the method to multi-tesing problem with false discovery rate (FDR) control.

Chapter 2 focuses on the development of a semi-supervised approach that efficiently leverages sizable unlabeled samples with error-prone EHR surrogate outcomes from multiple local sites, to improve the learning accuracy of the small gold-labeled data. we apply our method to develop a high dimensional genetic risk model for type II diabetes using large-scale data sets from UK and Mass General Brigham biobanks, where only a small fraction of subjects in one site has been labeled via chart reviewing.

Chapter 3 presents a novel inferential framework for general graphical models to select graph features with false discovery rate controlled. The proposed method is based on the maximum of p-values from single edges that comprise the topological feature of interest, thus is able to detect weak signals. Moreover, we introduce the K-dimensional persistent Homology Adaptive selectioN (KHAN) algorithm to select all the homological features within K dimensions with the uniform control of the false discovery rate over continuous filtration levels. The KHAN method applies a novel discrete Gram-Schmidt algorithm to select statistically significant generators from the homology group. We apply the structural screening method to identify the important residues of the SARS-CoV-2 spike protein during the binding process to the ACE2 receptors. We score the residues for all domains in the spike protein by the p-value weighted filtration level in the network persistent homology for the closed, partially open, and open states and identify the residues crucial for protein conformational changes and thus being potential targets for inhibition.

Details

Title
Large Scale Inference and Combinatorial Variable Selection for Complex Dataset
Author
Liu, Yue
Publication year
2024
Publisher
ProQuest Dissertations & Theses
ISBN
9798382775579
Source type
Dissertation or Thesis
Language of publication
English
ProQuest document ID
3063265752
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.