Large Scale Inference and Combinatorial Variable

Abstract

This dissertation advances the field of modern statistical theory and methodology by focusing on two primary areas: first, the quantification of uncertainty beyond mere estimation in combinatorial inference theory; and second, addressing the complexities and challenges inherent in electronic health records (EHR).

Chapter 1 introduces a novel combinatorial inference framework to conduct general uncertainty quantification in ranking problems. By considering the Bradley-Terry-Luce model, we aim to infer both local and global ranking properties, and generalize the method to multi-tesing problem with false discovery rate (FDR) control.

Chapter 2 focuses on the development of a semi-supervised approach that efficiently leverages sizable unlabeled samples with error-prone EHR surrogate outcomes from multiple local sites, to improve the learning accuracy of the small gold-labeled data. we apply our method to develop a high dimensional genetic risk model for type II diabetes using large-scale data sets from UK and Mass General Brigham biobanks, where only a small fraction of subjects in one site has been labeled via chart reviewing.

Chapter 3 presents a novel inferential framework for general graphical models to select graph features with false discovery rate controlled. The proposed method is based on the maximum of p-values from single edges that comprise the topological feature of interest, thus is able to detect weak signals. Moreover, we introduce the K-dimensional persistent Homology Adaptive selectioN (KHAN) algorithm to select all the homological features within K dimensions with the uniform control of the false discovery rate over continuous filtration levels. The KHAN method applies a novel discrete Gram-Schmidt algorithm to select statistically significant generators from the homology group. We apply the structural screening method to identify the important residues of the SARS-CoV-2 spike protein during the binding process to the ACE2 receptors. We score the residues for all domains in the spike protein by the p-value weighted filtration level in the network persistent homology for the closed, partially open, and open states and identify the residues crucial for protein conformational changes and thus being potential targets for inhibition.

Details

Title

Large Scale Inference and Combinatorial Variable Selection for Complex Dataset

Author

Liu, Yue

Publication year

2024

Publisher

ProQuest Dissertations & Theses

ISBN

9798382775579

Source type

Dissertation or Thesis

Language of publication

English

ProQuest document ID

3063265752

Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.

Large Scale Inference and Combinatorial Variable Selection for Complex Dataset

Content area

Abstract

Details

Suggested sources