Content area

Abstract

With the explosion of information, massive amounts of data are being generated daily from different sources. Due to the limited infrastructure and human capacity for data integration and the requirement of efficient processing, some data, especially historical data, are stored in an aggregated form at different levels of aggregation (e.g., aggregated by different time intervals). For example, epidemiological data preserves monthly counts of infected people. Meanwhile, data analysis and machine learning models often require elaborate knowledge of data for accurate analysis and prediction. This information should be obtained either from original or from aggregated data.

Motivated by the above challenge, this thesis aims to facilitate the generation and utilization of aggregated data from three aspects: 1) reconstructing higher-resolution time series from aggregated data with acceptable performance; 2) selecting aggregated data for analysis with minimal hurt for performance, e.g., detecting outbreaks of measles using monthly counts may have comparable performance with the raw data; 3) generating aggregated data for future studies with less information loss, e.g., aggregating data with different resolutions on different parts based on the importance.

Most data reconstruction methods utilize domain knowledge, e.g., smoothness, periodicity, or sparsity, to improve reconstruction accuracy. Meanwhile, domain knowledge is limited and may be inaccurate in many applications, which leads to a worse reconstruction. In order to tackle this, I present two advanced methods: 1) ARES (Automatic REStoration) that performs data reconstruction by automatically discovering patterns in the time series using annihilating filter technique, 2) TURBOLIFT that aims to improve the quality of any existing disaggregation methods by refining the initial reconstruction.

Despite that reconstruction provides an elaborate view of data, its performance may vary depending on the data aggregation level, and it requires extra computational cost. Moreover, in some cases, analyzing coarse data may be sufficient to achieve acceptable accuracy. Therefore, I propose a framework, SMARTPROGNOSIS, to automatically suggest aggregation levels, which maximizes iv the performance under specific machine learning models.

It is noteworthy that most aggregation methods face information loss when aggregation levels increase. That results in lossy aggregated data, e.g., with annual counts, it is hard to capture the detailed trade during the year. In order to tackle this drawback, I propose a novel summarization algorithm, I-AGG, to aggregate data by emphasizing the critical information of original data.

Details

Title
Methods and Techniques for Efficient Processing of Aggregated Data
Author
Yang, Fan
Publication year
2022
Publisher
ProQuest Dissertations & Theses
ISBN
9798351492933
Source type
Dissertation or Thesis
Language of publication
English
ProQuest document ID
2723853931
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.