Content area

Abstract

In recent years, the community has witnessed the enormous progress of deep neural network models in matching or even surpassing human performance on a variety of speech and audio tasks, including Automatic Speech Recognition (ASR), Spoken Language Understanding (SLU), Text-to-Speech (TTS), etc. However, their impressive and powerful achievement is predominantly dependent on training with a large set of data defined by a particular and rigid task. In such a paradigm, the model is expected to learn universal knowledge from a static entity of data and stationary environments. In contrast, the real world is inherently ever-changing and non-stationary. New data is often generated and collected every second in a stream format, and novel classes may also emerge from time to time. Without proper adaptation techniques, the knowledge learned in the past might be erased easily when the model is learning subsequent tasks, thus resulting in overall performance degradation. Such a phenomenon is called catastrophic forgetting, which limits the practical use and expansion of many deep neural network models.

Continual learning has emerged as a new machine learning paradigm that enables artificial intelligence (AI) systems to learn from a continuous stream of data and incrementally improve their performance over time. By adapting to changing environments and user needs, continual learning aims to address the catastrophic forgetting effect, so that the model can gradually extend the knowledge it acquires without drastically forgetting the knowledge that has been learned in the past. Such a property is crucial in practical applications to enable artificial systems to learn from the infinite streams of data of the changing world in a lifelong manner.

This thesis mainly focuses on the underexplored area of how continual learning techniques can be effective in speech and audio tasks via three perspectives: data, model, and metrics. We will introduce the background and formulations of multiple continual learning scenarios, including data-incremental, class-incremental, and task-incremental settings. Then we will present how different categories of continual learning scenarios and methods can be applied to different modules of the modeling pipeline. Starting from the taxonomy of methods, we propose to improve continual learning towards the three perspectives. First, we demonstrate how to address data sampling, selection, and imbalance to help with continual learning on different audio tasks. Second, we show how the joint use of model architecture and data with different learning strategies could benefit continual learning processes. Lastly, we propose new continual evaluation metrics to give us a comprehensive and deeper understanding of the general continual learning behaviors. We believe that this thesis provides an overall exploration of continual learning scenarios in various speech and audio tasks, and makes an important step towards realizing lifelong learning of speech interfaces.

Details

Title
Continual Learning on Speech and Audio: Towards Data, Model and Metrics
Author
Yang, Muqiao  VIAFID ORCID Logo 
Publication year
2024
Publisher
ProQuest Dissertations & Theses
ISBN
9798382832937
Source type
Dissertation or Thesis
Language of publication
English
ProQuest document ID
3068670211
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.