Continual Learning on Speech and Audio: Towards

Abstract

In recent years, the community has witnessed the enormous progress of deep neural network models in matching or even surpassing human performance on a variety of speech and audio tasks, including Automatic Speech Recognition (ASR), Spoken Language Understanding (SLU), Text-to-Speech (TTS), etc. However, their impressive and powerful achievement is predominantly dependent on training with a large set of data defined by a particular and rigid task. In such a paradigm, the model is expected to learn universal knowledge from a static entity of data and stationary environments. In contrast, the real world is inherently ever-changing and non-stationary. New data is often generated and collected every second in a stream format, and novel classes may also emerge from time to time. Without proper adaptation techniques, the knowledge learned in the past might be erased easily when the model is learning subsequent tasks, thus resulting in overall performance degradation. Such a phenomenon is called catastrophic forgetting, which limits the practical use and expansion of many deep neural network models.

Continual learning has emerged as a new machine learning paradigm that enables artificial intelligence (AI) systems to learn from a continuous stream of data and incrementally improve their performance over time. By adapting to changing environments and user needs, continual learning aims to address the catastrophic forgetting effect, so that the model can gradually extend the knowledge it acquires without drastically forgetting the knowledge that has been learned in the past. Such a property is crucial in practical applications to enable artificial systems to learn from the infinite streams of data of the changing world in a lifelong manner.

This thesis mainly focuses on the underexplored area of how continual learning techniques can be effective in speech and audio tasks via three perspectives: data, model, and metrics. We will introduce the background and formulations of multiple continual learning scenarios, including data-incremental, class-incremental, and task-incremental settings. Then we will present how different categories of continual learning scenarios and methods can be applied to different modules of the modeling pipeline. Starting from the taxonomy of methods, we propose to improve continual learning towards the three perspectives. First, we demonstrate how to address data sampling, selection, and imbalance to help with continual learning on different audio tasks. Second, we show how the joint use of model architecture and data with different learning strategies could benefit continual learning processes. Lastly, we propose new continual evaluation metrics to give us a comprehensive and deeper understanding of the general continual learning behaviors. We believe that this thesis provides an overall exploration of continual learning scenarios in various speech and audio tasks, and makes an important step towards realizing lifelong learning of speech interfaces.

Details

Title

Continual Learning on Speech and Audio: Towards Data, Model and Metrics

Author

Yang, Muqiao

Publication year

2024

Publisher

ProQuest Dissertations & Theses

ISBN

9798382832937

Source type

Dissertation or Thesis

Language of publication

English

ProQuest document ID

3068670211

Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.

Continual Learning on Speech and Audio: Towards Data, Model and Metrics

Content area

Abstract

Details

Suggested sources