Automatic Speech Recognition for Non-Native English with Transfer Learning
Alumni Data Science Project
As an extension of the “Trends in Computational Linguistics” course that is part of Block 6 in the MDS Computational Linguistics (MDS-CL) program, four alumni from the 2020-2021 cohort (Toshiko Shibano, Xinyi (Jeremy) Zhang, Mia Taige Li, and Haejin Cho), worked with Muhammad Abdul-Mageed, Assistant Professor, MDS-CL and former MDS-CL TA, Peter Sullivan, on a project that looked at answering the following question:
How can we improve automatic speech recognition (ASR) models for all English speakers (L1), including non-natives (L2)?
As immigrants living in Canada, the alumni noted that L2 speakers were not only being misunderstood by people, but also with ASR systems such as Siri or automatic caption generation systems like Zoom and YouTube.
The alumni used the Wav2Vec 2.0 model, which is a state-of-the-art pre-trained ASR model, and fine-tuned it on non-native English speech data from different countries.
The dataset that the alumni worked with was L2-ARCTIC, a speech corpus for non-native English speakers.
The recordings were from 24 L2 speakers with a balanced gender and L1 distribution, specifically, six different L1s (Arabic, Hindi, Korean, Mandarin, Spanish and Vietnamese) and each consisted of two females and two males.
In order to quantify the impact of accents, the alumni prepared two baselines: 1) the Wav2vec 2.0 model without any fine-tuning 2) the Wav2vec 2.0 model fine-tuned on CMU ARCTIC, which is the same-domain corpus as L2-ARCTIC but spoken by native English speakers. Therefore, alumni were able to compare accented models with the native English models and investigate how much degradation in model performance was coming from accents.
The alumni expanded their training settings to represent how the model can be fine-tuned for various real-world scenarios. They also added a few experiment settings to ensure the models they fine-tuned were speaker- and accent-independent, which implies that any performance increase obtained was a result of the training method used.
The alumni were able to show that the fine-tuned model performed a lot better on non-native English than if the model was fine-tuned on native English.