Transcription accuracy is a critical challenge that significantly impacts the accessibility of transcription services, especially for non-native English speakers who face a higher rate of errors. This paper addresses the root problems of transcription inaccuracies for non-native speakers. By leveraging advanced linguistic techniques and advanced language models, we aim to improve transcription quality, ensuring transcriptions are accurate and accessible for all users. This paper dives into the methodologies and strategies employed to tackle this issue, making transcriptions more reliable and universally beneficial.
The Transcription Challenge
Transcription errors are a widespread issue, especially for non-native English speakers, also known as L2 speakers. According to Peter Sullivan, Toshiko Shibano, and Muhammad Abdul-Mageed’s research, "Improving Automatic Speech Recognition for Non-Native English with Transfer Learning and Language Model Decoding" (2022), these individuals face a 10% increase in word error rate (WER), leading to a higher rate of mistranscriptions compared to native (L1) speakers.
The root of this problem lies in the phonological differences between English and other languages. For example, many Arabic dialects lack the phoneme /p/ and substitute it with the voiced equivalent /b/. Consequently, words containing /p/ are often misinterpreted, leading to communication inaccuracies. Given that non-native English speakers outnumber native speakers by nearly three to one, this problem significantly impacts a large portion of the global population, making transcription less accessible and effective for many.
Reversing Phonological Errors
Our theory was that many transcription errors for non-native English speakers occur because they replace the phonemes of English words with more familiar phonemes from their native language. By reversing these replacements after transcription, we believed we could greatly improve accuracy.
However, this method faced challenges, such as figuring out the correct phoneme substitution based on the context of surrounding words, applying substitutions efficiently, and converting textual words into their phonetic equivalents. To tackle these complexities, our solution involved using the combination of a phonetic alphabet and advanced language models.
Building a Solution
Developing the Text-to-IPA Tool
The first step in this process was to develop a text-to-IPA (International Phonetic Alphabet) translator. After exploring various options, we settled on the CMU-IPA dictionary, which enabled the conversion of English words into their phonetic counterparts. This foundational tool was essential for applying phonetic substitutions accurately. We created a script to convert words into their phonetic equivalents, forming the foundation of our transcription improvement system.
Creating Phonetic Substitutions
Initially, we created a basic function for applying phonetic substitutions, but it proved limited and cumbersome. We then discovered the Speech Accent Archive, a comprehensive resource detailing phonetic changes in various accents. Using data from this archive, we built a modular system to generate phonetic substitutions for different accents. To address the arbitrary order of substitutions, we employed binary counting methods to handle multiple substitutions efficiently. Instead of applying every substitution to every word, we only applied substitutions that involved phonemes actually present in the word.
Contextual Re-evaluation with AI
To refine the accuracy of transcription corrections, we then integrated the OpenAI API for re-evaluating phonetic options within context. This step allowed the system to determine the most likely spoken utterance by considering the surrounding words, significantly enhancing the reliability of the transcriptions. During testing, the system showed substantial improvement, even with synthetic examples that the language model had not previously encountered.
Results and Learnings
The results of the project demonstrated the efficacy of our hypothesis, and we saw significant improvements in transcription accuracy for non-native English speech. With the described updates, our system can now accommodate approximately 148 accents, potentially benefiting nearly 1 billion speakers.
Future Improvements
While the current system focuses on phoneme-to-phoneme substitutions, future improvements could incorporate additional contexts, such as the position in the word and surrounding sounds, to better track cross-word assimilation and other nuanced phonological shifts. The approach could also be expanded to consider all generalizations of a given language instead of the generalizations of a single speaker and be applied to speech impediments, making transcription even more accessible.