Improve the accuracy of ASR (Automatic Speech Recognition) system for OOV (Out-of-Vocabulary) words

While there are many channels through which customers interact with the brand, voice has been the most preferred channel for customers. Recent advances in audio AI and NLP technologies have placed automated speech recognition (ASR) as the backbone of voice data to make sense of the conversations. ASR is widely becoming a de facto requirement across industries and domains such as closed captions in videos, contact centers, medical transcription, educational content, workforce meetings, sales conversations and other autonomous applications to gain insights and for achieving KPIs.

Today, there are several transcription vendors that can transcribe audios into text. But one of the most important issues in these transcriptions is accuracy or word error rate (WER) especially for industry specific terms — proper nouns, product names, addresses, alphanumeric etc. Since ASR systems are trained with cross-industry data, they provide reasonable accuracy for a general conversation. However, the transcription accuracy drops for specialised industry specific conversation due to imbalance in the datasets during training.

In this article, I would like to suggest a few techniques that can be adopted to improve ASR accuracy for industry specific problems and use cases. If you are using:

i. ASR to identify and validate the words uttered in an audio

Let’s consider a company that has setup an IVR system to gather some information from the customer on a voice line such as name, date of birth, address, an alphanumeric identification number etc and quickly match these inputs in a database.

Here, the best solution would be to use acoustic matching techniques like Forced aligners (GentleMontreal etc). The audio data and the database values can be given as inputs to the aligners to see if the words uttered are sounding similar to what is in the database.

Infact, this could be used in addition to the direct Text matching validation of the ASR output. Having a weighted analysis of both the validations would give more than 95% accuracy for this problem.

ii. ASR to transcribe long, general conversations

When you are transcribing conversations that are not industry specific (workforce meetings, conferences, sales calls, educational content, self automation bots etc) from audio data, it is very difficult to improve the accuracy of the ASR since there is no pattern that you can identify as to how you can get the terms right.

In this case, the best solution would be to use Phonetics to solve the problem of transcribing OOV (out-of-vocabulary) words. Often times, the ASR’s output hypothesis would be close enough to getting these words right but would have certain disfluencies such as matching this to similar sounding word which is part of dictionary(“IQ” -> “I” “queue”) or splitting the words into two (“Marsview” > “mars” “view”) etc which can be easily solved using Phonetics library in Python. Though this is not a very foolproof way of solving the problem, it can solve at least 60–70% of the cases, just by adding custom vocabulary dictionaries that are customized to every industry. You can pick and choose multiple Phonetics algorithms(weighted average code) and have a weighted average of all to match the transcription words with the right custom vocabulary words. The threshold to match and replace the custom vocabulary words in the transcription using Levenshtein distance (or any string distance matching) algorithm has to be very tightly managed, failing which leads to more inaccuracy in the final transcript.

iii. ASR to transcribe conversation for a specific industry

This is the best case scenario, where you are using transcription to just solve for a specific industry. There are two possible ways to approach this problem. Either you could get a custom ASR model trained specific to your industry. For example, works with customers to understand the industry they are after and use their recordings with the groundtruth (if it is available or can even annotate them) and train an ASR with high domain-specific-accuracy.

Or, if you do not have the recordings, then the other solution would be to build a LM (Language Model) for your use case using the existing transcription. There are a lot of NLP frameworks like NeMoEspnetKalditensor2tensor etc that can be used to build your own Language model. This would require a lot of data (transcribed data and corrected data) to achieve high accuracy.

At, we solve the ASR accuracy issues on a case by case basis. Sometimes the ASR metrics you are relying on could be mis-leading. If you are using ASR for intent analysis, or just looking for certain entities and not using the whole transcription, then the general Word Error Rate for accuracy would be a bad metric to use. The metric has to be defined carefully and it should fit the problem.

Leave a Reply