Speech Recognition

macsen listens to the user and plays music

Speech recognition technologies allow a computer system to recognize words that someone speaks in order to convert the sound into text.

This does not necessarily mean that the speech recognition system will understand the meaning of every word.

Welsh wav2vec2 (With KenLM)

This is the Welsh speech recognition provision which, at the moment, gives the best possible recognition results.

The main foundation of this speech recognition is massive multilingual acoustic models designed by Facebook AI and trained under partial supervision. Unlike the previous procedure of training speech recognition models with audio and transcripts, wav2vec2 models learn the patterns (at a low level similar to phonemes) from speech audio only (without transcripts). As there is an abundance of untranscribed speech data available, it is possible to train on larger collections. wav2vec2 xlsr was trained with tens of thousands of hours of speech audio containing 53 different languages in order to also take advantage of cross-lingual pronunciation sound similarities

See the Facebook blog for me information

In order to adapt the models for the purposes of recognizing Welsh speech, they need to be refined with normal audio data and Welsh transcriptions. Fortunately this data is available from the Mozilla Common Voice project, and with the help of the KenLM language model, the recognition results are exceptional, with a word error rate of 15%.

This is an example of the effectiveness of the models within our Welsh Online Transcriber open source package and service. See the suggestion (‘Awgrym’) from the speech recognition machine which is almost correct compared to the corrected text (‘Cywiriad’):

wav2vec2 bilingual (Welsh/English)

We have also developed bilingual speech recognition provision (Welsh/English). The bilingual provision also uses Facebook AI models as a base. Mozilla Common Voice Welsh and English datasets were used to refine the model. Visit Hugging Face for full details or to try the demo.

Mozilla Welsh DeepSpeech

DeepSpeech is a speech recognition offering from Mozilla, makers of the Firefox browser. Although its effectiveness, at the moment, is not as good as wav2vec2 (see above), DeepSpeech models are much smaller in size and can recognize speech live as you speak. DeepSpeech is therefore suitable for performing speech recognition on home computers and devices such as mobile phones.

Go to DeepSpeech for more information on DeepSpeech

The Welsh National Language Technologies Portal provides ready-made Welsh models, and scripts for training them from the Mozilla Welsh CommonVoice data. Visit the ‘Releases’ pages within the GitHub pages to read more and to download the models themselves:

Other Speech Recognition Kits

The following packages were used in the past to try and realize Welsh speech recognition:

Kaldi Cymraeg

Kaldi-ASR has grown in popularity in recent years as an open source speech recognition implementation kit. It provides improvements and better licensing and commercialization flexibility than any other kit. It also provides for training acoustic models with neural networks. As a result there is a great increase in its use by researchers, developers and companies.

Here is the training environment resource for Welsh language and acoustic models with Kaldi:

Kaldi Cymraeg is used within our ‘Macsen’ Welsh language voice assistant project:

HTK Cymraeg

The HTK (Hidden Markov Model Speech Recognition Toolkit) from Cambridge University has been a foundation for speech recognition research since the 90s. It has been successfully applied to implement Welsh language speech recognition with the following resources:

Julius Cymraeg

Julius is an LVCSR (‘large vocabulary continuous speech recognition’) speech recognition operating system only. Julius is used to put the HTK acoustic models to use

Other Resources

Gwaith Adnabod Lleferydd Uwch (GALLU)
Speech Recognition Blog Entries

Welsh National Language Technologies Portal