Speech recognition technology allows a computer system to recognize words spoken by a person in order to convert the sound into text. This does not mean that the speech recognition system will necessarily be able to identify the meaning of every word.
The following speech recognition resources are now available through the Language Technologies Portal:
Welsh wav2vec2 (with KenLM)
This is the Welsh speech recognition provision which, at the moment, gives the best possible recognition results.
The main foundation of this speech recognition is massive multilingual acoustic models designed by Facebook AI and trained under partial supervision. Unlike the previous procedure of training speech recognition models with audio and transcripts, wav2vec2 models learn the patterns (at a low level similar to phonemes) from speech audio only (without transcripts). As there is an abundance of untranscribed speech data available, it is possible to train on larger collections. wav2vec2 xlsr was trained with tens of thousands of hours of speech audio containing 53 different languages in order to also take advantage of cross-lingual pronunciation sound similarities.
See https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/ for more information.
In order to adapt the models for the purposes of recognizing Welsh speech, they need to be refined with normal audio data and Welsh transcriptions. Fortunately this data is available from the Mozilla Common Voice project, and with the help of the KenLM language model, the recognition results are exceptional, with a word error rate of 15%.
This is an example of the effectiveness of the models within our Welsh Online Transcriber open source package and service. See the suggestion (‘Awgrym’) from the speech recognition machine which is almost correct compared to the corrected text (‘Cywiriad’):
Mozilla Welsh DeepSpeech
DeepSpeech is a speech recognition offering from Mozilla, makers of the Firefox browser. Although its effectiveness, at the moment, is not as good as wav2vec2 (see above), DeepSpeech models are much smaller in size and can recognize speech live as you speak. DeepSpeech is therefore suitable for carrying out speech recognition on home computers and devices such as mobile phones.
Please visit https://github.com/mozilla/deepspeech to learn more about DeepSpeech.
The Welsh National Language Technologies Portal provides ready-made Welsh models, and scripts for training them from the Mozilla Welsh CommonVoice data. Visit the ‘Releases’ pages within the GitHub pages to read more and to download the models themselves:
We use the Welsh DeepSpeech models within our Macsen app for recognizing questions or simple commands.
Speech Recognition Kits
The following packages were used in the past to try and realize Welsh speech recognition:
Kaldi Cymraeg
Kaldi-ASR (http://kaldi-asr.org) has grown in popularity in recent years as an open source speech recognition implementation kit. It provides improvements and better licensing and commercialization flexibility than any other kit. It also provides for training acoustic models with neural networks. As a result there is a great increase in its use by researchers, developers and companies.
Here is the training environment resource for Welsh language and acoustic models with Kaldi:
Kaldi Cymraeg is used within our ‘Macsen’ Welsh language voice assistant project:
HTK Cymraeg
The HTK (Hidden Markov Model Speech Recognition Toolkit) from Cambridge University has been a foundation for speech recognition research since the 90s. It has been successfully applied to implement Welsh language speech recognition with the following resources:
Julius Cymraeg
Julius is an LVCSR (‘large vocabulary continuous speech recognition’) speech recognition operating system only. Julius is used to put the HTK acoustic models to use
Other Resources
Further Speech Recognition Development
Gwaith Adnabod Lleferydd Uwch (GALLU)