Local Speech Recognition API Server
This resource, available from https://github.com/techiaith/welsh-whisperx, enables you to run Welsh speech recognition locally on your own hardware — without relying on external cloud services. It can power real-time voice assistants, transcribe meetings and broadcasts, translate Welsh speech into English, and automatically generate subtitles.
The system is built to scale with demand. It uses a task queue architecture (Celery with Redis) that allows you to add CPU or GPU workers as your needs grow — from one CPU handling a few requests to multiple GPUs processing many tasks at the same time. A two-level priority system ensures that time-sensitive requests such as voice input to apps are handled immediately, even when longer transcription tasks are running in the background.
What the API server can do
- Transcription — convert Welsh speech to written text, with word-level timestamps and confidence scores
- Translation — translate Welsh speech directly into English text
- Speaker diarisation — identify different speakers’ voices in a recording and label who said what
- Align speech and text — align a known text to audio, producing precise word and character-level timestamps
- Subtitle generation — automatically produce SRT and WebVTT subtitle files from speech
- Welsh text normalisation — convert verbatim spoken Welsh (with contractions, dialectal forms and informal speech) into standard written Welsh
Built for two use cases
The API is designed around two distinct modes:
Real-time
The endpoints /transcribe/, /translate/ and /keyboard/ are optimized for voice assistants and interactive applications. Short audio clips are processed with minimal delay and the results are returned directly in the response. These requests are routed to a high priority queue so they are never delayed by background work.
Batch
The endpoints /transcribe_long_form/ and /translate_long_form/ handle longer recordings such as meetings, interviews or broadcasts. The API receives the audio, returns a task ID, and processes it in the background with the full pipeline — including speaker recognition and audio and text alignment. The results can be retrieved as JSON, SRT, WebVTT, ELAN or plain text.
Align speech and text
The endpoints /align/ and /align_long_form/ accept known Welsh text alongside an audio file and return detailed word and character level timestamps with confidence scores. This is useful for synchronizing existing transcripts with audio, creating timed subtitles from scripts, or linguistic research where precise timing is required.
Output formats
For batch requests, the API generates and stores several output formats:
- JSON — full structured result with segments, word timestamps, confidence scores and normalised text
- SRT / WebVTT — subtitle files ready for use in media players and video platforms
- ELAN — annotation files for linguistic analysis tools
- Plain text — simple text transcript
- Speakers JSON — speaker diarisation result with speaker labels, text and timing
Welsh text normalisation
Each transcription segment contains the original text verbatim or verbatim and a normalized version. The normalizer converts spoken Welsh — including abbreviations (bo’ fi → bod fi), dialectal forms and code-switching artefacts — into standard written Welsh. Translation output (English) avoids normalization.
Running locally and at scale
The entire system is packaged as Docker containers and can run on any machine with or without an NVIDIA GPU. A single 24GB GPU can run up to four workers simultaneously. To handle higher throughput, add more GPUs or machines — each additional worker increases capacity without any code changes. The queue prioritization system automatically distributes work across the available workers.
Cloud Resources for developers
The speech recognition API server and all the underlying models are also available through the unit’s APIs Center for easier integration into your systems and services. https://api.techiaith.cymru
