Local Speech Recognition API Server

This resource, available from https://github.com/techiaith/welsh-whisperx, enables you to run Welsh speech recognition locally on your own hardware — without relying on external cloud services. It can power real-time voice assistants, transcribe meetings and broadcasts, translate Welsh speech into English, and automatically generate subtitles.

The system is built to scale with demand. It uses a task queue architecture (Celery with Redis) that allows you to add CPU or GPU workers as your needs grow — from one CPU handling a few requests to multiple GPUs processing many tasks at the same time. A two-level priority system ensures that time-sensitive requests such as voice input to apps are handled immediately, even when longer transcription tasks are running in the background.

What the API server can do

Transcription — convert Welsh speech to written text, with word-level timestamps and confidence scores
Translation — translate Welsh speech directly into English text
Speaker diarisation — identify different speakers’ voices in a recording and label who said what
Align speech and text — align a known text to audio, producing precise word and character-level timestamps
Subtitle generation — automatically produce SRT and WebVTT subtitle files from speech
Welsh text normalisation — convert verbatim spoken Welsh (with contractions, dialectal forms and informal speech) into standard written Welsh

Built for two use cases

The API is designed around two distinct modes:

Real-time

The endpoints /transcribe/, /translate/ and /keyboard/ are optimized for voice assistants and interactive applications. Short audio clips are processed with minimal delay and the results are returned directly in the response. These requests are routed to a high priority queue so they are never delayed by background work.

Batch

The endpoints /transcribe_long_form/ and /translate_long_form/ handle longer recordings such as meetings, interviews or broadcasts. The API receives the audio, returns a task ID, and processes it in the background with the full pipeline — including speaker recognition and audio and text alignment. The results can be retrieved as JSON, SRT, WebVTT, ELAN or plain text.

Align speech and text

The endpoints /align/ and /align_long_form/ accept known Welsh text alongside an audio file and return detailed word and character level timestamps with confidence scores. This is useful for synchronizing existing transcripts with audio, creating timed subtitles from scripts, or linguistic research where precise timing is required.

Output formats

For batch requests, the API generates and stores several output formats:

JSON — full structured result with segments, word timestamps, confidence scores and normalised text
SRT / WebVTT — subtitle files ready for use in media players and video platforms
ELAN — annotation files for linguistic analysis tools
Plain text — simple text transcript
Speakers JSON — speaker diarisation result with speaker labels, text and timing

Welsh text normalisation

Each transcription segment contains the original text verbatim or verbatim and a normalized version. The normalizer converts spoken Welsh — including abbreviations (bo’ fi → bod fi), dialectal forms and code-switching artefacts — into standard written Welsh. Translation output (English) avoids normalization.

Running locally and at scale

The entire system is packaged as Docker containers and can run on any machine with or without an NVIDIA GPU. A single 24GB GPU can run up to four workers simultaneously. To handle higher throughput, add more GPUs or machines — each additional worker increases capacity without any code changes. The queue prioritization system automatically distributes work across the available workers.

Cloud Resources for developers

The speech recognition API server and all the underlying models are also available through the unit’s APIs Center for easier integration into your systems and services. https://api.techiaith.cymru

Welsh National Language Technologies Portal