Many Welsh language (and bilingual) corpora, such as the CEG corpus and the Proceedings for the National Assembly of Wales, are available through the Welsh National Corpus Portal. The Welsh National Corpus Portal allows the user to easily search and find examples of the use of words and terms in many contexts.
Below, we have provided links for downloading corpus data from some of the corpora from the Welsh National Corpus Portal as well as links to data that we would not be able to provide otherwise.
Welsh Language Social Web Corpora
The following two data sets comprise of Welsh language user generated texts that we collect constantly from Twitter and Facebook.
Here are corpora in the form of audio files used to aid development of various speech technologies:
Corpus of CC0 Sentences
This is a corpus of Welsh language sentences released under a CC0 licence collected by members of the Language Technologies Unit, Bangor University, expressly to serve as prompts for developing Welsh speech recognition. The sentences come from various CC0 sources.
Corpus of POS Tagged CC0 Sentences
A corpus of Welsh language sentences tagged with part-of-speech tags and released under CC0 to enable the training of statistical part-of-speech taggers for Welsh.