Corpora

Many Welsh language (and bilingual) corpora, such as the CEG corpus and the Proceedings for the National Assembly of Wales, are available through the Welsh National Corpus Portal.  The Welsh National Corpus Portal allows the user to easily search and find examples of the use of words and terms in many contexts.

Below, we have provided links for downloading corpus data from some of the corpora from the Welsh National Corpus Portal as well as links to data that we would not be able to provide otherwise.

 

Welsh Language Social Web Corpora

The following two data sets comprise of Welsh language user generated texts that we collect constantly from Twitter and Facebook.

 

Speech Corpora

Here are corpora in the form of audio files used to aid development of various speech technologies:

 

Corpus of CC0 Sentences

This is a corpus of Welsh language sentences released under a CC0 licence collected by members of the Language Technologies Unit, Bangor University, expressly to serve as prompts for developing Welsh speech recognition. The sentences come from various CC0 sources.

Corpus of POS Tagged CC0 Sentences

A corpus of Welsh language sentences  tagged with part-of-speech tags and released under CC0 to enable the training of statistical part-of-speech taggers for Welsh.

Recording Script for Voice Talents