Corpora

Many Welsh language (and bilingual) corpora, such as the CEG corpus and the Proceedings for the National Assembly of Wales, are available through the Welsh National Corpus Portal.  The Welsh National Corpus Portal allows the user to easily search and find examples of the use of words and terms in many contexts.

Below, we have provided links for downloading corpus data from some of the corpora from the Welsh National Corpus Portal as well as links to data that we would not be able to provide otherwise.

 

Welsh Language Social Web Corpora

The following two data sets comprise of Welsh language user generated texts that we collect constantly from Twitter and Facebook.

 

Speech Corpora

Here are corpora in the form of audio files used to aid development of various speech technologies: