Data
Welsh place-names
Dyma rhestr o enwau yng Nghymru a ddefnyddir yn y Gymraeg, a rhestr sydd hefyd yn cynnwys gwybodaeth hydred a lledred yn ogystal ag enwau Saesneg sy’n cyfateb.
Corpora
Many Welsh language (and bilingual) corpora, such as the CEG corpus and the Proceedings for the National Assembly of Wales, are available through the Welsh National Corpus Portal. The Welsh National Corpus Portal allows the user to easily search and find examples of the use of words and terms in many contexts.
Below, we have provided links for downloading corpus data from some of the corpora from the Welsh National Corpus Portal as well as links to data that we would not be able to provide otherwise.
Here are corpora in the form of audio files used to aid development of various speech technologies:
A corpus of Welsh language sentences tagged with part-of-speech tags and released under CC0 to enable the training of statistical part-of-speech taggers for Welsh.
Corpus of CC0 Sentences
This is a corpus of Welsh language sentences released under a CC0 licence collected by members of the Language Technologies Unit, Bangor University, expressly to serve as prompts for developing Welsh speech recognition. The sentences come from various CC0 sources.
This is a collection of 14,857 sentences released under a CC0 licence. They were collected by members of the Language Technologies Unit, Bangor University, expressly to serve as prompts for Welsh Speech Recognition. The sentences come from various CC0 sources and include:
- Original sentences
- Sentences from novels, essays and other out of copyright material
- Sentences from the Welsh Wicipedia where authors gave us permission to release them under a CC0 licence
- Tweets, emails, and other electronic material gifted to the project to be used as prompts
In a number of cases, the language was adapted and the sentences heavily edited to make them suitable for reading aloud by volunteers.
The corpus was also given to the Mozilla Common Voice project, and these sentences were therefore used to record volunteers.
We wish to thank everyone who helped us collect these sentences, including those who gave us their materials under a CC0 licence, and to Mozilla for their help and leadership with the Common Voice project.
Download ‘Brawddegau Cymraeg’ resource from GitHub.
Lexicons
Welsh Pronunciation Dictionary
The Welsh Pronunciation Dictionary is a dictionary suitable for use with speech technology. It is hosted by Bangor University’s School of Linguistics and Bangor University’s Language Technologies Unit.
Hunspell
Hunspell is an open source spell checker used in a number of software applications.
In October 2020, we revised and significantly updated Hunspell, and we continue to update it regularly. The latest update includes new prime forms (‘actiwari’, ‘biodreulio’ a ‘seiberfwlio) and 98 additional international place names (including ‘Irac’).
The Hunspell files can be found on our GitHub page click on the green ‘code’ button then select ‘Download Zip’ to download.
We offer a separate version of our Hunspell data, specifically designed for Y Trawsgrifiwr, contributing to Bangor University Language Technologies Unit’s transcription bank. This version includes spoken forms that align with the project’s verbatim transcription conventions (see Bangor Transcription Conventions for details). Examples of included spoken forms are ch’mod, rwbath, sicir, and gweud. These Hunspell files are available on our GitHub page.
To install Hunspell within LibreOffice, follow the instructions in the README.
Wordlists of the most common wordforms in Welsh, and the most common English words used in Welsh
These wordlists are intended to help improve Welsh speech technology by identifying the most common words likely to be uttered for processing in a Welsh language transcription system.
Our most frequently written Welsh and English wordlists are available on our GitHub site.
A requirement of the Welsh Government funded “Macsen” Speech Recognition project was that we publish wordlists of the 2,500 most common words written in Welsh, and the 500 most common English words used in Welsh. These wordlists are intended to help improve Welsh speech technology by identifying the most common words likely to be uttered for processing in a Welsh language transcription system. These wordlists will be used to test the ability of our prototype transcription once it is ready, and measure its ability to recognise these common words.
Other projects, such as that to develop the Bangor Siarad Corpus (http://bangortalk.org.uk/speakers.php?c=siarad), have followed lexical principles in assigning words to either Welsh or English categories, i.e. their evaluation of what is a ‘Welsh’ or an ‘English’ word has been based on their attestation or otherwise in Welsh or English dictionaries. We have added other principles for our wordlists, based on whether their pronunciation follows Welsh or English letter-to-sound rules, as we are specifically interested in improving the recognition and transcription of spoken Welsh.
Wordlists are often thought of as lists of canonical forms, or the headwords found in dictionary entries. However, in a language such as Welsh, which has many inflected forms, and in contrast to English which has far fewer inflections, we decided to publish lists of the wordforms and their frequency counts, rather than the canonical forms or lemmas. This will help identify the most common forms, as some wordforms are much more common than their lemma, e.g. ‘mae’ (3rd person singular of the verb ‘to be’), is much more common that its lemma ‘bod’. Sometimes more than one wordform deriving from the same lemma comes into the 2,500 top most commonly used words, e.g. the lemma ‘Mehefin’ (‘June’) and the wordform ‘Fehefin’ with its soft mutation, both come into the list of the top 2,500 wordforms in Welsh.
Welsh makes much use of apostrophes to merge two words and indicate missing vowels. Wordforms that include apostrophes are regarded as legitimate wordforms in the Welsh wordlist, since doing so will help Welsh speech recognition and the creation of a Welsh transcription system. As illustrations, the wordform ‘hi’n’ occurs 1,117719 times in the corpus, and the wordform ‘a’i’ occurs 1,05456 times. For the avoidance of ambiguity, apostrophes are shown as underscores in the wordlist, therefore ‘hi_n’ and ‘a͏_i’ are the forms shown in the Welsh wordlist.
In order to gather objective evidence for the word frequency counts, we needed a Welsh corpus or corpora of sufficient size, if possible in the form of a balanced corpus. The largest Welsh corpus available for us was the Cysill Ar-lein Corpus (Prys, Prys a Jones 2016), which has by now reached over 200 million words and which continues to grow. It was created from texts input by users in order to check their spelling and grammar on-line, between 2009 and the present. It was not therefore intended as a balanced corpus, but it contains a great variety of text types, as Wooldridge’s (2011) analysis of it demonstrates:
Paper
Wooldridge, D. (2011) Gwella Cysill at Ddefnydd Cyfieithwyr: adnabod ymyrraeth gan yr iaith Saesneg mewn testunau Cymraeg. Traethawd MRes, Prifysgol Bangor. Dissertation
Ellis, N. C., O’Dochartaigh, C., Hicks, W., Morgan, M., & Laporte, N. (2001) Cronfa Electroneg o Gymraeg (CEG) Website
Knight, D. et al (2019) Corpws Cenedlaethol Cymraeg Cyfoes (CorCenCC) Website
Deuchar, M. et al (2009) Corpws Siarad Bangor Website
Chan, D., and Jones, D.B. Hunspell Cymraeg (2013) Website
Delyth Prys
Dewi Bryn Jones
September 2019