Paldaruo Speech Corpus

Here is the corpus that was crowdsourced via the Ap Paldaruo app for mobile devices, with a big thank you to all our contributors.

The Paldaruo Speech Corpus is a read speech corpus designed to develop automatic speech recognition for Welsh. A large amount of data is needed to develop speech recognition and the corpus was crowdsourced via the Ap Paldaruo – an app designed to collect audio data from speakers. Using an App to crowdsource data from speakers of Welsh means that speaker variation can be maximised, which is important for accurate speech recognition. Crowdsourcing refers to obtaining data from a large number of people, usually over the internet.

PALDARUO SPEECH CORPUS VERSION 5
Version 5 is the current version of the corpus, published on 19th December 2018. The audio data is in wav format at 48kHz. Version 5 contains 40 hours of data across 14,215 files. 564 individual speakers have contributed to the published corpus. The data were collected over the period 2014-2018.
AP PALDARUO
Ap Paldaruo is available on iOS and Android for smart phones and tablets. Within the app, each contributor creates a profile which gives background information about them which can be used to develop speech recognition, and for other researchers interested in investigating language variation in Wales. The metadata collected include: Age, gender, Childhood living location, Current living location, Frequency speaking Welsh. Contributors are also asked to categorise whether they have a learner or first language accent, and which region their accent comes from.
**IN ADDITION:: The app source code is available here: https://github.com/techiaith/Paldaruo

HOW TO DOWNLOAD THE CORPUS
The corpus can be downloaded through git below. It is possible to download the corpus as a zip file, or from command line using git lfs (https://git-lfs.github.com/). If using git lfs, use the following command:
git lfs clone –branch v5.0 –depth 1 https://git.techiaith.bangor.ac.uk/Data-Porth-Technolegau-Iaith/Corpws-Paldaruo.git

DIRECTORY AND FILE STRUCTURE
We have used the corpus with the HTK and Kaldi speech recognition toolkits. The speech, speaker metadata and details of each recording are available when the corpus is downloaded.

The audio/wav directory contains the wav samples for each speaker. Each folder is one individual speaker. The metadata for each speaker is available in the metadata.csv file. Details of all samples in the corpus are available in the samples.txt file.

CORPUS SPEAKER DISTRIBUTION
The data are spoken by the 564 speakers in the corpus. Below is the distribution based on the main speaker characteristics:

AGE
Frequency Percent
18-30 159 28.2
31-40 173 30.7
41-50 103 18.3
51-60 73 12.9
61-70 41 7.3
71-80 12 2.1
80+ 1 0.2
Total 562 99.6

GENDER
Frequency Percent
Female 286 50.7
Male 278 49.3
Total 564 100
ACCENT LOCATION
Frequency Percent
Mid-Wales 81 14.4
South East Wales 82 14.5
South West Wales 108 19.1
North East Wales 53 9.4
North West Wales 240 42.6
Total 564 100
ACCENT TYPE
Frequency Percent
Learner 79 14
First Language 485 86
Total 564 100

CORPUS DETAILS
The prompts read out by contributors contain isolated words and sentences. Isolated words can be found in samples 1-85. The remaining samples are sentences and questions. Examples of the two types can be found below:

Isolated words:
*/sample1 LLEUAD MELYN AELODAU SIARAD FFORDD YMLAEN CEFNOGAETH HELEN
*/sample2 GWRAIG OREN DIWRNOD GWAITH MEWN EISTEDDFOD DISGOWNT IDDO
*/sample3 OHERWYDD ELLIW AWDURDOD BLYNYDDOEDD GWLAD TYWYSOG LLYW UWCH
*/sample4 RHYBUDDIO ELEN UWCHRADDIO HWNNW BEIC CYMRU RHOI AELOD
*/sample5 RHAI STEROID CEFNOGAETH FELEN CAU GAREJ ANGAU YMHLITH

Sentences:
*/c9d8244ce45dfc242c50bf6a5032cdf0 BETH FYDD TYWYDD YFORY
*/adcb079e2a52e1d0b6477ff9e22f2613 FAINT O’R GLOCH YDY HI
*/28c511ad08560ccd329f85476155fff8 FAINT O’R GLOCH YW HI
*/01ab72b92f6829846eb58c2bbb538bca DYDW I DDIM YN BWRIADU BOD YNG NGHAERDYDD DROS Y GWYLIAU
*/0a9463ca8f7e5414f674a35e9a50636a MAE ANGEN I TI OFYN AM BETH HOFFET TI GAEL YN Y BWYTY
*/c59edf7c3bcd0f26134f56c19af0cc30 OEDD RHAID I TI DDWEUD NAD OEDDET TI’N GWYBOD UNRHYW BETH

LICENCE
Attribution : Cooper, S., Chan, D., Jones, D. B. (2017) The Paldaruo Speech Corpus, version 4 [http://techiaith.cymru/corpora/paldaruo/]

More information about the Paldaruo crowd sourcing app can be found on the Ap Paldaruo webpage.

In addition, the source code for the iOS app can be found on GitHub:

techiaith/Paldaruo

Acknowledgements

Any articles or software based on the use of this corpus should cite:

Cooper, S., Chan, D., Jones, D. B. (2018) The Paldaruo Speech Corpus, version 5 [http://techiaith.cymru/corpora/paldaruo/]