Category Archives: Corpora

Mozilla CommonVoice, Paldaruo ac Adnabod Lleferydd Cymraeg

Mae Mozilla, y cwmni rhyngwladol o Galifornia sy’n gyfrifol am y porwr gwe Firefox, newydd lansio eu cynllun CommonVoice amlieithog. Ar ôl cychwyn gyda Saesneg y llynedd, mae tair iaith newydd yn cael eu hychwanegu yn awr, sef y Gymraeg, Almaeneg, a Ffrangeg. Llwyddodd y Gymraeg i gyrraedd y brig oherwydd cymorth gan yr Uned Technolegau Iaith yng Nghanolfan Bedwyr, Prifysgol Bangor.

Rhagor o Leisiau i Common Voice
https://blog.mozilla.org/press-uk/2018/06/07/more-common-voices/#Cymraeg

Rydyn ni’n hynod o falch am CommonVoice Cymraeg ac yn awyddus iawn i gannoedd a miloedd o siaradwyr Cymraeg gyfrannu eu lleisiau drwy’r wefan neu’r ap.

 

Ond beth am Paldaruo? – ein ap torfoli sydd eisoes wedi casglu ers 2014 hyd at 38 awr o ddata lleferydd gan dros 500 unigolyn, ac sydd wedi helpu gwireddu meddalwedd cynorthwyydd personol digidol Cymraeg cod agored fel Macsen. Mae’r Uned wedi defnyddio gwaith Paldaruo i gynorthwyo Mozilla darparu CommonVoice ar gyfer y Gymraeg ac ieithoedd eraill llai eu hadnoddau eraill.

Un o’r heriau yw canfod a darparu testunau hwylus i’w ddarllen ond sy’n cynnwys ystod eang a chytbwys o ffonemau’r iaith. Ar gyfer y lansiad, mae 1200 promt gan yr Uned o fewn CommonVoice Cymraeg ond bydd angen mwy. Wrth i ni, a’r gymuned Cymraeg, gyfrannu rhagor o destunau a recordiadau i CommonVoice Cymraeg, rydyn ni’n rhagweld y bydd y corpws yn hwb sylweddol i weithgareddau ymchwil a datblygu adnabod lleferydd Cymraeg yr Uned ac eraill.

Y gobaith yw y bydd y bartneriaeth rhwng Mozilla a Phrifysgol Bangor yn tyfu, ac y bydd y gweithgaredd hwn hefyd yn symbylu cwmnïau mawr eraill i gynnwys y Gymraeg ac ieithoedd eraill llai eu hadnoddau yn eu cynlluniau rhyngwladol.

Cyfeiriad y wefan yw : https://voice.mozilla.org/cy ac mae’r ap ar gael o https://itunes.apple.com/us/app/project-common-voice-by-mozilla/id1240588326

 

New Welsh Language Social Web Corpora

Here at the Language Technologies Unit we have been collecting Welsh tweets from Twitter and public Facebook posts and comments for the past 6 months.

Today we are pleased to release these two huge corpora to the general public!

twitterFrom today, we have a collection of over 2.6 million Welsh tweets, and 40,000 Facebook comments and posts available for download. This collection of over 30 million Welsh words is constantly increasing, and more will be made available over time.

Through the use of a Welsh language-detection model produced here at Bangor University (keep your eyes peeled for this!) and the open source language-detection project, we have been able to sort through millions of tweets and Facebook posts to find only those Welsh language texts relevant to us, with a 99% accuracy rate.

fbThis is an exciting and ground-breaking release, with it being the first example of electronic and informal Welsh medium corpora available anywhere.

The corpora are noteworthy as they have been created entirely online (through Twitter and Facebook) and include content by Welsh speakers from across the world.

We envisage these corpora being used for anything from training predictive text systems for phones, to finding new words in the Welsh language and further academic research.

You can find and download all these files from our Corpora webpage.

Before downloading these files, we ask you to read the documentation and terms and conditions of download.