Author Archives: techiaith

New Welsh Language Social Web Corpora

Here at the Language Technologies Unit we have been collecting Welsh tweets from Twitter and public Facebook posts and comments for the past 6 months.

Today we are pleased to release these two huge corpora to the general public!

twitterFrom today, we have a collection of over 2.6 million Welsh tweets, and 40,000 Facebook comments and posts available for download. This collection of over 30 million Welsh words is constantly increasing, and more will be made available over time.

Through the use of a Welsh language-detection model produced here at Bangor University (keep your eyes peeled for this!) and the open source language-detection project, we have been able to sort through millions of tweets and Facebook posts to find only those Welsh language texts relevant to us, with a 99% accuracy rate.

fbThis is an exciting and ground-breaking release, with it being the first example of electronic and informal Welsh medium corpora available anywhere.

The corpora are noteworthy as they have been created entirely online (through Twitter and Facebook) and include content by Welsh speakers from across the world.

We envisage these corpora being used for anything from training predictive text systems for phones, to finding new words in the Welsh language and further academic research.

You can find and download all these files from our Corpora webpage.

Before downloading these files, we ask you to read the documentation and terms and conditions of download.

Language Technologies Portal Blog

During the next few weeks and months (and leading up to our ‘Through Technological Means’ conference) we will be publishing a number of language technology resources through Twitter (@techiaith) and this blog.

We hope to share stories on other developers and coders using these new resources, so contact us if any of them have been useful to your activities or projects.

There’s an exciting collection of new stuff on the way, giving a serious boost to coders and developers of new Welsh software.

We would like to thank the Welsh Government and their Welsh-language Technology and Digital Media Fund for sponsoring this work which forms part of the National Welsh national Language Technologies Portal.

Follow our blog for all our latest news!