Here at the Language Technologies Unit we have been collecting Welsh tweets from Twitter and public Facebook posts and comments for the past 6 months.
Today we are pleased to release these two huge corpora to the general public!
From today, we have a collection of over 2.6 million Welsh tweets, and 40,000 Facebook comments and posts available for download. This collection of over 30 million Welsh words is constantly increasing, and more will be made available over time.
Through the use of a Welsh language-detection model produced here at Bangor University (keep your eyes peeled for this!) and the open source language-detection project, we have been able to sort through millions of tweets and Facebook posts to find only those Welsh language texts relevant to us, with a 99% accuracy rate.
This is an exciting and ground-breaking release, with it being the first example of electronic and informal Welsh medium corpora available anywhere.
The corpora are noteworthy as they have been created entirely online (through Twitter and Facebook) and include content by Welsh speakers from across the world.
We envisage these corpora being used for anything from training predictive text systems for phones, to finding new words in the Welsh language and further academic research.
You can find and download all these files from our Corpora webpage.
Before downloading these files, we ask you to read the documentation and terms and conditions of download.