This is a corpus of over 7 million Welsh language tweets. Tweets are collected continuously from Twitter by our servers, so that the corpus is always growing in size.
There is great demand for large corpora of informal Welsh. We hope that this corpus will perform several roles, including:
- Training predictive text systems for phones
- Finding new words in the Welsh
- Research material for academic departments in Universities, including in linguistic, sociological and medical studies
- Educational and demonstration data for children and young people in coding clubs
- Valuable information for the market, for example; tracking and analysing user emotions (sentiment analysis)
Terms and conditions of downloading
Before downloading any files, we recommend that you read the Twitter Development Agreement, paying particularly close attention to Be a Good Partner to Twitter (Rhan b) which explains the terms of downloading these files.
Contents of the files
Each file which is available to download contains the following information in the CSV format:
- 50,000 tweets
- About each tweet:
- the message in the tweet
- the twitter user’s ID
- the date on which the message was created
- the number of retweets
- the number of favourites
- a number (0 or 1) which notes whether the tweet was a retweet
We have chosen to conceal the user details for the tweets that we will be releasing. If you would like to have access to this information, or any other details, you should contact the Language Technologies Unit directly.
Articles or software based on the use of this corpus should cite:
Jones, D. B., Robertson, P., Taborda, A. (2015) Corpus of Welsh Language Tweets [http://techiaith.org/corpora/twitter/?lang=en]
The open source language-detection library along with training by Bangor University was used to successfully analyse millions of tweets and Facebook messages to find the Welsh language texts relevant to us. According to our tests, which we ran on some of the tweets shared on this website, the language detection accuracy approaches 97% for Welsh medium texts longer than 30 characters.
For tweets shorter than 30 characters, the process of language recognition is indeterminate (with an accuracy rate lower than 50%), therefore we suggest that you remove any short tweets if you have a high threshold for language recognition accuracy.
Many thanks to Arthur Taborda for his contribution in developing the software for collecting the texts (see https://github.com/arthurtaborda/guaiamumcrawler) during his work placement at the Language Technologies Unit.