Corpus of Welsh Language Tweets

This is a corpus of over 7 million Welsh language tweets. Tweets are collected continuously from Twitter by our servers, so that the corpus is always growing in size.

There is great demand for large corpora of informal Welsh. We hope that this corpus will perform several roles, including:

  • Training predictive text systems for phones
  • Finding new words in the Welsh
  • Research material for academic departments in Universities, including in linguistic, sociological and medical studies
  • Educational and demonstration data for children and young people in coding clubs
  • Valuable information for the market, for example; tracking and analysing user emotions  (sentiment analysis)

In order to get access to the necessary files, follow the link below which will take you to a page containing a collection of zip files which can be downloaded to your computer and examined. Each file contains a block of 50,000 tweets. In accordance with Twitter’s terms of use, you may only download one file each day, per user.

Download

http://techiaith.org/corpws/Twitter/

Terms and conditions of downloading

Before downloading any files, we recommend that you read the Twitter Development Agreement, paying particularly close attention to Be a Good Partner to Twitter (Rhan b) which explains the terms of downloading these files.

Contents of the files

Each file which is available to download contains the following information in the CSV format:

  • 50,000 tweets
  • About each tweet:
    • the message in the tweet
    • the twitter user’s ID
    • the date on which the message was created
    • the number of retweets
    • the number of favourites
    • a number (0 or 1) which notes whether the tweet was a retweet

We have chosen to conceal the user details for the tweets that we will be releasing. If you would like to have access to this information, or any other details, you should contact the Language Technologies Unit directly.

Acknowledgements

Articles or software based on the use of this corpus should cite:

Jones, D. B., Robertson, P., Taborda, A. (2015) Corpus of Welsh Language Tweets [http://techiaith.org/corpora/twitter/?lang=en]

Language Detection

The open source language-detection library along with training by Bangor University was used to successfully analyse millions of tweets and Facebook messages to find the Welsh language texts relevant to us. According to our tests, which we ran on some of the tweets shared on this website, the language detection accuracy approaches 97% for Welsh medium texts longer than 30 characters.

For tweets shorter than 30 characters, the process of language recognition is indeterminate (with an accuracy rate lower than 50%), therefore we suggest that you remove any short tweets if you have a high threshold for language recognition accuracy.

Thanks

Many thanks to Arthur Taborda for his contribution in developing the software for collecting the texts (see  https://github.com/arthurtaborda/guaiamumcrawler) during his work placement at the Language Technologies Unit.