Corpus of Facebook Welsh Language Texts

The Corpus of Facebook Welsh Language Texts contains data from user contributions and posts collected from public groups.

The posts and their corresponding comments were collected from a variety of public pages, and the resource can be accessed through the link below. No private accounts were used, and no user information is shared in this data.

Data for comments and posts are kept separately, which means that comments are not connected to their original Facebook posts. This was done for reasons of anonymity.  If this information is useful to you, contact the Language Technologies Unit.

The link below will take you to a folder which contains two different folders: one for Facebook comments and one for Facebook messages. Each zip file within these folders contains 10,000 comments and posts (each). These files will remain constant over time, and because of this can be used for research purposes.


This corpus has been taken down temporarily. If you wish to be notified by email when it is back up, please email Delyth at

File Contents

Comment Files

CSV files containing:

  • 10,000 comments each
  • For each comment:
    • the comment ID
    • the contents of the comment
    • the time at which the comment was created
    • the number of likes

Post Files

CSV files containing:

  • 10,000 post each
  • for each post:
    • The post’s ID
    • The ID of the profile that created the post (this could be a Facebook user, page or group)
    • content of the post (text only)
    • the number of comments
    • the number of likes
    • the number of times it was shared

If you want any other details about the Facebook data here, contact the Language Technologies Unit directly.


Articles or software based on the use of this corpus should cite:

Jones, D. B., Robertson, P., Taborda, A. (2015) Corpus of Facebook Welsh Language Texts []