Machine Translation for Welsh – a new strategy

A revolution is afoot in the translation world. Translators are becoming post-editors as the machine translates the first draft, and a human editor then refines it with post-editing. Only literature and highly sensitive texts will avoid this fate. The trend is being led by the need to translate large volumes of text, quickly and at a reasonable price. The Language Technologies Unit has been preparing for this brave new world by developing Welsh<>English machine translation software, which will soon be made available through the Language Technologies Portal. Using this resource, anyone will be able to do the following:

  • own and support their own Welsh<>English translation system
  • use their own corpus to create and adapt specialized translation machines

Although there are machine translation tools already available for Welsh and English through companies like Google and Microsoft, there are inherent disadvantages in their use. By sharing our machine translation resources, freely and fairly, we hope to provide support to the Welsh translation industry, develop a community of machine translation practitioners, and avoid dependency on large external institutions.

We discussed our ideas on machine translation at the TILT conference in Bangor in June of last year. Here are the slides:

 

Quality Issues

Translations generated by automatic means are not yet perfect between any language pair.  Laughable or embarrassing mistranslations made by machines have made headline news in the past, and may also lead to miscarriages of justice. This gives organisations trying to save costs by using them a bad reputation. However, machine translation accompanied by human post-editing is acceptable, and can be incorporated within the translation memory workflow. It is your responsibility to ensure that this machine translation software is used in the appropriate manner described here, including appropriate and meaningful post-editing, which avoids tarnishing the image of the translation industry and the Welsh language.

A comprehensive advice note can be found here:

http://techiaith.bangor.ac.uk/index.php/advice-note/?lang=en

Add Cysill Ar-lein to your website or code

Would you like to add Cysill Ar-lein to your web page, blog or app? You can now do this by using our widget and our new Cysill Ar-lein web service!

logo_cysill_arlein_cy
http://www.cysgliad.com/cysill/arlein

Cysill Ar-lein is an online Welsh language spelling and grammar checker and is the Language Technologies Unit’s most popular website. During 2014, there was a significant increase in its use, and in the number of texts corrected with it. In fact, there was an increase of 40%, with over a million pieces of Welsh language text checked by the website.

Cysill Ar-lein has a proven ability to give users who are not sure about their Welsh a strong confidence boost, and by allowing users to check their Welsh on many more websites and software packages, we hope that it will become possible to support and increase the confidence of even more users.

In accordance with the aims of the Language Technologies Unit, both the widget and the API service may be used free of charge.

 

Registering for a Cysill Ar-lein API key

By registering on our API service Centre you can get your own Cysill Ar-lein API key, to use in any way you’d like with the widget or the online API service. Go to Registering for an API key for further details.

 

Web Application widget

The Cysill Ar-lein web application widget is a feature that could be particularly useful for websites which allow users to compose text, such as comments.

The widget works via the web, so there’s no need to install any specialised software or even download files to any server or computer before getting started. All that’s required is to add the following few lines of HTML into your website:

<script>
        var CYSILL_API_KEY = "YOUR_API_KEY";
</script>
<script type="text/javascript" language="javascript" 
        src="http://api.techiaith.org/cysill/ui/CysillArlein/CysillArlein.nocache.js">
</script>

N.B. you will need to register for your API key and input it instead of ‘YOUR_API_KEY’

The widget can be placed anywhere within a web page by adding :

<div id='CysillArleinApp'></div>

The widget can be presented in a smaller embedded format next to some text, or on its own on a separate page. The system is flexible, and empowers you to use the spell-checker in the way which works best for you. Below is an example of the widget on Bangor University’s ‘Cymorth Cymraeg’ website.

CaptureCysillArleinCymorthCymraeg
Cysill Ar-lein o fewn dudalennau CymorthCymraeg Prifysgol Bangor

Here’s an another example of a simple Cysill Ar-lein website . Right-click on the page and choose: ‘View Page Source’ to see how simple the code really is.

The following page on GitHub explains in full how to use the widget.

 

Cysill Ar-lein API Service

Another offering from Cysill Ar-lein that we are excited to announce is the ability to embed Cysill Welsh language spelling and grammar checking functionality within your software using a new API service. This is the API service already used by the widget and the official Cysill Ar-lein website.

From today we are opening access to the Cysill Ar-lein API service to anyone for use with their coding projects and/or integrate with their software. This service is free.

N.B. you will need to register for your API key and input it instead of ‘YOUR_API_KEY’

We’ve loaded some examples onto GitHub showing how the API can be used with programming languages such as Python.

Go to:

https://github.com/PorthTechnolegauIaith/cysill

 

The examples contain code that :

 

Welsh Parts of Speech Tagger

One of the most important components of our Welsh grammar and spell-checking program Cysill is the parts of speech tagger. In fact, a tagger is one of the most crucial component in any application where a computer is expected to analyse and interpret natural language text.

Our tagger can identify Welsh words – even when those words are mutated, or when verbs are inflected – and provide their part of speech. This information encompasses a wide range of Welsh linguistic features which the tagger is able to recognise and tag; including nouns, adjectives, the type of mutation, and so on.

For example, the tagger will return the text “Mae hen wlad fy nhadau” as :

mae/VBF/- hen/ADJP/- wlad/NF/TM fy/PRONOUN/- nhadau/NPL/TT

Our parts of speech tagger is the first capability from our online API Services.  We’re excited that the tagger will now be available for everybody to use, with generous terms of use, and for free.

For more information, go to the Parts of Speech Tagger API.

New Welsh Language Social Web Corpora

Here at the Language Technologies Unit we have been collecting Welsh tweets from Twitter and public Facebook posts and comments for the past 6 months.

Today we are pleased to release these two huge corpora to the general public!

twitterFrom today, we have a collection of over 2.6 million Welsh tweets, and 40,000 Facebook comments and posts available for download. This collection of over 30 million Welsh words is constantly increasing, and more will be made available over time.

Through the use of a Welsh language-detection model produced here at Bangor University (keep your eyes peeled for this!) and the open source language-detection project, we have been able to sort through millions of tweets and Facebook posts to find only those Welsh language texts relevant to us, with a 99% accuracy rate.

fbThis is an exciting and ground-breaking release, with it being the first example of electronic and informal Welsh medium corpora available anywhere.

The corpora are noteworthy as they have been created entirely online (through Twitter and Facebook) and include content by Welsh speakers from across the world.

We envisage these corpora being used for anything from training predictive text systems for phones, to finding new words in the Welsh language and further academic research.

You can find and download all these files from our Corpora webpage.

Before downloading these files, we ask you to read the documentation and terms and conditions of download.

Language Technologies Portal Blog

During the next few weeks and months (and leading up to our ‘Through Technological Means’ conference) we will be publishing a number of language technology resources through Twitter (@techiaith) and this blog.

We hope to share stories on other developers and coders using these new resources, so contact us if any of them have been useful to your activities or projects.

There’s an exciting collection of new stuff on the way, giving a serious boost to coders and developers of new Welsh software.

We would like to thank the Welsh Government and their Welsh-language Technology and Digital Media Fund for sponsoring this work which forms part of the National Welsh national Language Technologies Portal.

Follow our blog for all our latest news!