Category Archives: Machine Translation

New Cloud based Welsh Machine Translation

coin-tinyPart of our Welsh Language Communications Infrastructure project is to improve the machine translation resources for leveraging some capabilities that are provided via English language based technologies.

As a result, the Welsh National Language Technologies Portal Moses-SMT machine translation’s capabilities are now available from the API Centre thus making it easy to integrate into your software including translation memory systems such as Trados, WordFast and CyfieithuCymru (TranslateWales)

Welsh<>English Moses-SMT joins a wide range of other language technologies API services such as Cysill (Welsh spelling and grammar checker), text-to-speech, parts of speech tagger, language detection, lemmatizer and Vocab to enhance Welsh support of your website, app and software.

api cloudSimilar to these other API services, you can get started by obtaining your API key (How to register for an API key) and follow the documentation and code examples we’ve prepared on GitHub. Please see: https://github.com/PorthTechnolegauIaith/moses-smt/blob/master/docs/APIArlein.md

Before you go ahead however, we’d like to emphasize once more the importance of quality control – It is your responsibility to ensure that this machine translation software is used appropriately, including the use of careful post-editing (see Quality Issues).

Demo

We have prepared a demo so that you can evaluate the machine translation engines.

Please see:  http://techiaith.cymru/translation/demo

 

Towards a Welsh ‘Siri’…..

It is increasingly possible for you to speak with devices such as your phone or computer in order to command and control applications and devices as well as to receive intelligent and relevant answers to questions voiced in natural language.

Such capabilities are possible as a consequence of recent advancements in speech recognition, machine translation and natural language processing and understanding. As such they are the prime enablers for a disruptive change and a fundamental shift in how users and consumers engage with their devices and how they more widely use technology.

If looked at in its wider historical context, this is only the next step in the evolution of human computer interaction; from keyboard, to mouse, to touch, to voice and language.

There are four main commercial platforms driving this change, namely Siri, Ok Google, Microsoft Cortana and Amazon Alexa, as well as some lesser known open platforms.

 

 

To date, these provide their powerful capabilities in English and some other major languages, with little evidence that they are likely to extend their choice of languages to the ‘long tail’ of smaller languages, including Welsh, in the near future.

The Language Technologies Unit has been sponsored by the Welsh Government through its Welsh Language Technology and Digital Media Fund and S4C therefore to fulfill the ‘Welsh Language Communications Infrastructure‘ project, ensuring that users with a preferred language of Welsh are not left behind in such developments.

Our first deliverable as part of the project is a brief report on how we can achieve this. It concludes that the commercial offerings by the large companies do not provide any technical means at the moment for realising a Welsh language digital assistant. Thus only open alternatives such as finer grained online APIs and various open source software allow us to progress.

It is hoped that the project will lay the foundations for a range of Welsh language technologies to be used in such environments, including improving the work done to date on Welsh language speech recognition as well as machine translation for leveraging some of capabilities provided via English language based technologies.

All of the software and resources developed by the project will be available here from the Welsh National Language Technologies Portal. The project will stimulate the development of new Welsh language software and services that could contribute to the mainstreaming of Welsh in the next phase of human-computer interaction.

In the meantime, we need your help! Please contribute your voice to our speech corpus via our Paldaruo app:

paldaruo

iTunes Google Play

Moses SMT update

When we released our machine translation resources earlier this month, we were using the first version of Moses, version 1.0. We have now updated the script to the latest version: Moses 3.0.

Everything is available from either GitHub at http://github.com/PorthTechnolegauIaith/moses-smt or from Docker.com https://registry.hub.docker.com/u/techiaith/moses-smt/.

Moses 3.0 offers a number of improvements for translators. According to the release notes (which you can read here) these updates include features which make the decoding process quicker, release more memory, and make Moses more effective in the process of matching sentences correctly.

We will also be taking advantage of this update in order to improve the translation engine CofnodYCynulliad (which we’ve previously talked about here) with additional data which we will be collecting from the Welsh Assembly.

Additionally, we plan to create a new domain specific engine for translating software, with the help of data provided by Rhoslyn Prys from meddal.com.

These are good examples of the iterative nature of translation engines, where it’s possible to keep adding data in order to develop and improve them continuously. Keep an eye out for more developments on this soon.

Machine Translation on Mac OS X

Since we’ve already released our machine translation system in Docker, it’s easy enough to get it running on an OS X system!

First, you will need to install one or two pieces of software on your computer. This tutorial uses a homebrew to install the packages.
(You can look again at the original tutorial if you like).

Installing VirtualBox

  • Docker needs VirtualBox on OS X (and Windows) to run the Linux virtual engineering. Download VirtualBox from the VirtualBox website.

Installing boot2docker and docker

We will be using a Homebrew in order to install these. Open Terminal and write the following commands:

  • ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

    This will install the homebrew on your computer.

  • Next, install boot2docker and docker with the following commands:
    brew install boot2docker
    brew install docker

     

  • Start boot2docker (so that you can download the virtual engine) like this:
    boot2docker init

     

Increasing Virtual Box’s disk space

VirtualBox’s virtual disk will be created with a size limit of 20GB. The machine translation system (Moses SMT), including the language model file, needs more disk space than this, so the disk size will obviously need to be increased. This is unfortunately quite a long process, but the good news is that Docker have written a very simple tutorial on how to do it!

We recommend that you increase the disk size to 30GB (although the machine translation system only needs around 21GB).

Downloading and installing the translation system

Once you’ve increased the disk size in VirtualBox, you will need to start the boot2docker engine. Go back to Terminal, and write:

boot2docker up

Make a note of what is printed on the screen at the end of this command. This is important because you will need it to communicate with Docker. It should look something like this:

Writing /Users/patrick/.boot2docker/certs/boot2docker-vm/ca.pem
Writing /Users/patrick/.boot2docker/certs/boot2docker-vm/cert.pem
Writing /Users/patrick/.boot2docker/certs/boot2docker-vm/key.pem
    export DOCKER_CERT_PATH=/Users/patrick/.boot2docker/certs/boot2docker-vm
    export DOCKER_TLS_VERIFY=1
    export DOCKER_HOST=tcp://192.168.59.103:2376

The last three lines are particularly important. Copy them, and then paste them into your Terminal window so that you can run the export commands.

Docker is ready

Now, after all this work, Docker should be ready!
Download the machine translation file using the following command:

docker pull techiaith/moses-smt

And then start the engine with:

docker run --name moses-smt-cofnodycynulliad-en-cy -p 8008:8008 -p 8080:8080 techiaith/moses-smt start -e CofnodYCynulliad -s en -t cy

Note: this command downloads a translation model which is based on the Proceedings of the National Assembly for Wales corpus. You can change the name ‘CofnodYCynulliad’ after the ‘start’ command to any one of the three below:

  • CofnodYCynulliad (en-cy a cy-en) – two large models which are based on the Proceedings of the National Assembly for Wales. One is specifically for translation from English to Welsh (en-cy), and the other is for translation from Welsh to English (cy-en). Size: ~3.7GB each.
  • CofnodBachYCynulliad – a much smaller model of the proceedings corpus which is based on a sub-set of the data (we recommend this if you just want to experiment quickly). Size: ~65MB
  • Deddfwriaeth – this engine was trained with data from the Legislation corpus. Size: ~900MB

These three language models are also available for download from techiaith.org. See http://techiaith.org/moses/

It’s also important to note that you can use your own language model for this step (if you’ve already trained one)! Remember that the data we provide is a basis only, and it’s fairly simple to train your own language model. See the docs for more information on how to do this here.

See Moses working

The final ‘docker run’ command creates a server on your local computer on the port 8008. To connect with this port, you will need to open ports in the VirtualBox. Open the  ‘VirtualBox.app’ program (in your ‘Applications’ folder, and then click on Settings’, and then on the ‘Network’ tab. There is a button at the bottom of the screen called ‘port forwarding’. Add rules as you can see below:

virtualbox

That’s it!

Go to http://127.0.0.1:8008 in your browser and start translating!

diolch

Creating domain-specific Translation Engines

Many translators believe that there is only one translation engine within their translation infrastructure.  But some translators use many engines; domain-specific translation engines.

Domain-specific translation engines are engines created in order to translate for particular topics, styles or registers. For many translators, domain-specific engines offer superior translation compared to normal machine translation systems.

Domain-specific engines are particularly effective in situations where translation memories are already being used successfully to save time and money. If you use domain-specific translation memories, a translation engine can use the same allocation and a post-editing routine to increasy the efficacy and productivity of translation beyond that produced by normal translation memory systems.

Today we are releasing resources in the Language Technologies Portal and on GitHub which allow you to create, using Moses-SMT, your own domain-specific translation engines.

Be advised – you will need a Linux computer such as Ubuntu, with at least 4Gb of RAM and a significant amount of paralell Welsh-English text. Our method produces domain-specific translation machines which don’t need much memory to run, but do take some GBs of space on your hard disk.

To get started, you will need to install Moses-SMT using the instructions on the following page : Installing Moses-SMT on Linux. The installation scripts include adaptations that we’ve made that make it easier for you to train Moses-SMT with your own parallel Welsh-English text.

The page Create Moses-SMT engines  provides detailed instructions on how to get started. But in short, if you had a paralell text taken from your own work translating marketing (or ‘marchnata’ in Welsh) documents, you would need to do the following.

First, place the Welsh text in a file named ‘Marchnata.cy’ and the English text in ‘Saesneg.en’ and then keep these files in the sub-folder  ‘corpus’ inside the folder of your machine ‘Marketing’ like this:

moses@ubuntu:~/moses-smt$ cd ~/moses-models/Marchnata/corpus
moses@ubuntu:~/moses-models/Marchnata/corpus$ ls
Marchnata.cy  Marchnata.en

The data is now ready to be trained. You will only need a single command, noting the name and the direction of the translation (i.e. Welsh to English, or English to Welsh).

So, if you’d like to create a machine for the marketing data that translates from English to Welsh, you would type the command line as it is below:

moses@ubuntu:~/moses-smt$ python moses.py train -e Marchnata -s en -t cy

This will cause a lot of data to appear on the screen. The command, depending on the size of your original dataset, will probably take hours to complete. There is no need to follow the progress reports particularly closely, but you will need to keep an eye out for any serious error messages to check whether or not the training has succeeded.

If it is successful, follow the prompt to edit and change files of your new machine.

To start the new machine, you will need the following command:

moses@ubuntu:~/moses-smt$ python moses.py start -e Marchnata -s en -t cy

Moses and the Two Commands

Moses-SMT is an open source machine translation system that was mainly developed at Edinburgh University. This resource allows you to develop your own machine translation engines for use in your translation projects by training it with any pre-existing corpora of parallel texts.

We at the Language Technologies Unit have used Moses-SMT in order to provide machine translation in our commercial offering CyfieithuCymru, which enables and supports efficient Welsh<>English translation within institutions.

Today we are releasing these Moses-SMT translation systems to you, as well as the data which was used to train them.

We are making our machine translation engines freely available because we believe that it’s vital for Welsh translators be able to own and develop their own machine translation infrastructure, and master these new disruptive technologies for full effect. This ambition was explained in our previous blog post.

In order to make the package as easy as possible to use, we’ve developed a simple system which only requires two commands to operate (providing that the necessary operating system and equipment are already installed!).

Before you go ahead however, we’d like to emphasize once more the importance of quality control – It is your responsibility to ensure that this machine translation software is used appropriately, including the use of careful post-editing (see Quality Issues).

Docker

docker-whale-home-logoDocker is an open platform for developers and sysadmins to build, ship, and run distributed applications. Using Docker it will be easy to install and run Moses-SMT without adversely affecting any of your other installations.

We have loaed our Moses-SMT to docker.com’s central registry.

You will need a version of Docker more recent than 1.0.1 on your Linux system. We usually use Ubuntu. Here is a video on YouTube that explains how you can install docker 1.3 on Ubuntu 14.04. If you would like to run your translation engine on a Windows computer or on a Mac OS X then you may be able to use Boot2Docker.

So, in Linux, the two commands are:

Command 1 : Installing Moses-SMT (with Docker)

$ docker pull techiaith/moses-smt

This will download and install the machine translation infrastructure into your Docker system.

When it has finished downloading, type ‘docker images’ to check that it’s been installed.

$ docker images
REPOSITORY                                        TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
techiaith/moses-smt                               latest              3dbad7f9aabf        41 hours ago        3.333 GB
$

Command 2 : Start a Translation Engine of Your Choice

The Language Technologies Unit has created translation engines by training them with data collected from open and public sources, such as the Proceedings of the Welsh Assembly and the Legislation on-line.

These engines have specific names and translation directions. The name of engine that was trained with Assembly data is ‘CofnodYCynulliad’, while the name of the engine trained with the Legislation on-line is ‘Deddfwriaeth’.

Here is the second command, with options set to select the ‘CofnodYCynulliad’ engine that translates from English to Welsh :

$ docker run --name moses-smt-cofnodycynulliad-en-cy -p 8080:8080 -p 8008:8008 techiaith/moses-smt start -e CofnodYCynulliad -s en -t cy

The system will initially download a file (around 3Gb in the case of the  CofnodYCynulliad) before confirming that it is ready to start translating.

If you open your browser and go to http://127.0.0.1:8008 , a simple form should appear so that you can check whether or not the engine works as intended:

Screenshot from 2015-03-02 10:26:21

Training Data

The data collected by the Language Technologies Unit, which was used to train our Moses-SMT machines, is available below:

Machine Translation for Welsh – a new strategy

A revolution is afoot in the translation world. Translators are becoming post-editors as the machine translates the first draft, and a human editor then refines it with post-editing. Only literature and highly sensitive texts will avoid this fate. The trend is being led by the need to translate large volumes of text, quickly and at a reasonable price. The Language Technologies Unit has been preparing for this brave new world by developing Welsh<>English machine translation software, which will soon be made available through the Language Technologies Portal. Using this resource, anyone will be able to do the following:

  • own and support their own Welsh<>English translation system
  • use their own corpus to create and adapt specialized translation machines

Although there are machine translation tools already available for Welsh and English through companies like Google and Microsoft, there are inherent disadvantages in their use. By sharing our machine translation resources, freely and fairly, we hope to provide support to the Welsh translation industry, develop a community of machine translation practitioners, and avoid dependency on large external institutions.

We discussed our ideas on machine translation at the TILT conference in Bangor in June of last year. Here are the slides:

 

Quality Issues

Translations generated by automatic means are not yet perfect between any language pair.  Laughable or embarrassing mistranslations made by machines have made headline news in the past, and may also lead to miscarriages of justice. This gives organisations trying to save costs by using them a bad reputation. However, machine translation accompanied by human post-editing is acceptable, and can be incorporated within the translation memory workflow. It is your responsibility to ensure that this machine translation software is used in the appropriate manner described here, including appropriate and meaningful post-editing, which avoids tarnishing the image of the translation industry and the Welsh language.

A comprehensive advice note can be found here:

http://techiaith.bangor.ac.uk/index.php/advice-note/?lang=en