Category Archives: Translation

Moses SMT update

When we released our machine translation resources earlier this month, we were using the first version of Moses, version 1.0. We have now updated the script to the latest version: Moses 3.0.

Everything is available from either GitHub at http://github.com/PorthTechnolegauIaith/moses-smt or from Docker.com https://registry.hub.docker.com/u/techiaith/moses-smt/.

Moses 3.0 offers a number of improvements for translators. According to the release notes (which you can read here) these updates include features which make the decoding process quicker, release more memory, and make Moses more effective in the process of matching sentences correctly.

We will also be taking advantage of this update in order to improve the translation engine CofnodYCynulliad (which we’ve previously talked about here) with additional data which we will be collecting from the Welsh Assembly.

Additionally, we plan to create a new domain specific engine for translating software, with the help of data provided by Rhoslyn Prys from meddal.com.

These are good examples of the iterative nature of translation engines, where it’s possible to keep adding data in order to develop and improve them continuously. Keep an eye out for more developments on this soon.

Creating domain-specific Translation Engines

Many translators believe that there is only one translation engine within their translation infrastructure.  But some translators use many engines; domain-specific translation engines.

Domain-specific translation engines are engines created in order to translate for particular topics, styles or registers. For many translators, domain-specific engines offer superior translation compared to normal machine translation systems.

Domain-specific engines are particularly effective in situations where translation memories are already being used successfully to save time and money. If you use domain-specific translation memories, a translation engine can use the same allocation and a post-editing routine to increasy the efficacy and productivity of translation beyond that produced by normal translation memory systems.

Today we are releasing resources in the Language Technologies Portal and on GitHub which allow you to create, using Moses-SMT, your own domain-specific translation engines.

Be advised – you will need a Linux computer such as Ubuntu, with at least 4Gb of RAM and a significant amount of paralell Welsh-English text. Our method produces domain-specific translation machines which don’t need much memory to run, but do take some GBs of space on your hard disk.

To get started, you will need to install Moses-SMT using the instructions on the following page : Installing Moses-SMT on Linux. The installation scripts include adaptations that we’ve made that make it easier for you to train Moses-SMT with your own parallel Welsh-English text.

The page Create Moses-SMT engines  provides detailed instructions on how to get started. But in short, if you had a paralell text taken from your own work translating marketing (or ‘marchnata’ in Welsh) documents, you would need to do the following.

First, place the Welsh text in a file named ‘Marchnata.cy’ and the English text in ‘Saesneg.en’ and then keep these files in the sub-folder  ‘corpus’ inside the folder of your machine ‘Marketing’ like this:

moses@ubuntu:~/moses-smt$ cd ~/moses-models/Marchnata/corpus
moses@ubuntu:~/moses-models/Marchnata/corpus$ ls
Marchnata.cy  Marchnata.en

The data is now ready to be trained. You will only need a single command, noting the name and the direction of the translation (i.e. Welsh to English, or English to Welsh).

So, if you’d like to create a machine for the marketing data that translates from English to Welsh, you would type the command line as it is below:

moses@ubuntu:~/moses-smt$ python moses.py train -e Marchnata -s en -t cy

This will cause a lot of data to appear on the screen. The command, depending on the size of your original dataset, will probably take hours to complete. There is no need to follow the progress reports particularly closely, but you will need to keep an eye out for any serious error messages to check whether or not the training has succeeded.

If it is successful, follow the prompt to edit and change files of your new machine.

To start the new machine, you will need the following command:

moses@ubuntu:~/moses-smt$ python moses.py start -e Marchnata -s en -t cy

Moses and the Two Commands

Moses-SMT is an open source machine translation system that was mainly developed at Edinburgh University. This resource allows you to develop your own machine translation engines for use in your translation projects by training it with any pre-existing corpora of parallel texts.

We at the Language Technologies Unit have used Moses-SMT in order to provide machine translation in our commercial offering CyfieithuCymru, which enables and supports efficient Welsh<>English translation within institutions.

Today we are releasing these Moses-SMT translation systems to you, as well as the data which was used to train them.

We are making our machine translation engines freely available because we believe that it’s vital for Welsh translators be able to own and develop their own machine translation infrastructure, and master these new disruptive technologies for full effect. This ambition was explained in our previous blog post.

In order to make the package as easy as possible to use, we’ve developed a simple system which only requires two commands to operate (providing that the necessary operating system and equipment are already installed!).

Before you go ahead however, we’d like to emphasize once more the importance of quality control – It is your responsibility to ensure that this machine translation software is used appropriately, including the use of careful post-editing (see Quality Issues).

Docker

docker-whale-home-logoDocker is an open platform for developers and sysadmins to build, ship, and run distributed applications. Using Docker it will be easy to install and run Moses-SMT without adversely affecting any of your other installations.

We have loaed our Moses-SMT to docker.com’s central registry.

You will need a version of Docker more recent than 1.0.1 on your Linux system. We usually use Ubuntu. Here is a video on YouTube that explains how you can install docker 1.3 on Ubuntu 14.04. If you would like to run your translation engine on a Windows computer or on a Mac OS X then you may be able to use Boot2Docker.

So, in Linux, the two commands are:

Command 1 : Installing Moses-SMT (with Docker)

$ docker pull techiaith/moses-smt

This will download and install the machine translation infrastructure into your Docker system.

When it has finished downloading, type ‘docker images’ to check that it’s been installed.

$ docker images
REPOSITORY                                        TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
techiaith/moses-smt                               latest              3dbad7f9aabf        41 hours ago        3.333 GB
$

Command 2 : Start a Translation Engine of Your Choice

The Language Technologies Unit has created translation engines by training them with data collected from open and public sources, such as the Proceedings of the Welsh Assembly and the Legislation on-line.

These engines have specific names and translation directions. The name of engine that was trained with Assembly data is ‘CofnodYCynulliad’, while the name of the engine trained with the Legislation on-line is ‘Deddfwriaeth’.

Here is the second command, with options set to select the ‘CofnodYCynulliad’ engine that translates from English to Welsh :

$ docker run --name moses-smt-cofnodycynulliad-en-cy -p 8080:8080 -p 8008:8008 techiaith/moses-smt start -e CofnodYCynulliad -s en -t cy

The system will initially download a file (around 3Gb in the case of the  CofnodYCynulliad) before confirming that it is ready to start translating.

If you open your browser and go to http://127.0.0.1:8008 , a simple form should appear so that you can check whether or not the engine works as intended:

Screenshot from 2015-03-02 10:26:21

Training Data

The data collected by the Language Technologies Unit, which was used to train our Moses-SMT machines, is available below:

Machine Translation for Welsh – a new strategy

A revolution is afoot in the translation world. Translators are becoming post-editors as the machine translates the first draft, and a human editor then refines it with post-editing. Only literature and highly sensitive texts will avoid this fate. The trend is being led by the need to translate large volumes of text, quickly and at a reasonable price. The Language Technologies Unit has been preparing for this brave new world by developing Welsh<>English machine translation software, which will soon be made available through the Language Technologies Portal. Using this resource, anyone will be able to do the following:

  • own and support their own Welsh<>English translation system
  • use their own corpus to create and adapt specialized translation machines

Although there are machine translation tools already available for Welsh and English through companies like Google and Microsoft, there are inherent disadvantages in their use. By sharing our machine translation resources, freely and fairly, we hope to provide support to the Welsh translation industry, develop a community of machine translation practitioners, and avoid dependency on large external institutions.

We discussed our ideas on machine translation at the TILT conference in Bangor in June of last year. Here are the slides:

 

Quality Issues

Translations generated by automatic means are not yet perfect between any language pair.  Laughable or embarrassing mistranslations made by machines have made headline news in the past, and may also lead to miscarriages of justice. This gives organisations trying to save costs by using them a bad reputation. However, machine translation accompanied by human post-editing is acceptable, and can be incorporated within the translation memory workflow. It is your responsibility to ensure that this machine translation software is used in the appropriate manner described here, including appropriate and meaningful post-editing, which avoids tarnishing the image of the translation industry and the Welsh language.

A comprehensive advice note can be found here:

http://techiaith.bangor.ac.uk/index.php/advice-note/?lang=en