Towards a Welsh ‘Siri’…..

It is increasingly possible for you to speak with devices such as your phone or computer in order to command and control applications and devices as well as to receive intelligent and relevant answers to questions voiced in natural language.

Such capabilities are possible as a consequence of recent advancements in speech recognition, machine translation and natural language processing and understanding. As such they are the prime enablers for a disruptive change and a fundamental shift in how users and consumers engage with their devices and how they more widely use technology.

If looked at in its wider historical context, this is only the next step in the evolution of human computer interaction; from keyboard, to mouse, to touch, to voice and language.

There are four main commercial platforms driving this change, namely Siri, Ok Google, Microsoft Cortana and Amazon Alexa, as well as some lesser known open platforms.

 

 

To date, these provide their powerful capabilities in English and some other major languages, with little evidence that they are likely to extend their choice of languages to the ‘long tail’ of smaller languages, including Welsh, in the near future.

The Language Technologies Unit has been sponsored by the Welsh Government through its Welsh Language Technology and Digital Media Fund and S4C therefore to fulfill the ‘Welsh Language Communications Infrastructure‘ project, ensuring that users with a preferred language of Welsh are not left behind in such developments.

Our first deliverable as part of the project is a brief report on how we can achieve this. It concludes that the commercial offerings by the large companies do not provide any technical means at the moment for realising a Welsh language digital assistant. Thus only open alternatives such as finer grained online APIs and various open source software allow us to progress.

It is hoped that the project will lay the foundations for a range of Welsh language technologies to be used in such environments, including improving the work done to date on Welsh language speech recognition as well as machine translation for leveraging some of capabilities provided via English language based technologies.

All of the software and resources developed by the project will be available here from the Welsh National Language Technologies Portal. The project will stimulate the development of new Welsh language software and services that could contribute to the mainstreaming of Welsh in the next phase of human-computer interaction.

In the meantime, we need your help! Please contribute your voice to our speech corpus via our Paldaruo app:

paldaruo

iTunes Google Play

Project Raspberry Pi: Symud braich robot gyda’ch llais

Yn yr Eisteddfodau a digwyddiadau Hacio’r Iaith diweddar, rydym wedi arddangos ein breichiau robot sy’n glwm i Raspberry Pis ac sy’n yn ymateb i gyfarwyddyd yn y Gymraeg.

Dyma fideo o dair braich gyda’i gilydd :

Mae’n system adnabod lleferydd syml iawn a nawr, i’r rhai sy’n teimlo’n anturus, dyma gyfarwyddiadau ar sut y gallwch chithau gosod y demo ar eich Raspberry Pi chi.

Byddwch angen yr offer canlynol:

Os rydych yn defnyddio Raspberry Pi hŷn, gyda ddim ond dau borth USB, yna rydych angen hwb USB, fel http://www.modmypi.com/raspberry-pi/accessories/usb-hubs/pihub-official-4-port-raspberry-pi-usb-hub-eu-plug-5v-3a, er mwyn cysylltu popeth.

Mae’r demo yn defnyddio peiriant adnabod lleferydd cod agored o’r enw ‘Julius’. Mae hefyd yn defnyddio modelau acwstig rydym wedi eu cynhyrchu gyda recordiadau 20 unigolyn yn llefaru promtiau arbennig.

Teipiwch y canlynol o linell gorchymyn ar eich Raspberry Pi er mwyn gosod y system ‘Julius’:

$ sudo apt-get update
$ sudo apt-get install alsa-tools alsa-oss flex zlib1g-dev libc-bin libc-dev-bin python-pexpect libasound2 libasound2-dev cvs
$ cvs -z3 -d:pserver:anonymous@cvs.sourceforge.jp:/cvsroot/julius co julius4
$ export CFLAGS="-O2 -mcpu=arm1176jzf-s -mfpu=vfp -mfloat-abi=hard -pipe -fomit-frame-pointer"
$ ./configure --with-mictype=alsa
$ sudo make
$ sudo make install
$ export ALSADEV="plughw:1,0"
$ julius

Os yw’r llinell olaf yn achosi i’r canlynol ymddangos, yna rydych wedi gosod Julius yn llwyddiannus!

Julius rev.4.3.1 - based on
JuliusLib rev.4.3.1 (fast) built for x86_64-unknown-linux-gnu

Copyright (c) 1991-2013 Kawahara Lab., Kyoto University
Copyright (c) 1997-2000 Information-technology Promotion Agency, Japan
Copyright (c) 2000-2005 Shikano Lab., Nara Institute of Science and Technology
Copyright (c) 2005-2013 Julius project team, Nagoya Institute of Technology

Try '-setting' for built-in engine configuration.
Try '-help' for run time options.

Yn nesaf, rhaid i chi lwytho i lawr ein ffeiliau adnabod lleferydd braich robot o’r Porth Technolegau Iaith ar gyfer eu defnyddio gyda Julius.

$ mkdir robot
$ cd robot
$ wget http://techiaith.cymru/gallu/braichrobot.tar.gz
$ tar -zxvf braichrobot.tar.gz

Ac yna er mwyn cael y Raspberry Pi a’r fraich robot i ymateb i’r gorchmynion ar lafar, teipiwch:

$ cd braichrobot
$ sudo python robotarm_voicectl.py

Dylai’r gair ‘siaradwch’ ymddangos. Dyma beth fyddwch nawr yn gallu dweud wrth y fraich:

ysgwydd i fyny
ysgwydd i lawr
penelin i fyny
penelin i lawr
arddwrn i fyny
arddwrn i lawr
gafael agor
gafael cau
troi i’r chwith
troi i’r dde
golau ymlaen

Gobeithio bydd y project bach yma yn hwyl yn enwedig i ddisgyblion Ysgol Pont y Gof, Botwnnog a enillodd un o’n breichiau robot mewn cystadleuaeth codio yng Ngholeg Meirion Dwyfor ym Mhwllheli yn ystod yr haf:

Yn y cyfamser, diolch i nawdd gan Lywodraeth Cymru ac S4C, rydym yn parhau i ddatblygu adnabod lleferydd Cymraeg ac i’w chynnig yn rhad ac am ddim o fewn y Porth Technolegau Iaith. Ein bwriad yw datblygu systemau mwy soffistigedig a mwy defnyddiol.

Ond mae angen eich help! Cyfrannwch eich llais drwy ein ap Paldaruo:

paldaruo

iTunes Google Play

More Welsh text-to-speech resources on GitHub

Since its launch in March, a few coders and companies have been using the cloud based Welsh language text-to-speech API service.

Very often however, developers from companies in particular wish to utilise Welsh language text-to-speech available offline and in Microsoft Windows based environments. We also get from time to time e-mails from text-to-speech developers of other lesser resourced languages asking for help on using their own voices in Microsoft Windows.

Our Welsh language text-to-speech voice is possible thanks to the superb Festival Speech Synthesis System. However, Festival, as its developers openly admit, does not support Microsoft Windows very well at all.

We think that Festival and its Welsh voice should be possible in Microsoft Windows. Therefore, we’ve published the speech data that makes Festival talk Welsh on GitHub as well as hack on the side to create a Visual Studio Solution project that makes Festival run natively on Windows with a very basic COM and .NET interface.

The voice data can be found here: https://github.com/PorthTechnolegauIaith/llais_festival

While our attempt to get get our Welsh text-to-speech voice running on Windows and our contribution to improving Festival on Microsoft Windows can be found here: https://github.com/techiaith/Festival_Windows

Without these resources there are very few, if any, options for Welsh or any Festival voice to be usable on Windows. We hope that these contributions are of great help and can be improved upon with the aid of Welsh language and international open source communities.

Coding a Welsh language robot

As part of our mission to promote the acquisition of computing skills amongst Welsh speakers, the Language Technologies Unit has been developing a series of computer science lessons aimed at primary school children.

The basis of these resources is the Raspberry Pi foundation’s collection of Turing Test lessonsrobot. The resources were originally created in English and placed on the foundation’s website under an open license, allowing for free distribution and sharing.

Our contribution has been to translate the whole course into Welsh, and to place it on GitHub, so that it can be made accessible to the public to use or adapt towards any purpose that they wish. We’ve also created a brand new lesson for the course that is specifically geared towards Welsh speaking children. This special lesson introduces children to some of the resources of the Language Resources Portal, including Welsh language text-to-speech, Cysill Ar-lein (online spelling and grammar checker for Welsh), language detection and parts of speech tagger, all in a fun and easy format.

tyrbinau 006
Children from Garndolbenmaen Primary school enjoying their coding lesson with Dewi Bryn Jones, Patrick Robertson and Rapiro the Robot.

The lesson was trialled by Dewi Bryn Jones and Patrick Robertson at Garndolbenmaen Primary School in March, and was considered a resounding success. See this previous blog post for a video created by the children to learn more about the day’s events.

All of the resources are available on GitHub under an open license here. These include the three original lessons that were translated, the special lesson on adding Welsh features to the robot and also instructions for setting up for teachers and students.

Here is the lesson structure:

Lessons

And you can find the special Welsh lesson here:

Moses SMT update

When we released our machine translation resources earlier this month, we were using the first version of Moses, version 1.0. We have now updated the script to the latest version: Moses 3.0.

Everything is available from either GitHub at http://github.com/PorthTechnolegauIaith/moses-smt or from Docker.com https://registry.hub.docker.com/u/techiaith/moses-smt/.

Moses 3.0 offers a number of improvements for translators. According to the release notes (which you can read here) these updates include features which make the decoding process quicker, release more memory, and make Moses more effective in the process of matching sentences correctly.

We will also be taking advantage of this update in order to improve the translation engine CofnodYCynulliad (which we’ve previously talked about here) with additional data which we will be collecting from the Welsh Assembly.

Additionally, we plan to create a new domain specific engine for translating software, with the help of data provided by Rhoslyn Prys from meddal.com.

These are good examples of the iterative nature of translation engines, where it’s possible to keep adding data in order to develop and improve them continuously. Keep an eye out for more developments on this soon.

Machine Translation on Mac OS X

Since we’ve already released our machine translation system in Docker, it’s easy enough to get it running on an OS X system!

First, you will need to install one or two pieces of software on your computer. This tutorial uses a homebrew to install the packages.
(You can look again at the original tutorial if you like).

Installing VirtualBox

  • Docker needs VirtualBox on OS X (and Windows) to run the Linux virtual engineering. Download VirtualBox from the VirtualBox website.

Installing boot2docker and docker

We will be using a Homebrew in order to install these. Open Terminal and write the following commands:

  • ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

    This will install the homebrew on your computer.

  • Next, install boot2docker and docker with the following commands:
    brew install boot2docker
    brew install docker

     

  • Start boot2docker (so that you can download the virtual engine) like this:
    boot2docker init

     

Increasing Virtual Box’s disk space

VirtualBox’s virtual disk will be created with a size limit of 20GB. The machine translation system (Moses SMT), including the language model file, needs more disk space than this, so the disk size will obviously need to be increased. This is unfortunately quite a long process, but the good news is that Docker have written a very simple tutorial on how to do it!

We recommend that you increase the disk size to 30GB (although the machine translation system only needs around 21GB).

Downloading and installing the translation system

Once you’ve increased the disk size in VirtualBox, you will need to start the boot2docker engine. Go back to Terminal, and write:

boot2docker up

Make a note of what is printed on the screen at the end of this command. This is important because you will need it to communicate with Docker. It should look something like this:

Writing /Users/patrick/.boot2docker/certs/boot2docker-vm/ca.pem
Writing /Users/patrick/.boot2docker/certs/boot2docker-vm/cert.pem
Writing /Users/patrick/.boot2docker/certs/boot2docker-vm/key.pem
    export DOCKER_CERT_PATH=/Users/patrick/.boot2docker/certs/boot2docker-vm
    export DOCKER_TLS_VERIFY=1
    export DOCKER_HOST=tcp://192.168.59.103:2376

The last three lines are particularly important. Copy them, and then paste them into your Terminal window so that you can run the export commands.

Docker is ready

Now, after all this work, Docker should be ready!
Download the machine translation file using the following command:

docker pull techiaith/moses-smt

And then start the engine with:

docker run --name moses-smt-cofnodycynulliad-en-cy -p 8008:8008 -p 8080:8080 techiaith/moses-smt start -e CofnodYCynulliad -s en -t cy

Note: this command downloads a translation model which is based on the Proceedings of the National Assembly for Wales corpus. You can change the name ‘CofnodYCynulliad’ after the ‘start’ command to any one of the three below:

  • CofnodYCynulliad (en-cy a cy-en) – two large models which are based on the Proceedings of the National Assembly for Wales. One is specifically for translation from English to Welsh (en-cy), and the other is for translation from Welsh to English (cy-en). Size: ~3.7GB each.
  • CofnodBachYCynulliad – a much smaller model of the proceedings corpus which is based on a sub-set of the data (we recommend this if you just want to experiment quickly). Size: ~65MB
  • Deddfwriaeth – this engine was trained with data from the Legislation corpus. Size: ~900MB

These three language models are also available for download from techiaith.org. See http://techiaith.org/moses/

It’s also important to note that you can use your own language model for this step (if you’ve already trained one)! Remember that the data we provide is a basis only, and it’s fairly simple to train your own language model. See the docs for more information on how to do this here.

See Moses working

The final ‘docker run’ command creates a server on your local computer on the port 8008. To connect with this port, you will need to open ports in the VirtualBox. Open the  ‘VirtualBox.app’ program (in your ‘Applications’ folder, and then click on Settings’, and then on the ‘Network’ tab. There is a button at the bottom of the screen called ‘port forwarding’. Add rules as you can see below:

virtualbox

That’s it!

Go to http://127.0.0.1:8008 in your browser and start translating!

diolch

Thanks!!!

We would like to thank everyone who attended the Through Technological Means conference, and all those who gave presentations and contributed their time and energy towards making it a great day.

But most of all, we’d like to pass our special thanks on to the children of Garndolbenmaen primary school. They came to talk about their experiences using our synthetic voice resources in recent lessons they received on coding with the Raspberry Pi, which were provided by the Unit. They had prepared a video for the conference, but unfortunately there were technical problems when it was played. So now at last (and with apologies for those difficulties), here is the full video that was made by the children of Garndolbenmaen primary school:

The children described to the audience their experience during the lessons, where they were taught core coding skills using the Language Technology Unit’s Welsh medium Turing Test resources. The children also had the opportunity to meet one very special guest – the Vice-chancellor of Bangor University!

DSC_0010

The children explained to the Vice-chancellor, professor John Hughes, that they had thoroughly enjoyed working on the project, and that they had learnt a variety of very useful skills. One or two even said that they would like to be professional coders in the future! The children were also able to meet with some of the guest speakers who had travelled from far and wide to attend the conference. Below, from left to right, are John Judge from Ireland, Dwayne Bailey from South Africa (but who is currently working in London) and Kepa Sarasola from the Basque Country .

siaradwyr_NDF8994

Here are the children meeting the guest speakers, as well as those members of the Language Technologies Unit who worked on the Language Technologies Portal project, not forgetting Rapiro, the little robot who speaks Welsh:

Grwp_NDF8993

The children also shared their story with Radio Cymru:

Post Cyntaf : http://www.bbc.co.uk/programmes/b053hsb6 – at 1:16:25.

And the BBC News programme on S4C :

http://www.bbc.co.uk/cymrufyw/31833000

And there were many positive comments on Twitter :

 

 

Welsh language synthetic voice API

Text to speech technologies are now commonly used in mobile apps, websites and desktop applications to improve user experience and understanding. Today we are pleased to launch an API service that will make it possible for anybody to insert Welsh text to speech technologies into their websites and software.

Using the open source Festival Speech Synthesis System, and a Welsh language speech model we previously created our new web API makes it easy to automatically convert any Welsh text into audio in realtime. This cloud service needs no setup on the user’s side making it instantly widely accessible and available to all.

Below, you can find an example of how this voice could be inserted into this page, with only one line of code!

You can get started with the API today by signing up to our API Centre and creating your API key.
To learn more see our Speech Technologies pages.

Creating domain-specific Translation Engines

Many translators believe that there is only one translation engine within their translation infrastructure.  But some translators use many engines; domain-specific translation engines.

Domain-specific translation engines are engines created in order to translate for particular topics, styles or registers. For many translators, domain-specific engines offer superior translation compared to normal machine translation systems.

Domain-specific engines are particularly effective in situations where translation memories are already being used successfully to save time and money. If you use domain-specific translation memories, a translation engine can use the same allocation and a post-editing routine to increasy the efficacy and productivity of translation beyond that produced by normal translation memory systems.

Today we are releasing resources in the Language Technologies Portal and on GitHub which allow you to create, using Moses-SMT, your own domain-specific translation engines.

Be advised – you will need a Linux computer such as Ubuntu, with at least 4Gb of RAM and a significant amount of paralell Welsh-English text. Our method produces domain-specific translation machines which don’t need much memory to run, but do take some GBs of space on your hard disk.

To get started, you will need to install Moses-SMT using the instructions on the following page : Installing Moses-SMT on Linux. The installation scripts include adaptations that we’ve made that make it easier for you to train Moses-SMT with your own parallel Welsh-English text.

The page Create Moses-SMT engines  provides detailed instructions on how to get started. But in short, if you had a paralell text taken from your own work translating marketing (or ‘marchnata’ in Welsh) documents, you would need to do the following.

First, place the Welsh text in a file named ‘Marchnata.cy’ and the English text in ‘Saesneg.en’ and then keep these files in the sub-folder  ‘corpus’ inside the folder of your machine ‘Marketing’ like this:

moses@ubuntu:~/moses-smt$ cd ~/moses-models/Marchnata/corpus
moses@ubuntu:~/moses-models/Marchnata/corpus$ ls
Marchnata.cy  Marchnata.en

The data is now ready to be trained. You will only need a single command, noting the name and the direction of the translation (i.e. Welsh to English, or English to Welsh).

So, if you’d like to create a machine for the marketing data that translates from English to Welsh, you would type the command line as it is below:

moses@ubuntu:~/moses-smt$ python moses.py train -e Marchnata -s en -t cy

This will cause a lot of data to appear on the screen. The command, depending on the size of your original dataset, will probably take hours to complete. There is no need to follow the progress reports particularly closely, but you will need to keep an eye out for any serious error messages to check whether or not the training has succeeded.

If it is successful, follow the prompt to edit and change files of your new machine.

To start the new machine, you will need the following command:

moses@ubuntu:~/moses-smt$ python moses.py start -e Marchnata -s en -t cy

Moses and the Two Commands

Moses-SMT is an open source machine translation system that was mainly developed at Edinburgh University. This resource allows you to develop your own machine translation engines for use in your translation projects by training it with any pre-existing corpora of parallel texts.

We at the Language Technologies Unit have used Moses-SMT in order to provide machine translation in our commercial offering CyfieithuCymru, which enables and supports efficient Welsh<>English translation within institutions.

Today we are releasing these Moses-SMT translation systems to you, as well as the data which was used to train them.

We are making our machine translation engines freely available because we believe that it’s vital for Welsh translators be able to own and develop their own machine translation infrastructure, and master these new disruptive technologies for full effect. This ambition was explained in our previous blog post.

In order to make the package as easy as possible to use, we’ve developed a simple system which only requires two commands to operate (providing that the necessary operating system and equipment are already installed!).

Before you go ahead however, we’d like to emphasize once more the importance of quality control – It is your responsibility to ensure that this machine translation software is used appropriately, including the use of careful post-editing (see Quality Issues).

Docker

docker-whale-home-logoDocker is an open platform for developers and sysadmins to build, ship, and run distributed applications. Using Docker it will be easy to install and run Moses-SMT without adversely affecting any of your other installations.

We have loaed our Moses-SMT to docker.com’s central registry.

You will need a version of Docker more recent than 1.0.1 on your Linux system. We usually use Ubuntu. Here is a video on YouTube that explains how you can install docker 1.3 on Ubuntu 14.04. If you would like to run your translation engine on a Windows computer or on a Mac OS X then you may be able to use Boot2Docker.

So, in Linux, the two commands are:

Command 1 : Installing Moses-SMT (with Docker)

$ docker pull techiaith/moses-smt

This will download and install the machine translation infrastructure into your Docker system.

When it has finished downloading, type ‘docker images’ to check that it’s been installed.

$ docker images
REPOSITORY                                        TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
techiaith/moses-smt                               latest              3dbad7f9aabf        41 hours ago        3.333 GB
$

Command 2 : Start a Translation Engine of Your Choice

The Language Technologies Unit has created translation engines by training them with data collected from open and public sources, such as the Proceedings of the Welsh Assembly and the Legislation on-line.

These engines have specific names and translation directions. The name of engine that was trained with Assembly data is ‘CofnodYCynulliad’, while the name of the engine trained with the Legislation on-line is ‘Deddfwriaeth’.

Here is the second command, with options set to select the ‘CofnodYCynulliad’ engine that translates from English to Welsh :

$ docker run --name moses-smt-cofnodycynulliad-en-cy -p 8080:8080 -p 8008:8008 techiaith/moses-smt start -e CofnodYCynulliad -s en -t cy

The system will initially download a file (around 3Gb in the case of the  CofnodYCynulliad) before confirming that it is ready to start translating.

If you open your browser and go to http://127.0.0.1:8008 , a simple form should appear so that you can check whether or not the engine works as intended:

Screenshot from 2015-03-02 10:26:21

Training Data

The data collected by the Language Technologies Unit, which was used to train our Moses-SMT machines, is available below: