Author Archives: techiaith

Towards a Welsh ‘Siri’…..

It is increasingly possible for you to speak with devices such as your phone or computer in order to command and control applications and devices as well as to receive intelligent and relevant answers to questions voiced in natural language.

Such capabilities are possible as a consequence of recent advancements in speech recognition, machine translation and natural language processing and understanding. As such they are the prime enablers for a disruptive change and a fundamental shift in how users and consumers engage with their devices and how they more widely use technology.

If looked at in its wider historical context, this is only the next step in the evolution of human computer interaction; from keyboard, to mouse, to touch, to voice and language.

There are four main commercial platforms driving this change, namely Siri, Ok Google, Microsoft Cortana and Amazon Alexa, as well as some lesser known open platforms.

 

 

To date, these provide their powerful capabilities in English and some other major languages, with little evidence that they are likely to extend their choice of languages to the ‘long tail’ of smaller languages, including Welsh, in the near future.

The Language Technologies Unit has been sponsored by the Welsh Government through its Welsh Language Technology and Digital Media Fund and S4C therefore to fulfill the ‘Welsh Language Communications Infrastructure‘ project, ensuring that users with a preferred language of Welsh are not left behind in such developments.

Our first deliverable as part of the project is a brief report on how we can achieve this. It concludes that the commercial offerings by the large companies do not provide any technical means at the moment for realising a Welsh language digital assistant. Thus only open alternatives such as finer grained online APIs and various open source software allow us to progress.

It is hoped that the project will lay the foundations for a range of Welsh language technologies to be used in such environments, including improving the work done to date on Welsh language speech recognition as well as machine translation for leveraging some of capabilities provided via English language based technologies.

All of the software and resources developed by the project will be available here from the Welsh National Language Technologies Portal. The project will stimulate the development of new Welsh language software and services that could contribute to the mainstreaming of Welsh in the next phase of human-computer interaction.

In the meantime, we need your help! Please contribute your voice to our speech corpus via our Paldaruo app:

paldaruo

iTunes Google Play

Project Raspberry Pi: Symud braich robot gyda’ch llais

Yn yr Eisteddfodau a digwyddiadau Hacio’r Iaith diweddar, rydym wedi arddangos ein breichiau robot sy’n glwm i Raspberry Pis ac sy’n yn ymateb i gyfarwyddyd yn y Gymraeg.

Dyma fideo o dair braich gyda’i gilydd :

Mae’n system adnabod lleferydd syml iawn a nawr, i’r rhai sy’n teimlo’n anturus, dyma gyfarwyddiadau ar sut y gallwch chithau gosod y demo ar eich Raspberry Pi chi.

Byddwch angen yr offer canlynol:

Os rydych yn defnyddio Raspberry Pi hŷn, gyda ddim ond dau borth USB, yna rydych angen hwb USB, fel http://www.modmypi.com/raspberry-pi/accessories/usb-hubs/pihub-official-4-port-raspberry-pi-usb-hub-eu-plug-5v-3a, er mwyn cysylltu popeth.

Mae’r demo yn defnyddio peiriant adnabod lleferydd cod agored o’r enw ‘Julius’. Mae hefyd yn defnyddio modelau acwstig rydym wedi eu cynhyrchu gyda recordiadau 20 unigolyn yn llefaru promtiau arbennig.

Teipiwch y canlynol o linell gorchymyn ar eich Raspberry Pi er mwyn gosod y system ‘Julius’:

$ sudo apt-get update
$ sudo apt-get install alsa-tools alsa-oss flex zlib1g-dev libc-bin libc-dev-bin python-pexpect libasound2 libasound2-dev cvs
$ cvs -z3 -d:pserver:anonymous@cvs.sourceforge.jp:/cvsroot/julius co julius4
$ export CFLAGS="-O2 -mcpu=arm1176jzf-s -mfpu=vfp -mfloat-abi=hard -pipe -fomit-frame-pointer"
$ ./configure --with-mictype=alsa
$ sudo make
$ sudo make install
$ export ALSADEV="plughw:1,0"
$ julius

Os yw’r llinell olaf yn achosi i’r canlynol ymddangos, yna rydych wedi gosod Julius yn llwyddiannus!

Julius rev.4.3.1 - based on
JuliusLib rev.4.3.1 (fast) built for x86_64-unknown-linux-gnu

Copyright (c) 1991-2013 Kawahara Lab., Kyoto University
Copyright (c) 1997-2000 Information-technology Promotion Agency, Japan
Copyright (c) 2000-2005 Shikano Lab., Nara Institute of Science and Technology
Copyright (c) 2005-2013 Julius project team, Nagoya Institute of Technology

Try '-setting' for built-in engine configuration.
Try '-help' for run time options.

Yn nesaf, rhaid i chi lwytho i lawr ein ffeiliau adnabod lleferydd braich robot o’r Porth Technolegau Iaith ar gyfer eu defnyddio gyda Julius.

$ mkdir robot
$ cd robot
$ wget http://techiaith.cymru/gallu/braichrobot.tar.gz
$ tar -zxvf braichrobot.tar.gz

Ac yna er mwyn cael y Raspberry Pi a’r fraich robot i ymateb i’r gorchmynion ar lafar, teipiwch:

$ cd braichrobot
$ sudo python robotarm_voicectl.py

Dylai’r gair ‘siaradwch’ ymddangos. Dyma beth fyddwch nawr yn gallu dweud wrth y fraich:

ysgwydd i fyny
ysgwydd i lawr
penelin i fyny
penelin i lawr
arddwrn i fyny
arddwrn i lawr
gafael agor
gafael cau
troi i’r chwith
troi i’r dde
golau ymlaen

Gobeithio bydd y project bach yma yn hwyl yn enwedig i ddisgyblion Ysgol Pont y Gof, Botwnnog a enillodd un o’n breichiau robot mewn cystadleuaeth codio yng Ngholeg Meirion Dwyfor ym Mhwllheli yn ystod yr haf:

Yn y cyfamser, diolch i nawdd gan Lywodraeth Cymru ac S4C, rydym yn parhau i ddatblygu adnabod lleferydd Cymraeg ac i’w chynnig yn rhad ac am ddim o fewn y Porth Technolegau Iaith. Ein bwriad yw datblygu systemau mwy soffistigedig a mwy defnyddiol.

Ond mae angen eich help! Cyfrannwch eich llais drwy ein ap Paldaruo:

paldaruo

iTunes Google Play

More Welsh text-to-speech resources on GitHub

Since its launch in March, a few coders and companies have been using the cloud based Welsh language text-to-speech API service.

Very often however, developers from companies in particular wish to utilise Welsh language text-to-speech available offline and in Microsoft Windows based environments. We also get from time to time e-mails from text-to-speech developers of other lesser resourced languages asking for help on using their own voices in Microsoft Windows.

Our Welsh language text-to-speech voice is possible thanks to the superb Festival Speech Synthesis System. However, Festival, as its developers openly admit, does not support Microsoft Windows very well at all.

We think that Festival and its Welsh voice should be possible in Microsoft Windows. Therefore, we’ve published the speech data that makes Festival talk Welsh on GitHub as well as hack on the side to create a Visual Studio Solution project that makes Festival run natively on Windows with a very basic COM and .NET interface.

The voice data can be found here: https://github.com/PorthTechnolegauIaith/llais_festival

While our attempt to get get our Welsh text-to-speech voice running on Windows and our contribution to improving Festival on Microsoft Windows can be found here: https://github.com/techiaith/Festival_Windows

Without these resources there are very few, if any, options for Welsh or any Festival voice to be usable on Windows. We hope that these contributions are of great help and can be improved upon with the aid of Welsh language and international open source communities.

Coding a Welsh language robot

As part of our mission to promote the acquisition of computing skills amongst Welsh speakers, the Language Technologies Unit has been developing a series of computer science lessons aimed at primary school children.

The basis of these resources is the Raspberry Pi foundation’s collection of Turing Test lessonsrobot. The resources were originally created in English and placed on the foundation’s website under an open license, allowing for free distribution and sharing.

Our contribution has been to translate the whole course into Welsh, and to place it on GitHub, so that it can be made accessible to the public to use or adapt towards any purpose that they wish. We’ve also created a brand new lesson for the course that is specifically geared towards Welsh speaking children. This special lesson introduces children to some of the resources of the Language Resources Portal, including Welsh language text-to-speech, Cysill Ar-lein (online spelling and grammar checker for Welsh), language detection and parts of speech tagger, all in a fun and easy format.

tyrbinau 006
Children from Garndolbenmaen Primary school enjoying their coding lesson with Dewi Bryn Jones, Patrick Robertson and Rapiro the Robot.

The lesson was trialled by Dewi Bryn Jones and Patrick Robertson at Garndolbenmaen Primary School in March, and was considered a resounding success. See this previous blog post for a video created by the children to learn more about the day’s events.

All of the resources are available on GitHub under an open license here. These include the three original lessons that were translated, the special lesson on adding Welsh features to the robot and also instructions for setting up for teachers and students.

Here is the lesson structure:

Lessons

And you can find the special Welsh lesson here:

Moses SMT update

When we released our machine translation resources earlier this month, we were using the first version of Moses, version 1.0. We have now updated the script to the latest version: Moses 3.0.

Everything is available from either GitHub at http://github.com/PorthTechnolegauIaith/moses-smt or from Docker.com https://registry.hub.docker.com/u/techiaith/moses-smt/.

Moses 3.0 offers a number of improvements for translators. According to the release notes (which you can read here) these updates include features which make the decoding process quicker, release more memory, and make Moses more effective in the process of matching sentences correctly.

We will also be taking advantage of this update in order to improve the translation engine CofnodYCynulliad (which we’ve previously talked about here) with additional data which we will be collecting from the Welsh Assembly.

Additionally, we plan to create a new domain specific engine for translating software, with the help of data provided by Rhoslyn Prys from meddal.com.

These are good examples of the iterative nature of translation engines, where it’s possible to keep adding data in order to develop and improve them continuously. Keep an eye out for more developments on this soon.

Welsh language synthetic voice API

Text to speech technologies are now commonly used in mobile apps, websites and desktop applications to improve user experience and understanding. Today we are pleased to launch an API service that will make it possible for anybody to insert Welsh text to speech technologies into their websites and software.

Using the open source Festival Speech Synthesis System, and a Welsh language speech model we previously created our new web API makes it easy to automatically convert any Welsh text into audio in realtime. This cloud service needs no setup on the user’s side making it instantly widely accessible and available to all.

Below, you can find an example of how this voice could be inserted into this page, with only one line of code!

You can get started with the API today by signing up to our API Centre and creating your API key.
To learn more see our Speech Technologies pages.

Moses and the Two Commands

Moses-SMT is an open source machine translation system that was mainly developed at Edinburgh University. This resource allows you to develop your own machine translation engines for use in your translation projects by training it with any pre-existing corpora of parallel texts.

We at the Language Technologies Unit have used Moses-SMT in order to provide machine translation in our commercial offering CyfieithuCymru, which enables and supports efficient Welsh<>English translation within institutions.

Today we are releasing these Moses-SMT translation systems to you, as well as the data which was used to train them.

We are making our machine translation engines freely available because we believe that it’s vital for Welsh translators be able to own and develop their own machine translation infrastructure, and master these new disruptive technologies for full effect. This ambition was explained in our previous blog post.

In order to make the package as easy as possible to use, we’ve developed a simple system which only requires two commands to operate (providing that the necessary operating system and equipment are already installed!).

Before you go ahead however, we’d like to emphasize once more the importance of quality control – It is your responsibility to ensure that this machine translation software is used appropriately, including the use of careful post-editing (see Quality Issues).

Docker

docker-whale-home-logoDocker is an open platform for developers and sysadmins to build, ship, and run distributed applications. Using Docker it will be easy to install and run Moses-SMT without adversely affecting any of your other installations.

We have loaed our Moses-SMT to docker.com’s central registry.

You will need a version of Docker more recent than 1.0.1 on your Linux system. We usually use Ubuntu. Here is a video on YouTube that explains how you can install docker 1.3 on Ubuntu 14.04. If you would like to run your translation engine on a Windows computer or on a Mac OS X then you may be able to use Boot2Docker.

So, in Linux, the two commands are:

Command 1 : Installing Moses-SMT (with Docker)

$ docker pull techiaith/moses-smt

This will download and install the machine translation infrastructure into your Docker system.

When it has finished downloading, type ‘docker images’ to check that it’s been installed.

$ docker images
REPOSITORY                                        TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
techiaith/moses-smt                               latest              3dbad7f9aabf        41 hours ago        3.333 GB
$

Command 2 : Start a Translation Engine of Your Choice

The Language Technologies Unit has created translation engines by training them with data collected from open and public sources, such as the Proceedings of the Welsh Assembly and the Legislation on-line.

These engines have specific names and translation directions. The name of engine that was trained with Assembly data is ‘CofnodYCynulliad’, while the name of the engine trained with the Legislation on-line is ‘Deddfwriaeth’.

Here is the second command, with options set to select the ‘CofnodYCynulliad’ engine that translates from English to Welsh :

$ docker run --name moses-smt-cofnodycynulliad-en-cy -p 8080:8080 -p 8008:8008 techiaith/moses-smt start -e CofnodYCynulliad -s en -t cy

The system will initially download a file (around 3Gb in the case of the  CofnodYCynulliad) before confirming that it is ready to start translating.

If you open your browser and go to http://127.0.0.1:8008 , a simple form should appear so that you can check whether or not the engine works as intended:

Screenshot from 2015-03-02 10:26:21

Training Data

The data collected by the Language Technologies Unit, which was used to train our Moses-SMT machines, is available below:

Machine Translation for Welsh – a new strategy

A revolution is afoot in the translation world. Translators are becoming post-editors as the machine translates the first draft, and a human editor then refines it with post-editing. Only literature and highly sensitive texts will avoid this fate. The trend is being led by the need to translate large volumes of text, quickly and at a reasonable price. The Language Technologies Unit has been preparing for this brave new world by developing Welsh<>English machine translation software, which will soon be made available through the Language Technologies Portal. Using this resource, anyone will be able to do the following:

  • own and support their own Welsh<>English translation system
  • use their own corpus to create and adapt specialized translation machines

Although there are machine translation tools already available for Welsh and English through companies like Google and Microsoft, there are inherent disadvantages in their use. By sharing our machine translation resources, freely and fairly, we hope to provide support to the Welsh translation industry, develop a community of machine translation practitioners, and avoid dependency on large external institutions.

We discussed our ideas on machine translation at the TILT conference in Bangor in June of last year. Here are the slides:

 

Quality Issues

Translations generated by automatic means are not yet perfect between any language pair.  Laughable or embarrassing mistranslations made by machines have made headline news in the past, and may also lead to miscarriages of justice. This gives organisations trying to save costs by using them a bad reputation. However, machine translation accompanied by human post-editing is acceptable, and can be incorporated within the translation memory workflow. It is your responsibility to ensure that this machine translation software is used in the appropriate manner described here, including appropriate and meaningful post-editing, which avoids tarnishing the image of the translation industry and the Welsh language.

A comprehensive advice note can be found here:

http://techiaith.bangor.ac.uk/index.php/advice-note/?lang=en

Add Cysill Ar-lein to your website or code

Would you like to add Cysill Ar-lein to your web page, blog or app? You can now do this by using our widget and our new Cysill Ar-lein web service!

logo_cysill_arlein_cy
http://www.cysgliad.com/cysill/arlein

Cysill Ar-lein is an online Welsh language spelling and grammar checker and is the Language Technologies Unit’s most popular website. During 2014, there was a significant increase in its use, and in the number of texts corrected with it. In fact, there was an increase of 40%, with over a million pieces of Welsh language text checked by the website.

Cysill Ar-lein has a proven ability to give users who are not sure about their Welsh a strong confidence boost, and by allowing users to check their Welsh on many more websites and software packages, we hope that it will become possible to support and increase the confidence of even more users.

In accordance with the aims of the Language Technologies Unit, both the widget and the API service may be used free of charge.

 

Registering for a Cysill Ar-lein API key

By registering on our API service Centre you can get your own Cysill Ar-lein API key, to use in any way you’d like with the widget or the online API service. Go to Registering for an API key for further details.

 

Web Application widget

The Cysill Ar-lein web application widget is a feature that could be particularly useful for websites which allow users to compose text, such as comments.

The widget works via the web, so there’s no need to install any specialised software or even download files to any server or computer before getting started. All that’s required is to add the following few lines of HTML into your website:

<script>
        var CYSILL_API_KEY = "YOUR_API_KEY";
</script>
<script type="text/javascript" language="javascript" 
        src="http://api.techiaith.org/cysill/ui/CysillArlein/CysillArlein.nocache.js">
</script>

N.B. you will need to register for your API key and input it instead of ‘YOUR_API_KEY’

The widget can be placed anywhere within a web page by adding :

<div id='CysillArleinApp'></div>

The widget can be presented in a smaller embedded format next to some text, or on its own on a separate page. The system is flexible, and empowers you to use the spell-checker in the way which works best for you. Below is an example of the widget on Bangor University’s ‘Cymorth Cymraeg’ website.

CaptureCysillArleinCymorthCymraeg
Cysill Ar-lein o fewn dudalennau CymorthCymraeg Prifysgol Bangor

Here’s an another example of a simple Cysill Ar-lein website . Right-click on the page and choose: ‘View Page Source’ to see how simple the code really is.

The following page on GitHub explains in full how to use the widget.

 

Cysill Ar-lein API Service

Another offering from Cysill Ar-lein that we are excited to announce is the ability to embed Cysill Welsh language spelling and grammar checking functionality within your software using a new API service. This is the API service already used by the widget and the official Cysill Ar-lein website.

From today we are opening access to the Cysill Ar-lein API service to anyone for use with their coding projects and/or integrate with their software. This service is free.

N.B. you will need to register for your API key and input it instead of ‘YOUR_API_KEY’

We’ve loaded some examples onto GitHub showing how the API can be used with programming languages such as Python.

Go to:

https://github.com/PorthTechnolegauIaith/cysill

 

The examples contain code that :

 

Welsh Parts of Speech Tagger

One of the most important components of our Welsh grammar and spell-checking program Cysill is the parts of speech tagger. In fact, a tagger is one of the most crucial component in any application where a computer is expected to analyse and interpret natural language text.

Our tagger can identify Welsh words – even when those words are mutated, or when verbs are inflected – and provide their part of speech. This information encompasses a wide range of Welsh linguistic features which the tagger is able to recognise and tag; including nouns, adjectives, the type of mutation, and so on.

For example, the tagger will return the text “Mae hen wlad fy nhadau” as :

mae/VBF/- hen/ADJP/- wlad/NF/TM fy/PRONOUN/- nhadau/NPL/TT

Our parts of speech tagger is the first capability from our online API Services.  We’re excited that the tagger will now be available for everybody to use, with generous terms of use, and for free.

For more information, go to the Parts of Speech Tagger API.