Man-to-vocal-machine interfacing: a state of the art
Mar 28, 2014
In the mid-2000s, a new set of touch-based devices arrived on the market. Since then, interfaces have had to adapt to improve the man-to-machine communication. Tactile represents the typing (man-to-machine), while the screen remains the machine’s main source of response.
Sci-Fi movies – an important source of inspiration for innovators! – have already integrated the possibility to dialogue with the machine. The idea is not new but requires implementing a series of innovation blocks.
But what is the vocal interface made of? Mainly two components:
- Speech recognition (Human-to-machine)
- Text-to-speech (Machine-to-human)
Then again, the idea is not new, and the technologies that are required to develop such solutions have been in place for about 20 years now.
Text-to-speech is quite easy to implement as it merely consists in transcribing words into sound waves. This technology is already used in different areas like road navigation, PBX telephony, personal assistance – e.g., Siri, etc. Possible innovations here are connected to new data management algorithms that would facilitate the diction and make the pronunciation more natural.
On the other hand, speech recognition is slightly more complicated to implement. Technology has to deal with the heterogeneity of human diction due to the diversity of languages and accents within those languages. The existing two systems are:
- Learning-based speech recognition (mono-speaker model): The machine learns the user’s pronunciation over the time. This requires a certain phase of adjustment but allows a clear recognition of the words that are being spoken without concern for the grammar. Application domain: vocal dictation, personal assistant.
- Grammar-based speech recognition (multi-speakers model): no learning phase, the system can be used immediately, but the machine expects a certain wording order, with regards to a grammatical scheme. Application domain: telephony, personal assistant
But then why aren’t these tools more generally widespread? Because it depends on three factors:
- The first one is the technology availability. It must be standardized and sharable in order for software to integrate it. It is already the case for Sun (today Oracle) which brought some norms and technologies (JSpeech Grammar Format, JSApi via JSR 113 for Java environments, Voice XML, etc.) in the 2000s. Besides, some open source solutions have democratized these technologies such as FreeTTS for text-to-speech and CMU Sphinx for speech recognition (see this 60-line program available on Github: https://github.com/lcotonea/BaraGwuin.)
- The second one is the usage. Any new man-to-machine interfacing requires adapting the software to integrate the technology and bring real value for the user. This has been on for the last 10 years and has become more and more democratic. All tablets and smartphones are now equipped with such technology.
- The third factor is the intelligence. Today, speech recognition depends on the grammatical analysis of the sentence that’s being pronounced and is limited to an exhaustive vocabulary (standard VoiceXML for example). In brief, the machine now has ears and a mouth, but it still needs to process the information and find the best possible answer which is more difficult to implement.
To do so, predictive analysis technologies are currently being developed to compensate for this weakness and eventually create a more interactive and more intelligent machine.
OS mobile manufacturers (Apple, Google, Samsung) and the big robotics producers are already working on this domain and you can already see some sign of it.
Here are two examples:
- A vocal interaction example with the robot Nao, fruit of a collective work between Aldebaran Robotics (who developed Nao) and Nuance (who edited the vocal dictation software Dragon Naturally Speaking): https://www.youtube.com/watch?v=Qw2k40NDCxg#t=99
- Interacting with Nao by Pierre Lison (Department of Informatics, University of Oslo): https://www.youtube.com/watch?v=tJbdyXimYE8.
There are multiple experiments going on, and there is no doubt the technology will soon be available. Next steps are the standardization and the democratization phases. And finally, the practical applications of the man-to-machine vocal interaction! Probably by the dawn of the next decade.
Sources:
- JSApi : https://jcp.org/en/jsr/detail?id=113
- JSpeech Grammar Format (JSGF): http://www.w3.org/TR/jsgf/
- Sphinx 4 – Open source recognition engine: http://cmusphinx.sourceforge.net/
- FreeTTS – Open source text to speech engine: http://freetts.sourceforge.net/docs/index.php
- Voice Extensible Markup Language (VoiceXML) Version 2.0: http://www.w3.org/TR/2004/REC-voicexml20-20040316/
- Aldebaran Robotics : http://www.aldebaran.com/en
- OpenDial – Open source domain-independent software toolkit to facilitate the development of robust and adaptive dialogue systems: https://code.google.com/p/opendial/
- Pierre Lison’s homepage: http://folk.uio.no/plison/