Man-to-vocal-machine interfacing: a state of the art

ManToVocalMachine1In the mid-2000s, a new set of touch-based devices arrived on the market. Since then, interfaces have had to adapt to improve the man-to-machine communication. Tactile represents the typing (man-to-machine), while the screen remains the machine’s main source of response.

Source : Movie « 2001: A Space Odyssey » Source : Movie « 2001: A Space Odyssey »

Sci-Fi movies – an important source of inspiration for innovators! – have already integrated the possibility to dialogue with the machine. The idea is not new but requires implementing a series of innovation blocks.

But what is the vocal interface made of? Mainly two components:

  • Speech recognition (Human-to-machine)
  • Text-to-speech (Machine-to-human)

Then again, the idea is not new, and the technologies that are required to develop such solutions have been in place for about 20 years now.

Text-to-speech is quite easy to implement as it merely consists in transcribing words into sound waves. This technology is already used in different areas like road navigation, PBX telephony, personal assistance – e.g., Siri, etc. Possible innovations here are connected to new data management algorithms that would facilitate the diction and make the pronunciation more natural.

On the other hand, speech recognition is slightly more complicated to implement. Technology has to deal with the heterogeneity of human diction due to the diversity of languages and accents within those languages. The existing two systems are:

  • Learning-based speech recognition (mono-speaker model): The machine learns the user’s pronunciation over the time. This requires a certain phase of adjustment but allows a clear recognition of the words that are being spoken without concern for the grammar. Application domain: vocal dictation, personal assistant.
  • Grammar-based speech recognition (multi-speakers model): no learning phase, the system can be used immediately, but the machine expects a certain wording order, with regards to a grammatical scheme. Application domain: telephony, personal assistant

But then why aren’t these tools more generally widespread? Because it depends on three factors:

  1. The first one is the technology availability. It must be standardized and sharable in order for software to integrate it. It is already the case for Sun (today Oracle) which brought some norms and technologies (JSpeech Grammar Format, JSApi via JSR 113 for Java environments, Voice XML, etc.) in the 2000s. Besides, some open source solutions have democratized these technologies such as FreeTTS for text-to-speech and CMU Sphinx for speech recognition (see this 60-line program available on Github:
  2. The second one is the usage. Any new man-to-machine interfacing requires adapting the software to integrate the technology and bring real value for the user. This has been on for the last 10 years and has become more and more democratic. All tablets and smartphones are now equipped with such technology.
  3. The third factor is the intelligence. Today, speech recognition depends on the grammatical analysis of the sentence that’s being pronounced and is limited to an exhaustive vocabulary (standard VoiceXML for example). In brief, the machine now has ears and a mouth, but it still needs to process the information and find the best possible answer which is more difficult to implement.

MinorityReport3To do so, predictive analysis technologies are currently being developed to compensate for this weakness and eventually create a more interactive and more intelligent machine.

OS mobile manufacturers (Apple, Google, Samsung) and the big robotics producers are already working on this domain and you can already see some sign of it.

Here are two examples:

Source : tv serie « Äkta människor » Source : tv serie « Äkta människor »

There are multiple experiments going on, and there is no doubt the technology will soon be available. Next steps are the standardization and the democratization phases. And finally, the practical applications of the man-to-machine vocal interaction! Probably by the dawn of the next decade.


Loïc Cotonéa


Loïc Cotonéa has been Architect for Sogeti Group (Sogeti ATC) since 2011. In this role, he is in support for technical presales, technical project delivery, and responsible for customer advising steps.

More on Loïc Cotonéa.

Related Posts

Your email address will not be published. Required fields are marked *

6 + 3 =