Saturday, August 23, 2008

the speaking computer

There have been numerous attempts at providing computers with some kind of voice interface. The first step was to provide voice output with text-to-speech engines. IBM pioneered this work in the early 60's. Their work inspired Arthur C. Clarke to include a segment where HAL sings a song in the movie 2001 [1]. The initial offerings were effective but sounded poor. Notable users of this technology include Stephen Hawking and even the rock band Pink Floyd [2]. The most common application was probably in computer games because it provided an advantage. If a game developer wanted to provide vocal instructions to the user, the only way to do that before text-to-speech engines arrived was to record someone speaking the text and play it back at runtime. This approach is still used in foreign language learning applications [3]. By compressing the text as much as possible to limit dynamic range, re-sampling at very low sampling rates and only using the least number of bits possible to encode the data, it was possible to pack a reasonable amount of speech onto a CD. An audio CD can normally hold about 70 minutes of stereo music. By compressing the content even modestly using 8-bit mono at 11 kHz sampling rate, a CD can hold nearly 19 hours of audio. By using less bits per sample, that number can be increased quite a bit more. But consider what happens when we have a text-to-speech engine available. Now we can put 700 MB of text onto the CD and have the text-to-speech engine read it. We trade off the space needed for the code, of course, but in practice it's not much code and we get an awful lot of speech in exchange.
Continuing work on modeling the human vocal system has resulted in modern text-to-speech engines which are not only quite realistic but also capable of presenting in different accents with a clear distinction between male and female voices. The National Oceanic and Atmospheric Administration (NOAA) uses text-to-speech engines to read the weather forecasts on the NOAA weather radio channels [4]. Many companies use similar engines in their telephone support systems to read a list of menu items which the user can choose from by pressing one of the number keys on the telephone pad [5]. More recently, some of these systems have added the ability to recognize spoken language.


[1] http://en.wikipedia.org/wiki/Speech_synthesis.
[2] http://www.ted.com/index.php/talks/stephen_hawking_asks_big_questions_about_the_universe.html.
[3] "Computer Games for Partially Sighted and Blind Children, http://www.tpb.se/barnens_tpb/spel/projekt/report.html."
[4] http://www.nws.noaa.gov/nwr/VIPstatus.htm.
[5] http://www.nortel.com/products/04/ivr/collateral/nn103943.pdf.

No comments: