Speech recognition makes some noise

This article appeared in InfoWorld on February 2, 1998 on page 69.

By Kimberly Patch & Eric Smalley

Mention speech recognition these days, and it’s almost inevitable that someone will point to HAL, the computer from 2001: A Space Odyssey. This illustration of where the technology is headed has lulled many IT managers into ignoring speech recognition because it’s obvious that computers that can hold an intelligent conversation will remain science fiction for a long time. The trouble is, practical, usable speech-recognition products are here now.

Systems that recognize ordinary speech are sweeping through the call-center market and are poised to dramatically alter the very nature of desktop computing. IS managers are in danger of being caught flat-footed. Most at risk are those deploying or planning for new desktop computers, client/server networks, or transaction-processing systems.

Speech-recognition systems require large amounts of resources, including processing power, memory and network bandwidth. A failure to account for speech recognition could derail carefully laid plans for allocating those resources.Worse, it could disrupt strategic plans such as adopting thin clients.

Picture the road warriors in your company for a moment. Using laptops, they probably log in to the corporate network several times per day. But what if, instead, they used their cell phones, which seem glued to their ears anyway, to check their e-mail and their voice mail, as well as to query the order-processing system and pull documents from a file server? And that’s just in-house use. Now imagine that every telephone in every home is a Web browser. How robust did you say your extranet is?

A little further out, speech is going to dramatically broaden what constitutes data, bringing about the long-promised multimedia revolution. Speech technology will accomplish this because it will enable practical content processing: the ability to easily search for and access audio and video material online.

"You’re going to be able to annotate and index information, which is voice in nature," says Victor Zue, head of the Spoken Language Systems Group and associate director of the Laboratory for Computer Science at MIT, in Cambridge, Mass. "The vast amount of information that is in voice mail, that’s in [recorded] meetings … plus news broadcasts, plus entertainment. All those things are going to be indexable."

THE PARTS OF SPEECH

Speech-recognition technology, often incorrectly identified as voice recognition, has several components: noise-canceling input, a recognition engine, vocabularies, application interfaces, and rudimentary natural-language processing. (Voice recognition refers to voice-print security systems, commonly called voice ID.) There are two classes of speech-recognition technology: speaker-dependent, in which the user has to train the system to recognize his or her voice, and speaker-independent.

There are also two principal categories of speech recognition: keyboard and keypad. Keyboard applications allow users to speak directly to their computers, complementing or replacingthe computer keyboard.

Keypad applications use speech to replace the telephone keypad as input for accessing voice mail and navigating a telephone system’s menus. More important, they also allow the telephone to act as a remote computer peripheral.

"Your phone is your personal digital assistant," says Xuedong Huang, research manager for the Speech Technology Group at Microsoft Research, in Redmond, Wash. "You don’t need to carry anything else. You can always be in touch with your computer."

Keypad applications tend to use limited vocabularies because they are focused on fairly narrow subjects. Limited vocabularies make it easier for these applications to be speaker-independent. The limited scope also allows for some elements of natural-language processing. Keyboard applications, particularly full dictation programs such as IBM’s ViaVoice and Dragon Systems’
NaturallySpeaking, tend to use larger vocabularies that, for now, require them to be speaker-dependent.

The 1997 breakthrough that has jump-started the speech-recognition market was the release of products based on large
vocabulary, continuous speech-recognition engines. Until then, large vocabulary systems were limited by discrete speech-recognition engines that required users to pause between each word.

At the same time, natural-language
technology is progressing rapidly. Natural language is the capability of a computer to decipher the meaning in ordinary, everyday speech, rather than requiring users to speak in prescribed patterns. A very limited application of
the technology, which relies on the computer to decipher meaning from keywords, allows users of IBM’s ViaVoice Gold to format Word documents.

"1998’s going to be the year when the flood gate opens," says Ken
Landoline, area director at Giga Information Group, a market research company in Cambridge, Mass.

Within the next several years, speech input will become commonplace, according to Jackie Fenn, vice president and research director
of advanced technologies at the Gartner Group, in Stamford, Conn.

"By 2001 we’ll see around 30 percent of users using speech recognition for some aspect of their daily work," Fenn says.

Financial-services
companies appear to be at the forefront of adopting speech-recognition technology, both in call-center applications for customers and desktop applications for workers.

Chase Manhattan has literally removed the computer
keyboards in the office of its Global Trust Services that processes bearer bonds, according to Nicholas Papanikolaw, senior vice president and chief operating officer of the Global Trust Services. The company uses
speech-recognition technology to boost efficiency, reducing the time it takes to process a single bond from 7 minutes to less than 1 minute, Papanikolaw says.

Because the application is a narrow one — only processing one
type of bond — the bank has been able to develop a speaker-independent, natural-language application based on 200 keywords. This allows workers to say phrases such as "I want IBM" and "gimme GM." Chase
Manhattan developed its application using technology from UmeVoice, in Novato, Calif.

"If you look at any application that has a keyboard, I believe we can replace it with voice technology," Papanikolaw says.
"You can certainly look at any data-entry application in the bank."

NOT IN MY BACK OFFICE

It’s not a stretch to accept that speech recognition will quickly pervade the call-center market, particularly in
the financial and travel-services industries. But for most IT managers it’s another matter when considering desktop computer users. After all, why change when the keyboard and mouse have done the job for years, repetitive strain
injuries aside? And who wants to add to the noise level in cubicle-filled work environments?

Efficiency gains such as those at Chase Manhattan are certainly incentive enough for IT managers who can identify specific
applications that lend themselves to spoken input. After all, speech is the most natural means humans have for conveying information. However, social factors should not be discounted when measuring resistance to new technology.

Through the ’80s, most PC users viewed the mouse-GUI combination as a tool for graphic artists and engineers working on Macintosh and high-end Unix systems. There did not seem to be a compelling reason to replace the familiar
and relatively speedy command-line DOS interface with a new, awkward point-and-click interface.

But just as DOS applications hung around for years after Windows burst on the scene, no one is predicting that keyboards are
going to disappear overnight when speech input takes hold. The bottom line is that the industry often accepts a technology because Microsoft incorporates it.

"The big question mark is obviously when Microsoft is going
to start bundling [speech recognition] with the Office suites or the operating system," Fenn says. "That’s going to have a big impact on the rate of adoption."

Like many companies, State Farm Insurance is
keeping an eye on Microsoft, according to a company representative. State Farm, in Bloomington, Ill., is evaluating current speech-recognition products in-house, and is developing a speech-enabled camera application that will allow
adjusters in the field to annotate photographs, he said.

Microsoft officials declined to comment on when and how the company would offer speech-recognition technology. Publicly, the company is focusing its efforts on
promoting its Speech API (SAPI).

ACCOMMODATING SPEECH

With the technology on the market and Microsoft poised once again to alter the landscape, how do IT managers meld speech recognition into corporate networks? As
far as the technology is concerned, there appears to be little reason to rush a decision.

"For probably the next year at least, [speech recognition] should be viewed as a tactical investment," Fenn says. "You
probably don’t want to commit to a corporate rollout until the products are hitting their second or third rounds."

But now is probably the right time to plan for the technology, especially for IT shops that are moving
to network computers.

"If your strategy is to have all NCs in your next installation, you need to think about where you’re going to put the voice processing," says Amy Wohl, president of Wohl Associates, in
Narberth, Pa. "We think [vendors] are going to do server-side voice processing eventually, [but] there isn’t very much of that yet."

IT managers "also need to think about bandwidth for their network because
if they’re going to use server-side voice processing, that’s going to mean they’re going to ship this stuff up and down the network," Wohl says.

IBM is working on a client/server version of ViaVoice, says Joe Orlando,
worldwide marketing manager for IBM’s ViaVoice. To handle the bandwidth crunch, IT staffs will need to use a tiered approach in which an intermediate layer of servers handles speech processing rather than back-end data servers, he
said.

For handling speech processing on the desktop, the key factors are processing power and memory. Current large vocabulary speech-recognition products have minimum requirements of 166-MHz processors and 32MB of memory,
although users are finding that 200-MHz processors and 64MB of memory are the threshold for adequate performance.

So, the current installed base of 90-MHz, 120-MHz, and 133-MHz desktop systems are unable to support speech
recognition, but this should be a short-term problem. Better compression will boost speech-recognition products’ efficiency and the installed base of desktop computers will eventually roll over to higher performance systems.

Transaction processing is another area in which IT managers will have to account for speech recognition. The technology is improving the efficiency of call centers, which allows companies to expand business, thereby increasing
transaction volume.

American Express is rolling out a speech-recognition system that will allow its corporate travel customers to get information and book their flights. The company expects the system, which is designed to
augment human agents instead of replace them, should reduce the ratio of calls to transactions because callers will sort out their options before talking to an agent, according to David Pereira, senior manager for Corporate
Services Interactive at American Express.

PLATFORM SPEECH

Perhaps the biggest impact speech-recognition will have in the short-term is in software development.

Initially, speech-recognition vendors
integrated their products with individual applications. Microsoft’s SAPI and Java Speech API from Sun now allow application developers to "speech-enable" their products.

"The third phase will be when
applications are designed from the first day taking into consideration that speech is one of the modalities of interaction with them," says David Nahamoo, senior manager of the Human Language Technologies Department at IBM
Research. "That will have tremendous impact on how applications are designed and developed."

This highlights the possibility of a user interface that bypasses, or at least minimizes, the importance of Windows. The
race to develop such a speech-dominated interface is already on, raising the specter of a renewed battle between IBM and Microsoft for control of the desktop.

"We’re not [saying] that we’re going to go out and replace
the Windows interface," says IBM’s Orlando. "What we need to figure out is where to take a leadership position in creating a voice-user interface. That’s a whole new ball game."

Speech Futures

Experts predict when and how speech recognition will take hold.

Jackie Fenn, Gartner Group

30 percent of desktop users use 3 years speech recognition every day

User interface assumes voice input 5 years

HAL 50+ years

Ezra Gottheil, Hurwitz Group

Instant transcriptions of audio- and videoconferences 3 years

Ken Landoline, Giga Information Group

Continuous speech recognition 2 years

Commonplace in telephony 3-5 years

Speech-enabled appliances use speech 8-10 years recognition to sort through multiple databases a la Star Trek

Amy Wohl, Wohl Associates

Limited natural language processing 2 years in specific applications

A general natural-language model 5 years

HAL decades

Victor Zue, Massachusetts Institute of Technology

Limited applications of a conversational interface 2-3 years

HAL decades

Koen Bouwers, Lernout & Hauspie

In the operating system 1-3 years

Handheld PC dictation 2-3 years

Roger Matus, Dragon Systems

Microphones as common as mice 3 years

In the operating system 5 years

sidebar:

Speech attracts a crowd

Speech
recognition vendors fall into four categories: speech-to-text dictation, computer command-and-control, telephony, and electronic assistants. The players range from IBM, Microsoft, and Philips Electronics to a plethora of start-ups.

Dictation products for professionals, particularly doctors and lawyers, have been on the market for years. But the spotlight has been on the vendors that offer large vocabulary, general purpose dictation software. Four
companies dominate this field: IBM, Dragon Systems, Philips and, through its acquisition of Kurzweil Applied Intelligence, Lernout & Hauspie (L&H).

All four offer their products to VARs and developers, and all but
Philips sell shrink-wrapped versions of their products. The continuous speech version is not due until early this year.

The latest strategy is bundling. IBM is bundling ViaVoice Gold with Lotus SmartSuite and has a deal
with AST Research. Dragon Systems has deals with Micron and Digital Equipment.

The mother of all bundling deals, of course, would be with Microsoft. In September, L&H announced an alliance with Microsoft, which has a
substantial speech recognition effort of its own, but the companies declined to discuss plans.

Products that allow users to control Windows and other desktop OSes have long been on the market. Companies in the field include
Advanced Recognition Technologies, Applied Voice Recognition, Command, and Verbex Voice Systems.

A large number of vendors focus on vertical markets, usually developing applications that incorporate recognition engines
developed by one of the major companies. UmeVoice is an example of a vendor in the financial-services market.

Speech-recognition technology is also rapidly transforming the telephony market. Applied Language Technologies
developed a reservation system for United Airlines. Nuance Communications developed a stock quote system for Charles Schwab & Co. and a travel information system for American Express. Other companies in the field include
PureSpeech and Voice Control Systems.

And in the emerging field of electronic assistants, Wildfire Communications uses speech recognition in its Enterprise Wildfire call-management system.

Scriven

Homebase for a couple of writers

Speech recognition makes some noise