logo
Published on developer.* Blogs (http://www.developerdotstar.com/community)

My Experience with Speech Recognition: From Yesteryear to Tomorrow (Part 1)

By Gunish Rai Chawla
Created 2005-06-22 13:06

Here is the first of the two part series on Speech Recognition Patterns. But I really should apologize to people for not being prompt in producing this article, but as I have learned it the hard way that it really is not easy being a writer. There are so many uncertainties to how people might react to what you right and this really bugs on the personal front. The content of the article is divided into 8 sub contents, 4 of which I present here in the first part and the remaining in the consequent second part. So here goes...

Contents:
1. Introduction
2. When it all began
3. Ambitions
4. Classical Model
5. Current Status
6. The aspirations : Neural Modal for Speech Recognition
7. The Very near future
8. Conclusion

Introduction

Okay here comes the long awaited and the much needed literature on Voice Recognition. In here I will write everything I have every known about voice recognition and maybe sometimes more than whatever I have known. I have never been a writer before but after reading Edwards G. Nilges’s book ‘Build Your Own .NET Language and Compiler’, I have found that experiences of a programmers life within serious subjects can be simulating enough to make further reading more of an addiction than a requirement, I am venturing out into writing myself. I know I deserve kicks, shouts and thrashing by ‘The Authors’ community as I most certainly know that I am definitely not of an ‘Author’s’ class but what the heck, I feel like giving it a shot anyways so here I am doing it.

But most certainly I believe that I am being reborn here; where another face of my personality is starting to open up as I have starting to write in a way I never though I would before.

The need for this literature arose after my blog entry on ‘Making a Computer do anything...’ [1], in which I suggested of using Voice recognition for next generation software operations and to give a human like personality to a desktop machine.

As many people (both users and programmers) believe that voice recognition is still far away from being perfect for being used as a feasible Input Stream Source, this literature is meant to change the way they think by introducing to them newer technologies that have come our way to implement voice recognition systems.

I am in no way suggesting that voice recognition systems have become perfect but what I am implying is that we have started to tread the right path and hence we can obviously see successful implementations even today by using the new and improved version of speech recognition engines.

In here I also explain to you all my own ‘conceptual’ understanding of a way in which voice recognition can be improved.

Here I repeat again that I am NOT an author by any means but am just a lone ranger who has a ‘get it all out’ feeling stuck in his head from a far too long time now to be further controllable. I just need to satisfy my hunger to tell people about the ideas that I have and an equivalent hunger to know the reactions of people about my ideas. So I will most certainly cherish all your comments on this ‘adventure’ of mine of foraying into a writer’s world.

When it all began

Okay, so let’s get to business now, Speech Recognition or Voice recognition become a subject or R&D in the early 1970’s with introduction of Text to Speech Synthesizers, NET-Talk was the name given to it and it was the first ever acceptable implementation of speech / voice to a computer. But the subject was still far away from desktop machines as it required more that available processing power.

But in the 1990’s desktop machines started having enough power for accommodating speech recognition engines. So this was around when the first speech recognition engine for a desktop machine was built, and supposedly Microsoft was the first one to do it.

With the introduction of Windows 3.1, Microsoft also introduced the concept of APIs, application programming interfaces, which came imbedded within the operating system for providing an interface to the application developers directly to the operating systems integrated functionality.

As the Windows operating system has evolved, several versions of the Windows API have been published. Windows 3.1 uses the Win16 API. The Microsoft® Windows NT®, Windows 95, and Windows 98 platforms use the Microsoft® Win32® API. And we all programmers must be hoping with all our 20 fingers crossed that it soon be about time that we get to lay our hands on the 64bit APIs.

Only after this was something that is called SAPI was introduced (SPEECH APPLICATION PROGRAMING INTERFACE) and newer versions keep popping up each year with increased usability and greater accuracy. SAPI is still known to be the only solution by Many Many developers who really have not come across SASDK.

The details of SASDK is what we will discuss further on later, but for now we may just need to know that, SASDK stands for (Speech Application Software development Kit) which is an ADD-ON development kit and it installs over Visual Studio 2003 providing it speech functionality by adding Speech Controls to it.

The latest version of SAPI is 5.1 and that of SASDK is 1.1 (beta), both are freely available over at Microsoft’s web site as free downloads.

Ambitions

For technology evangelists, SAPI was an ambitious move to enable programmers to foray into something new and exciting, but there have always been fundamental problems with speech as a robust Input stream. Writing speech applications can be complicated. Since speech is the predominant form of communication among humans, there is a very high set of user expectations concerning the naturalness and efficiency of spoken dialog. Consequently, the goal of an application author is to make a speech application as efficient as possible for the user. In addition it should feel natural and be easy to use.

Only very few people have really been able to create acceptable speech enabled software with the tools that were available with SAPI. But these applications were still not acceptable enough simple because of the ‘expectations’ factor.

When I myself was first introduced to speech recognition about 2 years ago in a freak mid night dream that drove me to switch on my computer and look for a speech enabled software, that was when I found Dragon Naturally Speaking, and I still remember very clearly the desperation that developed inside of me to somehow integrate speech into the scope of development.

The first idea that came to my mind was to try to trigger events of software (pre-built) using the interface of Dragon Naturally speaking software. My ambitions were beyond rational reasoning; hence they crashed with a big bang. Then I somehow collected the pieces of my tyrant-ed soul from the ground and started all over again only to find SAPI ver 4.0. This time I was lucky. I successfully did develop an application, specifically an MIS which I named Smart Business Management System, with a speech interface and a nice MS-Agent Character. Although it looked pretty but I was not at all satisfied by the deal that I got.

It did the work as it was supposed to do, but it was imperfect. It had a romantic affair with background noise and the height came in when something as stupid as this happened during a major demonstration.

The scene was something like this, I was using a desktop computer which was connected to a table-top microphone and a small set of speakers placed along side the monitor. The application ran like this:

1. On Detecting speech input the recognized ‘phonemes’ where parsed through a simple CFG (Context Free Grammar) which actually was simply a list of words that could be recognized, written in a text file (apparently that was the way earlier grammars were created, and you would have most certainly been delighted if you were in my place to see the current way of creating grammars with the SASDK).

2. If a matching word was found then ‘a speech recognized event’ would be triggered with arguments as the matching word picked up from the grammar.

3. Otherwise non-recognized event would be triggered, at this place a made a small message playing provision which said, “I Am Sorry, I could not understand you, could you say that again.”

Now everything was doing fine until right in the middle of a presentation something happened that might be counted as the funniest moments of my life. Right before this happened I was pompously creating illusions of the capabilities of speech recognition and the great future it held and how this application I developed is a successful demonstration of the future, among the audience. I was speaking in a microphone and when I was doing this, one of my friends was busy plugging in the microphone in the jack, another one decided that he gave a test run to my software. As I spoke, the microphone caught my voice from the large hall speakers and though it was an ‘INPUT’, it very promptly played the Not-Recognized Prompt that I had created to do a bit of human like error handling.

As it played the prompt, the sound created by the prompt again went into the microphone (as I had turned on the volume of the small desktop speakers so that the audience was able to hear clearly ‘The Computer Talking Back’) and it again triggered the Not-Recognized event. This very instant the system went into an infinite loop (this was apparently because of the loud voice being played by the computer speakers, unlike when we tested it). The system had to be shut down and the presentation became a subject of huge laugh among a large number people (both types, the ones who were present at the presentation and others who heard of it from the first type of people).

Now when I look back at that day, imagining the scene to it makes me laugh and I realize as to how stupid and AMBITIOUS a programmer can get with a technology that is over prescribed in its first encounters. This experience gave me an on hand experience with the Major Problems with speech recognition technology.

The Microphone can now identify what part of the sound that goes into it is speech input and what part is not. This still remains a fundamental problem. The solution is perhaps with more complex speech recognition engines or maybe a better microphone design, with a proximity range to accept voices, within this range which can be fixed physically (but this is really an impossibility with the current understand of physics, magnetism and sensor technology that I have).

Classical Model

The classical substance that worked for speech recognition was simply a store-and-find phonetics representation. It first stored the way a user would pronounce the phonetics and then retrieve the wave-base to do matching and recognition stuff.

This was basically a two step process

1. Training
2. Recognition

In the training mode, the user was asked to read a few paragraphs of English or the language in question in the microphone, the training module would recognize the parts of speech that the user said and saved them to a wave-base with appropriate attributes attached to it.

This was basically an intelligent hit and trial method, i.e. the algorithm worked as to guess what part of speech the user is speaking now... I have simplified the algorithmic procedure for understand purpose as following.

1. A little base state was prepared. In this base state were the wave patterns of the very few of those common pronunciations that do not differ all around the world, and the test paragraph always started with one of these very common words.

2. All this is arranged in a linear pattern, i.e. the recognition would not record the current pointer position until and unless the previous one was Okayed by the engine.

3. With every ‘approximate’ (i.e. hit n trial) recognition, the pointers move on to next and next words and to improve integrity and check the current flow, every now and then there would be a reappearance of one of the universal constants in the context.

4. In the later versions of SAPI’s associated speech recognition engine, there was a recursion also available, i.e. the earlier recognized version was repeated several times to update the profile and also the same words would be shifted in the three basic locations of, a) Beginning b) Middle c) Ending of a sentence because all these three locations triggered different ways of speaking in human languages. Also words were repeatedly asked at the same place also and the two wave patterns are operated with an OR operation.

The initial state of this speech recognition engine was pre prepared by Microsoft Engineers using a tool called Microsoft Linguistic Information Sound Editing Tool which later came in as a package for end users and developers also. It looked something like this

The wave table generated by this tool was saved in a wave base. Now wave bases were typical databases with all the functionality of search and sort operations available to recognize the speech.

A Word on ‘Threshold’

Now you might be thinking as to if an entire database could be prepared, then this database could be used for speech synthesis also. But this is not possible, because still speech recognition is not applicable to an entire word.

I’ll explain this with an example:

Let’s take up the picture above, in this image the Text ‘This is a test’ is being ‘Phonemically’ broken up.

The first word – ‘This’ is broken into, ‘dh’, ‘ih’ & ‘s’.

The same word in another sentence – “What is this” will be broken down as following, ‘dh’, ‘ih’ & ‘ez’.

Two different breakages of the same word? This is not new for literary people but is a situation, an exception that MUST be handled by a programmer.

This problem is solved by again an approximation function, recognition is NOT by any means accurate, never is a full word accurately recognized, after the wave is converted to phonetics, a search operation is launched in the wave base and the word which matches beyond a ‘Threshold Limit’ is taken to be the intended word.

These search operations preferences can be controlled from the control panel
Typically they are present in Control Panel > Speech > Speech Recognition Tab > Any Speech Profile > Settings.

It looks something like this:

On the following link you can find the list of threshold functions.

http://www.geocities.com/mr_grac/TFUNC.HTM [2]

Crude, but effective.

For further interested people, can download SAPI 5.1 from Microsoft’s website and I have extracted a sample SR Engine from the SAPI SDK, code written in VC++ which can be downloaded from the following link

http://geocities.com/mr_grac/SSREC.ZIP [3]

Well this is how stuff was done in the internals of speech recognition systems and this is about it that many people still know it to be. After this everything that I present might be new to a lot of people.

So signing off with this first part and a promise to put the next part ASAP.

--Gunish Rai Chawla


Source URL:
http://www.developerdotstar.com/community/community/node/232