My Experience with Speech Recognition: From Yesteryear to Tomorrow (Part 1)
Here is the first of the two part series on Speech Recognition Patterns. But I really should apologize to people for not being prompt in producing this article, but as I have learned it the hard way that it really is not easy being a writer. There are so many uncertainties to how people might react to what you right and this really bugs on the personal front. The content of the article is divided into 8 sub contents, 4 of which I present here in the first part and the remaining in the consequent second part. So here goes...
Contents:
1. Introduction
2. When it all began
3. Ambitions
4. Classical Model
5. Current Status
6. The aspirations : Neural Modal for Speech Recognition
7. The Very near future
8. Conclusion
Introduction
Okay here comes the long awaited and the much needed literature on Voice Recognition. In here I will write everything I have every known about voice recognition and maybe sometimes more than whatever I have known. I have never been a writer before but after reading Edwards G. Nilges’s book ‘Build Your Own .NET Language and Compiler’, I have found that experiences of a programmers life within serious subjects can be simulating enough to make further reading more of an addiction than a requirement, I am venturing out into writing myself. I know I deserve kicks, shouts and thrashing by ‘The Authors’ community as I most certainly know that I am definitely not of an ‘Author’s’ class but what the heck, I feel like giving it a shot anyways so here I am doing it.
But most certainly I believe that I am being reborn here; where another face of my personality is starting to open up as I have starting to write in a way I never though I would before.
The need for this literature arose after my blog entry on ‘Making a Computer do anything...’, in which I suggested of using Voice recognition for next generation software operations and to give a human like personality to a desktop machine.
As many people (both users and programmers) believe that voice recognition is still far away from being perfect for being used as a feasible Input Stream Source, this literature is meant to change the way they think by introducing to them newer technologies that have come our way to implement voice recognition systems.
I am in no way suggesting that voice recognition systems have become perfect but what I am implying is that we have started to tread the right path and hence we can obviously see successful implementations even today by using the new and improved version of speech recognition engines.
In here I also explain to you all my own ‘conceptual’ understanding of a way in which voice recognition can be improved.
Here I repeat again that I am NOT an author by any means but am just a lone ranger who has a ‘get it all out’ feeling stuck in his head from a far too long time now to be further controllable. I just need to satisfy my hunger to tell people about the ideas that I have and an equivalent hunger to know the reactions of people about my ideas. So I will most certainly cherish all your comments on this ‘adventure’ of mine of foraying into a writer’s world.
When it all began
Okay, so let’s get to business now, Speech Recognition or Voice recognition become a subject or R&D in the early 1970’s with introduction of Text to Speech Synthesizers, NET-Talk was the name given to it and it was the first ever acceptable implementation of speech / voice to a computer. But the subject was still far away from desktop machines as it required more that available processing power.
But in the 1990’s desktop machines started having enough power for accommodating speech recognition engines. So this was around when the first speech recognition engine for a desktop machine was built, and supposedly Microsoft was the first one to do it.
With the introduction of Windows 3.1, Microsoft also introduced the concept of APIs, application programming interfaces, which came imbedded within the operating system for providing an interface to the application developers directly to the operating systems integrated functionality.
As the Windows operating system has evolved, several versions of the Windows API have been published. Windows 3.1 uses the Win16 API. The Microsoft® Windows NT®, Windows 95, and Windows 98 platforms use the Microsoft® Win32® API. And we all programmers must be hoping with all our 20 fingers crossed that it soon be about time that we get to lay our hands on the 64bit APIs.
Only after this was something that is called SAPI was introduced (SPEECH APPLICATION PROGRAMING INTERFACE) and newer versions keep popping up each year with increased usability and greater accuracy. SAPI is still known to be the only solution by Many Many developers who really have not come across SASDK.
The details of SASDK is what we will discuss further on later, but for now we may just need to know that, SASDK stands for (Speech Application Software development Kit) which is an ADD-ON development kit and it installs over Visual Studio 2003 providing it speech functionality by adding Speech Controls to it.
The latest version of SAPI is 5.1 and that of SASDK is 1.1 (beta), both are freely available over at Microsoft’s web site as free downloads.
Ambitions
For technology evangelists, SAPI was an ambitious move to enable programmers to foray into something new and exciting, but there have always been fundamental problems with speech as a robust Input stream. Writing speech applications can be complicated. Since speech is the predominant form of communication among humans, there is a very high set of user expectations concerning the naturalness and efficiency of spoken dialog. Consequently, the goal of an application author is to make a speech application as efficient as possible for the user. In addition it should feel natural and be easy to use.
Only very few people have really been able to create acceptable speech enabled software with the tools that were available with SAPI. But these applications were still not acceptable enough simple because of the ‘expectations’ factor.
When I myself was first introduced to speech recognition about 2 years ago in a freak mid night dream that drove me to switch on my computer and look for a speech enabled software, that was when I found Dragon Naturally Speaking, and I still remember very clearly the desperation that developed inside of me to somehow integrate speech into the scope of development.
The first idea that came to my mind was to try to trigger events of software (pre-built) using the interface of Dragon Naturally speaking software. My ambitions were beyond rational reasoning; hence they crashed with a big bang. Then I somehow collected the pieces of my tyrant-ed soul from the ground and started all over again only to find SAPI ver 4.0. This time I was lucky. I successfully did develop an application, specifically an MIS which I named Smart Business Management System, with a speech interface and a nice MS-Agent Character. Although it looked pretty but I was not at all satisfied by the deal that I got.
It did the work as it was supposed to do, but it was imperfect. It had a romantic affair with background noise and the height came in when something as stupid as this happened during a major demonstration.
The scene was something like this, I was using a desktop computer which was connected to a table-top microphone and a small set of speakers placed along side the monitor. The application ran like this:
1. On Detecting speech input the recognized ‘phonemes’ where parsed through a simple CFG (Context Free Grammar) which actually was simply a list of words that could be recognized, written in a text file (apparently that was the way earlier grammars were created, and you would have most certainly been delighted if you were in my place to see the current way of creating grammars with the SASDK).
2. If a matching word was found then ‘a speech recognized event’ would be triggered with arguments as the matching word picked up from the grammar.
3. Otherwise non-recognized event would be triggered, at this place a made a small message playing provision which said, “I Am Sorry, I could not understand you, could you say that again.â€
Now everything was doing fine until right in the middle of a presentation something happened that might be counted as the funniest moments of my life. Right before this happened I was pompously creating illusions of the capabilities of speech recognition and the great future it held and how this application I developed is a successful demonstration of the future, among the audience. I was speaking in a microphone and when I was doing this, one of my friends was busy plugging in the microphone in the jack, another one decided that he gave a test run to my software. As I spoke, the microphone caught my voice from the large hall speakers and though it was an ‘INPUT’, it very promptly played the Not-Recognized Prompt that I had created to do a bit of human like error handling.
As it played the prompt, the sound created by the prompt again went into the microphone (as I had turned on the volume of the small desktop speakers so that the audience was able to hear clearly ‘The Computer Talking Back’) and it again triggered the Not-Recognized event. This very instant the system went into an infinite loop (this was apparently because of the loud voice being played by the computer speakers, unlike when we tested it). The system had to be shut down and the presentation became a subject of huge laugh among a large number people (both types, the ones who were present at the presentation and others who heard of it from the first type of people).
Now when I look back at that day, imagining the scene to it makes me laugh and I realize as to how stupid and AMBITIOUS a programmer can get with a technology that is over prescribed in its first encounters. This experience gave me an on hand experience with the Major Problems with speech recognition technology.
The Microphone can now identify what part of the sound that goes into it is speech input and what part is not. This still remains a fundamental problem. The solution is perhaps with more complex speech recognition engines or maybe a better microphone design, with a proximity range to accept voices, within this range which can be fixed physically (but this is really an impossibility with the current understand of physics, magnetism and sensor technology that I have).
Classical Model
The classical substance that worked for speech recognition was simply a store-and-find phonetics representation. It first stored the way a user would pronounce the phonetics and then retrieve the wave-base to do matching and recognition stuff.
This was basically a two step process
1. Training
2. Recognition
In the training mode, the user was asked to read a few paragraphs of English or the language in question in the microphone, the training module would recognize the parts of speech that the user said and saved them to a wave-base with appropriate attributes attached to it.
This was basically an intelligent hit and trial method, i.e. the algorithm worked as to guess what part of speech the user is speaking now... I have simplified the algorithmic procedure for understand purpose as following.
1. A little base state was prepared. In this base state were the wave patterns of the very few of those common pronunciations that do not differ all around the world, and the test paragraph always started with one of these very common words.
2. All this is arranged in a linear pattern, i.e. the recognition would not record the current pointer position until and unless the previous one was Okayed by the engine.
3. With every ‘approximate’ (i.e. hit n trial) recognition, the pointers move on to next and next words and to improve integrity and check the current flow, every now and then there would be a reappearance of one of the universal constants in the context.
4. In the later versions of SAPI’s associated speech recognition engine, there was a recursion also available, i.e. the earlier recognized version was repeated several times to update the profile and also the same words would be shifted in the three basic locations of, a) Beginning b) Middle c) Ending of a sentence because all these three locations triggered different ways of speaking in human languages. Also words were repeatedly asked at the same place also and the two wave patterns are operated with an OR operation.
The initial state of this speech recognition engine was pre prepared by Microsoft Engineers using a tool called Microsoft Linguistic Information Sound Editing Tool which later came in as a package for end users and developers also. It looked something like this
The wave table generated by this tool was saved in a wave base. Now wave bases were typical databases with all the functionality of search and sort operations available to recognize the speech.
A Word on ‘Threshold’
Now you might be thinking as to if an entire database could be prepared, then this database could be used for speech synthesis also. But this is not possible, because still speech recognition is not applicable to an entire word.
I’ll explain this with an example:
Let’s take up the picture above, in this image the Text ‘This is a test’ is being ‘Phonemically’ broken up.
The first word – ‘This’ is broken into, ‘dh’, ‘ih’ & ‘s’.
The same word in another sentence – “What is this†will be broken down as following, ‘dh’, ‘ih’ & ‘ez’.
Two different breakages of the same word? This is not new for literary people but is a situation, an exception that MUST be handled by a programmer.
This problem is solved by again an approximation function, recognition is NOT by any means accurate, never is a full word accurately recognized, after the wave is converted to phonetics, a search operation is launched in the wave base and the word which matches beyond a ‘Threshold Limit’ is taken to be the intended word.
These search operations preferences can be controlled from the control panel
Typically they are present in Control Panel > Speech > Speech Recognition Tab > Any Speech Profile > Settings.
It looks something like this:
On the following link you can find the list of threshold functions.
http://www.geocities.com/mr_grac/TFUNC.HTM
Crude, but effective.
For further interested people, can download SAPI 5.1 from Microsoft’s website and I have extracted a sample SR Engine from the SAPI SDK, code written in VC++ which can be downloaded from the following link
http://geocities.com/mr_grac/SSREC.ZIP
Well this is how stuff was done in the internals of speech recognition systems and this is about it that many people still know it to be. After this everything that I present might be new to a lot of people.
So signing off with this first part and a promise to put the next part ASAP.
--Gunish Rai Chawla
lot of experience with speech
lot of experience with speech recognition i see!
when did u start working on it anyways
... i think i met you at a software contest hel at NRI Institute ...
and i am sure that you dont remember me ... but i was preety impressed by your imagination and futuristic vision.
i hope that you post the next part soon enough so that i can get the complete idea of it !
-Pooja
hi, u must be the same guy w
hi,
u must be the same guy who i met at OIST Software expo... because there also yopu had brought an AI Project that was working in human interaction through speech,well i asked u there itself how did u make it and u said 'its a long and booring story' i guesses you didnt want to talk then. i had decided that time that one day i would find you and make you teach me how to work around with AI and speech in particular,so i guess i HAVE found you and now i want you to help me out, post your contact number or something so that i can call you and get in touch with you!
-
Love ur work
Swetha
Thanks for an interesting article
I have been pondering a related problem, Gunish: accurately recognizing Chinese handwritten characters.
In the past, I thought the answer was to raise the "level" of the geometry, where geometry consists of "metric" geometry (study of properties such as length which are not invariant under simple changes), "projective" geometry (study of invariants under projection and perspective) and "topology" (study of invariants under continuous, nonbreaking changes in shape).
I thought one might have a shot at recognizing handwritten letters in various languages by considering only topological features. The trouble is that (for example) the English letter A is topologically equivalent to O, therefore you need a separate geometry for "things that stick out like limbs" and "things that come to a point with respect to the rest of the letter".
At that point it seemed that the recognizer would have to be its own mathematician and dynamically evolve mathematical theories for recognizing variations of a letter.
It would in fact have the same sort of problems I faced in learning to write Chinese, where I would make embarassing blunders in calligraphy on the new Shenzen subway, pointed out by six year olds. As it happens, for example, the second bar of the symbol for the number three has to be shorter than the first bar.
Yet, when you examine signage in Hong Kong, the letters are systematically distorted!
Where we've gotten to with respect to both voice and handwriting recognition is the idea of "training" the machine, but consider some kid dumped by his parents at Lo Wu in 1955, who grows up in Hong Kong, and learns to read signage because it's that, or starve.
He's forming scientific theories and testing them.
A teacher of mine at Princeton, Gil Harman, did work in Lisp on models that would form theories and test them but as far as I know this work is commercialized only the form of neural nets.
Anyway, certainly sounds like you are on track to something. My only warning, apart from the above considerations, is that relying on Microsoft APIs to be stable over multiple releases of Windows is a bad idea. Microsoft's policy in the past has been to adopt favored companies, and give them information on changes to APIs, and new APIs, only on condition that they become official Microsoft certified sites.
This is why there are no API calls at all anywhere in the software for Build Your Own .Net Language and Compiler. Instead, the utilities.DLL and windowsUtilities.DLL do the best they can to provide needed functionality exclusively in terms of the documented, and presumably stable, behavior of Visual Basic .Net.
This has its own dangers. I use "legacy" character input and output, for example, in file2String and string2File to translate files to and from strings. There's a chance that in some future release, the functionality exposed by Microsoft.VisualBasic may in some cases be downgraded out of existence if it is inconvenient to support.
However, wrapping the function in a clear, transparent, Saran wrap like "file2String" makes it obvious that this is the goal, and, the code can be replaced.
Since Build Your Own, I have started always qualifying ANY reference to functionality that is in Microsoft.VisualBasic with "Microsoft.VisualBasic.x" so as to be able to find possible exposures. Soon, it shall be time to simply remove the reference and the imports even from VB.Net projects because in many cases the functionality is available in more reliable (and, more internationalizable) form elsewhere.
Or, to junk VB. C# is .Net "equivalent" but VB encourages legacy and USA centric ways of thinking.
But, I have decided, not before a VB version of Spinoza is complete, because it is quite late in my own personal game, and developing a compiler for a new language is my highest priority.
You and I are in the same general boat. You are relying on Microsoft software, and I am relying on VB.Net having a future in the global marketplace.
However, my experience has long been that you can drive yourself batshit by backing up continually to do things "perfectly".
Our (same) boat definately has a HOLE !
The reason i chose microsoft technologies over other such as Java is simple, 'to avoid reinventing the wheel'. Microsoft has always provided trmendous support over their own stuff than anybody else, really. and yes dealing with API's is really an complicated issue which certainly requires more than a 'schozmo' developer like me! lol.
Anyways and you are perfectly right about pointing toward Neural Nets as the area under review by maximum developers, weather be it Handwriting recognition or any other human interface parser. The Issue that i have come across is that Neural Nets are STILL not used by Microsoft in doing Speech Recognition in the SASDK's Telephone English Speech recognition Server, nor the ON-NOTE PDA , tablet pc - handwriting recognition algorithm, my contribution being i am close to perfecting simulation over a neural net model of speech recognition patters, i have constantly employed neural nets in various idea over many years from networking to vision parsing but everything is miserable not even near the mark of acceptable range.
Instead of giving up i have tried to use them in many places and out of all the applications that can DIGEST a neural net, speech recognition has been the most promising.
My next part of the article is supposed to talk about how the SASDk works and how this can be considerably improved by employing neural nets to not SYNTHESISES speech but to recognize one ... i am typically using a 3 layer network based on the KOHNEN , GROSSBERG Model of neural nets, its far from acceptable standards but still i do believe that this is how the future models will definitely be built!
About the language recognition... i would recommend you towards a very insignificant product out on the net but with a very effective algorithm. These are called yahoo Crackers, typically work to steal yahoo ID's passwords by many techniques such as brute forcing etc. but since the introduction of the "Enter the Letters Printed Above" scheme in almost all the websites , including Developer Dot Star, hackers have started building algorithms that can read distorted English.. i have definitely come across at least two crackers that were able to constantly Brute Force yahoo by successfully overriding the "Enter the Keywords Printed Above" issue.
the introduction or curvature such as a 3d- ball beneath the x-y alphabet plane introduces a significant amount of distortion in the alphabets which normal recognizers fail to process. This defiantly is a challenging problem, and results of recognition algorithms only get there half way across... the recognition ratio being around just 53%,
i am sure this is what you mean in the Chinese language recognition.
i have plans to work on this issue when i get to the part of implementing Vision in desktops... moreover .... i have once seen a program at The Discover Channel, which showed an Robot At the MIT AI Lab which would 'LEARN TO RECOGNIZE and DIFFERENCIATE' between different shapes such as a Ball or a Cube, this was one thing that was not algorithmically implemented but was done on the base of a Self Learning Expert System clubbed with a neural network... the robot eventually tries to GRAB the ball and if the ball moves away, and it is not able to reach it, it extends its arm.. this also caused the robot to realize the length of It’s OWN arm, a very interesting situation from AI's point of view other that the classical Sheep Dog Simulation or the PICK up a glass of water and Put it BACK, situation..
at this point i would certainly like to say that there is no PERFECT Algorithmic way to address this problem, thus we should be working more towards 'Writing Programs that can write programs to write Programs' Methodology...
Indeed this is the only way we can Fill up the hole in our BOAT by coming up with a solution that Microsoft is NOT Interests in developing or investing.
-
Gunish
When is the Next Part gonne be out!
Hi Gunish,
I found this article very intreaguing, i hope that the next part comes out soon, altough the english was not very flamboyish but still it was worth reading,
anyways, best of luck with your new job, hope to read more soon!
Garima
Thanks! But how ?
hey Garima,
Always Wonder how you never fail to amaze me,There are so many 'Hows ' that i want answered :
How are you?
( Hell i wanna know that )
How did you find me out?
(i though we were never gonna communicate again all of our lives!)
How did u know i got a new job?
( yeah i got a great job, as a Sr. Lead Software Specialist (.Net)
at www.gatesix.com the company is not that big but its a great startup. I gotta Build a Entire .NET Development Infrastructure and Team by my self, phew!)
send me a mail at gunish.chawla@gatesix.com
dont get lost now!
-Gunish
i need urgently
hello sir how to recognize speech from engine in vb.net
speech recognition using VB.net
how to write coding for speech recognition using VB.net
Learn about Speech Recognition in VB Dot NET
Plz sir send me detailed info and some idea how to program for speech recognition in VB Dot NET
Speech recognition with ANN + GA
Now i'm doing project for final bachelor course of Comp Sci - it is "Speech recognition based on Artificial Neural network and Genetic Algorithm". I use VC++ for coding, and some main modules are completed, it's enough to demo. But now i want to build a new module that can be draw spectrum frequency of origion .wav file and after applied FFT. Who has source code of above module, plz help me. plz to me by email letamn@gmail.com . Thanks a lot. Mapcon
Speech Recognition: Do Your Homework
Editor's note: I am closing further comment on this thread since it has unfortunately only become a home for "help me with my homework" messages. (Gunish, I think you could make a nice living helping students with their speech recognition homework assignments.)
All the best,
Dan


hey this is ankush,, nice o
hey this is ankush,,
nice one ... was very informative .... i hope you will be writing the next part soon though!