Network of Excellence
The Internet Portal for Speach Recognition Technology
Speech technology is becoming increasingly important in both personal and enterprise computing where it is used to improve existing user interfaces and to support new means of human interaction with computers. Speech technology can allow hands-free use of computers and support access to many computing facilities away from the desk and over the telephone. The VoiceXML-activity of the W3C will allow voice over the Internet or intranets to access speech capabilities on the user's machine. This will provide the ability to enhance www sites with speech and support new ways of browsing. However, VoiceXML requires a high performance method for recognition of Phonemes. In this project a novel time domain approach, which will result in vastly improved speech recognition will be implemented. “To some degree, speech recognition has reached the point at which increasing recognition accuracy from the current rate of 95 to 98% will require a quantum leap in recognition technology.” /N. Cravotta:VoiceXML distributes voice to the masses. EDN EUROPE; January 2002, p29-33/
RecoPhone satisfies many of the requirements of Framework 6. It will
“contribute to strengthening European scientific and technological excellence
/E1/. It aims to
increase innovation and competitiveness in European businesses and industry and
to contribute to greater benefits for all European citizens.” /E2/.
Europe has many different countries and even more different languages. However,
the phonemes of different languages, and especially the methods for recognition
of phonemes by information technologies are nearly the same. RecoPhone with very
high accuracy for the recognition of phonemes for different languages will be an
important bridge to draw together the people of Europe.
The Research activities in RecoPhone “will focus on the future generation of technologies in which computers and networks will be integrated into the everyday environment, rendering accessible a multitude of services and applications through easy to- use human interfaces”/E2/. Another focus of RecoPhone “is on ‘ambient intelligence’ for a broader inclusion of citizens in the Information Society” /E2/. The objectives of RecoPhone include improvements to the performance, cost-efficiency, functionality and adaptive capabilities of communications and computing technologies. Work will also lead to the next generation Internet.
An objective of RecoPhone is to stimulate radically new content and media services and applications.” /E2/
development and the first applications of this method will open a broad field of
further IST research and applications: RecoPhone
will be a Future and Emerging Technology to “nurture invention, creativity,
and bright spark ideas.”
Essential Technologies and Infrastructures address the convergence of information processing, communications and networking technologies and infrastructures by using natural speech. European competitiveness in a knowledge-based economy is an increasingly important issue and is very evident in the plans for the 6th Framework Programme. The RecoPhone initiative is strongly driven by SME involvement and sees its market as world-wide, though the test bed for development is Europe in the first instance. RecoPhone will contribute to Europe’s competitiveness by its own cutting edge technological development. The development is in relation to (1) the citizen by a new quality of communication between citizens and monitoring devices, (2) the use of natural communication with computers to provide the flexibility to be free from many existing constraints on both working methods and organisation, including those imposed by distance, time and language. It will allow future e-work systems, organisations and e-business. (3) the development in Europe of a new generation of electronic and mobile devices with natural speech interactions. (4) Multimedia Content and Tools by new communication between user and computer by internet communication, search machines with natural speech interfaces.
“Despite very substantial improvements in speech technology in the last 40 years, speech synthesis and speech recognition technologies still have many limitations, and often do not meet the high expectations of users familiar with natural speech communication.”/JAVATM SPEECH API , http://java.sun.com/ marketing/ collateral/ speech.html/. The scientific Literature and the practical solution on speech recognition have been based on the method of frequency analysis /G. Roscher: Literature study: Signalanalyse, Zeitreihenanalyse, Sprachanalyse und Methoden der KI. ICS Dr. G. Roscher GmbH. (unv.) 2000/. Methods worked in time domain were used in the past only for pitch tracking or Zero Crossing and without success. Many variants of frequency analysis were based on the physiology of hearing and on the methods of speech recognition. Examples in the scientific magazine SCIENCE will demonstrate this fact /M. P. Stryker: Sensory Maps an the Move. SCIENCE Vol 284 7. May 1999 p. 925-6/ or newer: “It is therefore not surprising that the range of frequency information and the frequency resolution of these surface-electrode ABI’s are not satisfactory and usually do not lead to an understanding of speech even after month and years of practice.” /SCIENCE from 8. February 2002! p 1028/ (ABI’s = Auditory brainstem implants). But new results suggest that speech recognition performed in the time domain will offer significant improvements /S1/.
over 15 years one partner of the consortium has been developing a method for the
real-time recognition of signals in the time domain - in contrast to the
established frequency domain methods such as the Discrete Fourier Transform (DFT).
This method evaluates each event in the signal as extreme amplitude, extreme
slope, and as an option each extreme curvature, and transforms each event into
the data structure named Virtual Source (in German: Virtuelle Quelle
– VQ). The result of this transformation is the description of the
signal as a sequence of VQs. Further steps build up a hierarchical system of
chained lists of VQs, named
SuperPeaks, Cycles and Classes. This
description of the signal can be easily manipulated by mathematical methods and
can be easily recognized. As Albert Einstein said “We should make things as
simple as possible, but not simpler”. Previous researchers have not
appreciated the sophisticated performance in the time domain of the hard wired
parallel processing human auditory system and cortex. All the established groups
have used frequency domain methods! These methods are approximations which lack
the accuracy and performance of the human ears and brain.
If you hear a phoneme such as “m” or “n” spoken by an unknown speaker without additional context, you can differentiate between “m” and “n” with great reliability. If you see the visual representation of these signals on the computer screen, you cannot differentiate between “m” and “n” with such high reliability, especially, if the pitch of the spoken phonemes or the speakers are different. If you heard an international well known artist such as Heinz Rühmann in Germany or John Wayne in the USA, you could recognize the speaker you heard years ago with high reliability. These examples serve to demonstrate the high performance of the human ears and brain. Each detail of the phonemes, words, and sentences spoken by a specific speaker can be recognised and is stored in the brain!
high performance requires the application of time domain methods for signal
recognition of the highest accuracy, which employ the latest technologies in the
field of Database and Knowledge Management Systems (KMS). Each incoming signal
is stored in the Data Base System. These signals are segmented and
indexed by the time and by the segment number. The classification is achieved
using KMS for the acquisition of personal knowledge in direct communication
between the user and the KMS. In this way personal knowledge is acquired through
direct communication between the user and the KMS. In
the first stage, the KMS is only related to phonemes and the changes between
phonemes/pause for different languages, dialects and speakers. If possible, this
KMS will be extended for speech recognition.
time domain recognition procedure gives the necessary improved accuracy to
achieve the same high performance phoneme recognition as the human ears and
brain, leading to higher speech recognition accuracy. The phoneme recognition
rate will approach asymptotically to that of
the human ear and brain (99.99%).
The RecoPhone project will result in a quantum leap in the quality of speech recognition technology!
to Submit Expressions of Interest - An opportunity for Europe’s research community to
help prepare for the first calls of FP6. GUIDE FOR SUBMITTERS. Identifier:
Annex 1: Priority
thematic areas of research in FP6.
S1: Marcia Barinaga: „New Ion Channel May Yield Clues to Hearing. Science, 24. March 2000, Vol 287, p 2132-2133 and Science, 24. March 2000, Vol 287, p 2229-2234: „By studying the electrical currents passing through the membranes of hair cells as they are stimulated, they learned that hair-cell channels are stunningly fast, opening up within microseconds, compared to the milliseconds needed by biochemically activated channels. They are also exquisitely sensitive to the slightest movement and to direction; they open when the tip of the cell’s cilia bundle is deflected by a mere atom’s width – akin to bending the tip of the Eiffel Tower by the width of your thumb. If the cilia bundle moves one way, the channel opens; the other way and it shuts. The channels are also able to register tinny cilia movements on top of a larger constant deflection – a trait that lets us discern meaningful sounds from background noise.”
 The modelling of the hearing by RecoPhone will lead to an new quality of ABIs, performed in the time domain!