9512.net
甜梦文库
当前位置:首页 >> >>

Noise adaptive stream weighting in audio-visual speech recognition



Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition
Martin Heckmann ? , Fr? d? ric Berthommier ? , Kristian Kroschel ? e e Institut f¨ r Nachrichtentechnik u Universit¨ t Karlsruhe a Kaiserstra?e 12, 76128 Karlsruhe, Germany heckmann, kroschel @int.uni-karlsruhe.de Institut de la Communication Parl? e (ICP) e Institut National Polytechnique de Grenoble 46, Av. F? lix Viallet, 38031 Grenoble, France e bertho@icp.inpg.fr May 23, 2002
Abstract It has been shown that integration of acoustic and visual information especially in noisy conditions yields improved speech recognition results. This raises the question of how to weight the two modalities in different noise conditions. Throughout this article we develop a weighting process adaptive to various background noise situations. In the presented recognition system audio and video data are combined following a Separate Integration (SI) architecture. A hybrid Arti?cial Neural Network/Hidden Markov Model (ANN/HMM) system is used for the experiments. The neural networks were in all cases trained on clean data. Firstly, we evaluate the performance of different weighting schemes in a manually controlled recognition task with different types of noise. Next we compare different criteria to estimate the reliability of the audio stream. Based on this a mapping between the measurements and the free parameter of the fusion process is derived and its applicability is demonstrated. Finally, the possibilities and limitations of a an adaptive weighting are compared and discussed.
? ?

Keywords: Audio-visual, Speech Recognition, Adaptive Weighting, Robust Recognition, Multi-stream Recognition, ANN/HMM

1 Introduction
The limited performance of Automatic Speech Recognition (ASR) systems in the presence of background noise still restricts their usability in many scenarios. Different attempts have been made to increase the robustness of ASR systems but all fall short in comparison to human performance. It is well known that the movement of the lips plays an important role in speech perception[1][2]. The contribution of the lips is especially high in noisy speech [3][4]. This is due to the fact that visual speech mainly conveys information about the place of articulation, which is most easily confused in the audio modality when noise is present [5]. Motivated by this ?ndings many researchers have tried to integrate the information transmitted by lip movement into ASR systems (see [6]-[11] and [12] for a review). Already the ?rst systems showed noticeable improvements of the recognition scores in noise when the audio and video signal are jointly 1

A Code

A

Code Code Code

V

V

a) Direct Integration
A Code

b) Separate Integration
A Code

V

V

c) Motor Recoding

d) Dominant Recoding

Figure 1: Four different fusion architectures for audio-visual recognition. evaluated. Since then signi?cant progress was made and currently a recognition system using both, audio and video data, can outperform humans having only access to the audio signal at low Signal to Noise Ratio (SNR) [13]. Despite this high performance of audio-visual ASR systems there is still a long way to go before these systems will have performance comparable to humans in an identical task. Throughout this article we mainly want to focus on the adaptive fusion of audio and video data under different noise conditions. We start with a quick look at different possible fusion architectures and point out why we have chosen a Separate Integration architecture, where fusion takes place on a decision level. Next we present four different fusion schemes of audio and video decisions. A comparison of these fusion schemes in a wide range of noise conditions allows us to identify the best scheme. In order to be adaptive to changing noise conditions there is need for a criterion to evaluate the reliability of the audio channel. We present three different reliability criteria and compare them in different noise conditions. We conclude this article with a discussion of the results of our comparisons. Throughout this discussion special attention is paid to the question of whether adaptive weights on the audio and video stream are necessary or if it is suf?cient to simply use one ?xed weight for all situations.

2 Fusion of Audio and Video Data
When looking at the fusion of audio and video data for audio-visual speech recognition, the ?rst question to be addressed is where the fusion of the data takes place. Several different architectures for the fusion process have been proposed [5][14]. The ?rst is integration on the feature level. In this case audio and video features are directly combined to a larger feature vector which is then used to identify the corresponding phoneme. This is also referred to as Direct Integration (DI) (see also Fig.1). In contrast to this, fusion can also take place after independent identi?cation of each stream. Hence the fusion is rather a fusion of identi?cation results. This is called Separate Integration (SI). Between these two extremes lies the so called Motor Recoding (MR) in which the input features are ?rst transformed into a common representation and the classi?cation then is based upon the combined features in this representation. As common representation, to which both audio and video features are mapped, the articulatory gesture parameters are chosen. A problematic point when using Motor Recoding is the choice of the representation of the articu-

2

latory gestures. In the fourth fusion architecture one stream is dominant. In this case the decision is based on the dominant stream and the second stream is only used to rescore the identi?cation results of the dominant stream. This is called Dominant Recoding (DR). Due to the fact that it conveys much more information than the video stream, naturally the audio stream is chosen as dominant stream. When comparing the different fusion architectures Separate Integration shows some characteristics making it the best choice for our task. An important property is that the fusion of the two input streams can be controlled by weighting the streams. The code elements in Fig. 1 are the phonemes ? to which we can assign a posteriori probabilities ? ? ? ?? for their occurrence given the acoustic feature vector ? and the video feature vector ? ? (see Fig. 2). These

?

?

Stream reliability

$ P Hi x A

c

h h

$ P Hi x V

c

$ P Hi x A , x V

c

h

Figure 2: Weighting of audio and video a posteriori probabilities in a Separate Integration architecture to take changing reliability of the input streams into account. a posteriori probabilities, or to be more precise their estimates ? , are generated by an Arti?cial Neural Network (ANN) [15] in each time frame. Therefore the SI, in combination with an ANN, allows an adaptive weighting of the input streams depending on their reliability. Adaptation of the weights can be done once per scenario as well as for each single frame. Furthermore comparisons of SI with other architectures showed superior performance of SI [16][17][18] 1 . For these reasons we decided to use a SI architecture for our recognition experiments. Once we have chosen the SI architecture the next question to tackle is how the fusion of the identi?cation results takes place. The quality of the estimate of the a posteriori probabilities is related to the match of the training and test conditions. As training was in all cases performed on clean data, the reliability of these estimates, particularly in the case of the audio path, strongly depends on the noise present in the test condition. In order to cope with the changing reliability, a weighting of the audio and video probabilities is desirable.

2.1 Unweighted Bayesian Product
The simplest way to combine audio and video data is to follow Bayes’ rule and multiply the audio and video a posteriori probabilities to derive the combined probabilities. This approach is valid in a probabilistic sense if the audio and video data are independent. Perceptive studies showed that in human speech perception audio and video data are treated as class conditional independent [19][20]. Under this hypothesis

? ?? ?? ? ?

? ?? ? ?? ??? ? ?

(1)

When applying Bayes’ rule we can write the desired a posteriori probability of the phoneme ? as: ? ? ? ? ? ?? ? ? ? ?? ? ? ? ?? ? ?? ? (2) ? ? ?

?

?

?

? ? ? ?

?

? ? ? ? ? ?

1 Regarding

the comparison of DI and SI these results were con?rmed by our own experiments but not reported here

3

Replacement of the probabilities ? by estimates to call it, Unweighted Bayesian Product (UBP)

?

leads to the representation of the, as we want

??

?

??

? ?? ?

? ?? ? ?? ?? ?? ? ? ? ?? ?

(3)

where the terms independent of the actual phoneme are replaced by the normalization factor

? ?? ? with ? being the number of phonemes. This fusion scheme is also the core of the Fuzzy Logical Model of Perception (FLMP) [21] which is used to model human perception.
?

?? ? ??

?

? ?? ?? ?? ?

(4)

2.2 Standard Weighted Product
In order to deal with varying reliability levels of the input streams different authors introduced a weighted fusion where different weights are put on the audio and video channel. The weighting of the a posteriori probabilities proposed in the literature [17][18] follows (we want to refer to this as Standard Weighted Product)

???? ?? ? ?? ?

??

? ?? ? ?? ??? ? ?? ?? ? ??? ? ?? ? ? ? ? ? ?? ? ??

(5)

The assumption of conditional independence is approached for equal a priori probabilities of the phonemes or words, respectively, depending on the place of fusion. It is not actually ful?lled since equal weights on both streams correspond to weights of instead of . In addition to the intermediate setting when the audio and video stream contribute equally to the recognition, a further two distinct settings of the weights exist. When the SNR is very low, the estimation in the audio path completely fails. Therefore the ?nal a posteriori probability should only depend on the video features, which is achieved with

?

?

???? ?? ? ?? ?

? ?? ?? ? ? ?? ?

?

(6)

Similar, for very high SNR the estimation in the audio path is in general much better than the one in the video path and consequently

with . The most common recognition systems are based on Gaussian Mixture Hidden Markov Models (GM/HMM). These produce likelihoods instead of a posteriori probabilities. Weighting of these likelihoods corresponds to a weighting of Eq.1 [16][9][10]. This approximates the assumption of conditional independence independent of the a priori probabilities. instead of entails that not the product of the probabilities but the square Equal weights of root of the product is evaluated when both the audio and the video stream have the same weight. To resolve this problem we modify the parameterization of the Standard Weighted Product. We introduce the parameters ? and ? which depend both on a third parameter according to

?

???? ?? ? ?? ?

?

(7)

?

?

? ?

? ?? ? · ?? ? ? ? ? ? ?? ? ? ? ?
4

(8)

yielding

?????? ?? ? ?? ?

??

? ? ?? ? ?? ? ?? ?? ? ? ? ? ? ?? ? ?? ?? ?? ?

(9)

Similar to in the previous parameterization, the parameter varies with the SNR and it determines the contribution of the audio and video stream to the ?nal probability.
1 0.75 0.5 0.25 0 ?1.5 ?1

α

β

?0.5

0

0.5

1

1.5

Figure 3: Dependence of the parameters ? and ? on the fusion parameter When the a posteriori probabilities from the audio and video path both have the same weight as ? and ? (see also Fig. 3). For ? at very low SNR, ? ? and when only the audio signal carries information, ? ? . Hence this takes the for situations into account, where we only want to rely on one of the two streams. In contrast to the original parameterization of the Standard Weighted Product to which we want to refer as SWP this implementation will be referred to as SWP ?? .

?

? ?

?

?

?

?

?

?

2.3 Geometric Weighting
A concept integrating class conditional independence of audio and video data expressed in Eq. 1 and the idea of noise dependent stream weighting expressed in Eq. 5 is the Geometric Weighting [22]: ? ? ? ? ? ? ? ?? ? ?? (10) ? ? ? ? ?? ? ?·??? ?

?

?

?

? ? ? ?

? ?

?

The normalization factor

?? ? ?

?? ? ? ??
?

?? is determined by evaluating the condition . Factors only dependent ?? ?? ? ? ? on ? and ?? are eliminated by the normalization. The result of the sum in Eq. 11 is independent on ? and hence only depends on the fusion weights ? and ? . For the Geometric Weighting we solely employed the parameterization with ? and ? as dethe assumption of conditional independence as stated in ?ned in Eq.8. Consequently, for Eq.1 is ful?lled when equal weight is put on the audio and video stream. Similar to the description in the previous section for ? the ?nal probability only depends on the a posteriori it only depends on the audio stream (see also Fig. probability of the video stream and for 3).

? ?? ? ?? ?? ? ? ?·??? ?? ?

?

(11)

?

? ?

?

?

?

2.4 Full Combination
Findings in human speech perception showed that the error rate for phoneme recognition using the full frequency range is approximately equal to the product of the error rates using only non-overlapping frequency sub-bands [23][24]. This is known as the so called Product of Errors 5

(POE) Rule. Motivated by this rule multi-stream recognition systems were built, which decompose the speech signal in multiple sub-bands, perform an identi?cation of the phoneme for each sub-band and then combine the results [25]. In general the performance gain of this approach was not very high in noise and was paid for by a loss of performance on clean speech. The loss on clean speech is alleviated by the so-called Full Combination (FC) approach [26]. Here phoneme identi?cation is performed for all combinations of sub-bands, including also the full frequency range, and the identi?cation results are then combined linearly. When applying this concept to audio-visual recognition we have to consider two input streams. Taking all combinations of the input streams plus the empty stream containing only the a priori probabilities into account we have a total of four streams: the audio, the video, the combined audio-visual and the empty stream. Hence three ANNs have to be trained to generate the corresponding probabilities. The weighting of the streams is performed by a linear combination of the a priori and a posteriori probabilities according to

?

??

? ?? ?

? ?? ? ?? ? · ? ? ?? ? ? · ? ? ?? ?? ? · ? ?? ?
?

(12) In order to reduce the number of neural networks to be trained on each independent stream (which grows exponentially with the number of streams) the so called Full Combination Approximation (FCA) was introduced [26]. Here class conditional independence is assumed between the streams and hence the identi?cation result for a combination of streams can be derived from the identi?cation results of the individual streams (compare Eq.2). Then the a posteriori probability of the combined audio-visual stream is evaluated according to

?

??

? ?? ?

? ?? ? ?? ?? ?? ? ? ? ?? ? ? ? ?? ? ? · ? ? ?? ?? ? · ? ?? ?
?

·

(13) with as de?ned in Eq. 4. The ?rst term in Eq.13 results from the postulation of class conditional independence and the other terms ensure the same behavior as Geometric Weighting when only one of the streams is reliable. The are the weights with which the individual streams contribute ??? , ? ? ?? , ? ? ? ? and to the ?nal probability. They are set to ? ? ? ? ? ? , with ? and ? as given in Eq.8. When the estimation process for the different probabilities is not consistent and hence the sum over all probabilities does not equal the assumption of one, an independent normalization for each stream is necessary. At conditional independence is ful?lled. Similar for and ? all weight is put on the audio or video stream, respectively. In our implementation the degrees of freedom of the FCA and the Geometric Fusion are limited to one. This might not be optimal but a multidimensional optimization with multiple degrees of freedom would be much more costly to perform.

??

? ??

?

??

?

??

?

?

?

?

3 The Recognition Task
As common task to evaluate the presented fusion schemes we have chosen the recognition of continuously uttered English numbers. This task comprises many of the problems of continuous 6

speech recognition whilst still being not too costly to implement. One of the distinct features of a continuous recognition task is the necessity to discriminate between speech segments and silence passages, which is especially problematic in noisy speech. Due to the very limited availability of audio-visual speech data we had to record a new database to setup our system.

3.1 The audio-visual database
For the recording of the database selected utterances from NUMBERS95 [27] were chosen and repeated by a single native English-speaking male subject. The database contains 1712 sentences or 6432 words. It was subdivided into two subsets of similar size for training and ?nal recognition. Synchronous recordings of the speech signal and video images of the head and mouth region at frames per second were taken. Recordings were made on BETACAM video and standard audio tapes and A/D converted with kHz off-line.

?

3.2 The recognition system
Our audio-visual speech recognition system is based on a hybrid Arti?cial Neural Network/Hidden Markov Model (ANN/HMM) structure. ANN/HMM hybrid systems represent an alternative concept for continuous speech recognition to pure HMM systems giving competitive recognition results [28]. As already mentioned in the previous section, our system follows an SI architecture (see Fig.4). The implementation of our system was carried out using the tool STRUT from TCTS lab Mons, Belgium [29].
Stream reliability
RASTAPLP Audio ANN AV Fusion
ChromaKey

HMM

Video ANN

Figure 4: Implementation of the SI audio-visual speech recognition system The emphasis of our research lies on the fusion of the audio and video data during the recognition process which requires large amounts of data to obtain meaningful results. Therefore, following [16], we rely on geometric lip features and simpli?ed the extraction of the features signi?cantly by a chroma key process. The chroma key process requires coloring the speakers lips with blue lipstick. Due to the coloring, the lips can then be located easily and their movement parameters can be extracted in real time. As lips parameters ? outer lip width ? inner lip width ? outer lip height ? inner lip height ? lip surface area and ? inner mouth area surrounded by lips were chosen. The video parameters were linearly interpolated from the original 50Hz to kHz in order to be synchronous with the audio data. Following the interpolation each lip parameter was low-pass ?ltered to remove high frequency noise introduced by the parameter extraction and to further smooth the results of the interpolation. Audio feature extraction was performed using RASTA-PLP [30]. 7

To take temporal information into account, several successive time frames of the audio and video feature vectors are presented simultaneously at the input of the corresponding ANNs. The concept of visemes was not used. Each acoustical articulation is assumed to have a synchronously generated corresponding visual articulation. Hence the recognition process is based on phonemes. Individual phonemes are modeled via left-right HMM models. The number of states of the HMMs used to represent the different phonemes was adapted to the mean length of the corresponding phoneme. Word models were generated by the concatenation of the corresponding phoneme models. Recognition is based on a dictionary with the phonetic transcription of 30 English numbers. Complete sentences containing a sequence of numbers were presented to the system during the recognition process. The sentences consist of free format numbers making a grammar model unnecessary. Training of the ANNs was in all cases performed on clean data. During our recognition tests different SNR levels to the audio signal, we added different types of environmental noise at resulting in different test conditions. Adding noise to the recorded signal instead of adding it during the recordings does not take into account the changes in articulation speakers produce when background noise is present [31] and therefore generates somehow non-realistic scenarios. On the other hand it opens the possibility to test exactly the same utterances in different noise conditions and tremendously facilitates the recordings of the data. As additive noise we have chosen white noise, noise recorded in a car at 120 km/h and babble noise and two types of factory noise taken from the NOISEX database [32].

?

??

4 Evaluation of the Fusion Schemes
The ?rst step in the evaluation is to compare the fusion schemes under identical conditions using a manual setting of the optimal weights.

4.1 Manual weight adaptation
Throughout this ?rst stage of evaluation, the fusion parameter in the Standard Weighted Product with ? and ? parameterization (SWP ?? ), the FCA and the Geometric Fusion was adapted manually at each SNR level in order to get the best possible recognition score. Thus the weight combination and not the weight estimation method is compared at this point. During a test in a particular noise condition the fusion parameter was held constant over all frames. Tests in that particular noise condition with different settings of the fusion parameter were repeated until the minimum Word Error Rate (WER) was reached. For the Standard Weighted Product with its original parameterization the parameter instead of was adapted to each noise scenario. In the following evaluation of the different fusion schemes, we will use the Relative Word Error Rate (RWER) instead of the WER. The reference point of the RWER is the WER resulting from a fusion according to the Standard Weighted Product with the original parameterization (SWP ) for the corresponding noise scenario. The ? at a given noise type ? and SNR level is de?ned as:

?? ????? ?
?

? ? ?? ????? ?? ? ? ???? ?? ????? ?? ???? ?? ? ?
? ?× ??

(14)

To take all noise conditions into account the mean relative error for a particular fusion scheme over all noise conditions was calculated

? ?

????

??

?? ????? ??

(15)

An improvement compared to the Standard Weighted Product results in a negative RWER. 8

Method Error

Audio 137.6?2.6

SWP 0.0?2.4

SWP?? -0.9?2.4

UBP 9.2?2.1

FC -2.2?2.3

FCA -28.9?2.0

GW -30.2?2.0

Table 1: Average of the relative error in percent for audio alone recognition and the fusion schemes Standard Weighted Product (SWP) parameterized with (SWP ) and ? and ? (SWP?? ), Geometric Weighting (GW), Full Combination (FC), Full Combination Approximation (FCA), and Unweighted Bayesian Product (UBP) over all noise types and SNR levels. Additionally the con?dence interval for the relative error is given.

±

Tab.1 compares the different mean relative errors. Both the FC and the FCA were implemented but due to the very poor performance of the identi?cation network trained on the combined clean audio and video features in noise, resulting from a training on clean data, the performance of the FC was signi?cantly worse than that of the FCA. For the Standard Weighted Product a parameterization with ? and ? is compared to the original parameterization with which serves as the reference point for the evaluation of the relative error. Parameterization with ? and ? , which results in equal weights of at instead of at , leads to a small but consistent improvement over all noise types. The results are given in detail in Fig.5 and Tab.2, which show the graphical and numerical results when car noise was added to the audio signal. For comparison, also the scores for the audio and video stream alone are given. Due to its poor performance the FC is not included in

?

?

?

?

100 90 80 Word Error Rate in % 70 60 50 40 30 20 10 0 ?12dB ?9dB ?6dB ?3dB 0dB 3dB SNR 6dB 9dB

Audio Video UBP SWP λ FCA GW

12dB 15dB 18dB clean

Figure 5: Word error rates for each individual stream and for audio-visual recognition with different fusion schemes. The fusion parameter was set by hand. Car noise was added to the audio channel. this comparison. The SWP is included to serve as a reference point. From Fig.5 and Tab.1 and 2 follows that all weighted fusion schemes are able to ful?ll the basic postulation of audio-visual recognition. This postulation states that the audio-visual score should always be better or equal to the audio or video score alone [18]. From a useful fusion scheme we further expect that it is able to generate synergy effects from the joint use of audio and video data in a way that the resulting error rates are signi?cantly lower than the error rates from either stream alone. The Standard Weighted Product rather yields poor performance and shows only little gain from the joint use of audio and video data. Geometric Weighting and FCA give very similar results which are much better for audio-visual recognition at medium SNR than audio or video recognition alone. For 9

Audio SWP UBP FCA Geometric

-12dB 98.4 22.1 67.5 22.1 22.1

-6dB 0dB 6dB 12dB clean 76.9 28.4 5.9 1.5 0.8 22.1 19.8 4.6 1.4 0.8 32.6 8.8 2.1 1.0 0.7 21.6 8.8 2.0 0.8 0.6 20.5 8.7 2.0 0.8 0.6

Table 2: Comparison of the word error rates in percent for recognition with different fusion ) schemes when car noise is added (WER on video alone is

?? ?±

low SNRs the Geometric Weighting performs slightly though not signi?cantly better than the FCA, but gives identical results for medium and high SNR. The Unweighted Bayesian Product is the only fusion scheme which does not ful?ll the basic postulation. At very low SNR values the recognition scores drop below those of the video channel alone, whereas at medium and high SNR values the scores are very similar or identical to those of the Geometric Weighting or the FCA. Due to its superior performance we only employed the Geometric Weighting in the following tests.

4.2 Automatic weight adaptation
For a real-time scenario the setting of the weights has to be performed automatically depending on the noise level. A prerequisite to this is the estimation of the reliability of the audio stream during the fusion. The reliability estimation can follow two different approaches, either relying on the statistics of the a posteriori probabilities or directly on the speech signal. We will ?rst present two measures based on the distribution of the a posteriori probabilities and will then also present a measure based on the speech signal. 4.2.1 Audio stream reliability estimation methods Entropy of A Posteriori Probabilities The distribution of the a posteriori probabilities at the output of the ANN carries information on the reliability of the input stream to the ANN. If one distinct phoneme class shows a very high probability and all other classes have a low probability this signi?es a reliable input. Whereas, when all classes have quasi equal probability the input is very unreliable. This information is for the occurrence captured in the entropy of the estimated a posteriori probabilities ? ? ? of the phoneme ? given the acoustic feature vector ? in time frame [33][16][34]. The average entropy of the a posteriori probabilities over all frames is

?

?

?

? ??

? ?
?

?

?

? ??? ?

? ??

?

? ??? ?

?

(16)

where ? is the number of phonemes and ? the number of frames. We want to control the fusion process based on the entropy. Therefore a mapping between the value of the entropy and the fusion parameter has to be established. Experiments showed that for this mapping it is necessary to exclude segments where the pause is the most likely state, due to many false identi?cations of pauses at low SNR levels. Therefore only those frames, where the silence state is not amongst the most probable phonemes, are taken into account for the calculation of the entropy. Dispersion of A Posteriori Probabilities

10

A measure similar to the entropy is the dispersion of the a posteriori probabilities [16][34]

?
? ?
?

?

?

? ?? ? ?? ?
?

?

?
?

(17)

where the probabilities ? ?? ? are sorted in descending order beginning with the highest one. Hence the difference between the ? most likely phonemes is calculated and summed up. In our setup the best results were obtained for ? . As for the entropy, only frames where a pause is not amongst the most likely phonemes are taken into account.

?

? ?·?

?? ?? ??? ?

?? ? ?? ?? ??? ?

?

??

Voicing as Audio Reliability Measure It is known that speech contains many harmonic components whereas in many everyday life situations background noise is non-harmonic. Thus the lower the ratio of the harmonic to the non-harmonic components is, the more noise is present in the signal. A measure to asses this relation is the so called Voicing Index [35]. First a harmonicity index ? is calculated from the pre-emphasized and demodulated signal via the autocorrelation function. In each time frame of bins the waveform is recti?ed and ?ltered by a trapezoidal band-pass ?lter (with the cut-off frequencies: ). Then the maximum value is picked from the autocorrelogram in a time window of possible pitch values ×). To obtain ?, the amplitude is normalized by the zero time-lag of the autocor( relation function. From this we generate the probability of the signal to be clean enough to be ? . We added recognized, knowing the value of the harmonicity index ?, e.g. ? white noise at dB SNR to sentences of the database, and we compiled a bi-dimensional bins time frame) and the histogram of the relationship between the local SNR value (in each harmonicity index. Giving a threshold at dB, the mapping function having a sigmoidal shape is derived from this histogram, and we gain an estimate of the probability for the signal to be “clean enough”. We call this probability, derived from the harmonicity index ?, Voicing Index. Similar to the previous criteria the Voicing Index was evaluated only in those segments where the pause was not amongst the 4 most probable phonemes. First tests of the use of the Voicing Index for the fusion in audio-visual speech recognition are reported in [36].

???

? ? ? ? ???? ?? ?

???? ?

?

?

?

???? ? ???

4.2.2 Evaluation of time constant audio stream weights After the de?nition of the various criteria to be used in the estimation of the reliability of the audio stream the questions at hand are: How sensitive are the recognition results to variations of the fusion parameter and how consistent are the values of the criteria over different noise types and SNR levels? To answer the ?rst question we can have a look at Fig.6. Here the recognition results are . As additive noise, car noise at SNR levels plotted for varying between ? was used. The points of minimum WER used for the manual weight adaptation in Sec. 4.1 are connected by a dotted line in Fig.6. The goal of the automatic adaptation is now to ?nd the mapping between the reliability estimation measure and the fusion parameter which results in the same minimum WERs in all noise conditions. As can be seen in the ?gure, there are large regions where the WER does not increase signi?cantly over a wide range of values of the fusion parameter . On the over hand there are also regions at low SNR where small variations of the fusion parameter have a strong impact on the WER. In general, Fig.6 tells us that the fusion is not very sensitive to the setting of and hence an automatic fusion should give reasonable results. The next question is the sensitivity of the criteria with respect to different noise types. To test SNR levels each and calculated the average value the sensitivity we used all noise types at

?

?

??

??

11

50%
?12dB

40% 30%

?9dB ?6dB

WER

?3dB

20%
0dB

10% 0% ?1

3dB 6dB 9dB

?0.8

?0.6

?0.4

?0.2

0 c

0.2

0.4

0.6

0.8

1

Figure 6: Relation between the fusion parameter and the WER when adding car noise at SNR levels ranging from ? dB to clean speech. The dashed line connects the points of minimum WER at a given SNR.

??

??

of the corresponding reliability measure (Entropy, Dispersion, Voicing Index) over the whole test set for a given noise scenario. In Fig.7 we plotted the value of the reliability measure over the different optimal settings (i.e., the minimum WER=f(c) points) of the fusion parameter . Each point of the curves corresponds to one of the SNR values and each of the ?rst ?ve curves corresponds to one noise type. If the criteria were independent of the noise type all points of the curve would lie on one continuously decreasing (for the Entropy) or increasing (for the Dispersion and the Voicing Index) curve. This is obviously not the case. Nevertheless, the curves lie more or less close together which indicates that the variation of the criteria with the noise type is rather small. Exceptions are babble noise in the case of the Dispersion and white noise for the Voicing Index. If we want to have a reliability measure which does not depend on the noise type, we have to search for a mapping between the reliability measure and the fusion parameter which is optimal in a minimum error sense. Our optimization criterion for the mapping is the minimization of the squared relative word error over all noise types ? and all SNR levels [37]

??

? ?

????

??

?? ????? ??

?

(18)

with the relative word error

?? ????? ?? ? ? ???? ?? ? ? ? ?? ???? ?? ? ? ????
?? ? ×?? ??

(19)

The resulting mapping follows a sigmoidal function

?±? ? · ? ???± · ? ? ?

(20)

where ± is the value of the reliability measure and , , and the parameters which de?ne the shape of the sigmoidal function being subject to the optimization. The results of the optimization for each criterion can be seen in Fig.7. The sigmoidal mapping function is visualized as a dashed line. During the optimization, the parameter which gives the maximal attainable value of the function is adapted manually and the parameters and are evaluated following a gradient descent algorithm where the unknown derivation is approximated by the difference quotient. We 12

0.2 0 ?0.2 ?0.4 ?0.6 ?0.8 ?1 0.05 c

Babble White Car Factory1 Factory2 Sigmoid Fit

0.1

0.15

0.2

0.25 Entropy

0.3

0.35

0.4

0.45

a)
0.2 0 ?0.2 ?0.4 ?0.6 ?0.8 ?1 0 Babble White Car Factory1 Factory2 Sigmoid Fit 0.5 1 1.5 Dispersion 2 2.5 3 c

b)
0.2 0 ?0.2 ?0.4 ?0.6 ?0.8 ?1 0 c Babble White Car Factory1 Factory2 Sigmoid Fit

0.1

0.2

0.3

0.4 0.5 Voicing Index

0.6

0.7

0.8

c)

Figure 7: Relation between the three criteria and the fusion parameter types at varying SNR.

for ?ve different noise

want to point out, that the distance of the sigmoidal curve to the other curves in Fig.7 is not a direct measure for the quality of the ?t. As a consequence of the minimization of the word errors, the optimal sigmoidal ?t is the one which causes variations of the fusion parameter from the optimal value which induces the smallest increase in word error. Hence, in regions where variations of cause only a small increase of the word error, the distance of the sigmoidal curve and the curves resulting from the reliability criteria can be signi?cant, whereas the resulting word error 13

rates are still very close to optimal. 4.2.3 Evaluation of adaptive audio stream weights So far word error rates were calculated for a setting of the fusion parameter being constant over the whole test set. This assumes that the whole test set is known at recognition time, which of course is unrealistic in a real life recognition system. Rather it is necessary to calculate the correct setting of the fusion parameter instantaneously for each frame. This also opens the possibility to cope with non-stationary noise and variations of the SNR of the speech signal. We therefore repeated the tests in the previous section with audio stream weights adapted on a frame by frame basis. To reduce the in?uence of estimation errors the values of the fusion parameter were smoothed over time with a ?rst order recursive ?lter with a cut off frequency of Hz. Tab.3 compares the results of the optimization for the different criteria, when the value of the fusion parameter is ?xed over the whole test set (Time Constant) and when it is varied (Frame by Frame). As for the previous recognition results the average RWER is based on the results obtained with SWP and hence evaluated according to Eq.14 and 15. In Fig.8 the results of the

?

Entropy Dispersion Voicing Index Time Constant -27.5?2.0 -26.4?2.0 -27.0?2.0 Frame by Frame -26.3?2.0 -22.6?2.0 -26.7?2.0 Table 3: Average RWER in percent for each criterion over all noise types and SNRs when fusion is done according to Geometric Weighting either with Time Constant or Frame by Frame evaluation con?dence interval for the relative error is given. of the criteria. Additionally the

±

automatic fusion, the manual setting of the fusion parameter and the fusion using the Unweighted Bayesian Product are contrasted. For the automatic fusion the voicing index was chosen as fusion
35% 30% Word Error Rate in % 25% 20% 15% 10% 5% 0% ?12dB ?9dB ?6dB ?3dB 0dB 3dB 6dB SNR 9dB 12dB 15dB 18dB clean Audio Video UBP GW: Time Constant GW: Frame by Frame

Figure 8: Word error rates for audio and video alone, fusion with Unweighted Bayesian Product and the Geometric Weighting with determination of the fusion parameter according to the Voicing Index time constant and frame wise. criterion and the evaluation of the criterion was performed on a frame by frame basis.

14

5 Discussion
In the previous sections we presented different weight combination and estimation schemes of audio and video a posteriori probabilities in an audio-visual recognition task. Different tests were carried out to assess the performance of the different weighting schemes. In all tests we used different types of noise at SNR levels each to obtain results not limited to one special scenario.

??

5.1 Performance of weight combination schemes
In the ?rst test the free parameters of the weighting schemes were adapted manually to each noise condition. Three of the presented weighting schemes, namely the Unweighted Bayesian Product, the FCA and the Geometric Weighting, are based on the assumption of class conditional independence of audio and video features. The fourth one, Standard Weighted Product, only approximates this assumption for equal a priori probabilities of the phonemes, which was not the case in our tests. Furthermore, the parameterization of the Standard Weighted Product is characterized by having a sum of weights equal to one. So when both streams have equal weights the square root of the two a posteriori probabilities is taken instead of the product as for the of other methods. In order to have weights equal to on both streams in the equal weight condition (as for Geometric Weighting), we changed the parameterization of the Standard Weighted Product from ? to ? ? , respectively. This led to a small, but consistent, improvement in comparison to the original form. Yet the main result of this ?rst comparison is the clear superior performance of the weighting schemes following the assumption of class conditional independence over the Standard Weighted Product thanks to the introduction of the a priori probabilities. Especially the FCA and the Geometric Weighting showed very similar results where one reason is the similarity of the two algorithms (FCA is based on arithmetic weighting). Both attain the pure a posteriori probabilities when all weight is put on either channel and produce the a posteriori probability following class conditional independence for equal weights. They differ only in the way the probabilities are weighted apart from these three special cases. The results indicate a small but not very signi?cant advantage of the Geometric Weighting for low SNR values and equal performance for the other values. Therefore we only took the Geometric Weighting into consideration in the succeeding experiments.

?

?

? ??

?

?

5.2 Performance of audio stream reliability measures
The next test was designed to reveal the performance of the weighting scheme found best in the previous test in a more realistic scenario where the adaptation of the weights is done automatically and not by hand. In the ?rst step of the comparison we investigated a static case, where we ?rst evaluated the reliability measure over the whole dataset and then performed the fusion with the setting of the fusion parameter corresponding to the measure. The mapping of the reliability measure to the fusion parameter took a wide range of noise conditions into account. For the mapping a ?t in the minimum error sense between the value of the measure in a particular noise condition and the corresponding optimal fusion parameter was established. The results showed that large improvements compared to the audio-only recognition can be achieved under all noise conditions investigated though for low SNR values the WERs are still to high so as to achieve useful recognition. An open question is how the optimized mapping generalizes to new, previously unseen, noise conditions. The consistency of the results (see Fig.7) proposes a possibility for generalization, even though ?nal answers can only be found by tests in noise conditions not present during the design of the mapping. In the last step of the comparison we made the transition from the unrealistic static case where the whole test set has to be known before determination of the fusion parameter to an evaluation

15

of the measure on a frame by frame basis. In general we expected an increase in performance from the fact that a frame-wise fusion is able to take variations of the SNR during one utterance and from one utterance to the other into account and is capable to cope with non-stationary noise (like babble or factory noise). On the other hand, the limitation of the estimation interval of the reliability criteria to one frame has a high impact on the quality of the estimation. This effect was alleviated via smoothing the values with a ?rst order recursive ?lter, although this reduces the ability to quickly adapt to intensity variations. The results of the frame-wise adaptation showed that both effects, the larger ?exibility and the lesser precision seem to trade off for one another. The results of the frame by frame evaluation are very similar to those evaluated on the whole test-set (see Tab.3). Even though there was no performance gain from the frame-wise evaluation, the results show that the reliability estimation criteria are applicable to a realistic system. Both, the Entropy and the Voicing Index, showed only small deviations from the optimum values. In the Time Constant evaluation the results of the Entropy criterion were better than those of the Voicing Index, but in the more realistic frame-wise adaptation the Entropy criterion deteriorated more than the Voicing Index. The Voicing Index, however, gave less consistent results for babble noise (which contains many harmonic components) and white noise (which has no harmonic components) than the Entropy. The Dispersion, especially in the frame-wise adaptation, was not competitive to the other criteria. To summarize, the Entropy and the Voicing Index criterion can be used ef?ciently to control the adaptive fusion process.

5.3 Unweighted Bayesian Product vs. adaptive weights
One interesting result of our comparison is the good performance at medium to high SNR of the Unweighted Bayesian Product, which does not require any weighting and hence no reliability estimation either. As can be seen in Fig.5 the performance of the Unweighted Bayesian Product is almost identical to that of the Geometric Weighting for medium and good SNR values (e.g. SNR 0dB), whereas for low SNR values (e.g. SNR 0dB) the performance sharply decreases. For SNR 0dB, audio and video channels carry complementary phonetic information which is well fused by Bayes’ rule [20]. For SNR 0dB there is a gain for the weighting principle, and Bayes’ rule seems to start producing wrong results. Decreasing audio stream reliability results a priori in an increase of the entropy of the corresponding categorization results, which is also exploited in the stream reliability criterion based on the entropy. This should result in a ?attening of the distribution of the probability values and a corresponding increase in its entropy. In the extreme case, where the stream under consideration does not contribute any information, the output distribution of this stream becomes a uniform distribution. During fusion the uniform distribution does not interfere with the distribution of the reliable input stream as the product of the uniform distribution does not alter the shape of the second distribution. Consequently, the phonetic identi?cation is not impaired by the unreliable stream. If this is true why can we then observe a sharp decrease of performance at low SNR ? To answer this question we should have a look at the confusion matrix of the phoneme identi?cation. In a confusion matrix the elements of the matrix determine the percentage of the stimulus on the y-axis as being identi?ed as the output class on the x-axis. For the confusion matrix in Fig.9 car noise at ? dB was added to the audio signal. 2 Already a ?rst quick look on the main diagonal of the confusion matrix reveals that the distribution of errors is clearly non-uniform. There are phonemes which are identi?ed very well and others which show only poor identi?cation scores. The silence state “sil” obviously plays a special role. With increasing noise level more and more phonemes are confused with the silence state. This is partly taken into account in the mapping between the fusion parameter and the stream reliability measures by the fact that only segments where the silence is not amongst the 4 most likely states are used for the evaluation of the criteria. Both the entropy and the dispersion criteria are improved by this modi?cation.
2 The

phonetic symbols follow the ARPABET Notation

16

t e: u: Ou o: a ai ei e i i: w r l n v z dcl d h th f s kcl tcl k sil

59 53 74 87 83 68 66 51 68 55 68 88 82 66 83 78 81 71 61 88 94 93 78 77 1 87 87 5 99

15 2 1

2 1 1 1 1

2 5

2

1 3 1 5 1 2

1 2 2

4 5 1 2 1 5 1

11 1 4 1 7

1 15 15 2 14 4 3 8 3 1 14 16 1 4 1 2 1 2 1 1 1 2 v n 3 1 1 1

1 16 17 1 1 22 3 2 2 1 10 2 1 10 2 2 12 2 1 1 21 1 1 1 21 1 1 12 1 28 1 1 1 7 10 1 2 38 22 2 1 9 1 2 14 1 1 15 7 1 1 1 7 3

3

1 3

1 3 18 12 1 7 1 2 1 f th h

1 6 1 1 l r w i: i 1 e ei ai a o: Ou u: e: 1 1 3 1 t

sil k tcl kcl s

d dcl z

Figure 9: Confusion matrix of the phoneme identi?cation from the audio stream at ? dB when car noise was added to the signal, showing percentage of each phoneme on the y-axis identi?ed as phonemes on the x-axis.

Furthermore also the phonemes “s”, “n”, and “i” attract many other phonemes. On the other hand, there are phonemes which are very poorly recognized and hardly any other phoneme is confused with them (e.g “z” and “e:” ). It follows from this analysis that the distribution of the a posteriori probabilities does not ?atten but rather build certain peaks at some attractor phonemes and dips at phonemes which are hardly identi?ed or confused. Though, to impair the audio-visual recognition, not only an increase of errors in the audio stream has to occur, but these errors also have to be correlated with those committed in the video stream. The combination according to Bayes’ rule is able to compensate for uncorrelated errors to a certain degree. Therefore, to judge the consequences of the deformation of the audio a posteriori probability distribution it is indispensable also to look at the video stream confusion matrix visualized in Fig.10. Comparing the two confusion matrices demonstrates that the phonemes confused in the audio stream (“s”,”n”, and “i”) also lead to confusions in the video stream. In the video stream the silence state also is the origin for many confusions. Hence the errors of phonetic identi?cation in the audio and video stream are correlated and in this case Bayes’ rule is not able to perform a compensation. It appears that in both confusion matrices the dominant cause for confusions is the silence of the phonemes are confused in the audio stream with the state, but not equally. At ? dB are misleadingly taken as pauses. 3 When fusing silence state. In the video stream only audio and video following the Unweighted Bayesian Product this strong preference for pauses at noisy audio has a confusion of of the phonemes with pauses as a consequence. Whereas

±

??±



distinction of speech from pauses in the video stream is not trivial as many non articulatory lip movements, e.g. moisturization of the lips, are part of continuous speech and are easily confused with a real utterance

3 The

17

t e: u: Ou o: a ai ei e i i: w r l n v z dcl d h th f s kcl tcl k sil

29 12 33 31 14 17 12 11 11 9 22 12 23 34 30 21 17 30 32 38 45 26 32 14 28 18 92

2 3 9

3 1 1

1 1

1

4 9 2 1 13 5 7 3 1 3 6 1 12 52 2 2 17 16

1 2 3 6 1

1 3 1 9

4 1 2 4 1 4 5 5 2 6 1 1 3 4 59 28 4 7 1 1 1 3 2

5 1 6 11 1 1 1 1 1 1 1 1 53 3 60 63 1 4 1 10 12 1 1 3 17 2 8 2 3 8 1 2 i 4

1 31

7 3

4

1 10 1 1 1 1 21 1 4 17 2

3 4 3 3 51

1 3 4 2 2 56 72 2 6 13 1 2 1 6 1 25 3 7 1 1 1 5 1 1 1 w i:

8 7 3 8 2 1 3

5 18 54

1 1 5 52 1 18 22 16

45 4 4 3 17 1 1 1 f th h d dcl z

5 1 1 4 6 15 13 1 v n l

3 5 1

47 12 2 1 2 46 12 1 52 4 2 1 71 1 53 1 76 13 1 3 3 1 1 2 3 3 2 3 6 1 6 4 1 5 5 3 1 1 9 4 1 1 5 16 1 1 7 1 5 17 4 4 2 3 6 7 1 1 1 1 3 3 1 4 2 2 1 12 2 1 6 1 1 1 t

1

4

sil k tcl kcl s

r

e ei ai a o: Ou u: e:

Figure 10: Confusion matrix of the phoneme identi?cation from the video stream, showing percentage of each phoneme on the y-axis identi?ed as phonemes on the x-axis.

when Geometric Weighting is used to weight the audio and video probabilities this confusion . The weighting has the tendency to select the modality having less confusion with drops to the silence state. Nevertheless, at medium SNR levels the performance of the Unweighted Bayesian Product is very close to that of the Geometric Weighting. To further quantify this, in Tab.4 in addition to the mean RWER of the Geometric Weighting and the Unweighted Bayesian Product over all noise conditions also the mean RWER of the Unweighted Bayesian Product for SNR levels above and below dB is given (all three evaluated according to Eq.14 and 15). From this evaluation it can



?

all noise conditions SNR dB dB SNR

? ?

GW/Voicing Index -26.7?2.0 -39.4?2.9 -1.1?1.5

UBP 9.2?2.1 -35.0?3.0 97.6?1.7

Table 4: Mean relative error for GW with Voicing Index evaluated on a frame by frame basis and Unweighted Bayesian Product. The errors are calculated over all noise types and SNR levels and for SNR 0dB and SNR 0dB separately. Additionally the con?dence interval for the relative error is given.

±

be seen that the difference of performance between the UBP and the GW increases largely for SNR 0dB. Regardless of the remaining performance difference there are applications where the SNR is typically higher than 0dB and a loss of performance is counterbalanced by a simple and

18

intrinsically stable implementation.

5.4 Conclusion
Our objective was to compare a number of schemes for an adaptive combination of audio and video a posteriori probabilities estimated by an ANN for an audio-visual recognition task under different noise conditions. In a ?rst test we were looking at the effectiveness of different weight combination schemes for audio and video data. The results demonstrated that a multiplicative combination respecting class conditional independence of the streams gives the best results. Next we compared different criteria for an adaptive estimation of the audio stream reliability using the Geometric Weighting method. The performance of both the criterion based on the entropy of the a posteriori probabilities and that based on the ratio of the harmonic to the non-harmonic components in the speech signal, was very close to the best achievable performance determined by a manual adjustment. We showed that an adaptive weighting scheme based on the entropy and the Voicing Index can be build yielding consistent performance in various noise conditions. Finally, we investigated if a constant weight on the audio and video stream in all noise conditions would give comparable performance to the adaptive weighting. The test we made showed that when the SNR is higher than 0dB the Unweighted Bayesian Product performs as well as Geometric Weighting, so weighting, ?xed or adaptive, is unnecessary. Whereas for SNR values below -3dB performance losses are tremendous if no weighting is performed. An analysis of the confusion matrices showed that the confusion of all phonemes with the silence state is the main cause of the failure of the Unweighted Bayesian Product for SNR 0dB. Let remark, this is related to the continuous speech recognition task and the problem of speech detection in noise. Therefore an algorithm (namely FCA and GW) incorporating Bayes’ rule, which runs well for SNR 0dB, and a weighting principle, being dominant for SNR 0dB, seems to be optimal. The weighting globally performs as a switch between the two modalities towards the one having less confusions with the silence state. This complements Bayes’ rule when this type of confusion occurs. All tests are based on a database with a single male speaker whose lips were colored in blue to facilitate the lip feature extraction. Most of the tests were repeated on a database with a single female speaker where no additional coloring of the lips was used [38]. The results of these tests are comparable to those reported here.

6 Acknowledgments
We want to thank Christophe Savariaux for the recording of the database, Christian S¨ rensen, o Thorsten Wild, and Vincent Charbonneau for carrying out many simulations, Gunther Sessler for his hints on statistics and J¨ rgen L¨ ttin for his thourough review of the article. This work was u u partly funded by the EC program SPHEAR and is a part of the project RESPITE.

References
[1] H. McGurk and J. W. MacDonald, “Hearing lips and seeing voices,” Nature, vol. 264, pp. 746–748, 1976. [2] Q. Summer?eld, “Lipreading and audio-visual speech perception,” Phil. Trans. R. Soc. Lond. B, vol. 335, pp. 71–78, 1992. [3] W. H. Sumby and I. Pollack, “Visual contribution to speech intelligibility,” J. Acoust. Soc. Amer., vol. 26, pp. 221–215, 1954.

19

[4] N. P. Erber, “Auditory-visual perception of speech,” J. Speech Hear. Disord., vol. 40, pp. 481–492, 1975. [5] Q. Summer?eld, “Some preliminaries to a comprehensive account of audio-visual speech perception,” in Hearing by Eye: The Psychology of Lipreading, B. Dodd and R. Campell, Eds., pp. 3–51. Lawrence Erlbaum Associates Ltd., Hove, UK, 1987. [6] E. D. Petajan, Adaptive Determination of Audio and Visual Weights for Automatic Speech Recognition, Ph.D. thesis, Univ. Illinois, Urbana, 1984. [7] D. G. Stork, G. Wolff, and E. Levine, “A neural network lipreading system for improved speech recognition,” in Proc. of Int. Joint Conf. on Neural Networks, 1992, pp. 285–295. [8] P. Duchnowski, U. Meier, and A. Waibel, “See me, hear me: Integrating automatic speech recognition and lip-reading,” in Proc. of ICSLP 94, Yokohama, Japan, 1994, pp. 547–550. [9] P. L. Silbsee and Q. Su, “Computer lipreading for improved accuracy in automatic speech recognition,” IEEE Trans. Speech Audio Proccessing, vol. 4, pp. 337–351, 1996. [10] S. Dupont and J. L¨ ttin, “Audio-visual speech modeling for continuous speech recognition,” u IEEE Trans. on Multimedia, vol. 2, pp. 141–151, 2000. [11] C. Neti, P. Potamianos, J. L¨ ttin, I. Matthews, H. Glotin, D. Vergyri, J Sison, A. Mashari, u and J. Zhou, “Audio-visual speech recognition,” Tech. Rep., Center for Language and Speech Processing, The John Hopkins University, Baltimore, 2000. [12] T. Chen, “Audiovisual speech processing: Lip reading and lip synchronization,” IEEE Signal Processing Magazin, vol. 18, pp. 9–21, 2001. [13] G. Potamianos, C. Neti, G. Iyengar, and E Helmuth, “Large vocabulary audio-visual speech recognition by machines and humans,” in Proc. of EUROSPEECH 2001, Aalborg, DK, 2001. [14] J. L. Schwartz, J. Robert-Ribes, and P. Escudier, “Ten years after summer?eld: A taxonomy of models for audio-visual fusion in speech perception,” in Hearing by Eye II, R. Campell, B. Dodd, and D. Burnham, Eds., pp. 85–108. Taylor & Francis Books Ltd., 1998. [15] B. Gold and N. Morgan, Speech and Audio Signal Processing, John Wiley & Sons Inc., New York, 2000. [16] A. Adjoudani and C. Benoit, “On the integration of auditory and visual parameters in an hmm-based asr,” in Speachreading by Man and Machine: Models, Systems and Applications, D.G. Stork and M.E. Hennecke, Eds., Berlin, 1996, NATO ASI Series, pp. 461–472, Springer. [17] A. Rogozan and P. Del? glise, “Adaptive fusion of acoustic and visual sources for automatic e spech recognition,” Speech Communication, vol. 26, pp. 149–161, 1998. [18] P. Teissier, J. Robert-Ribes, J.-L. Schwartz, and A. Gu? rin-Dugu? , “Comparing models for e e audiovisual fusion in a noisy-vowel recognition task,” IEEE Trans. Speech Audio Processing, vol. 7, pp. 629–642, 1999. [19] J. R. Movellan and G. Chadderdon, “Channel separability in the audio-visual integration of speech: A bayesian approach,” in Speachreading by Man and Machine: Models, Systems and Applications, D.G. Stork and M.E. Hennecke, Eds., Berlin, 1996, NATO ASI Series, pp. 473–487, Springer.

20

[20] D. W. Massaro and D. G. Stork, “Speech recognition and sensory integration,” American Scientist, vol. 86, no. 3, 1998. [21] D. W. Massaro, Speech Perception by Eye and by Ear: A Paradigm for Physiological Inquiry, Lawrence Erlbaum Associates, Hillsdale, N. J., 1987. [22] M. Heckmann, F. Berthommier, and K. Kroschel, “Optimal weighting of posteriors for audio-visual speech recognition,” in Proc. of ICASSP 2001, Salt Lake City, Utah, 2001. [23] H. Fletcher, “The nature of speech and its interpretation,” J. Franklin Instit., vol. 193, no. 6, pp. 729–747, June 1922. [24] J. B. Allen, “How do humans process and recognize speech,” IEEE Trans. on Speech and Signal Proc., vol. 2, no. 4, pp. 567–576, 1994. [25] H. Bourlard and S. Dupont, “A new asr approach based on independent processing and recombination of partial frequency bands,” in Proc. of ICSLP 96, Philadelphia, pp. 422– 425. [26] A. Morris, A. Hagen, H. Glotin, and H. Bourlard, “Multi-stream adaptive evidence combination for noise robust asr,” Speech Communication, vol. 34, pp. 25–40, 2001. [27] R. A. Cole, T. Noel, L. Lander, and T. Durham, “New telephone speech corpora at cslu,” in Proc. of EUROSPEECH 95, 1995, pp. 821–824. [28] N. Morgan and H. Boulard, “Continuous speech recognition,” IEEE Sig. Proc. Magazine, vol. 12, no. 3, pp. 24, May 1995. [29] University of Mons, Mons, Step by Step Guide to using the Speech Training and Recognition Uni?ed Tool (STRUT), May 1997. [30] H. Hermansky, N. Morgan, A. Bayya, and P. Kohn, “Rasta-plp speech analysis technique,” in Proc. of ICASSP 1992, San Fransisco, 1992, vol. 1, pp. 121–124. [31] E. Lombard, “Le signe de l? velation de la voix,” Annals Maladies Oreille, Larynx, Nez, e Pharynx, pp. 101–119, 1911. [32] A.P. Varga, H.J.M. Steeneken, M. Tomlinson, and D. Jones, “The noisex-92 study on the effect of additive noise on automatic speech recognition,” Tech. Rep., Speech Research Unit, Defence Research Agency, Malvern, U.K., 1992. [33] C. Bregler, H. Hild, S. Manke, and A. Waibel, “Improving connected letter recognition by lipreading,” in Proc. of ICASSP 93. [34] G. Potamianos and C. Neti, “Stream con?dence estimation for audio-visual speech recognition,” in Proc. of ICSLP 2000, Bejing, China, 2000, pp. 746–749. [35] F. Berthommier and H. Glotin, “A new snr feature mapping for robust multistream speech recognition,” in Proc. of ICPhS 1999, San Fransisco, CA, 1999. [36] H. Glotin, D. Vergyri, C. Neti, G. Potamianos, and J. L¨ ttin, “Weighting schemes for audiou visual fusion in speech recognition,” in Proc. of ICASSP 2001, Salt Lake City, Utah, 2001. [37] M. Heckmann, T. Wild, F. Berthommier, and K. Kroschel, “Comparing audio- and aposteriori-probability-based stream con?dence measures for audio-visual speech recognition,” in Proc. of EUROSPEECH 2001, Aalborg, Denmark, 2001.

21

[38] M. Heckmann, K. Kroschel, F. Berthommier, and C. Savariaux, “Dct-based video features for audio-visual speech recognition,” in submitted to ICSLP 2002, Denver, Colorado, USA, 2002.

22



更多相关文章:
更多相关标签:

All rights reserved Powered by 甜梦文库 9512.net

copyright ©right 2010-2021。
甜梦文库内容来自网络,如有侵犯请联系客服。zhit325@126.com|网站地图