Conference PaperPDF Available

Acoustic Modeling with Deep Neural Networks Using Raw Time Signal for LVCSR

September 2014

Conference: Interspeech

Authors:

Zoltán Tüske

RWTH Aachen University

Pavel Golik

RWTH Aachen University

Hermann Ney

RWTH Aachen University

In this paper we investigate how much feature extraction is required by a deep neural network (DNN) based acoustic model for automatic speech recognition (ASR). We decompose the feature extraction pipeline of a state-of-the-art ASR system step by step and evaluate acoustic models trained on standard MFCC features, critical band energies (CRBE), FFT magnitude spectrum and even on the raw time signal. The focus is put on raw time signal as input features, i.e. as much as zero feature extraction prior to DNN training. Noteworthy, the gap in recognition accuracy between MFCC and raw time signal decreases strongly once we switch from sigmoid activation function to rectified linear units, offering a real alternative to standard signal processing. The analysis of the first layer weights reveals that the DNN can discover multiple band pass filters in time domain. Therefore we try to improve the raw time signal based system by initializing the first hidden layer weights with impulse responses of an audiologically motivated filter bank. Inspired by the multi-resolutional analysis layer learned automatically from raw time signal input, we train the DNN on a combination of multiple short-term features. This illustrates how the DNN can learn from the little differences between MFCC, PLP and Gammatone features, suggesting that it is useful to present the DNN with different views on the underlying audio.

: Feature preprocessing and normalization for DNN AM. Dimension of a single feature vector. WER in %.

…

Figures - uploaded by Hermann Ney

Content may be subject to copyright.

Content uploaded by Hermann Ney

Content may be subject to copyright.

Acoustic Modeling with Deep Neural Networks Using Raw Time Signal for

LVCSR

Zolt´

an T¨

uske1, Pavel Golik1, Ralf Schl¨

uter1, Hermann Ney1,2

1Human Language Technology and Pattern Recognition, Computer Science Department,

RWTH Aachen University, 52056 Aachen, Germany

2Spoken Language Processing Group, LIMSI CNRS, Paris, France

{tuske,golik,schlueter,ney}@cs.rwth-aachen.de

Abstract

In this paper we investigate how much feature extraction is re-

quired by a deep neural network (DNN) based acoustic model

for automatic speech recognition (ASR). We decompose the

feature extraction pipeline of a state-of-the-art ASR system step

by step and evaluate acoustic models trained on standard MFCC

features, critical band energies (CRBE), FFT magnitude spec-

trum and even on the raw time signal. The focus is put on raw

time signal as input features, i.e. as much as zero feature ex-

traction prior to DNN training. Noteworthy, the gap in recog-

nition accuracy between MFCC and raw time signal decreases

strongly once we switch from sigmoid activation function to

rectiﬁed linear units, offering a real alternative to standard sig-

nal processing. The analysis of the ﬁrst layer weights reveals

that the DNN can discover multiple band pass ﬁlters in time do-

main. Therefore we try to improve the raw time signal based

system by initializing the ﬁrst hidden layer weights with im-

pulse responses of an audiologically motivated ﬁlter bank. In-

spired by the multi-resolutional analysis layer learned automati-

cally from raw time signal input, we train the DNN on a combi-

nation of multiple short-term features. This illustrates how the

DNN can learn from the little differences between MFCC, PLP

and Gammatone features, suggesting that it is useful to present

the DNN with different views on the underlying audio.

Index Terms: acoustic modeling, raw signal, neural networks

1. Introduction

Since DNN based acoustic models have become a popular alter-

native to the Gaussian mixture models (GMMs), a lot of effort

was put into feature engineering that aimed at ﬁnding a repre-

sentation of input audio data that is most suitable for training

of neural networks [1][2]. GMMs are quite sensitive to input

features: the features need to be decorrelated so that a diag-

onal covariance matrix can be used for faster scoring and the

dimension needs to be relatively low. These requirements have

led to a large variety of feature extraction pipelines that build

upon expert knowledge of speech production and perception.

In contrast, hybrid DNN/HMM models [3] have none of these

constraints and a DNN acoustic model can easily be trained on

high dimensional features (several thousands) even with a large

amount of correlation between components. Further, the uni-

versal approximator property of a neural network [4][5] with

(multiple) hidden layers should allow the DNN to learn the nec-

essary (non-linear) feature extraction steps from data. For this

reason we investigated the question: how much feature extrac-

tion can be left for the DNN to discover?

Many groups have found logarithm of critical band ener-

gies (CRBE) extracted e.g. from a Mel ﬁlter bank to be most

suitable for training DNNs. One of the reasons why CRBE of-

ten outperform MFCC or PLP might be the fact that they con-

tain somewhat more high resolution information: while con-

ventional MFCC and PLP range from 13 to 16 dimensions, the

CRBEs used for DNN training are often 20- to 40-dimensional.

Further, many steps of typical feature extraction pipelines boil

down to linear projections, which should be easy to learn from

data. Ultimately, to avoid a loss of information the acoustic

model needs to be trained on full magnitude spectrum, e.g. [6],

or even the raw audio samples of the waveform. While the cross

entropy training is still performed on frame level, the latter case

allows to present the DNN a sequence of audio samples without

any notion of frame boundaries, thus allowing the neural net-

work to discover all kind of non-stationary patterns. Such pat-

terns correspond to various phonetic events that are described

poorly with frame-based stationary processing such as FFT.

The cost for processing raw time signal is twofold. First, the

high dimensionality of such feature spaces increases the number

of free parameters. This issue can be counteracted by adjusting

the network topology, e.g. introducing narrow matrix factoriza-

tion layers [7]. Second, the raw time signal features discard the

common assumption of most feature extraction pipelines that

human perception resolution is non-linear along the frequency

axis leaving it up to the DNN to discover. Our approach to

tackle this is to initialize the weights of the ﬁrst layer by im-

pulse responses of Gammatone ﬁlters that follow the audiolog-

ical spacing in the frequency domain.

Further, we investigate how the DNN can be presented

with an increased time-frequency resolution without leaving the

framework of conventional feature extraction. This is related to

the concept of feature combination, where different short-term

features describe more or less the same spectral properties of the

signal [8]. However, the differences between the different fea-

ture streams are themselves an additional source of information

about the underlying audio signal.

Previously, a dramatic degradation of recognition accuracy

by training a DNN directly on raw speech signal was reported

in [9]. Instead, the authors used convolutional neural networks

[10] to obtain competitive results on the TIMIT task. Some

works on processing of raw speech signal make use of other

models such as linear predictive models [11][12] or SVMs [13].

They mostly evaluate on small classiﬁcation tasks, leaving the

question open, how much would the amount of training data

compensate for the described difﬁculties.

This paper is organized as follows. The different feature ex-

traction pipelines, including the FFT and time signal features,

are summarized in Section 2. The experimental setup is intro-

duced in Section 3 and the results of our investigation are pre-

18 September 2014, Singapore

INTERSPEECH 2014

890

sented in Section 4. The conclusions are drawn in Section 6.

2. Feature extraction

This section gives a brief overview of the three cepstral features,

the FFT based features and the raw signal features.

2.1. Waveform — “raw” time signal

Processing the audio sampled at 16 kHz with the same 10 ms

steps as in case of typical cepstral features boils down to taking

160 samples from the PCM waveform. The windows are non-

overlapping so that stacking neighboring vectors does not result

in discontinuities. The samples quantized with 16 bit need to

be normalized to a numerically robust range by performing the

mean and variance normalization either globally over the com-

plete training data or on the per-utterance level. This can be

interpreted as DC bias removal and loudness equalization and

at the same time it serves numerical purposes to stabilize the

DNN training with gradient descent.

2.2. Amplitude spectrum — FFT

In contrast to raw time signal, the short-time Fourier transform

(STFT) is performed on overlapping windows of 25 ms. The

samples are zero-padded to a window of size 29and weighted

with a Hanning function, which exhibits smaller side lobes in

the amplitude spectrum than a rectangular window. The 512-

FFT results in a 257-dimensional vector due to the symmetry of

the amplitude spectrum. The phase spectrum is discarded.

2.3. Mel-Frequency cepstral coefﬁcients — MFCC

The feature extraction is based on the STFT of the pre-

emphasized speech signal [14]. The amplitude spectrum is inte-

grated by a ﬁlterbank with the triangular ﬁlters being equidis-

tantly spaced on Mel-scale. The MFCC features are extracted

from the logarithm ﬁlter outputs (also referred to as CRBE) by

applying discrete cosine transform (DCT).

2.4. Gammatone features — GT

Instead of the STFT based analysis, the features are extracted

from an audiologically motivated ﬁlterbank realized by time-

domain Gammatone ﬁlters [15]. The auditory ﬁlters are placed

equidistantly on Greenwood-scale. After spectral and temporal

integration the 10th root is taken instead of the logarithm and

the DCT is applied for decorrelation.

2.5. Perceptual linear predictive coefﬁcients — PLP

These features are again based on the STFT of speech [16].

Simulating the critical band masking, the amplitude spectrum is

integrated with trapezoid ﬁlters equally spaced on Bark-scale.

The ﬁlterbank output is pre-emphasized according to equal-

loudness curve. To simulate the relation between the intensity

and perceived loudness of sound, cubic root amplitude com-

pression is performed followed by all-pole model parameter es-

timation (linear predictive (LP) analysis). The autoregressive

coefﬁcients are directly transformed to cepstral coefﬁcients.

3. Experimental setup

The acoustic model training is performed w.r.t. frame-wise

cross entropy criterion on 50 hours of speech from the

Quaero [17] English database train11, which amounts to ca.

16 million input vectors. The development and evaluation sets

consist of ca. 3.5 hours of speech each, corresponding to about

1.2 million vectors. Some experiments are presented on a large

250 hours set from the same corpus train11. A 4-gram language

model (LM) is used during the recognition.

Throughout all experiments we use 6 hidden layers with

2000 hidden units in each layer. The output layer with 4500

nodes corresponds to the generalized triphones tied by a pho-

netic classiﬁcation and regression tree (CART). The number of

trainable weights amounts to approx. 30M-35M depending on

the features used. The input features always correspond to 17

stacked frames so that the overall amount of temporal context

presented to the DNN at once is the same. The mini-batches of

size 512 are drawn from the shufﬂed training set. The weights

are initialized via discriminative pre-training (DPT) [1].

The ASR baseline system is a conventional GMM/HMM

based model trained on the same database w.r.t. the maximum

likelihood criterion. We applied linear discriminant analysis

(LDA) to 9 consecutive MFCC frames to obtain the ﬁnal 45-

dimensional features. The GMM with a globally pooled diago-

nal covariance matrix consists of approx. 660k densities, which

corresponds to about 30M trainable parameters. For acoustic

training and recognition we used the RASR toolkit [18].

4. Results

In the ﬁrst experiment we compared the baseline results ob-

tained with the GMM and DNN acoustic models on MFCC fea-

tures normalized for mean/variance and the vocal tract length

(VTLN). The results are shown in Table 1. Unless stated oth-

erwise, the training is performed on 50 hours of speech. The

same DNN conﬁguration was trained on the raw time signal

as described in Section 2.1. As expected, the MFCC-based

DNN model outperforms the GMM, but the WER of the sys-

tem trained on raw time signal is still signiﬁcantly higher.

Table 1: Baseline results. WER in %.

Features model dev eval

MFCC GMM 24.4 31.6

MFCC DNN 19.4 25.3

time signal DNN 29.4 36.8

In the next experiment we wanted to ﬁgure out, how the

recognition accuracy depends on the various preprocessing

steps. For this purpose we decomposed the MFCCs step by

step and measured the performance. Table 2 shows the word

error rate (WER) after each step. The results indicate that with-

out mean/variance normalization and VTLN, the gap between

MFCCs and FFT features decreases signiﬁcantly.

4.1. Feature combination

From the results in Table 2 it is clear, that the presented fea-

tures differ in the dimensionality by an order of magnitude.

Still MFCC outperform the high-dimensional FFT and time

signal features. How can we increase the amount of informa-

tion within the framework of low-dimensional features? As de-

scribed in Section 2, the different short-term feature extraction

pipelines cover slightly different representations of the underly-

ing audio. Hoping that the DNN can extract useful information

from these differences we performed feature combination fol-

lowing the approach of [8]. The results in Table 3 conﬁrm that

a DNN being a powerful classiﬁer can learn more from multiple

feature streams than from every single feature set.

891

Table 2: Feature preprocessing and normalization for DNN AM.

Dimension of a single feature vector. WER in %.

Features dim. dev eval

MFCC 16

+ global norm. 19.8 26.1

+ utterance norm. 19.7 25.5

+ VTLN 19.4 25.3

MFCC 20

+ VTLN + utterance norm. 19.1 25.2

CRBE

+ VTLN + utterance norm. 20 19.5 25.7

40 19.7 26.2

|FFT|257

+ global norm. 21.3 27.8

+ 10th root 21.0 27.5

+ utterance norm.

+ 10th root 20.6 26.8

time signal 160

+ global norm. 29.4 36.8

+ utterance norm. 28.9 35.0

Table 3: Feature combination. WER in %.

Features dev eval

MFCC 19.1 25.2

PLP 19.2 24.8

GT 19.2 25.5

MFCC + PLP + GT 18.4 24.2

4.2. Analysis of the input layer trained on time signal

Having obtained surprisingly reasonable results on the normal-

ized raw time signal, we were curious what kind of patterns

could have been learned by the neural network. Although the

analysis of all parameters remains infeasible, we could detect

clearly interpretable patterns within the ﬁrst layer of the fully

trained DNN. Figure 1 shows the weights learned by four of

the hidden nodes of the ﬁrst layer. Apparently, the DNN man-

aged to learn some kind of impulse responses that correspond

to band pass ﬁlters and other patterns (e.g. short bursts) purely

from data. In order to illustrate the spectral properties of the dis-

covered ﬁlters, we zero-padded every row in the weight matrix

to 8000 entries, calculated the magnitude spectrum

Wi=|FFT{wi,·}| ∈ R1×8000 1≤i≤2000 (1)

and sorted the rows by the location of the most prominent

“blob”. The position of the blob was calculated after smoothing

the spectrum with a Gaussian kernel gas

Wi=Wi∗g(2)

c= argmax

1≤j≤8000

{ˆ

Wi,j }(3)

Assuming that every row can be interpreted as a band pass im-

pulse response, the location of the blob corresponds to the cen-

ter frequency of the learned transfer function. Figure 2 shows

the obtained spectra as 20 log10 Wi. It can be seen that, without

any prior knowledge, the DNN has discovered a large number

of band pass like ﬁlters that exhibit roughly the audiological

distribution. It means, the number of narrow band pass ﬁlters in

the lower frequency region is quite high, while with increasing

center frequency, the bandwidth of the ﬁlters becomes larger.

Also the distribution of the center frequencies is non-linear. The

Time [samples]

Weight

0 500 1000 1500 2000 2500

−0.2

−0.1

0.1

0.2

0.3 w1

Figure 1: Four rows from the ﬁrst layer weight matrix trained

on raw time signal. The time range corresponds to 17 frames of

10 ms (17 ·10ms ·16kHz = 2720)

Frequency [kHz]

Reordered hidden units

012345678

500

1000

1500

2000 0−40

−30

−20

−10

Figure 2: Amplitude spectra of the reordered rows from the ﬁrst

layer weight matrix trained on time signal.

bandwidth of the transfer function can be calculated as equiva-

lent noise bandwidth by

b=PjW2

i,j

(maxjWi,j )2(4)

Figure 3 shows the scatter plot of the approximated parameters

fcand fbof the learned ﬁlters.

Remarkably, the position of the ﬁlters in time is not re-

stricted to the center of the stacked audio samples, but is scat-

tered across left and right context approximately uniformly.

These shifts (or time offsets) are expressed in the phase spec-

trum and are therefore not visible in Figure 2. This distribution

indicates that the DNN was able to learn different ﬁlters for dif-

ferent parts of the presented audio context. Also, none of the

learned narrow ﬁlters exhibits multiple passbands.

4.3. Rectiﬁed linear units and large scale experiments

In the following set of experiments we investigated how strong

can we further reduce the gap in recognition accuracy between

the various feature conﬁgurations by (a) switching the activa-

tion function and (b) increasing the amount of training data.

First we compared sigmoid activation function with the recti-

ﬁed linear units (ReLU) [19]. From the previous experience

Table 4: Feature and activation function comparison, training

on 50h. WER in %.

Features dev eval

sigmoid ReLU sigmoid ReLU

MFCC 19.1 18.0 25.2 23.8

MFCC + PLP + GT 18.4 16.6 24.2 21.7

|FFT|20.6 18.4 26.8 24.7

time signal 28.9 22.6 35.0 28.5

892

Center frequency [kHz]

Bandwidth [Hz]

012345678

200

400

600

800

1000

Learned filters

Learned filters (least squares trend)

Audiological filter bank

Figure 3: Scatter plot of approximated parameters of the

learned ﬁlter bank.

Table 5: Feature and activation function comparison, training

on 250h. WER in %.

Features dev eval

sigmoid ReLU sigmoid ReLU

MFCC 15.2 15.9 20.4 21.1

MFCC + PLP + GT 14.8 14.0 19.8 18.9

|FFT|16.1 15.8 21.6 21.5

time signal 19.2 17.6 25.6 23.5

we know that ReLUs are sensitive to regularization so we used

L2-regularization with a value of 0.0001. In contrast, sigmoid

non-linearities perform best with no regularization at all. The

results shown in Table 4 suggest that the ReLUs have a stronger

effect on the systems with high error rates, which is presumably

due to a more difﬁcult optimization problem. In addition, we

repeated these experiments with DNNs trained on 250 hours of

speech. Table 5 shows the obtained results. Further large scale

experiments revealed that increasing the number of hidden lay-

ers up to 12 narrowed the performance gap between MFCC and

raw time signal achieving 20.9% WER on the evaluation cor-

pus.

4.4. Manual weight initialization with audiological ﬁlters

After we observed the ﬁlter shapes that have been learned from

raw time signal, we investigated, whether we can initialize the

weights of the ﬁrst hidden layer in a way that makes it easier for

the DNN to discover further meaningful ﬁlters during training

with gradient descent. For this purpose we calculated the real

part of impulse responses of a stationary Gammatone ﬁlter bank

that follows the audiological ﬁlter distribution [20]. The param-

eters of the 32 ﬁlters were deﬁned as follows (with l= 24.7

and q= 9.265):

c=l·q·(ei/q −1) (5)

b=l+fi

c/q (6)

In order to account for different positions in time we created

multiple shifted copies of each ﬁlter’s impulse response to ob-

tain a weight matrix of the same size as the randomly initial-

ized ﬁrst layer weights in the previous experiments. Table 6

shows the comparison of three different approaches: random

Table 6: Weight initialization for learning from raw time signal.

WER in %.

Weight initialization update allowed dev eval

random yes 22.6 28.5

GT yes 22.4 28.7

no 24.9 31.1

initialization (as in Table 4), initialization by a Gammatone ﬁlter

bank with regular weight update through backpropagation, and

a ﬁxed Gammatone ﬁlter bank layer with no update through-

out the training. The latter case corresponds to a ﬁxed “fea-

ture extraction layer” where only layers above the ﬁrst one are

trained, so that we can compare whether the DNN can improve

the weights by backpropagation upon the initialization.

It can be seen that the manually designed ﬁlter bank does

not help the DNN much to discover better features compared

with fully random initialization. Also, keeping the ﬁrst layer

weights ﬁxed throughout the training rather hurts the recogni-

tion performance. This indicates that the initial ﬁlter bank con-

ﬁguration is suboptimal, presumably because of a too low fre-

quency resolution and the lack of non-band-pass patterns.

5. Conclusions

In this paper we have shown that using hybrid DNN/HMM

acoustic models allows to obtain reasonable recognition results

even without any processing of the raw time signal. The per-

formance gap between raw time signal and conventional MFCC

features could be reduced strongly by switching from sigmoid

activation function to rectiﬁed linear units. The amount of train-

ing data further reduced the gap.

Our analysis of the learned weights suggests that without

any prior knowledge, the DNN is able to learn a set of band

pass ﬁlters in time domain purely from the raw time signal. We

presented a way to interpret the learned parameters: by reorder-

ing the rows within the input layer weight matrix, it is possible

to see the approximately audiological distribution of the ﬁlters.

This again nicely conﬁrms the result of many years of research

on feature extraction. Further, this result shows a real alternative

to the otherwise (mostly) stationary feature extraction pipelines:

presenting the DNN with data on sampling frequency level al-

lows the acoustic model to learn non-stationary patterns, local-

ized in time across frame boundaries. Also, the loss of informa-

tion can be reduced by processing time domain data.

Finally we presented a trade-off between feature dimen-

sionality and level of detail of the underlying audio. By training

the DNN on a combination of MFCC, PLP and Gammatone fea-

tures, the resulting acoustic model outperformed all other sys-

tems, even with a large amount of training data. This suggests

that the differences in these feature extraction pipelines allow

the DNN to gain additional knowledge about the input data.

6. Acknowledgements

The research leading to these results has received funding

from the European Union Seventh Framework Programme

(FP7/2007-2013) under grant agreement no. 287755 (transLec-

tures). This work has received funding from the Quaero Pro-

gramme funded by OSEO, French State agency for innovation.

H. Ney was partially supported by a senior chair award from

DIGITEO, a French research cluster in ˆ

Ile-de-France. Sup-

ported by the Intelligence Advanced Research Projects Activ-

ity (IARPA) via Department of Defense U.S. Army Research

Laboratory (DoD / ARL) contract no. W911NF-12-C-0012.

The U.S. Government is authorized to reproduce and distribute

reprints for Governmental purposes notwithstanding any copy-

right annotation thereon. Disclaimer: The views and con-

clusions contained herein are those of the authors and should

not be interpreted as necessarily representing the ofﬁcial poli-

cies or endorsements, either expressed or implied, of IARPA,

DoD/ARL, or the U.S. Government.

893

7. References

[1] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering

in context-dependent deep neural networks for conversational

speech transcription,” in Proc. IEEE Automatic Speech Recogni-

tion and Understanding Workshop (ASRU), Hawaii, USA, Dec.

2011, pp. 24–29.

[2] D. Yu, M. L. Seltzer, J. Li, J.-T. Huang, and F. Seide, “Feature

learning in deep neural networks - a study on speech recognition

tasks,” in International Conference on Learning Representations,

Scottsdale, AZ, USA, May 2013.

[3] H. A. Bourlard and N. Morgan, Connectionist speech recognition:

a hybrid approach. Norwell, MA, USA: Kluwer Academic Pub-

lishers, 1993.

[4] G. Cybenko, “Approximation by superpositions of a sigmoidal

function,” Mathematics of Control, Signals and Systems, vol. 2,

no. 4, pp. 303–314, 1989.

[5] K. Hornik, M. B. Stinchcombe, and H. White, “Multilayer feed-

forward networks are universal approximators,” Neural Networks,

vol. 2, no. 5, pp. 359–366, Jul. 1989.

[6] T. N. Sainath, B. Kingsbury, A.-r. Mohamed, and B. Ramabhad-

ran, “Learning ﬁlter banks within a deep neural network frame-

work,” in Proc. IEEE Automatic Speech Recognition and Un-

derstanding Workshop (ASRU), Olomouc, Czech Republic, Dec.

2013, pp. 297–302.

[7] S. Wiesler, A. Richard, R. Schl ¨

uter, and H. Ney, “Mean-

normalized stochastic gradient for large-scale deep learning,” in

Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process-

ing, Florence, Italy, May 2014, pp. 180–184.

[8] C. Plahl, R. Schl ¨

uter, and H. Ney, “Improved acoustic feature

combination for LVCSR by neural networks,” in Proc. Inter-

speech, Florence, Italy, Aug. 2011, pp. 1237–1240.

[9] D. Palaz, R. Collobert, and M. Magimai.-Doss, “Estimating

phoneme class conditional probabilities from raw speech signal

using convolutional neural networks,” in Proc. Interspeech, Lyon,

France, Aug. 2013, pp. 1766–1770.

[10] O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Apply-

ing convolutional neural networks concepts to hybrid NN-HMM

model for speech recognition,” in Proc. IEEE Int. Conf. on Acous-

tics, Speech and Signal Processing, Kyoto, Japan, Mar. 2012, pp.

4277–4280.

[11] A. B. Poritz, “Linear predictive hidden Markov models and the

speech signal,” in Proc. IEEE Int. Conf. on Acoustics, Speech and

Signal Processing, vol. 7, Paris, France, May 1982, pp. 1291–

1294.

[12] Y. Ephraim and W. J. J. Roberts, “Revisiting autoregressive hid-

den Markov modeling of speech signals,” IEEE Signal Processing

Letters, vol. 12, no. 2, pp. 166–169, Feb. 2005.

[13] J. Yousafzai, Z. Cvetkovi´

c, and P. Sollich, “Subband acoustic

waveform front-end for robust speech recognition using support

vector machines,” in Proc. Interspeech, Brighton, UK, Sep. 2009,

pp. 2679–2682.

[14] S. B. Davis and P. Mermelstein, “Comparison of parametric rep-

resentations for monosyllabic word recognition in continuously

spoken sentences,” IEEE Transactions on Acoustics, Speech, and

Signal Processing, vol. 28, no. 4, pp. 357–366, Aug. 1980.

[15] R. Schl ¨

uter, I. Bezrukov, H. Wagner, and H. Ney, “Gamma-

tone features and feature combination for large vocabulary speech

recognition,” in Proc. IEEE Int. Conf. on Acoustics, Speech and

Signal Processing, Honolulu, Hawaii, USA, Apr. 2007, pp. 649–

652.

[16] H. Hermansky, “Perceptual linear predictive (PLP) analysis of

speech,” Journal of the Acoustical Society of America, vol. 87,

no. 4, pp. 1738–1752, 1990.

[17] Quaero Programme. http://www.quaero.org.

[18] D. Rybach, S. Hahn, P. Lehnen, D. Nolden, M. Sundermeyer,

Z. T¨

uske, S. Wiesler, R. Schl ¨

uter, and H. Ney, “RASR -

the RWTH Aachen university open source speech recognition

toolkit,” in Proc. IEEE Automatic Speech Recognition and Un-

derstanding Workshop (ASRU), Hawaii, USA, Dec. 2011.

[19] V. Nair and G. E. Hinton, “Rectiﬁed linear units improve restricted

Boltzmann machines,” in Proc. of the 27th Int. Conf. on Machine

Learning, Haifa, Israel, Jun. 2010, pp. 807–814.

[20] B. R. Glasberg and B. C. J. Moore, “Derivation of auditory ﬁlter

shapes from notched-noise data,” Hearing Research, vol. 47, no.

1-2, pp. 103–138, Aug. 1990.

894

Feature Replacement and Combination for Hybrid ASR Systems

Preprint

Apr 2021

Acoustic modeling of raw waveform and learning feature extractors as part of the neural network classifier has been the goal of many studies in the area of automatic speech recognition (ASR). Recently, one line of research has focused on frameworks that can be pre-trained on audio-only data in an unsupervised fashion and aim at improving downstream ASR tasks. In this work, we investigate the usefulness of one of these front-end frameworks, namely wav2vec, for hybrid ASR systems. In addition to deploying a pre-trained feature extractor, we explore how to make use of an existing acoustic model (AM) trained on the same task with different features as well. Another neural front-end which is only trained together with the supervised ASR loss as well as traditional Gammatone features are applied for comparison. Moreover, it is shown that the AM can be retrofitted with i-vectors for speaker adaptation. Finally, the described features are combined in order to further advance the performance. With the final best system, we obtain a relative improvement of 4% and 6% over our previous best model on the LibriSpeech test-clean and test-other sets.

An exploration of semi-supervised and language-adversarial transfer learning using hybrid acoustic model for hindi speech recognition

Article

Full-text available

Apr 2021

Semi-supervised training and language adversarial transfer learning are two different techniques to improve the Automatic Speech Recognition (ASR) performance in limited resource conditions. In this paper, we combined these two techniques and presented a common framework for the Hindi ASR system. For acoustic modeling, we proposed a hybrid architecture of SincNet-Convolutional Neural Network (CNN)-Light Gated Recurrent Unit (LiGRU), which shows increased interpretability, high accuracy, and fewer parameter size. We investigate the impact of the proposed hybrid model on monolingual Hindi ASR with semi-supervised training, and multilingual Hindi ASR with language adversarial transfer learning. In this work, we have chosen three Indian languages (Hindi, Marathi, Bengali) of the same Indo-Aryan family for multilingual training. All experiments were conducted using Kaldi and Py-Torch Kaldi toolkits. The proposed model with combined learning strategies helps to get the lowest 5.5% Word Error Rate (WER) for Hindi ASR.

Representation Learning For Speech Recognition Using Feedback Based Relevance Weighting

Preprint

Full-text available

Feb 2021

In this work, we propose an acoustic embedding based approach for representation learning in speech recognition. The proposed approach involves two stages comprising of acoustic filterbank learning from raw waveform, followed by modulation filterbank learning. In each stage, a relevance weighting operation is employed that acts as a feature selection module. In particular, the relevance weighting network receives embeddings of the model outputs from the previous time instants as feedback. The proposed relevance weighting scheme allows the respective feature representations to be adaptively selected before propagation to the higher layers. The application of the proposed approach for the task of speech recognition on Aurora-4 and CHiME-3 datasets gives significant performance improvements over baseline systems on raw waveform signal as well as those based on mel representations (average relative improvement of 15% over the mel baseline on Aurora-4 dataset and 7% on CHiME-3 dataset).

Speech recognition based on Convolutional neural networks and MFCC algorithm

Article

Full-text available

Jan 2021

In this paper, an automatic speech recognition system based on Convolutional neural networks and MFCC has been proposed, we have been investigated some deep models' architecture with various hyperparameters options such as Dropout rate and Learning rate. The dataset used in this paper collected from Kaggle TensorFlow Speech Recognition Challenge. Each audio file in the dataset contain one word with one second length the total words in the dataset is 30 categories with one category for background noise. The dataset contains 64,721 files has been separated into 51,088 for the training set, 6,798 for the validation set and 6,835 for the testing set. We have evaluated 3 models with different hyperparameters configuration in order to choose the best model with higher accuracy. The highest accuracy achieved is 88.21%.

Learning Filterbanks from Raw Waveform for Accent Classification

Conference Paper

Jul 2020

Robust Raw Waveform Speech Recognition Using Relevance Weighted Representations

Conference Paper

Oct 2020

Audio Classification - Feature Dimensional Analysis

Chapter

Mar 2021

An audio signal is an analogue signal representation in one-dimensional function x(t) with t the continual variable depicting time. Such signals, generated from diverse sources, can be discerned as music, speech, noise or any combination. For machines to understand, these audio signals must be represented such as the extraction of its features which are representations of the composition of the audio signal and behavior over time. Audio feature extraction can enhance the efficacy of audio processing and hence a benefit for numerous applications. We are presenting an emotion classification analysis with reference to audio representation (1 Dimensional and 2 Dimensional) with focus on audio recordings obtainable in Ryerson Audio-Visual Database of Emotion Speech and Song (RAVDESS) dataset, classification is based on eight (8) different emotions. We scrutinize the accuracy evaluation metric on the average of five (5) iterations for each audio signal (raw audio, normalized raw audio and spectrogram) representation. This presents the extraction of features in 1D and 2D as input using the Convolutional Neutral Network (CNN). A Variance of analysis (ANOVA - single factor) analysis was done to test the hypotheses on obtained accuracy values to show significance between the different audio signal representations of the dataset. Results obtained on F-ratio is greater than the critical F-ratio hence this value lies in the critical region. Thus, a shred of evidence that at 0.05 significance level, the true mean of the varied dataset does differ.

Deep learning-based high-frequency source depth estimation using a single sensor

Article

Mar 2021
J ACOUST SOC AM

The sensitivity of underwater propagation models to acoustic and environmental variability increases with the signal frequency; therefore, realizing accurate acoustic propagation predictions is difficult. Owing to this mismatch between the model and actual scenarios, achieving high-frequency source localization using model-based methods is generally difficult. To address this issue, we propose a deep learning approach trained on real data. In this study, we focused on depth estimation. Several 18-layer residual neural networks were trained on a normalized log-scaled spectrogram that was measured using a single hydrophone. The algorithm was evaluated using measured data transmitted from the linear frequency modulation chirp probe (11–31 kHz) in the shallow-water acoustic variability experiment 2015. The signal was received through two vertical line arrays (VLAs). The proposed method was applied to all 16 sensors of the VLA to determine the estimation performance with respect to the receiver depth. Furthermore, frequency-difference matched field processing was applied to the experimental data for comparison. The results indicate that ResNet can determine complicated features of high-frequency signals and predict depths, regardless of the receiver depth, while exhibiting robust environmental and positional variability.

Mobile Phones Know Your Keystrokes through the Sounds from Finger’s Tapping on the Screen

Conference Paper

Nov 2020

Interpretable Representation Learning for Speech and Audio Signals Based on Relevance Weighting

Article

Jan 2020

The learning of interpretable representations from raw data presents significant challenges for time series data like speech. In this work, we propose a relevance weighting scheme that allows the interpretation of the speech representations during the forward propagation of the model itself. The relevance weighting is achieved using a sub-network approach that performs the task of feature selection. A relevance sub-network, applied on the output of first layer of a convolutional neural network model operating on raw speech signals, acts as an acoustic filterbank (FB) layer with relevance weighting. A similar relevance sub-network applied on the second convolutional layer performs modulation filterbank learning with relevance weighting. The full acoustic model consisting of relevance sub-networks, convolutional layers and feed-forward layers is trained for a speech recognition task on noisy and reverberant speech in the Aurora-4, CHiME-3 and VOiCES datasets. The proposed representation learning framework is also applied for the task of sound classification in the UrbanSound8K dataset. A detailed analysis of the relevance weights learned by the model reveals that the relevance weights capture information regarding the underlying speech/audio content. In addition, speech recognition and sound classification experiments reveal that the incorporation of relevance weighting in the neural network architecture improves the performance significantly.

Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition

Conference Paper

Full-text available

May 2012
Acoust Speech Signal Process

Convolutional Neural Networks (CNN) have showed success in achieving translation invariance for many image processing tasks. The success is largely attributed to the use of local filtering and max-pooling in the CNN architecture. In this paper, we propose to apply CNN to speech recognition within the framework of hybrid NN-HMM model. We propose to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance. In our method, a pair of local filtering layer and max-pooling layer is added at the lowest end of neural network (NN) to normalize spectral variations of speech signals. In our experiments, the proposed CNN architecture is evaluated in a speaker independent speech recognition task using the standard TIMIT data sets. Experimental results show that the proposed CNN method can achieve over 10% relative error reduction in the core TIMIT test sets when comparing with a regular NN using the same number of hidden layers and weights. Our results also show that the best result of the proposed CNN model is better than previously published results on the same TIMIT test sets that use a pre-trained deep NN model.

Subband acoustic waveform front-end for robust speech recognition using support vector machines

Article

Full-text available

Dec 2010

A subband acoustic waveform front-end for robust speech recognition using support vector machines (SVMs) is developed. The primary issues of kernel design for subband components of acoustic waveforms and combination of the individual subband classifiers using stacked generalization are addressed. Experiments performed on the TIMIT phoneme classification task demonstrate the benefits of classification in frequency subbands: the subband classifier outperforms the cepstral classifiers in the presence of noise for signal-to-noise ratio (SNR) below 12dB.

Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription

Article

Full-text available

Dec 2011

We investigate the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective. Recently, we had shown that for speaker-independent transcription of phone calls (NIST RT03S Fisher data), CD-DNN-HMMs reduced the word error rate by as much as one third—from 27.4%, obtained by discriminatively trained Gaussian-mixture HMMs with HLDA features, to 18.5%—using 300+ hours of training data (Switchboard), 9000+ tied triphone states, and up to 9 hidden network layers.

RASR - The RWTH Aachen University Open Source Speech Recognition Toolkit

Conference Paper

Full-text available

Dec 2011

Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences

Article

Jan 1980
IEEE Trans Acoust Speech Signal Process

Mean-normalized stochastic gradient for large-scale deep learning

Conference Paper

May 2014

Deep neural networks are typically optimized with stochastic gradient descent (SGD). In this work, we propose a novel second-order stochastic optimization algorithm. The algorithm is based on analytic results showing that a non-zero mean of features is harmful for the optimization. We prove convergence of our algorithm in a convex setting. In our experiments we show that our proposed algorithm converges faster than SGD. Further, in contrast to earlier work, our algorithm allows for training models with a factorized structure from scratch. We found this structure to be very useful not only because it accelerates training and decoding, but also because it is a very effective means against overfitting. Combining our proposed optimization algorithm with this model structure, model size can be reduced by a factor of eight and still improvements in recognition error rate are obtained. Additional gains are obtained by improving the Newbob learning rate strategy.

Learning filter banks within a deep neural network framework

Conference Paper

Dec 2013

Mel-filter banks are commonly used in speech recognition, as they are motivated from theory related to speech production and perception. While features derived from mel-filter banks are quite popular, we argue that this filter bank is not really an appropriate choice as it is not learned for the objective at hand, i.e. speech recognition. In this paper, we explore replacing the filter bank with a filter bank layer that is learned jointly with the rest of a deep neural network. Thus, the filter bank is learned to minimize cross-entropy, which is more closely tied to the speech recognition objective. On a 50-hour English Broadcast News task, we show that we can achieve a 5% relative improvement in word error rate (WER) using the filter bank learning approach, compared to having a fixed set of filters.

Multilayer feedforward networks are universal approximator

Article

Jan 1989
NEURAL NETWORKS

Estimating Phoneme Class Conditional Probabilities from Raw Speech Signal using Convolutional Neural Networks

Article

Apr 2013

In hybrid hidden Markov model/artificial neural networks (HMM/ANN) automatic speech recognition (ASR) system, the phoneme class conditional probabilities are estimated by first extracting acoustic features from the speech signal based on prior knowledge such as, speech perception or/and speech production knowledge, and, then modeling the acoustic features with an ANN. Recent advances in machine learning techniques, more specifically in the field of image processing and text processing, have shown that such divide and conquer strategy (i.e., separating feature extraction and modeling steps) may not be necessary. Motivated from these studies, in the framework of convolutional neural networks (CNNs), this paper investigates a novel approach, where the input to the ANN is raw speech signal and the output is phoneme class conditional probability estimates. On TIMIT phoneme recognition task, we study different ANN architectures to show the benefit of CNNs and compare the proposed approach against conventional approach where, spectral-based feature MFCC is extracted and modeled by a multilayer perceptron. Our studies show that the proposed approach can yield comparable or better phoneme recognition performance when compared to the conventional approach. It indicates that CNNs can learn features relevant for phoneme classification automatically from the raw speech signal.

Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks

Article

Jan 2013

Recent studies have shown that deep neural networks (DNNs) perform significantly better than shallow networks and Gaussian mixture models (GMMs) on large vocabulary speech recognition tasks. In this paper, we argue that the improved accuracy achieved by the DNNs is the result of their ability to extract discriminative internal representations that are robust to the many sources of variability in speech signals. We show that these representations become increasingly insensitive to small perturbations in the input with increasing network depth, which leads to better speech recognition performance with deeper networks. We also show that DNNs cannot extrapolate to test samples that are substantially different from the training examples. If the training data are sufficiently representative, however, internal features learned by the DNN are relatively stable with respect to speaker differences, bandwidth differences, and environment distortion. This enables DNN-based recognizers to perform as well or better than state-of-the-art systems based on GMMs or shallow networks without the need for explicit model adaptation or feature normalization.

Acoustic Modeling with Deep Neural Networks Using Raw Time Signal for LVCSR

Abstract and Figures

Recommended publications

"To lead my own group is a fantastic opportunity!" Clio Azina, Ph.D. is a new Junior Principal Inv...

RWTH is inviting excellent and experienced early career researchers to apply for 6 Junior Principal...

High speed hardware development for FDMA/TDM system

Enhancement of audio signals based on digital techniques

Low-Power Wideband Analog Channelization Filter Bank Using Passive Polyphase-FFT Techniques

An Auditory Inspired Amplitude Modulation Filter Bank for Robust Feature Extraction in Automatic Spe...