Conference PaperPDF Available

Acoustic Modeling with Deep Neural Networks Using Raw Time Signal for LVCSR

Authors:

Abstract and Figures

In this paper we investigate how much feature extraction is required by a deep neural network (DNN) based acoustic model for automatic speech recognition (ASR). We decompose the feature extraction pipeline of a state-of-the-art ASR system step by step and evaluate acoustic models trained on standard MFCC features, critical band energies (CRBE), FFT magnitude spectrum and even on the raw time signal. The focus is put on raw time signal as input features, i.e. as much as zero feature extraction prior to DNN training. Noteworthy, the gap in recognition accuracy between MFCC and raw time signal decreases strongly once we switch from sigmoid activation function to rectified linear units, offering a real alternative to standard signal processing. The analysis of the first layer weights reveals that the DNN can discover multiple band pass filters in time domain. Therefore we try to improve the raw time signal based system by initializing the first hidden layer weights with impulse responses of an audiologically motivated filter bank. Inspired by the multi-resolutional analysis layer learned automatically from raw time signal input, we train the DNN on a combination of multiple short-term features. This illustrates how the DNN can learn from the little differences between MFCC, PLP and Gammatone features, suggesting that it is useful to present the DNN with different views on the underlying audio.
Content may be subject to copyright.
Acoustic Modeling with Deep Neural Networks Using Raw Time Signal for
LVCSR
Zolt´
an T¨
uske1, Pavel Golik1, Ralf Schl¨
uter1, Hermann Ney1,2
1Human Language Technology and Pattern Recognition, Computer Science Department,
RWTH Aachen University, 52056 Aachen, Germany
2Spoken Language Processing Group, LIMSI CNRS, Paris, France
{tuske,golik,schlueter,ney}@cs.rwth-aachen.de
Abstract
In this paper we investigate how much feature extraction is re-
quired by a deep neural network (DNN) based acoustic model
for automatic speech recognition (ASR). We decompose the
feature extraction pipeline of a state-of-the-art ASR system step
by step and evaluate acoustic models trained on standard MFCC
features, critical band energies (CRBE), FFT magnitude spec-
trum and even on the raw time signal. The focus is put on raw
time signal as input features, i.e. as much as zero feature ex-
traction prior to DNN training. Noteworthy, the gap in recog-
nition accuracy between MFCC and raw time signal decreases
strongly once we switch from sigmoid activation function to
rectified linear units, offering a real alternative to standard sig-
nal processing. The analysis of the first layer weights reveals
that the DNN can discover multiple band pass filters in time do-
main. Therefore we try to improve the raw time signal based
system by initializing the first hidden layer weights with im-
pulse responses of an audiologically motivated filter bank. In-
spired by the multi-resolutional analysis layer learned automati-
cally from raw time signal input, we train the DNN on a combi-
nation of multiple short-term features. This illustrates how the
DNN can learn from the little differences between MFCC, PLP
and Gammatone features, suggesting that it is useful to present
the DNN with different views on the underlying audio.
Index Terms: acoustic modeling, raw signal, neural networks
1. Introduction
Since DNN based acoustic models have become a popular alter-
native to the Gaussian mixture models (GMMs), a lot of effort
was put into feature engineering that aimed at finding a repre-
sentation of input audio data that is most suitable for training
of neural networks [1][2]. GMMs are quite sensitive to input
features: the features need to be decorrelated so that a diag-
onal covariance matrix can be used for faster scoring and the
dimension needs to be relatively low. These requirements have
led to a large variety of feature extraction pipelines that build
upon expert knowledge of speech production and perception.
In contrast, hybrid DNN/HMM models [3] have none of these
constraints and a DNN acoustic model can easily be trained on
high dimensional features (several thousands) even with a large
amount of correlation between components. Further, the uni-
versal approximator property of a neural network [4][5] with
(multiple) hidden layers should allow the DNN to learn the nec-
essary (non-linear) feature extraction steps from data. For this
reason we investigated the question: how much feature extrac-
tion can be left for the DNN to discover?
Many groups have found logarithm of critical band ener-
gies (CRBE) extracted e.g. from a Mel filter bank to be most
suitable for training DNNs. One of the reasons why CRBE of-
ten outperform MFCC or PLP might be the fact that they con-
tain somewhat more high resolution information: while con-
ventional MFCC and PLP range from 13 to 16 dimensions, the
CRBEs used for DNN training are often 20- to 40-dimensional.
Further, many steps of typical feature extraction pipelines boil
down to linear projections, which should be easy to learn from
data. Ultimately, to avoid a loss of information the acoustic
model needs to be trained on full magnitude spectrum, e.g. [6],
or even the raw audio samples of the waveform. While the cross
entropy training is still performed on frame level, the latter case
allows to present the DNN a sequence of audio samples without
any notion of frame boundaries, thus allowing the neural net-
work to discover all kind of non-stationary patterns. Such pat-
terns correspond to various phonetic events that are described
poorly with frame-based stationary processing such as FFT.
The cost for processing raw time signal is twofold. First, the
high dimensionality of such feature spaces increases the number
of free parameters. This issue can be counteracted by adjusting
the network topology, e.g. introducing narrow matrix factoriza-
tion layers [7]. Second, the raw time signal features discard the
common assumption of most feature extraction pipelines that
human perception resolution is non-linear along the frequency
axis leaving it up to the DNN to discover. Our approach to
tackle this is to initialize the weights of the first layer by im-
pulse responses of Gammatone filters that follow the audiolog-
ical spacing in the frequency domain.
Further, we investigate how the DNN can be presented
with an increased time-frequency resolution without leaving the
framework of conventional feature extraction. This is related to
the concept of feature combination, where different short-term
features describe more or less the same spectral properties of the
signal [8]. However, the differences between the different fea-
ture streams are themselves an additional source of information
about the underlying audio signal.
Previously, a dramatic degradation of recognition accuracy
by training a DNN directly on raw speech signal was reported
in [9]. Instead, the authors used convolutional neural networks
[10] to obtain competitive results on the TIMIT task. Some
works on processing of raw speech signal make use of other
models such as linear predictive models [11][12] or SVMs [13].
They mostly evaluate on small classification tasks, leaving the
question open, how much would the amount of training data
compensate for the described difficulties.
This paper is organized as follows. The different feature ex-
traction pipelines, including the FFT and time signal features,
are summarized in Section 2. The experimental setup is intro-
duced in Section 3 and the results of our investigation are pre-
Copyright © 2014 ISCA 14
-
18 September 2014, Singapore
INTERSPEECH 2014
890
sented in Section 4. The conclusions are drawn in Section 6.
2. Feature extraction
This section gives a brief overview of the three cepstral features,
the FFT based features and the raw signal features.
2.1. Waveform — “raw” time signal
Processing the audio sampled at 16 kHz with the same 10 ms
steps as in case of typical cepstral features boils down to taking
160 samples from the PCM waveform. The windows are non-
overlapping so that stacking neighboring vectors does not result
in discontinuities. The samples quantized with 16 bit need to
be normalized to a numerically robust range by performing the
mean and variance normalization either globally over the com-
plete training data or on the per-utterance level. This can be
interpreted as DC bias removal and loudness equalization and
at the same time it serves numerical purposes to stabilize the
DNN training with gradient descent.
2.2. Amplitude spectrum — FFT
In contrast to raw time signal, the short-time Fourier transform
(STFT) is performed on overlapping windows of 25 ms. The
samples are zero-padded to a window of size 29and weighted
with a Hanning function, which exhibits smaller side lobes in
the amplitude spectrum than a rectangular window. The 512-
FFT results in a 257-dimensional vector due to the symmetry of
the amplitude spectrum. The phase spectrum is discarded.
2.3. Mel-Frequency cepstral coefficients — MFCC
The feature extraction is based on the STFT of the pre-
emphasized speech signal [14]. The amplitude spectrum is inte-
grated by a filterbank with the triangular filters being equidis-
tantly spaced on Mel-scale. The MFCC features are extracted
from the logarithm filter outputs (also referred to as CRBE) by
applying discrete cosine transform (DCT).
2.4. Gammatone features — GT
Instead of the STFT based analysis, the features are extracted
from an audiologically motivated filterbank realized by time-
domain Gammatone filters [15]. The auditory filters are placed
equidistantly on Greenwood-scale. After spectral and temporal
integration the 10th root is taken instead of the logarithm and
the DCT is applied for decorrelation.
2.5. Perceptual linear predictive coefficients — PLP
These features are again based on the STFT of speech [16].
Simulating the critical band masking, the amplitude spectrum is
integrated with trapezoid filters equally spaced on Bark-scale.
The filterbank output is pre-emphasized according to equal-
loudness curve. To simulate the relation between the intensity
and perceived loudness of sound, cubic root amplitude com-
pression is performed followed by all-pole model parameter es-
timation (linear predictive (LP) analysis). The autoregressive
coefficients are directly transformed to cepstral coefficients.
3. Experimental setup
The acoustic model training is performed w.r.t. frame-wise
cross entropy criterion on 50 hours of speech from the
Quaero [17] English database train11, which amounts to ca.
16 million input vectors. The development and evaluation sets
consist of ca. 3.5 hours of speech each, corresponding to about
1.2 million vectors. Some experiments are presented on a large
250 hours set from the same corpus train11. A 4-gram language
model (LM) is used during the recognition.
Throughout all experiments we use 6 hidden layers with
2000 hidden units in each layer. The output layer with 4500
nodes corresponds to the generalized triphones tied by a pho-
netic classification and regression tree (CART). The number of
trainable weights amounts to approx. 30M-35M depending on
the features used. The input features always correspond to 17
stacked frames so that the overall amount of temporal context
presented to the DNN at once is the same. The mini-batches of
size 512 are drawn from the shuffled training set. The weights
are initialized via discriminative pre-training (DPT) [1].
The ASR baseline system is a conventional GMM/HMM
based model trained on the same database w.r.t. the maximum
likelihood criterion. We applied linear discriminant analysis
(LDA) to 9 consecutive MFCC frames to obtain the final 45-
dimensional features. The GMM with a globally pooled diago-
nal covariance matrix consists of approx. 660k densities, which
corresponds to about 30M trainable parameters. For acoustic
training and recognition we used the RASR toolkit [18].
4. Results
In the first experiment we compared the baseline results ob-
tained with the GMM and DNN acoustic models on MFCC fea-
tures normalized for mean/variance and the vocal tract length
(VTLN). The results are shown in Table 1. Unless stated oth-
erwise, the training is performed on 50 hours of speech. The
same DNN configuration was trained on the raw time signal
as described in Section 2.1. As expected, the MFCC-based
DNN model outperforms the GMM, but the WER of the sys-
tem trained on raw time signal is still significantly higher.
Table 1: Baseline results. WER in %.
Features model dev eval
MFCC GMM 24.4 31.6
MFCC DNN 19.4 25.3
time signal DNN 29.4 36.8
In the next experiment we wanted to figure out, how the
recognition accuracy depends on the various preprocessing
steps. For this purpose we decomposed the MFCCs step by
step and measured the performance. Table 2 shows the word
error rate (WER) after each step. The results indicate that with-
out mean/variance normalization and VTLN, the gap between
MFCCs and FFT features decreases significantly.
4.1. Feature combination
From the results in Table 2 it is clear, that the presented fea-
tures differ in the dimensionality by an order of magnitude.
Still MFCC outperform the high-dimensional FFT and time
signal features. How can we increase the amount of informa-
tion within the framework of low-dimensional features? As de-
scribed in Section 2, the different short-term feature extraction
pipelines cover slightly different representations of the underly-
ing audio. Hoping that the DNN can extract useful information
from these differences we performed feature combination fol-
lowing the approach of [8]. The results in Table 3 confirm that
a DNN being a powerful classifier can learn more from multiple
feature streams than from every single feature set.
891
Table 2: Feature preprocessing and normalization for DNN AM.
Dimension of a single feature vector. WER in %.
Features dim. dev eval
MFCC 16
+ global norm. 19.8 26.1
+ utterance norm. 19.7 25.5
+ VTLN 19.4 25.3
MFCC 20
+ VTLN + utterance norm. 19.1 25.2
CRBE
+ VTLN + utterance norm. 20 19.5 25.7
40 19.7 26.2
|FFT|257
+ global norm. 21.3 27.8
+ 10th root 21.0 27.5
+ utterance norm.
+ 10th root 20.6 26.8
time signal 160
+ global norm. 29.4 36.8
+ utterance norm. 28.9 35.0
Table 3: Feature combination. WER in %.
Features dev eval
MFCC 19.1 25.2
PLP 19.2 24.8
GT 19.2 25.5
MFCC + PLP + GT 18.4 24.2
4.2. Analysis of the input layer trained on time signal
Having obtained surprisingly reasonable results on the normal-
ized raw time signal, we were curious what kind of patterns
could have been learned by the neural network. Although the
analysis of all parameters remains infeasible, we could detect
clearly interpretable patterns within the first layer of the fully
trained DNN. Figure 1 shows the weights learned by four of
the hidden nodes of the first layer. Apparently, the DNN man-
aged to learn some kind of impulse responses that correspond
to band pass filters and other patterns (e.g. short bursts) purely
from data. In order to illustrate the spectral properties of the dis-
covered filters, we zero-padded every row in the weight matrix
to 8000 entries, calculated the magnitude spectrum
Wi=|FFT{wi,·}| ∈ R1×8000 1i2000 (1)
and sorted the rows by the location of the most prominent
“blob”. The position of the blob was calculated after smoothing
the spectrum with a Gaussian kernel gas
ˆ
Wi=Wig(2)
fi
c= argmax
1j8000
{ˆ
Wi,j }(3)
Assuming that every row can be interpreted as a band pass im-
pulse response, the location of the blob corresponds to the cen-
ter frequency of the learned transfer function. Figure 2 shows
the obtained spectra as 20 log10 Wi. It can be seen that, without
any prior knowledge, the DNN has discovered a large number
of band pass like filters that exhibit roughly the audiological
distribution. It means, the number of narrow band pass filters in
the lower frequency region is quite high, while with increasing
center frequency, the bandwidth of the filters becomes larger.
Also the distribution of the center frequencies is non-linear. The
Time [samples]
Weight
0 500 1000 1500 2000 2500
−0.2
−0.1
0
0.1
0.2
0.3 w1
w2
w3
w4
Figure 1: Four rows from the first layer weight matrix trained
on raw time signal. The time range corresponds to 17 frames of
10 ms (17 ·10ms ·16kHz = 2720)
Frequency [kHz]
Reordered hidden units
012345678
0
500
1000
1500
2000 0−40
−30
−20
−10
0
10
20
30
Figure 2: Amplitude spectra of the reordered rows from the first
layer weight matrix trained on time signal.
bandwidth of the transfer function can be calculated as equiva-
lent noise bandwidth by
fi
b=PjW2
i,j
(maxjWi,j )2(4)
Figure 3 shows the scatter plot of the approximated parameters
fcand fbof the learned filters.
Remarkably, the position of the filters in time is not re-
stricted to the center of the stacked audio samples, but is scat-
tered across left and right context approximately uniformly.
These shifts (or time offsets) are expressed in the phase spec-
trum and are therefore not visible in Figure 2. This distribution
indicates that the DNN was able to learn different filters for dif-
ferent parts of the presented audio context. Also, none of the
learned narrow filters exhibits multiple passbands.
4.3. Rectified linear units and large scale experiments
In the following set of experiments we investigated how strong
can we further reduce the gap in recognition accuracy between
the various feature configurations by (a) switching the activa-
tion function and (b) increasing the amount of training data.
First we compared sigmoid activation function with the recti-
fied linear units (ReLU) [19]. From the previous experience
Table 4: Feature and activation function comparison, training
on 50h. WER in %.
Features dev eval
sigmoid ReLU sigmoid ReLU
MFCC 19.1 18.0 25.2 23.8
MFCC + PLP + GT 18.4 16.6 24.2 21.7
|FFT|20.6 18.4 26.8 24.7
time signal 28.9 22.6 35.0 28.5
892
Center frequency [kHz]
Bandwidth [Hz]
012345678
0
200
400
600
800
1000
Learned filters
Learned filters (least squares trend)
Audiological filter bank
Figure 3: Scatter plot of approximated parameters of the
learned filter bank.
Table 5: Feature and activation function comparison, training
on 250h. WER in %.
Features dev eval
sigmoid ReLU sigmoid ReLU
MFCC 15.2 15.9 20.4 21.1
MFCC + PLP + GT 14.8 14.0 19.8 18.9
|FFT|16.1 15.8 21.6 21.5
time signal 19.2 17.6 25.6 23.5
we know that ReLUs are sensitive to regularization so we used
L2-regularization with a value of 0.0001. In contrast, sigmoid
non-linearities perform best with no regularization at all. The
results shown in Table 4 suggest that the ReLUs have a stronger
effect on the systems with high error rates, which is presumably
due to a more difficult optimization problem. In addition, we
repeated these experiments with DNNs trained on 250 hours of
speech. Table 5 shows the obtained results. Further large scale
experiments revealed that increasing the number of hidden lay-
ers up to 12 narrowed the performance gap between MFCC and
raw time signal achieving 20.9% WER on the evaluation cor-
pus.
4.4. Manual weight initialization with audiological filters
After we observed the filter shapes that have been learned from
raw time signal, we investigated, whether we can initialize the
weights of the first hidden layer in a way that makes it easier for
the DNN to discover further meaningful filters during training
with gradient descent. For this purpose we calculated the real
part of impulse responses of a stationary Gammatone filter bank
that follows the audiological filter distribution [20]. The param-
eters of the 32 filters were defined as follows (with l= 24.7
and q= 9.265):
fi
c=l·q·(ei/q 1) (5)
fi
b=l+fi
c/q (6)
In order to account for different positions in time we created
multiple shifted copies of each filter’s impulse response to ob-
tain a weight matrix of the same size as the randomly initial-
ized first layer weights in the previous experiments. Table 6
shows the comparison of three different approaches: random
Table 6: Weight initialization for learning from raw time signal.
WER in %.
Weight initialization update allowed dev eval
random yes 22.6 28.5
GT yes 22.4 28.7
no 24.9 31.1
initialization (as in Table 4), initialization by a Gammatone filter
bank with regular weight update through backpropagation, and
a fixed Gammatone filter bank layer with no update through-
out the training. The latter case corresponds to a fixed “fea-
ture extraction layer” where only layers above the first one are
trained, so that we can compare whether the DNN can improve
the weights by backpropagation upon the initialization.
It can be seen that the manually designed filter bank does
not help the DNN much to discover better features compared
with fully random initialization. Also, keeping the first layer
weights fixed throughout the training rather hurts the recogni-
tion performance. This indicates that the initial filter bank con-
figuration is suboptimal, presumably because of a too low fre-
quency resolution and the lack of non-band-pass patterns.
5. Conclusions
In this paper we have shown that using hybrid DNN/HMM
acoustic models allows to obtain reasonable recognition results
even without any processing of the raw time signal. The per-
formance gap between raw time signal and conventional MFCC
features could be reduced strongly by switching from sigmoid
activation function to rectified linear units. The amount of train-
ing data further reduced the gap.
Our analysis of the learned weights suggests that without
any prior knowledge, the DNN is able to learn a set of band
pass filters in time domain purely from the raw time signal. We
presented a way to interpret the learned parameters: by reorder-
ing the rows within the input layer weight matrix, it is possible
to see the approximately audiological distribution of the filters.
This again nicely confirms the result of many years of research
on feature extraction. Further, this result shows a real alternative
to the otherwise (mostly) stationary feature extraction pipelines:
presenting the DNN with data on sampling frequency level al-
lows the acoustic model to learn non-stationary patterns, local-
ized in time across frame boundaries. Also, the loss of informa-
tion can be reduced by processing time domain data.
Finally we presented a trade-off between feature dimen-
sionality and level of detail of the underlying audio. By training
the DNN on a combination of MFCC, PLP and Gammatone fea-
tures, the resulting acoustic model outperformed all other sys-
tems, even with a large amount of training data. This suggests
that the differences in these feature extraction pipelines allow
the DNN to gain additional knowledge about the input data.
6. Acknowledgements
The research leading to these results has received funding
from the European Union Seventh Framework Programme
(FP7/2007-2013) under grant agreement no. 287755 (transLec-
tures). This work has received funding from the Quaero Pro-
gramme funded by OSEO, French State agency for innovation.
H. Ney was partially supported by a senior chair award from
DIGITEO, a French research cluster in ˆ
Ile-de-France. Sup-
ported by the Intelligence Advanced Research Projects Activ-
ity (IARPA) via Department of Defense U.S. Army Research
Laboratory (DoD / ARL) contract no. W911NF-12-C-0012.
The U.S. Government is authorized to reproduce and distribute
reprints for Governmental purposes notwithstanding any copy-
right annotation thereon. Disclaimer: The views and con-
clusions contained herein are those of the authors and should
not be interpreted as necessarily representing the official poli-
cies or endorsements, either expressed or implied, of IARPA,
DoD/ARL, or the U.S. Government.
893
7. References
[1] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering
in context-dependent deep neural networks for conversational
speech transcription,” in Proc. IEEE Automatic Speech Recogni-
tion and Understanding Workshop (ASRU), Hawaii, USA, Dec.
2011, pp. 24–29.
[2] D. Yu, M. L. Seltzer, J. Li, J.-T. Huang, and F. Seide, “Feature
learning in deep neural networks - a study on speech recognition
tasks,” in International Conference on Learning Representations,
Scottsdale, AZ, USA, May 2013.
[3] H. A. Bourlard and N. Morgan, Connectionist speech recognition:
a hybrid approach. Norwell, MA, USA: Kluwer Academic Pub-
lishers, 1993.
[4] G. Cybenko, “Approximation by superpositions of a sigmoidal
function,” Mathematics of Control, Signals and Systems, vol. 2,
no. 4, pp. 303–314, 1989.
[5] K. Hornik, M. B. Stinchcombe, and H. White, “Multilayer feed-
forward networks are universal approximators,Neural Networks,
vol. 2, no. 5, pp. 359–366, Jul. 1989.
[6] T. N. Sainath, B. Kingsbury, A.-r. Mohamed, and B. Ramabhad-
ran, “Learning filter banks within a deep neural network frame-
work,” in Proc. IEEE Automatic Speech Recognition and Un-
derstanding Workshop (ASRU), Olomouc, Czech Republic, Dec.
2013, pp. 297–302.
[7] S. Wiesler, A. Richard, R. Schl ¨
uter, and H. Ney, “Mean-
normalized stochastic gradient for large-scale deep learning,” in
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process-
ing, Florence, Italy, May 2014, pp. 180–184.
[8] C. Plahl, R. Schl ¨
uter, and H. Ney, “Improved acoustic feature
combination for LVCSR by neural networks,” in Proc. Inter-
speech, Florence, Italy, Aug. 2011, pp. 1237–1240.
[9] D. Palaz, R. Collobert, and M. Magimai.-Doss, “Estimating
phoneme class conditional probabilities from raw speech signal
using convolutional neural networks,” in Proc. Interspeech, Lyon,
France, Aug. 2013, pp. 1766–1770.
[10] O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Apply-
ing convolutional neural networks concepts to hybrid NN-HMM
model for speech recognition,” in Proc. IEEE Int. Conf. on Acous-
tics, Speech and Signal Processing, Kyoto, Japan, Mar. 2012, pp.
4277–4280.
[11] A. B. Poritz, “Linear predictive hidden Markov models and the
speech signal,” in Proc. IEEE Int. Conf. on Acoustics, Speech and
Signal Processing, vol. 7, Paris, France, May 1982, pp. 1291–
1294.
[12] Y. Ephraim and W. J. J. Roberts, “Revisiting autoregressive hid-
den Markov modeling of speech signals,” IEEE Signal Processing
Letters, vol. 12, no. 2, pp. 166–169, Feb. 2005.
[13] J. Yousafzai, Z. Cvetkovi´
c, and P. Sollich, “Subband acoustic
waveform front-end for robust speech recognition using support
vector machines,” in Proc. Interspeech, Brighton, UK, Sep. 2009,
pp. 2679–2682.
[14] S. B. Davis and P. Mermelstein, “Comparison of parametric rep-
resentations for monosyllabic word recognition in continuously
spoken sentences,” IEEE Transactions on Acoustics, Speech, and
Signal Processing, vol. 28, no. 4, pp. 357–366, Aug. 1980.
[15] R. Schl ¨
uter, I. Bezrukov, H. Wagner, and H. Ney, “Gamma-
tone features and feature combination for large vocabulary speech
recognition,” in Proc. IEEE Int. Conf. on Acoustics, Speech and
Signal Processing, Honolulu, Hawaii, USA, Apr. 2007, pp. 649–
652.
[16] H. Hermansky, “Perceptual linear predictive (PLP) analysis of
speech,” Journal of the Acoustical Society of America, vol. 87,
no. 4, pp. 1738–1752, 1990.
[17] Quaero Programme. http://www.quaero.org.
[18] D. Rybach, S. Hahn, P. Lehnen, D. Nolden, M. Sundermeyer,
Z. T¨
uske, S. Wiesler, R. Schl ¨
uter, and H. Ney, “RASR -
the RWTH Aachen university open source speech recognition
toolkit,” in Proc. IEEE Automatic Speech Recognition and Un-
derstanding Workshop (ASRU), Hawaii, USA, Dec. 2011.
[19] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
Boltzmann machines,” in Proc. of the 27th Int. Conf. on Machine
Learning, Haifa, Israel, Jun. 2010, pp. 807–814.
[20] B. R. Glasberg and B. C. J. Moore, “Derivation of auditory filter
shapes from notched-noise data,” Hearing Research, vol. 47, no.
1-2, pp. 103–138, Aug. 1990.
894
... This could avoid potential information loss which might result in suboptimal performance. Besides some early methods using feedforward neural networks (FFNNs) [4], mainly convolutional approaches have been proposed. They seem especially suitable for this task as they can operate similar to previous feature extraction methods but take advantage of learnable filters [5,6,7,8]. ...
Preprint
Acoustic modeling of raw waveform and learning feature extractors as part of the neural network classifier has been the goal of many studies in the area of automatic speech recognition (ASR). Recently, one line of research has focused on frameworks that can be pre-trained on audio-only data in an unsupervised fashion and aim at improving downstream ASR tasks. In this work, we investigate the usefulness of one of these front-end frameworks, namely wav2vec, for hybrid ASR systems. In addition to deploying a pre-trained feature extractor, we explore how to make use of an existing acoustic model (AM) trained on the same task with different features as well. Another neural front-end which is only trained together with the supervised ASR loss as well as traditional Gammatone features are applied for comparison. Moreover, it is shown that the AM can be retrofitted with i-vectors for speaker adaptation. Finally, the described features are combined in order to further advance the performance. With the final best system, we obtain a relative improvement of 4% and 6% over our previous best model on the LibriSpeech test-clean and test-other sets.
... This lossless and richer input signal representation produces excellent results in the past and replaces the standard features like Mel-Frequency Cepstral Coefficient (MFCC) [31,39]. CNN, with raw input, becomes more prevalent in speech recognition [29,31,39,63,77]. The data of size 4800 (0.300 × 16,000 = 4800, the chunk size is 300 ms and the sampling frequency is 16 kHz) onedimensional feature vector was fed into SincNet filters. ...
Article
Full-text available
Semi-supervised training and language adversarial transfer learning are two different techniques to improve the Automatic Speech Recognition (ASR) performance in limited resource conditions. In this paper, we combined these two techniques and presented a common framework for the Hindi ASR system. For acoustic modeling, we proposed a hybrid architecture of SincNet-Convolutional Neural Network (CNN)-Light Gated Recurrent Unit (LiGRU), which shows increased interpretability, high accuracy, and fewer parameter size. We investigate the impact of the proposed hybrid model on monolingual Hindi ASR with semi-supervised training, and multilingual Hindi ASR with language adversarial transfer learning. In this work, we have chosen three Indian languages (Hindi, Marathi, Bengali) of the same Indo-Aryan family for multilingual training. All experiments were conducted using Kaldi and Py-Torch Kaldi toolkits. The proposed model with combined learning strategies helps to get the lowest 5.5% Word Error Rate (WER) for Hindi ASR.
... One of the research directions pursued for speech has been the learning of filter banks operating directly on the raw waveform [3][4][5][6][7], mostly in supervised setting. Other efforts attempting unsupervised learning of filterbank have also been investigated. ...
Preprint
Full-text available
In this work, we propose an acoustic embedding based approach for representation learning in speech recognition. The proposed approach involves two stages comprising of acoustic filterbank learning from raw waveform, followed by modulation filterbank learning. In each stage, a relevance weighting operation is employed that acts as a feature selection module. In particular, the relevance weighting network receives embeddings of the model outputs from the previous time instants as feedback. The proposed relevance weighting scheme allows the respective feature representations to be adaptively selected before propagation to the higher layers. The application of the proposed approach for the task of speech recognition on Aurora-4 and CHiME-3 datasets gives significant performance improvements over baseline systems on raw waveform signal as well as those based on mel representations (average relative improvement of 15% over the mel baseline on Aurora-4 dataset and 7% on CHiME-3 dataset).
... Modern speech recognition systems use deep learning techniques [5,6]. They are used to represent features and to model the language [7,8]. Better results are achieved by the recently popular convolution neural networks [9,10]. ...
Article
Full-text available
In this paper, an automatic speech recognition system based on Convolutional neural networks and MFCC has been proposed, we have been investigated some deep models' architecture with various hyperparameters options such as Dropout rate and Learning rate. The dataset used in this paper collected from Kaggle TensorFlow Speech Recognition Challenge. Each audio file in the dataset contain one word with one second length the total words in the dataset is 30 categories with one category for background noise. The dataset contains 64,721 files has been separated into 51,088 for the training set, 6,798 for the validation set and 6,835 for the testing set. We have evaluated 3 models with different hyperparameters configuration in order to choose the best model with higher accuracy. The highest accuracy achieved is 88.21%.
... There has been a tremendous improvement in the field of vision [21], [22] by directly learning from pixels. Currently, some applications of speech explored learning directly from raw waveform such as speech recognition [23]- [26], speaker verification [27], emotion recognition [28], and environment sound recognition [29]. In [30], raw waveform modeling approaches are used in Styrian dialect identification which performed better than the baseline methods. ...
... In the past, the main direction pursued has been to learn filterbank parameters [2][3][4] from raw waveforms. The objective can be either detection or classification [3,5,6]. ...
Chapter
An audio signal is an analogue signal representation in one-dimensional function x(t) with t the continual variable depicting time. Such signals, generated from diverse sources, can be discerned as music, speech, noise or any combination. For machines to understand, these audio signals must be represented such as the extraction of its features which are representations of the composition of the audio signal and behavior over time. Audio feature extraction can enhance the efficacy of audio processing and hence a benefit for numerous applications. We are presenting an emotion classification analysis with reference to audio representation (1 Dimensional and 2 Dimensional) with focus on audio recordings obtainable in Ryerson Audio-Visual Database of Emotion Speech and Song (RAVDESS) dataset, classification is based on eight (8) different emotions. We scrutinize the accuracy evaluation metric on the average of five (5) iterations for each audio signal (raw audio, normalized raw audio and spectrogram) representation. This presents the extraction of features in 1D and 2D as input using the Convolutional Neutral Network (CNN). A Variance of analysis (ANOVA - single factor) analysis was done to test the hypotheses on obtained accuracy values to show significance between the different audio signal representations of the dataset. Results obtained on F-ratio is greater than the critical F-ratio hence this value lies in the critical region. Thus, a shred of evidence that at 0.05 significance level, the true mean of the varied dataset does differ.
Article
The sensitivity of underwater propagation models to acoustic and environmental variability increases with the signal frequency; therefore, realizing accurate acoustic propagation predictions is difficult. Owing to this mismatch between the model and actual scenarios, achieving high-frequency source localization using model-based methods is generally difficult. To address this issue, we propose a deep learning approach trained on real data. In this study, we focused on depth estimation. Several 18-layer residual neural networks were trained on a normalized log-scaled spectrogram that was measured using a single hydrophone. The algorithm was evaluated using measured data transmitted from the linear frequency modulation chirp probe (11–31 kHz) in the shallow-water acoustic variability experiment 2015. The signal was received through two vertical line arrays (VLAs). The proposed method was applied to all 16 sensors of the VLA to determine the estimation performance with respect to the receiver depth. Furthermore, frequency-difference matched field processing was applied to the experimental data for comparison. The results indicate that ResNet can determine complicated features of high-frequency signals and predict depths, regardless of the receiver depth, while exhibiting robust environmental and positional variability.
Article
The learning of interpretable representations from raw data presents significant challenges for time series data like speech. In this work, we propose a relevance weighting scheme that allows the interpretation of the speech representations during the forward propagation of the model itself. The relevance weighting is achieved using a sub-network approach that performs the task of feature selection. A relevance sub-network, applied on the output of first layer of a convolutional neural network model operating on raw speech signals, acts as an acoustic filterbank (FB) layer with relevance weighting. A similar relevance sub-network applied on the second convolutional layer performs modulation filterbank learning with relevance weighting. The full acoustic model consisting of relevance sub-networks, convolutional layers and feed-forward layers is trained for a speech recognition task on noisy and reverberant speech in the Aurora-4, CHiME-3 and VOiCES datasets. The proposed representation learning framework is also applied for the task of sound classification in the UrbanSound8K dataset. A detailed analysis of the relevance weights learned by the model reveals that the relevance weights capture information regarding the underlying speech/audio content. In addition, speech recognition and sound classification experiments reveal that the incorporation of relevance weighting in the neural network architecture improves the performance significantly.
Conference Paper
Full-text available
Convolutional Neural Networks (CNN) have showed success in achieving translation invariance for many image processing tasks. The success is largely attributed to the use of local filtering and max-pooling in the CNN architecture. In this paper, we propose to apply CNN to speech recognition within the framework of hybrid NN-HMM model. We propose to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance. In our method, a pair of local filtering layer and max-pooling layer is added at the lowest end of neural network (NN) to normalize spectral variations of speech signals. In our experiments, the proposed CNN architecture is evaluated in a speaker independent speech recognition task using the standard TIMIT data sets. Experimental results show that the proposed CNN method can achieve over 10% relative error reduction in the core TIMIT test sets when comparing with a regular NN using the same number of hidden layers and weights. Our results also show that the best result of the proposed CNN model is better than previously published results on the same TIMIT test sets that use a pre-trained deep NN model.
Article
Full-text available
A subband acoustic waveform front-end for robust speech recognition using support vector machines (SVMs) is developed. The primary issues of kernel design for subband components of acoustic waveforms and combination of the individual subband classifiers using stacked generalization are addressed. Experiments performed on the TIMIT phoneme classification task demonstrate the benefits of classification in frequency subbands: the subband classifier outperforms the cepstral classifiers in the presence of noise for signal-to-noise ratio (SNR) below 12dB.
Article
Full-text available
We investigate the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective. Recently, we had shown that for speaker-independent transcription of phone calls (NIST RT03S Fisher data), CD-DNN-HMMs reduced the word error rate by as much as one third—from 27.4%, obtained by discriminatively trained Gaussian-mixture HMMs with HLDA features, to 18.5%—using 300+ hours of training data (Switchboard), 9000+ tied triphone states, and up to 9 hidden network layers.
Conference Paper
Deep neural networks are typically optimized with stochastic gradient descent (SGD). In this work, we propose a novel second-order stochastic optimization algorithm. The algorithm is based on analytic results showing that a non-zero mean of features is harmful for the optimization. We prove convergence of our algorithm in a convex setting. In our experiments we show that our proposed algorithm converges faster than SGD. Further, in contrast to earlier work, our algorithm allows for training models with a factorized structure from scratch. We found this structure to be very useful not only because it accelerates training and decoding, but also because it is a very effective means against overfitting. Combining our proposed optimization algorithm with this model structure, model size can be reduced by a factor of eight and still improvements in recognition error rate are obtained. Additional gains are obtained by improving the Newbob learning rate strategy.
Conference Paper
Mel-filter banks are commonly used in speech recognition, as they are motivated from theory related to speech production and perception. While features derived from mel-filter banks are quite popular, we argue that this filter bank is not really an appropriate choice as it is not learned for the objective at hand, i.e. speech recognition. In this paper, we explore replacing the filter bank with a filter bank layer that is learned jointly with the rest of a deep neural network. Thus, the filter bank is learned to minimize cross-entropy, which is more closely tied to the speech recognition objective. On a 50-hour English Broadcast News task, we show that we can achieve a 5% relative improvement in word error rate (WER) using the filter bank learning approach, compared to having a fixed set of filters.
Article
In hybrid hidden Markov model/artificial neural networks (HMM/ANN) automatic speech recognition (ASR) system, the phoneme class conditional probabilities are estimated by first extracting acoustic features from the speech signal based on prior knowledge such as, speech perception or/and speech production knowledge, and, then modeling the acoustic features with an ANN. Recent advances in machine learning techniques, more specifically in the field of image processing and text processing, have shown that such divide and conquer strategy (i.e., separating feature extraction and modeling steps) may not be necessary. Motivated from these studies, in the framework of convolutional neural networks (CNNs), this paper investigates a novel approach, where the input to the ANN is raw speech signal and the output is phoneme class conditional probability estimates. On TIMIT phoneme recognition task, we study different ANN architectures to show the benefit of CNNs and compare the proposed approach against conventional approach where, spectral-based feature MFCC is extracted and modeled by a multilayer perceptron. Our studies show that the proposed approach can yield comparable or better phoneme recognition performance when compared to the conventional approach. It indicates that CNNs can learn features relevant for phoneme classification automatically from the raw speech signal.
Article
Recent studies have shown that deep neural networks (DNNs) perform significantly better than shallow networks and Gaussian mixture models (GMMs) on large vocabulary speech recognition tasks. In this paper, we argue that the improved accuracy achieved by the DNNs is the result of their ability to extract discriminative internal representations that are robust to the many sources of variability in speech signals. We show that these representations become increasingly insensitive to small perturbations in the input with increasing network depth, which leads to better speech recognition performance with deeper networks. We also show that DNNs cannot extrapolate to test samples that are substantially different from the training examples. If the training data are sufficiently representative, however, internal features learned by the DNN are relatively stable with respect to speaker differences, bandwidth differences, and environment distortion. This enables DNN-based recognizers to perform as well or better than state-of-the-art systems based on GMMs or shallow networks without the need for explicit model adaptation or feature normalization.