This paper reports the results obtained by an Automatic Speech Recognition system when MFCCs, J-RASTA Perceptual Linear Prediction Coefficients (J-Rasta PLP) and energies from a Multi Resolution Analysis (MRA) tree of filters are used as input features to a hybrid system consisting of a Neural Network (NN) which provides observation probabilities for a network of Hidden Markov Models (HMM). Furthermore, the paper compares the performance of the system when various combinations of these features are used showing a WER reduction of 20% w.r.t. the use of J-Rasta PLP coefficients, when J-Rasta PLP coefficients are combined with the energies computed at the output of the leaves of an MRA filter tree. Such a combination is practically feasible thanks to the use of a NN architecture designed to integrate multiple features, exploiting the NN capability of mixing several input parameters without any assumption about their stochastical independence. Recognition is performed on a very large test set including many speakers uttering proper names from different locations of the Italian public telephone network.

Multi source neural networks based on fixed and multiple resolution analysis for speech recognition

PEGORARO, PAOLO ATTILIO
2001-01-01

Abstract

This paper reports the results obtained by an Automatic Speech Recognition system when MFCCs, J-RASTA Perceptual Linear Prediction Coefficients (J-Rasta PLP) and energies from a Multi Resolution Analysis (MRA) tree of filters are used as input features to a hybrid system consisting of a Neural Network (NN) which provides observation probabilities for a network of Hidden Markov Models (HMM). Furthermore, the paper compares the performance of the system when various combinations of these features are used showing a WER reduction of 20% w.r.t. the use of J-Rasta PLP coefficients, when J-Rasta PLP coefficients are combined with the energies computed at the output of the leaves of an MRA filter tree. Such a combination is practically feasible thanks to the use of a NN architecture designed to integrate multiple features, exploiting the NN capability of mixing several input parameters without any assumption about their stochastical independence. Recognition is performed on a very large test set including many speakers uttering proper names from different locations of the Italian public telephone network.
2001
0780370449
Software
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/191875
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 0
social impact