In this paper, we propose a robust and computationally efficient pipeline for transcribing speech in noisy environments, such as workshops and industrial settings. The pipeline is designed to operate offline, making it suitable for resource-constrained scenarios. It begins with a noise filtering module that preprocesses audio recordings to suppress background noise and enhance speech clarity. The filtered audio is then passed to an Automatic Speech Recognition (ASR) model, which generates initial transcription outputs. Given the potential for transcription errors in challenging acoustic conditions, we incorporate a quantized Small Language Model (SLM) trained on an ontology of defects related to the industrial environment to post-process and correct these errors. The quantization of the SLM significantly reduces its computational footprint while maintaining correction accuracy, enabling the pipeline to function effectively on low-resource devices. Experimental evaluations demonstrate the effectiveness of the proposed approach in improving transcription quality in noisy conditions, highlighting its practicality for offline and resource-limited applications. In fact, preliminary validation on a synthetic dataset of 200 sentences in Italian and English showed a consistent F1 score of 87.04% for SNR as challenging as-5 dBW (Decibels Watt) in Italian sentences and 91.25% in English sentences, with the least computationally expensive version of Whisper (Whisper Tiny) and the SLM correction.
Noise-Robust Speech Transcription with Quantized Language Model Correction for Industrial Settings
reforgiato Recupero Diego.
;Scarpi G.;
2025-01-01
Abstract
In this paper, we propose a robust and computationally efficient pipeline for transcribing speech in noisy environments, such as workshops and industrial settings. The pipeline is designed to operate offline, making it suitable for resource-constrained scenarios. It begins with a noise filtering module that preprocesses audio recordings to suppress background noise and enhance speech clarity. The filtered audio is then passed to an Automatic Speech Recognition (ASR) model, which generates initial transcription outputs. Given the potential for transcription errors in challenging acoustic conditions, we incorporate a quantized Small Language Model (SLM) trained on an ontology of defects related to the industrial environment to post-process and correct these errors. The quantization of the SLM significantly reduces its computational footprint while maintaining correction accuracy, enabling the pipeline to function effectively on low-resource devices. Experimental evaluations demonstrate the effectiveness of the proposed approach in improving transcription quality in noisy conditions, highlighting its practicality for offline and resource-limited applications. In fact, preliminary validation on a synthetic dataset of 200 sentences in Italian and English showed a consistent F1 score of 87.04% for SNR as challenging as-5 dBW (Decibels Watt) in Italian sentences and 91.25% in English sentences, with the least computationally expensive version of Whisper (Whisper Tiny) and the SLM correction.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


