The utilization of user’s facial- and speech-related features for the estimation of the Quality of Experience (QoE) of multimedia services is still underinvestigated despite its potential. Currently, only the use of either facial or speech features individually has been proposed, and relevant limited experiments have been performed. To advance in this respect, in this study, we focused on WebRTC-based videoconferencing, where it is often possible to capture both the facial expressions and vocal speech characteristics of the users. First, we performed thorough statistical analysis to identify the most significant facial- and speech-related features for QoE estimation, which we extracted from the participants’ audio-video data collected during a subjective assessment. Second, we trained individual QoE estimation machine learning-based models on the separated facial and speech datasets. Finally, we employed data fusion techniques to combine the facial and speech datasets into a single dataset to enhance the QoE estimation performance due to the integrated knowledge provided by the fusion of facial and speech features. The obtained results demonstrate that the data fusion technique based on the Improved Centered Kernel Alignment (ICKA) allows for reaching a mean QoE estimation accuracy of 0.93, whereas the values of 0.78 and 0.86 are reached when using only facial or speech features, respectively.

QoE Estimation of WebRTC-based Audio-visual Conversations from Facial and Speech Features

Bingol, Gulnaziye;Porcu, Simone;Floris, Alessandro;Atzori, Luigi
2024-01-01

Abstract

The utilization of user’s facial- and speech-related features for the estimation of the Quality of Experience (QoE) of multimedia services is still underinvestigated despite its potential. Currently, only the use of either facial or speech features individually has been proposed, and relevant limited experiments have been performed. To advance in this respect, in this study, we focused on WebRTC-based videoconferencing, where it is often possible to capture both the facial expressions and vocal speech characteristics of the users. First, we performed thorough statistical analysis to identify the most significant facial- and speech-related features for QoE estimation, which we extracted from the participants’ audio-video data collected during a subjective assessment. Second, we trained individual QoE estimation machine learning-based models on the separated facial and speech datasets. Finally, we employed data fusion techniques to combine the facial and speech datasets into a single dataset to enhance the QoE estimation performance due to the integrated knowledge provided by the fusion of facial and speech features. The obtained results demonstrate that the data fusion technique based on the Improved Centered Kernel Alignment (ICKA) allows for reaching a mean QoE estimation accuracy of 0.93, whereas the values of 0.78 and 0.86 are reached when using only facial or speech features, respectively.
2024
Information systems; Multimedia streaming; Quality of Experience; WebRTC; Facial Expressions; Speech; Machine Learning; Data Fusion
File in questo prodotto:
File Dimensione Formato  
qoe_estim_acm.pdf

accesso aperto

Descrizione: articolo online
Tipologia: versione editoriale (VoR)
Dimensione 2.1 MB
Formato Adobe PDF
2.1 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/390924
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact