Low-Latency Sentiment and Emotion Mining from Streaming Voice Transcriptions
- DOI
- 10.2991/978-94-6239-713-2_44How to use a DOI?
- Keywords
- ASR; Multimodal Fusion; BERT; Latency-Accuracy Tradeoff; Word Error Rate (WER); Prosodic Feature Extraction
- Abstract
Real-time speech emotion analysis is crucial in domains such as call center analytics and human-computer interaction. Although many existing emotion recognition systems achieve high accuracy, they often operate offline and overlook the impact of transcription errors and processing delays in real-time systems. This work presents a low-latency, multimodal framework that integrates Whisper-based Automatic Speech Recognition (ASR) with a fine-tuned BERT-based emotion classifier and real-time prosodic feature extraction to detect emotion from streaming voice calls. The system processes audio in chunks for real-time prediction and uses metrics such as Word Error Rate (WER), emotion accuracy, and overall pipeline latency to evaluate performance. A key contribution of this study is the analysis of the correlation between ASR accuracy and emotion prediction confidence, along with the role of specific acoustic features—pitch (YIN algorithm), energy, and zero-crossing rate (ZCR)—in improving classification robustness against transcription noise. The proposed multimodal framework reported an emotion classification accuracy of 32.00% and a weighted F1-score of 0.2280 on the CREMA-D dataset. Interestingly, the optimized pipeline using Whisper Tiny and DistilBERT reported an average end-to-end latency of 9.78 ms, which is substantially lower than the standard human conversation perception time of 200 ms.
- Copyright
- © 2026 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - Mopuri Rishitha AU - K. Madhumita AU - M. Chandraleka AU - M. Rahul Raj PY - 2026 DA - 2026/06/25 TI - Low-Latency Sentiment and Emotion Mining from Streaming Voice Transcriptions BT - Proceedings of the International Conference on Advances in Computing Technology and Artificial Intelligence (COMPUTATIA 2026) PB - Atlantis Press SP - 591 EP - 606 SN - 2589-4919 UR - https://doi.org/10.2991/978-94-6239-713-2_44 DO - 10.2991/978-94-6239-713-2_44 ID - Rishitha2026 ER -