Multimodal Deep Learning Framework for Video Summarization Using TVSum and SumMe Datasets
- DOI
- 10.2991/978-94-6239-674-6_37How to use a DOI?
- Keywords
- Video Summarization; Multimodal Deep Learning; Tvsum; Summe; Transformers; Audio-Visual Fusion; Keyshot Selection
- Abstract
The growth of video content has increased at a very high rate in digital platforms and there has been a high demand of automated systems that are capable of capturing the brief and meaningful summaries without compromising the underlying information. This paper hypothesizes a multimodal deep learning system on video summarization based on free TVSum and SumMe datasets. The model incorporates visual, audio and textual cues to gain a richer semantic context than single-modality models. The visual features are obtained with the help of a trained CNN backbone, audio cues are modeled with the help of log-mel spectrogram embeddings, and textual clues are obtained with the help of auto-generated speech transcripts with the help of a sequence encoder. Combination of these multimodal representations happens via an attention mechanism that is transformer-based to learn the importance of the segments but retain temporal coherence. Regularization of diversity and coverage limits are implemented in the training process to reduce redundancy and balance the keyshots that are made. The experiments of TVSum and SumMe show higher F1-scores in comparison to established baselines, which is a positive indication of the capability of the framework to generalize to different genres of video. The findings substantiate that multimodal fusion enhances semantic interpretation particularly whereby visual data is unclear or incomplete. The study provides an effective, scalable, and explainable method of summarizing consumer videos and tutorials and real-world recording with higher levels of informativeness and less computational cost.
- Copyright
- © 2026 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - Hridyesh Kumar AU - Ashish Sharma AU - Abhishek Kumar Gupta AU - Ankit Upadhyay AU - Methily Johri PY - 2026 DA - 2026/05/28 TI - Multimodal Deep Learning Framework for Video Summarization Using TVSum and SumMe Datasets BT - Proceedings of the International Conference on Sustainable Computing and Artificial Intelligence (ICSCAI 2025) PB - Atlantis Press SP - 445 EP - 458 SN - 2352-5401 UR - https://doi.org/10.2991/978-94-6239-674-6_37 DO - 10.2991/978-94-6239-674-6_37 ID - Kumar2026 ER -