Proceedings of the International Conference on Sustainable Computing and Artificial Intelligence (ICSCAI 2025)

Multimodal Deep Learning Framework for Video Summarization Using TVSum and SumMe Datasets

Authors
Hridyesh Kumar1, Ashish Sharma2, Abhishek Kumar Gupta3, Ankit Upadhyay1, Methily Johri4, *
1D.S. College, Aligarh, India, 202001
2Department of Technology, JIET, Jodhpur, India, 342008
3New Delhi Institute of Management, New Delhi, India, 110062
4School of Computer Science and Engineering, Galgotias University, Greater Noida, India, 201306
*Corresponding author. Email: methily.johri@gmail.com
Corresponding Author
Methily Johri
Available Online 28 May 2026.
DOI
10.2991/978-94-6239-674-6_37How to use a DOI?
Keywords
Video Summarization; Multimodal Deep Learning; Tvsum; Summe; Transformers; Audio-Visual Fusion; Keyshot Selection
Abstract

The growth of video content has increased at a very high rate in digital platforms and there has been a high demand of automated systems that are capable of capturing the brief and meaningful summaries without compromising the underlying information. This paper hypothesizes a multimodal deep learning system on video summarization based on free TVSum and SumMe datasets. The model incorporates visual, audio and textual cues to gain a richer semantic context than single-modality models. The visual features are obtained with the help of a trained CNN backbone, audio cues are modeled with the help of log-mel spectrogram embeddings, and textual clues are obtained with the help of auto-generated speech transcripts with the help of a sequence encoder. Combination of these multimodal representations happens via an attention mechanism that is transformer-based to learn the importance of the segments but retain temporal coherence. Regularization of diversity and coverage limits are implemented in the training process to reduce redundancy and balance the keyshots that are made. The experiments of TVSum and SumMe show higher F1-scores in comparison to established baselines, which is a positive indication of the capability of the framework to generalize to different genres of video. The findings substantiate that multimodal fusion enhances semantic interpretation particularly whereby visual data is unclear or incomplete. The study provides an effective, scalable, and explainable method of summarizing consumer videos and tutorials and real-world recording with higher levels of informativeness and less computational cost.

Copyright
© 2026 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

Volume Title
Proceedings of the International Conference on Sustainable Computing and Artificial Intelligence (ICSCAI 2025)
Series
Advances in Engineering Research
Publication Date
28 May 2026
ISBN
978-94-6239-674-6
ISSN
2352-5401
DOI
10.2991/978-94-6239-674-6_37How to use a DOI?
Copyright
© 2026 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

TY  - CONF
AU  - Hridyesh Kumar
AU  - Ashish Sharma
AU  - Abhishek Kumar Gupta
AU  - Ankit Upadhyay
AU  - Methily Johri
PY  - 2026
DA  - 2026/05/28
TI  - Multimodal Deep Learning Framework for Video Summarization Using TVSum and SumMe Datasets
BT  - Proceedings of the International Conference on Sustainable Computing and Artificial Intelligence (ICSCAI 2025)
PB  - Atlantis Press
SP  - 445
EP  - 458
SN  - 2352-5401
UR  - https://doi.org/10.2991/978-94-6239-674-6_37
DO  - 10.2991/978-94-6239-674-6_37
ID  - Kumar2026
ER  -