Multimodal Deep Learning Framework for Video Summarization Using TVSum and SumMe Datasets

Hridyesh Kumar; Ashish Sharma; Abhishek Kumar Gupta; Ankit Upadhyay; Methily Johri

doi:10.2991/978-94-6239-674-6_37

<Previous Article In Volume

Next Article In Volume>

Multimodal Deep Learning Framework for Video Summarization Using TVSum and SumMe Datasets

Authors

Hridyesh Kumar¹, Ashish Sharma², Abhishek Kumar Gupta³, Ankit Upadhyay¹, Methily Johri⁴^{, *}

¹D.S. College, Aligarh, India, 202001

²Department of Technology, JIET, Jodhpur, India, 342008

³New Delhi Institute of Management, New Delhi, India, 110062

⁴School of Computer Science and Engineering, Galgotias University, Greater Noida, India, 201306

^*Corresponding author. Email: methily.johri@gmail.com

Corresponding Author

Methily Johri

Available Online 28 May 2026.

DOI: 10.2991/978-94-6239-674-6_37 How to use a DOI?
Keywords: Video Summarization; Multimodal Deep Learning; Tvsum; Summe; Transformers; Audio-Visual Fusion; Keyshot Selection
Abstract: The growth of video content has increased at a very high rate in digital platforms and there has been a high demand of automated systems that are capable of capturing the brief and meaningful summaries without compromising the underlying information. This paper hypothesizes a multimodal deep learning system on video summarization based on free TVSum and SumMe datasets. The model incorporates visual, audio and textual cues to gain a richer semantic context than single-modality models. The visual features are obtained with the help of a trained CNN backbone, audio cues are modeled with the help of log-mel spectrogram embeddings, and textual clues are obtained with the help of auto-generated speech transcripts with the help of a sequence encoder. Combination of these multimodal representations happens via an attention mechanism that is transformer-based to learn the importance of the segments but retain temporal coherence. Regularization of diversity and coverage limits are implemented in the training process to reduce redundancy and balance the keyshots that are made. The experiments of TVSum and SumMe show higher F1-scores in comparison to established baselines, which is a positive indication of the capability of the framework to generalize to different genres of video. The findings substantiate that multimodal fusion enhances semantic interpretation particularly whereby visual data is unclear or incomplete. The study provides an effective, scalable, and explainable method of summarizing consumer videos and tutorials and real-world recording with higher levels of informativeness and less computational cost.
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the International Conference on Sustainable Computing and Artificial Intelligence (ICSCAI 2025)
Series: Advances in Engineering Research
Publication Date: 28 May 2026
ISBN: 978-94-6239-674-6
ISSN: 2352-5401
DOI: 10.2991/978-94-6239-674-6_37 How to use a DOI?
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

ris enw bib

TY  - CONF
AU  - Hridyesh Kumar
AU  - Ashish Sharma
AU  - Abhishek Kumar Gupta
AU  - Ankit Upadhyay
AU  - Methily Johri
PY  - 2026
DA  - 2026/05/28
TI  - Multimodal Deep Learning Framework for Video Summarization Using TVSum and SumMe Datasets
BT  - Proceedings of the International Conference on Sustainable Computing and Artificial Intelligence (ICSCAI 2025)
PB  - Atlantis Press
SP  - 445
EP  - 458
SN  - 2352-5401
UR  - https://doi.org/10.2991/978-94-6239-674-6_37
DO  - 10.2991/978-94-6239-674-6_37
ID  - Kumar2026
ER  -

download .riscopy to clipboard