Proceedings of the First International Conference on Artificial Intelligence, Smart Technologies and Communications (AISTC 2025)

Linking Language and Vision: A Deep Learning Method for Captioning Images

Authors
Houda Benaliouche1, *, Lina Ines Filali1, Oualid Bouhaddi1
1University Abdelhamid Mehri Constantine 2, Faculty of New Information and Communication Technologies, Constantine, Algeria
*Corresponding author. Email: houda.benaliouche@univ-constantine2.dz
Corresponding Author
Houda Benaliouche
Available Online 5 August 2025.
DOI
10.2991/978-94-6463-805-9_9How to use a DOI?
Keywords
Image Captioning; Multi-Modal Alignment Loss; VGG16; Flicker30k; LSTM; CNN; Attention Heatmaps
Abstract

Image captioning facilitates the matching of visual understanding and language comprehension by helping machines to narrate images in human terms. This paper details the applications of Convolutional Neural Networks (CNNs) fused with Bidirectional Long Short Term Memory BiLSTM neural networks with attention for better captioning. Important changes comprise the introduction of Multi-Modal Alignment Loss for better matching of textual and image data, the use of attention heatmaps for better model interpretability, and advanced data augmentation for higher model robustness. A pre-trained VGG16 model is used to extract visual features and the performance of the model is tested on the Flickr30k dataset with various metrics like BLEU, METEOR, ROUGE-L, CIDEr, and SPICE. These enhancements improve the fetch, readability, and multi-purpose use of captions for images with use in aid for the disabled and automatic content creation.

Copyright
© 2025 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

Volume Title
Proceedings of the First International Conference on Artificial Intelligence, Smart Technologies and Communications (AISTC 2025)
Series
Advances in Intelligent Systems Research
Publication Date
5 August 2025
ISBN
978-94-6463-805-9
ISSN
1951-6851
DOI
10.2991/978-94-6463-805-9_9How to use a DOI?
Copyright
© 2025 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

TY  - CONF
AU  - Houda Benaliouche
AU  - Lina Ines Filali
AU  - Oualid Bouhaddi
PY  - 2025
DA  - 2025/08/05
TI  - Linking Language and Vision: A Deep Learning Method for Captioning Images
BT  - Proceedings of the First International Conference on Artificial Intelligence, Smart Technologies and Communications (AISTC 2025)
PB  - Atlantis Press
SP  - 65
EP  - 75
SN  - 1951-6851
UR  - https://doi.org/10.2991/978-94-6463-805-9_9
DO  - 10.2991/978-94-6463-805-9_9
ID  - Benaliouche2025
ER  -