Combinatorial Video Captioning With Deep Learning

Authors

  • Angadi Ahalya, Cheelam Aishwarya Laxmi B.Tech Students, Department of CSE, Bhoj Reddy Engineering College for Women, India Author

Abstract

Video captioning is the task of generating natural language descriptions for videos by analyzing visual scenes, objects, and actions. Unlike video subtitling, which transcribes spoken dialogue, video captioning provides a comprehensive interpretation of all visual elements. Traditional approaches relied on rule-based and feature-based methods, which struggled with complex videos due to their rigidity and lack of contextual understanding.
Modern techniques leverage deep learning models, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), to extract video features and generate captions. Recent advancements focus on weakly supervised dense video captioning, which generates descriptions without predefined key events. This approach is particularly useful for long, untrimmed videos where multiple overlapping events occur, improving event recognition and caption accuracy. By combining event captioning with caption localization, this method enhances both contextual understanding and flexibility in video captioning tasks.

Downloads

Published

2025-04-24

Issue

Section

Articles

How to Cite

Combinatorial Video Captioning With Deep Learning. (2025). International Journal of Engineering and Science Research, 15(2s), 21-26. https://www.ijesr.org/index.php/ijesr/article/view/259