Image Captioning Using CNN and LSTM
Keywords:
Image captioning, Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTMAbstract
The Image Caption Generator introduces an innovative approach to automatically describing image content by seamlessly integrating computer vision and natural language processing (NLP). Leveraging recent advancements in neural networks, NLP, and computer vision, the model combines Convolutional Neural Networks (CNNs), specifically the pre-trained Xception model, for precise image feature extraction, with Recurrent Neural Networks (RNNs), including Long Short-Term Memory (LSTM) cells, for coherent sentence generation. Enhanced by the incorporation of a Beam Search algorithm and an Attention mechanism, the model significantly improves the accuracy and relevance of generated captions by dynamically focusing on different parts of the image and exploring multiple caption sequences. Trained on a dataset of 8,000 images from the Flickr8K dataset paired with human-judged descriptions over multiple epochs, the model achieves a significant reduction in loss. Additionally, it incorporates a text-to-speech module using the pyttsx3 library to audibly articulate the generated text from the image captions, enhancing accessibility for visually impaired individuals or users who prefer audio output. Evaluation using BLEUscoreand METEOR metrics confirms the model's proficiency in producing coherent and contextually accurate image captions, marking a significant advancement in image captioning technology.