Home
World Journal of Advanced Research and Reviews
International Journal with High Impact Factor for fast publication of Research and Review articles

Main navigation

  • Home
    • Journal Information
    • Editorial Board Members
    • Reviewer Panel
    • Abstracting and Indexing
    • Journal Policies
    • Our CrossMark Policy
    • Publication Ethics
    • Issue in Progress
    • Current Issue
    • Past Issues
    • Instructions for Authors
    • Article processing fee
    • Track Manuscript Status
    • Get Publication Certificate
    • Join Editorial Board
    • Join Reviewer Panel
  • Contact us
  • Downloads

eISSN: 2581-9615 || CODEN: WJARAI || Impact Factor 8.2 ||  CrossRef DOI

Research and review articles are invited for publication in March 2026 (Volume 29, Issue 3) Submit manuscript

Multimodal AI framework for image captioning, story generation and natural speech narration

Breadcrumb

  • Home
  • Multimodal AI framework for image captioning, story generation and natural speech narration

Ashwani Attri, Priyanka Gudeboyena, Vaishnavi Chigurla *, Soumika Moluguri and Nithin Kasoju

Department of Computer Science and Engineering (Data Science), Ashwani Attri,  ACE Engineering College, Telangana, India.

Research Article

World Journal of Advanced Research and Reviews, 2025, 26(02), 1037-1044

Article DOI: 10.30574/wjarr.2025.26.2.1685

DOI url: https://doi.org/10.30574/wjarr.2025.26.2.1685

Received on 27 March 2025; revised on 03 May 2025; accepted on 06 May 2025

With the increasing ubiquity of digital imagery, there is a growing need for intelligent systems capable of understanding visual content and expressing that understanding in human-like language. This paper presents a comprehensive AI-based pipeline that not only generates captions from images but also constructs vivid stories based on those captions and finally delivers them in a human voice. The proposed system integrates multiple components: a Convolutional Neural Network (VGG16) for extracting visual features, an LSTM-based sequence model for caption generation, GPT-2 for creative story generation, and Google Text-to-Speech (gTTS) for voice synthesis. The result is a multi-modal AI framework capable of transforming static images into rich, spoken narratives. This approach has applications in assistive technologies, interactive storytelling, content automation, and education. The proposed model is trained and evaluated on the Flickr8k dataset, demonstrating a viable path for automated visual storytelling. 

Image Captioning; CNN-LSTM; VGG16; GPT-2; Text-to-Speech (gTTS); Image-to-Story Generation; Natural Language Processing (NLP)

https://wjarr.com/sites/default/files/fulltext_pdf/WJARR-2025-1685.pdf

Preview Article PDF

Ashwani Attri, Priyanka Gudeboyena, Vaishnavi Chigurla, Soumika Moluguri and Nithin Kasoju. Multimodal AI framework for image captioning, story generation and natural speech narration. World Journal of Advanced Research and Reviews, 2025, 26(2), 1037-1044. Article DOI: https://doi.org/10.30574/wjarr.2025.26.2.1685

Copyright © Author(s). All rights reserved. This article is published under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits use, sharing, adaptation, distribution, and reproduction in any medium or format, as long as appropriate credit is given to the original author(s) and source, a link to the license is provided, and any changes made are indicated.


All statements, opinions, and data contained in this publication are solely those of the individual author(s) and contributor(s). The journal, editors, reviewers, and publisher disclaim any responsibility or liability for the content, including accuracy, completeness, or any consequences arising from its use.

Get Certificates

Get Publication Certificate

Download LoA

Check Corssref DOI details

Issue details

Issue Cover Page

Editorial Board

Table of content

Copyright © 2026 World Journal of Advanced Research and Reviews - All rights reserved

Developed & Designed by VS Infosolution