AI-powered Storybook Generation from Images
This repository contains the implementation of TalesRunner, a multimodal system that generates a coherent storybook (PDF) from user-provided images. The project integrates image captioning, structured metadata extraction, and Transformer-based story generation to produce narrative paragraphs conditioned on the visual and contextual input.
- Overview
- Project Timeline
- Key Features
- System Architecture
- Implementation Details
- Model Training
- Demo
- Team
TalesRunner is a full-stack AI project that transforms visual input into text-based stories. Users upload images (up to 10), optionally provide additional scene information, and the system automatically:
- Generates captions using BLIP
- Merges captions with structured metadata
- Produces narrative paragraphs using a fine-tuned KoT5 language model
- Compiles images + text into a PDF storybook
The project focuses on building a practical multimodal pipeline using pre-trained models, fine-tuned LLMs, and a user-friendly demo interface.
Jan–Feb 2025 (5 weeks)
- Dataset construction and preprocessing
- BLIP caption generation for 50,001 images
- KoT5 fine-tuning with optimized decoding
- Evaluation with text-generation metrics
- Streamlit-based demo development
- End-to-end UI + inference pipeline integration
- Multimodal story generation pipeline combining BLIP captions and KoT5 text generation
- Structured metadata extraction from AI Hub annotations
- Custom input format with special tokens to guide narrative generation
- Fine-tuned KoT5 model with Bayesian hyperparameter optimization
- Streamlit demo enabling interactive storybook creation
- PDF export for final story compilation
- Image Upload User provides 1–10 images in order.
- Captioning (BLIP) BLIP generates an initial natural-language caption for each image.
- Metadata Integration User-provided fields + extracted annotations are combined with BLIP captions.
- Story Generation (KoT5) Fine-tuned KoT5 outputs a paragraph for each scene.
- PDF Assembly Images + story paragraphs compiled into a downloadable PDF.
-
Source: AI Hub Fairy Tale Illustration Dataset (50,001 samples)
-
Each sample includes:
- an image (
.jpg) - a metadata file (
.json)
- an image (
-
BLIP generates captions for all images
-
Annotation fields extracted:
- Required: caption, name, i_action, classification
- Optional: character, setting, emotion, causality, outcome, prediction
-
Combined to create:
dataset_train.csvdataset_val.csv
- Special tokens mark structured fields
- Required fields validated for completeness
- Optional fields replaced with
<empty>if missing - Field order randomized per sample to prevent positional bias
- Row-wise seed ensures reproducibility
- Task prefix added to guide KoT5 generation
- Baseline models reviewed: KoGPT-2, KoT5
- KoT5 selected due to stronger generalization and encoder–decoder flexibility
- Added special tokens to tokenizer vocab
- Aligned embedding matrix with extended vocabulary
- Applied masking so structural tokens do not affect attention scores
- Hyperparameter search using Bayesian Optimization
- Optimizer: AdamW
- Scheduler: Warmup + Linear decay
- Early stopping applied
Best parameters during evaluation:
num_beams = 3
length_penalty = 0.8
repetition_penalty = 1.5
no_repeat_ngram_size = 3
- BERTScore
- METEOR
- CIDEr
- SPICE
KoT5 outperformed KoGPT2 in narrative quality, coherence, and content relevance.
A Streamlit demo provides an interactive interface for story generation.
- Image upload page
- Metadata auto-filling and keyword suggestion
- Real-time inference using the fine-tuned KoT5 model
- PDF generation
To run locally:
streamlit run app.pyGPU recommended due to reliance on pre-trained models.
- Doeun Kim — Dataset construction, KoT5 fine-tuning, model training & validation
- Yujin Shin — Annotation preprocessing, inference pipeline, Streamlit demo
- Junga Woo — Baseline model experiments (KoGPT/KoT5), decoding parameter search
- Soobin Cha (PM) — Project management, model baselines, inference UI