Skip to content

doeunyy/tales-runner

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🔮 TalesRunner

AI-powered Storybook Generation from Images

This repository contains the implementation of TalesRunner, a multimodal system that generates a coherent storybook (PDF) from user-provided images. The project integrates image captioning, structured metadata extraction, and Transformer-based story generation to produce narrative paragraphs conditioned on the visual and contextual input.

Table of Contents

  1. Overview
  2. Project Timeline
  3. Key Features
  4. System Architecture
  5. Implementation Details
  6. Model Training
  7. Demo
  8. Team

Overview

TalesRunner is a full-stack AI project that transforms visual input into text-based stories. Users upload images (up to 10), optionally provide additional scene information, and the system automatically:

  1. Generates captions using BLIP
  2. Merges captions with structured metadata
  3. Produces narrative paragraphs using a fine-tuned KoT5 language model
  4. Compiles images + text into a PDF storybook

The project focuses on building a practical multimodal pipeline using pre-trained models, fine-tuned LLMs, and a user-friendly demo interface.

Project Timeline

Jan–Feb 2025 (5 weeks)

  • Dataset construction and preprocessing
  • BLIP caption generation for 50,001 images
  • KoT5 fine-tuning with optimized decoding
  • Evaluation with text-generation metrics
  • Streamlit-based demo development
  • End-to-end UI + inference pipeline integration

Key Features

  • Multimodal story generation pipeline combining BLIP captions and KoT5 text generation
  • Structured metadata extraction from AI Hub annotations
  • Custom input format with special tokens to guide narrative generation
  • Fine-tuned KoT5 model with Bayesian hyperparameter optimization
  • Streamlit demo enabling interactive storybook creation
  • PDF export for final story compilation

System Architecture

High-level Flow

  1. Image Upload User provides 1–10 images in order.
  2. Captioning (BLIP) BLIP generates an initial natural-language caption for each image.
  3. Metadata Integration User-provided fields + extracted annotations are combined with BLIP captions.
  4. Story Generation (KoT5) Fine-tuned KoT5 outputs a paragraph for each scene.
  5. PDF Assembly Images + story paragraphs compiled into a downloadable PDF.

Implementation Details

Dataset Construction

  • Source: AI Hub Fairy Tale Illustration Dataset (50,001 samples)

  • Each sample includes:

    • an image (.jpg)
    • a metadata file (.json)
  • BLIP generates captions for all images

  • Annotation fields extracted:

    • Required: caption, name, i_action, classification
    • Optional: character, setting, emotion, causality, outcome, prediction
  • Combined to create:

    • dataset_train.csv
    • dataset_val.csv

Input Encoding

  • Special tokens mark structured fields
  • Required fields validated for completeness
  • Optional fields replaced with <empty> if missing
  • Field order randomized per sample to prevent positional bias
  • Row-wise seed ensures reproducibility
  • Task prefix added to guide KoT5 generation

Model Training

Model Choices

  • Baseline models reviewed: KoGPT-2, KoT5
  • KoT5 selected due to stronger generalization and encoder–decoder flexibility

Tokenizer & Model Customization

  • Added special tokens to tokenizer vocab
  • Aligned embedding matrix with extended vocabulary
  • Applied masking so structural tokens do not affect attention scores

Training & Optimization

  • Hyperparameter search using Bayesian Optimization
  • Optimizer: AdamW
  • Scheduler: Warmup + Linear decay
  • Early stopping applied

Decoding Optimization

Best parameters during evaluation:

num_beams = 3
length_penalty = 0.8
repetition_penalty = 1.5
no_repeat_ngram_size = 3

Evaluation Metrics Used

  • BERTScore
  • METEOR
  • CIDEr
  • SPICE

KoT5 outperformed KoGPT2 in narrative quality, coherence, and content relevance.

Demo

A Streamlit demo provides an interactive interface for story generation.

Demo Features

  • Image upload page
  • Metadata auto-filling and keyword suggestion
  • Real-time inference using the fine-tuned KoT5 model
  • PDF generation

To run locally:

streamlit run app.py

GPU recommended due to reliance on pre-trained models.

Team

  • Doeun Kim — Dataset construction, KoT5 fine-tuning, model training & validation
  • Yujin Shin — Annotation preprocessing, inference pipeline, Streamlit demo
  • Junga Woo — Baseline model experiments (KoGPT/KoT5), decoding parameter search
  • Soobin Cha (PM) — Project management, model baselines, inference UI

About

TalesRunner: AI-powered Storybook Generation from Images

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 98.3%
  • Python 1.7%