Skip to content

altaidevorg/afterimage

Repository files navigation

AfterImage

Synthetic conversational dataset generation for LLM fine-tuning.

Generate multi-turn chat data, DPO preference pairs, and structured outputs — from a single YAML file or a composable Python API.

Tests Ruff format Ruff lint PyPI version PyPI downloads Python License Docs Medium


Demonstration of a typical conversational dataset generation, where Afterimage simulates both sides of the conversation.

AfterImage demo — Credit Risk Management Q&A Bot

Generating a document-grounded Q&A dataset from BIS credit risk principles → ShareGPT format


News

April 23, 2026 — OpenSimula

OpenSimula is an experimental, open implementation of mechanism-design ideas from Simula (Davidson et al., TMLR; see also Google’s research blog on the framing). It covers LLM-built factor taxonomies, weighted mix sampling over those factors, meta-prompt diversification (with optional complexification), requirement critics with refinement, and an independent double-critic gate for verifiable multiple-choice items. Checkpoints live under an opensimula/ subtree (manifest, taxonomy bundle, sampling strategy); you can stream datapoints to JSONL, hook GenerationMonitor into OpenSimula, or bridge scenarios into ConversationGenerator via SimulaInstructionGeneratorCallback.

This module is not affiliated with Google and is not a reference port of internal systems—it is an independent take on the published Simula recipe.

Try it: walkthrough and CLI notes in examples/simula/README.md, scripts in examples/simula/, package overview in afterimage/simula/README.md. Narrative + monitoring notes: OpenSimula · autodoc: Simula / OpenSimula API.


Table of Contents


Why AfterImage

Fine-tuning a model requires data. Real conversations are slow to collect, expensive to label, and almost never domain-specific enough.

AfterImage flips the problem: you define what the data should look like, and it generates it for you using any LLM you already have access to.

Your documents  +  LLM  →  Realistic, diverse, quality-filtered training data

What you get:

  • Multi-turn conversations that read like real interactions — not templated Q&A pairs
  • Document-grounded datasets tied to your corpus (RAG-style)
  • DPO / RLHF preference pairs without a single manual label
  • Data already formatted for the training framework you use

Features

Category What's included
Generation Multi-turn chat · Document-grounded QA · Persona-driven diversity · Structured output · Tool-calling
Preference Data DPO · RLHF · UltraFeedback · Anthropic HH · ORPO
Quality LLM-as-judge · Embedding-based metrics · Auto-improve retries · Composite scoring
Providers Gemini · OpenAI · DeepSeek · OpenRouter · Local (vLLM / Ollama / llama.cpp)
Export ShareGPT · Alpaca · Messages · LLaMA Factory · Oumi · OpenAI fine-tune · DPO · Raw
Storage JSONL (default) · SQLite · PostgreSQL · MySQL
Scale Async-first · Concurrent generation · Smart API key rotation with rate limiting
Observability Real-time metrics · Configurable alerts · HTML analytics reports
Interface CLI · Python API · FastAPI REST server · Gradio demo UI

Installation

If you want your agent to do it for you: Just copy and paste the following to your agent:

Read https://afterimage.altai.dev/llms.txt and follow it for installing AfterImage, documentation links, and examples.

If you are doing it yourself:

pip install afterimage
# or with uv (recommended)

uv add afterimage

Requires Python 3.11+

Optional extras:

Extra What it adds
embeddings-local Local embeddings via sentence-transformers for Qdrant workflows and embedding-based quality checks
server FastAPI REST server (afterimage-server CLI entry point)
training PyTorch / TRL stack, Gradio UI, and training scripts under examples/
pip install "afterimage[server]"
pip install "afterimage[embeddings-local,server,training]"

Quickstart — CLI

Set your API key and run one command:

export GEMINI_API_KEY=your_key_here
afterimage generate -c examples/configs/basic.yaml

Preview the plan without spending any API credits:

afterimage generate -c examples/configs/basic.yaml --dry-run

Export to your training framework:

# List all available formats
afterimage export --list-formats

# Export to multiple formats in one shot
afterimage export -i output/dataset.jsonl -f sharegpt -f messages -f alpaca

# Create a train/val split automatically
afterimage export -i output/dataset.jsonl -f messages --split 0.9

# Push directly to Hugging Face Hub
afterimage push -c your_config.yaml --repo-id your-org/your-dataset

Generate DPO preference pairs:

afterimage preference -c your_config.yaml

Analyze your dataset:

afterimage analyze -i output/dataset.jsonl -o report.html

Quickstart — Python API

The CLI is powered by the same composable Python API. Drop into it whenever you need a custom pipeline.

Minimal conversation generation:

import asyncio
import os
from afterimage import ConversationGenerator

async def main():
    gen = ConversationGenerator(
        respondent_prompt="You are a helpful AI assistant. Answer clearly and concisely.",
        api_key=os.environ["GEMINI_API_KEY"],
        model_name="gemini-2.5-flash",
    )
    await gen.generate(num_dialogs=50, max_turns=4, max_concurrency=5)
    print(f"Generated {len(gen.load_conversations())} conversations.")

asyncio.run(main())

Document-grounded generation with personas:

import asyncio
import os
from afterimage import (
    ConversationGenerator,
    PersonaGenerator,
    PersonaInstructionGeneratorCallback,
    InMemoryDocumentProvider,
    WithContextRespondentPromptModifier,
)

DOCUMENTS = [
    "Pour-over coffee is brewed by pouring hot water over grounds through a filter. "
    "Key variables are grind size, water temperature (90–96 °C), and pour rate.",
    "Espresso is brewed at 9 bar pressure through finely-ground beans. "
    "It is the base for lattes, cappuccinos, and macchiatos.",
]

async def main():
    api_key = os.environ["GEMINI_API_KEY"]
    docs = InMemoryDocumentProvider(DOCUMENTS)

    # Generate diverse user personas from your documents
    persona_gen = PersonaGenerator(api_key=api_key)
    await persona_gen.generate_from_documents(docs)

    gen = ConversationGenerator(
        respondent_prompt="You are a coffee expert. Answer questions based on the provided context.",
        api_key=api_key,
        model_name="gemini-2.5-flash",
        instruction_generator_callback=PersonaInstructionGeneratorCallback(
            api_key=api_key,
            documents=docs,
            num_random_contexts=1,
        ),
        respondent_prompt_modifier=WithContextRespondentPromptModifier(),
    )

    await gen.generate(num_dialogs=100, max_turns=3, max_concurrency=5)

asyncio.run(main())

Generate DPO preference pairs:

import asyncio
import os
from afterimage import ConversationGenerator
from afterimage.preference.generator import PreferenceGenerator
from afterimage.evaluator import ConversationJudge

async def main():
    api_key = os.environ["GEMINI_API_KEY"]

    base_gen = ConversationGenerator(
        respondent_prompt="You are a helpful assistant.",
        api_key=api_key,
        model_name="gemini-2.5-flash",
    )

    judge = ConversationJudge(api_key=api_key, model_name="gemini-2.5-flash")

    pref_gen = PreferenceGenerator(conversation_generator=base_gen, judge=judge)
    await pref_gen.generate(num_pairs=200, max_concurrency=4)

asyncio.run(main())

More complete examples live under examples/. Full API reference is at afterimage.altai.dev.


Supported LLM Providers

Provider provider key Model examples Notes
Google Gemini gemini gemini-2.5-flash, gemini-2.0-flash Default in CLI configs
OpenAI openai gpt-4o, gpt-4o-mini Full API support
DeepSeek deepseek deepseek-chat, deepseek-reasoner Captures chain-of-thought reasoning
OpenRouter openrouter Any model via OpenRouter Access 100+ models with one key
Local local Any OpenAI-compatible server vLLM, Ollama, llama.cpp — zero API cost

Providers can be mixed freely — use a fast/cheap model to simulate the user (correspondent) and a stronger model to generate answers (respondent).

Scale beyond rate limits with SmartKeyPool — automatic key rotation across concurrent requests:

from afterimage.key_management import SmartKeyPool
from afterimage import ConversationGenerator

pool = SmartKeyPool(["key_1", "key_2", "key_3"])

gen = ConversationGenerator(
    respondent_prompt="You are a helpful assistant.",
    api_key=pool,
    model_name="gemini-2.5-flash",
)

Export Formats

One command converts your raw JSONL to any fine-tuning format:

Format --format flag Target framework
ShareGPT sharegpt LLaMA Factory · FastChat · Axolotl
Alpaca alpaca LLaMA Factory · many community trainers
HuggingFace Messages messages TRL SFTTrainer · HuggingFace ecosystem
LLaMA Factory llama_factory LLaMA Factory native format
Oumi oumi Oumi training framework
OpenAI Fine-tune openai_finetune OpenAI fine-tuning API
DPO dpo TRL DPOTrainer · preference training
Raw raw Custom pipelines — minimal processing
# Export and split into train/val in one shot
afterimage export -i output/dataset.jsonl -f sharegpt -f messages --split 0.9

How AfterImage Works

AfterImage runs a two-agent loop per dialog:

AfterImage Dialog-Level Workflow

  1. Correspondent generates user questions — driven by personas, document context, or custom instruction callbacks
  2. Respondent answers — with optional RAG context injected per turn
  3. Quality gate scores each dialog using LLM-as-judge + embedding metrics; retries below-threshold dialogs automatically
  4. Storage writes each dialog incrementally — crash-safe, resumable
  5. Export converts the raw JSONL to any training format in a single CLI command

Configuration Reference

The fastest path to generation is a YAML config:

# examples/configs/basic.yaml

generation:
  num_dialogs: 100
  max_turns: 4
  max_concurrency: 5

model:
  provider: gemini              # gemini | openai | deepseek | openrouter | local
  model_name: gemini-2.5-flash
  api_key_env: GEMINI_API_KEY   # environment variable name

respondent:
  system_prompt: |
    You are an expert assistant. Answer clearly and concisely.

# Optional: document grounding (RAG)
# documents:
#   provider: directory         # directory | file | jsonl | memory | qdrant
#   path: ./my_docs/

# Optional: persona diversity
# personas:
#   enabled: true

# Optional: context-grounded instruction generation
# context:
#   enabled: true
#   num_random_contexts: 2

# Optional: quality gate
# quality:
#   auto_improve: true

output:
  path: ./output/dataset.jsonl
  storage: jsonl                # jsonl | sql
# Validate config before running
afterimage validate -c examples/configs/basic.yaml

# Run
afterimage generate -c examples/configs/basic.yaml

All YAML options and their defaults are documented at afterimage.altai.dev.


Repository Layout

afterimage/              Core library
├── providers/           LLM, document, and embedding providers
├── callbacks/           Instruction generators, stopping criteria, prompt modifiers
├── evaluation/          LLM-as-judge and embedding-based evaluators
├── preference/          DPO / RLHF preference pair generation
├── integrations/        Export format adapters (ShareGPT, Alpaca, Messages, …)
├── analytics/           Dataset analytics engine and HTML report generator
└── server/              FastAPI REST server with SSE progress streaming

examples/
├── configs/             Ready-to-run YAML configs (basic, RAG, local, budget)
├── caselaw_rag/       Qdrant + HF CAP embeddings tutorial (index + generate)
├── demo_ui/             Gradio web UI — interactive generation + fine-tuning
└── *.py                 Python API usage examples

docs/                    Sphinx sources (hosted at afterimage.altai.dev)
tests/                   pytest suite — Python 3.11, 3.12, 3.13

Contributing

Contributions are welcome. Read DESIGN.md for architecture notes before opening a large PR.

# Clone and install all extras
git clone https://github.com/altaidevorg/afterimage
cd afterimage
uv sync --all-extras

# Run the test suite
pytest

# Check style
ruff check .
ruff format .

Open an issue before submitting significant changes — it helps align on design direction early and avoids wasted effort.


License

Apache License 2.0


Built by Altai Dev · Documentation · PyPI · Blog

About

Generate conversational, tool-calling, structured-output, and preference datasets — easily and at scale

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages