AfterImage

Synthetic conversational dataset generation for LLM fine-tuning.

Generate multi-turn chat data, DPO preference pairs, and structured outputs — from a single YAML file or a composable Python API.

Demonstration of a typical conversational dataset generation, where Afterimage simulates both sides of the conversation.

Generating a document-grounded Q&A dataset from BIS credit risk principles → ShareGPT format

News

April 23, 2026 — OpenSimula

OpenSimula is an experimental, open implementation of mechanism-design ideas from Simula (Davidson et al., TMLR; see also Google’s research blog on the framing). It covers LLM-built factor taxonomies, weighted mix sampling over those factors, meta-prompt diversification (with optional complexification), requirement critics with refinement, and an independent double-critic gate for verifiable multiple-choice items. Checkpoints live under an opensimula/ subtree (manifest, taxonomy bundle, sampling strategy); you can stream datapoints to JSONL, hook GenerationMonitor into OpenSimula, or bridge scenarios into ConversationGenerator via SimulaInstructionGeneratorCallback.

This module is not affiliated with Google and is not a reference port of internal systems—it is an independent take on the published Simula recipe.

Try it: walkthrough and CLI notes in examples/simula/README.md, scripts in examples/simula/, package overview in afterimage/simula/README.md. Narrative + monitoring notes: OpenSimula · autodoc: Simula / OpenSimula API.

Why AfterImage

Fine-tuning a model requires data. Real conversations are slow to collect, expensive to label, and almost never domain-specific enough.

AfterImage flips the problem: you define what the data should look like, and it generates it for you using any LLM you already have access to.

Your documents  +  LLM  →  Realistic, diverse, quality-filtered training data

What you get:

Multi-turn conversations that read like real interactions — not templated Q&A pairs
Document-grounded datasets tied to your corpus (RAG-style)
DPO / RLHF preference pairs without a single manual label
Data already formatted for the training framework you use

Features

Category	What's included
Generation	Multi-turn chat · Document-grounded QA · Persona-driven diversity · Structured output · Tool-calling
Preference Data	DPO · RLHF · UltraFeedback · Anthropic HH · ORPO
Quality	LLM-as-judge · Embedding-based metrics · Auto-improve retries · Composite scoring
Providers	Gemini · OpenAI · DeepSeek · OpenRouter · Local (vLLM / Ollama / llama.cpp)
Export	ShareGPT · Alpaca · Messages · LLaMA Factory · Oumi · OpenAI fine-tune · DPO · Raw
Storage	JSONL (default) · SQLite · PostgreSQL · MySQL
Scale	Async-first · Concurrent generation · Smart API key rotation with rate limiting
Observability	Real-time metrics · Configurable alerts · HTML analytics reports
Interface	CLI · Python API · FastAPI REST server · Gradio demo UI

Installation

If you want your agent to do it for you: Just copy and paste the following to your agent:

Read https://afterimage.altai.dev/llms.txt and follow it for installing AfterImage, documentation links, and examples.

If you are doing it yourself:

pip install afterimage
# or with uv (recommended)

uv add afterimage

Requires Python 3.11+

Optional extras:

Extra	What it adds
`embeddings-local`	Local embeddings via `sentence-transformers` for Qdrant workflows and embedding-based quality checks
`server`	FastAPI REST server (`afterimage-server` CLI entry point)
`training`	PyTorch / TRL stack, Gradio UI, and training scripts under `examples/`

pip install "afterimage[server]"
pip install "afterimage[embeddings-local,server,training]"

Quickstart — CLI

Set your API key and run one command:

export GEMINI_API_KEY=your_key_here
afterimage generate -c examples/configs/basic.yaml

Preview the plan without spending any API credits:

afterimage generate -c examples/configs/basic.yaml --dry-run

Export to your training framework:

# List all available formats
afterimage export --list-formats

# Export to multiple formats in one shot
afterimage export -i output/dataset.jsonl -f sharegpt -f messages -f alpaca

# Create a train/val split automatically
afterimage export -i output/dataset.jsonl -f messages --split 0.9

# Push directly to Hugging Face Hub
afterimage push -c your_config.yaml --repo-id your-org/your-dataset

Generate DPO preference pairs:

afterimage preference -c your_config.yaml

Analyze your dataset:

afterimage analyze -i output/dataset.jsonl -o report.html

Quickstart — Python API

The CLI is powered by the same composable Python API. Drop into it whenever you need a custom pipeline.

Minimal conversation generation:

import asyncio
import os
from afterimage import ConversationGenerator

async def main():
    gen = ConversationGenerator(
        respondent_prompt="You are a helpful AI assistant. Answer clearly and concisely.",
        api_key=os.environ["GEMINI_API_KEY"],
        model_name="gemini-2.5-flash",
    )
    await gen.generate(num_dialogs=50, max_turns=4, max_concurrency=5)
    print(f"Generated {len(gen.load_conversations())} conversations.")

asyncio.run(main())

Document-grounded generation with personas:

import asyncio
import os
from afterimage import (
    ConversationGenerator,
    PersonaGenerator,
    PersonaInstructionGeneratorCallback,
    InMemoryDocumentProvider,
    WithContextRespondentPromptModifier,
)

DOCUMENTS = [
    "Pour-over coffee is brewed by pouring hot water over grounds through a filter. "
    "Key variables are grind size, water temperature (90–96 °C), and pour rate.",
    "Espresso is brewed at 9 bar pressure through finely-ground beans. "
    "It is the base for lattes, cappuccinos, and macchiatos.",
]

async def main():
    api_key = os.environ["GEMINI_API_KEY"]
    docs = InMemoryDocumentProvider(DOCUMENTS)

    # Generate diverse user personas from your documents
    persona_gen = PersonaGenerator(api_key=api_key)
    await persona_gen.generate_from_documents(docs)

    gen = ConversationGenerator(
        respondent_prompt="You are a coffee expert. Answer questions based on the provided context.",
        api_key=api_key,
        model_name="gemini-2.5-flash",
        instruction_generator_callback=PersonaInstructionGeneratorCallback(
            api_key=api_key,
            documents=docs,
            num_random_contexts=1,
        ),
        respondent_prompt_modifier=WithContextRespondentPromptModifier(),
    )

    await gen.generate(num_dialogs=100, max_turns=3, max_concurrency=5)

asyncio.run(main())

Generate DPO preference pairs:

import asyncio
import os
from afterimage import ConversationGenerator
from afterimage.preference.generator import PreferenceGenerator
from afterimage.evaluator import ConversationJudge

async def main():
    api_key = os.environ["GEMINI_API_KEY"]

    base_gen = ConversationGenerator(
        respondent_prompt="You are a helpful assistant.",
        api_key=api_key,
        model_name="gemini-2.5-flash",
    )

    judge = ConversationJudge(api_key=api_key, model_name="gemini-2.5-flash")

    pref_gen = PreferenceGenerator(conversation_generator=base_gen, judge=judge)
    await pref_gen.generate(num_pairs=200, max_concurrency=4)

asyncio.run(main())

More complete examples live under examples/. Full API reference is at afterimage.altai.dev.

Supported LLM Providers

Provider	`provider` key	Model examples	Notes
Google Gemini	`gemini`	`gemini-2.5-flash`, `gemini-2.0-flash`	Default in CLI configs
OpenAI	`openai`	`gpt-4o`, `gpt-4o-mini`	Full API support
DeepSeek	`deepseek`	`deepseek-chat`, `deepseek-reasoner`	Captures chain-of-thought reasoning
OpenRouter	`openrouter`	Any model via OpenRouter	Access 100+ models with one key
Local	`local`	Any OpenAI-compatible server	vLLM, Ollama, llama.cpp — zero API cost

Providers can be mixed freely — use a fast/cheap model to simulate the user (correspondent) and a stronger model to generate answers (respondent).

Scale beyond rate limits with SmartKeyPool — automatic key rotation across concurrent requests:

from afterimage.key_management import SmartKeyPool
from afterimage import ConversationGenerator

pool = SmartKeyPool(["key_1", "key_2", "key_3"])

gen = ConversationGenerator(
    respondent_prompt="You are a helpful assistant.",
    api_key=pool,
    model_name="gemini-2.5-flash",
)

Export Formats

One command converts your raw JSONL to any fine-tuning format:

Format	`--format` flag	Target framework
ShareGPT	`sharegpt`	LLaMA Factory · FastChat · Axolotl
Alpaca	`alpaca`	LLaMA Factory · many community trainers
HuggingFace Messages	`messages`	TRL `SFTTrainer` · HuggingFace ecosystem
LLaMA Factory	`llama_factory`	LLaMA Factory native format
Oumi	`oumi`	Oumi training framework
OpenAI Fine-tune	`openai_finetune`	OpenAI fine-tuning API
DPO	`dpo`	TRL `DPOTrainer` · preference training
Raw	`raw`	Custom pipelines — minimal processing

# Export and split into train/val in one shot
afterimage export -i output/dataset.jsonl -f sharegpt -f messages --split 0.9

How AfterImage Works

AfterImage runs a two-agent loop per dialog:

Correspondent generates user questions — driven by personas, document context, or custom instruction callbacks
Respondent answers — with optional RAG context injected per turn
Quality gate scores each dialog using LLM-as-judge + embedding metrics; retries below-threshold dialogs automatically
Storage writes each dialog incrementally — crash-safe, resumable
Export converts the raw JSONL to any training format in a single CLI command

Configuration Reference

The fastest path to generation is a YAML config:

# examples/configs/basic.yaml

generation:
  num_dialogs: 100
  max_turns: 4
  max_concurrency: 5

model:
  provider: gemini              # gemini | openai | deepseek | openrouter | local
  model_name: gemini-2.5-flash
  api_key_env: GEMINI_API_KEY   # environment variable name

respondent:
  system_prompt: |
    You are an expert assistant. Answer clearly and concisely.

# Optional: document grounding (RAG)
# documents:
#   provider: directory         # directory | file | jsonl | memory | qdrant
#   path: ./my_docs/

# Optional: persona diversity
# personas:
#   enabled: true

# Optional: context-grounded instruction generation
# context:
#   enabled: true
#   num_random_contexts: 2

# Optional: quality gate
# quality:
#   auto_improve: true

output:
  path: ./output/dataset.jsonl
  storage: jsonl                # jsonl | sql

# Validate config before running
afterimage validate -c examples/configs/basic.yaml

# Run
afterimage generate -c examples/configs/basic.yaml

All YAML options and their defaults are documented at afterimage.altai.dev.

Repository Layout

afterimage/              Core library
├── providers/           LLM, document, and embedding providers
├── callbacks/           Instruction generators, stopping criteria, prompt modifiers
├── evaluation/          LLM-as-judge and embedding-based evaluators
├── preference/          DPO / RLHF preference pair generation
├── integrations/        Export format adapters (ShareGPT, Alpaca, Messages, …)
├── analytics/           Dataset analytics engine and HTML report generator
└── server/              FastAPI REST server with SSE progress streaming

examples/
├── configs/             Ready-to-run YAML configs (basic, RAG, local, budget)
├── caselaw_rag/       Qdrant + HF CAP embeddings tutorial (index + generate)
├── demo_ui/             Gradio web UI — interactive generation + fine-tuning
└── *.py                 Python API usage examples

docs/                    Sphinx sources (hosted at afterimage.altai.dev)
tests/                   pytest suite — Python 3.11, 3.12, 3.13

Contributing

Contributions are welcome. Read DESIGN.md for architecture notes before opening a large PR.

# Clone and install all extras
git clone https://github.com/altaidevorg/afterimage
cd afterimage
uv sync --all-extras

# Run the test suite
pytest

# Check style
ruff check .
ruff format .

Open an issue before submitting significant changes — it helps align on design direction early and avoids wasted effort.

License

Apache License 2.0

Built by Altai Dev · Documentation · PyPI · Blog

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
.agents/skills/afterimage-dataset-generation		.agents/skills/afterimage-dataset-generation
.github/workflows		.github/workflows
afterimage		afterimage
docs		docs
examples		examples
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
DESIGN.md		DESIGN.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
make.bat		make.bat
netlify.toml		netlify.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AfterImage

News

April 23, 2026 — OpenSimula

Table of Contents

Why AfterImage

Features

Installation

Quickstart — CLI

Quickstart — Python API

Supported LLM Providers

Export Formats

How AfterImage Works

Configuration Reference

Repository Layout

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AfterImage

News

April 23, 2026 — OpenSimula

Table of Contents

Why AfterImage

Features

Installation

Quickstart — CLI

Quickstart — Python API

Supported LLM Providers

Export Formats

How AfterImage Works

Configuration Reference

Repository Layout

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages