Synthetic conversational dataset generation for LLM fine-tuning.
Generate multi-turn chat data, DPO preference pairs, and structured outputs — from a single YAML file or a composable Python API.
Demonstration of a typical conversational dataset generation, where Afterimage simulates both sides of the conversation.
Generating a document-grounded Q&A dataset from BIS credit risk principles → ShareGPT format
OpenSimula is an experimental, open implementation of mechanism-design ideas from Simula (Davidson et al., TMLR; see also Google’s research blog on the framing). It covers LLM-built factor taxonomies, weighted mix sampling over those factors, meta-prompt diversification (with optional complexification), requirement critics with refinement, and an independent double-critic gate for verifiable multiple-choice items. Checkpoints live under an opensimula/ subtree (manifest, taxonomy bundle, sampling strategy); you can stream datapoints to JSONL, hook GenerationMonitor into OpenSimula, or bridge scenarios into ConversationGenerator via SimulaInstructionGeneratorCallback.
This module is not affiliated with Google and is not a reference port of internal systems—it is an independent take on the published Simula recipe.
Try it: walkthrough and CLI notes in examples/simula/README.md, scripts in examples/simula/, package overview in afterimage/simula/README.md. Narrative + monitoring notes: OpenSimula · autodoc: Simula / OpenSimula API.
- News
- Why AfterImage
- Features
- Installation
- Quickstart — CLI
- Quickstart — Python API
- Supported LLM Providers
- Export Formats
- How It Works
- Configuration Reference
- Repository Layout
- Contributing
- License
Fine-tuning a model requires data. Real conversations are slow to collect, expensive to label, and almost never domain-specific enough.
AfterImage flips the problem: you define what the data should look like, and it generates it for you using any LLM you already have access to.
Your documents + LLM → Realistic, diverse, quality-filtered training data
What you get:
- Multi-turn conversations that read like real interactions — not templated Q&A pairs
- Document-grounded datasets tied to your corpus (RAG-style)
- DPO / RLHF preference pairs without a single manual label
- Data already formatted for the training framework you use
| Category | What's included |
|---|---|
| Generation | Multi-turn chat · Document-grounded QA · Persona-driven diversity · Structured output · Tool-calling |
| Preference Data | DPO · RLHF · UltraFeedback · Anthropic HH · ORPO |
| Quality | LLM-as-judge · Embedding-based metrics · Auto-improve retries · Composite scoring |
| Providers | Gemini · OpenAI · DeepSeek · OpenRouter · Local (vLLM / Ollama / llama.cpp) |
| Export | ShareGPT · Alpaca · Messages · LLaMA Factory · Oumi · OpenAI fine-tune · DPO · Raw |
| Storage | JSONL (default) · SQLite · PostgreSQL · MySQL |
| Scale | Async-first · Concurrent generation · Smart API key rotation with rate limiting |
| Observability | Real-time metrics · Configurable alerts · HTML analytics reports |
| Interface | CLI · Python API · FastAPI REST server · Gradio demo UI |
If you want your agent to do it for you: Just copy and paste the following to your agent:
Read https://afterimage.altai.dev/llms.txt and follow it for installing AfterImage, documentation links, and examples.
If you are doing it yourself:
pip install afterimage
# or with uv (recommended)
uv add afterimageRequires Python 3.11+
Optional extras:
| Extra | What it adds |
|---|---|
embeddings-local |
Local embeddings via sentence-transformers for Qdrant workflows and embedding-based quality checks |
server |
FastAPI REST server (afterimage-server CLI entry point) |
training |
PyTorch / TRL stack, Gradio UI, and training scripts under examples/ |
pip install "afterimage[server]"
pip install "afterimage[embeddings-local,server,training]"Set your API key and run one command:
export GEMINI_API_KEY=your_key_here
afterimage generate -c examples/configs/basic.yamlPreview the plan without spending any API credits:
afterimage generate -c examples/configs/basic.yaml --dry-runExport to your training framework:
# List all available formats
afterimage export --list-formats
# Export to multiple formats in one shot
afterimage export -i output/dataset.jsonl -f sharegpt -f messages -f alpaca
# Create a train/val split automatically
afterimage export -i output/dataset.jsonl -f messages --split 0.9
# Push directly to Hugging Face Hub
afterimage push -c your_config.yaml --repo-id your-org/your-datasetGenerate DPO preference pairs:
afterimage preference -c your_config.yamlAnalyze your dataset:
afterimage analyze -i output/dataset.jsonl -o report.htmlThe CLI is powered by the same composable Python API. Drop into it whenever you need a custom pipeline.
Minimal conversation generation:
import asyncio
import os
from afterimage import ConversationGenerator
async def main():
gen = ConversationGenerator(
respondent_prompt="You are a helpful AI assistant. Answer clearly and concisely.",
api_key=os.environ["GEMINI_API_KEY"],
model_name="gemini-2.5-flash",
)
await gen.generate(num_dialogs=50, max_turns=4, max_concurrency=5)
print(f"Generated {len(gen.load_conversations())} conversations.")
asyncio.run(main())Document-grounded generation with personas:
import asyncio
import os
from afterimage import (
ConversationGenerator,
PersonaGenerator,
PersonaInstructionGeneratorCallback,
InMemoryDocumentProvider,
WithContextRespondentPromptModifier,
)
DOCUMENTS = [
"Pour-over coffee is brewed by pouring hot water over grounds through a filter. "
"Key variables are grind size, water temperature (90–96 °C), and pour rate.",
"Espresso is brewed at 9 bar pressure through finely-ground beans. "
"It is the base for lattes, cappuccinos, and macchiatos.",
]
async def main():
api_key = os.environ["GEMINI_API_KEY"]
docs = InMemoryDocumentProvider(DOCUMENTS)
# Generate diverse user personas from your documents
persona_gen = PersonaGenerator(api_key=api_key)
await persona_gen.generate_from_documents(docs)
gen = ConversationGenerator(
respondent_prompt="You are a coffee expert. Answer questions based on the provided context.",
api_key=api_key,
model_name="gemini-2.5-flash",
instruction_generator_callback=PersonaInstructionGeneratorCallback(
api_key=api_key,
documents=docs,
num_random_contexts=1,
),
respondent_prompt_modifier=WithContextRespondentPromptModifier(),
)
await gen.generate(num_dialogs=100, max_turns=3, max_concurrency=5)
asyncio.run(main())Generate DPO preference pairs:
import asyncio
import os
from afterimage import ConversationGenerator
from afterimage.preference.generator import PreferenceGenerator
from afterimage.evaluator import ConversationJudge
async def main():
api_key = os.environ["GEMINI_API_KEY"]
base_gen = ConversationGenerator(
respondent_prompt="You are a helpful assistant.",
api_key=api_key,
model_name="gemini-2.5-flash",
)
judge = ConversationJudge(api_key=api_key, model_name="gemini-2.5-flash")
pref_gen = PreferenceGenerator(conversation_generator=base_gen, judge=judge)
await pref_gen.generate(num_pairs=200, max_concurrency=4)
asyncio.run(main())More complete examples live under examples/. Full API reference is at afterimage.altai.dev.
| Provider | provider key |
Model examples | Notes |
|---|---|---|---|
| Google Gemini | gemini |
gemini-2.5-flash, gemini-2.0-flash |
Default in CLI configs |
| OpenAI | openai |
gpt-4o, gpt-4o-mini |
Full API support |
| DeepSeek | deepseek |
deepseek-chat, deepseek-reasoner |
Captures chain-of-thought reasoning |
| OpenRouter | openrouter |
Any model via OpenRouter | Access 100+ models with one key |
| Local | local |
Any OpenAI-compatible server | vLLM, Ollama, llama.cpp — zero API cost |
Providers can be mixed freely — use a fast/cheap model to simulate the user (correspondent) and a stronger model to generate answers (respondent).
Scale beyond rate limits with SmartKeyPool — automatic key rotation across concurrent requests:
from afterimage.key_management import SmartKeyPool
from afterimage import ConversationGenerator
pool = SmartKeyPool(["key_1", "key_2", "key_3"])
gen = ConversationGenerator(
respondent_prompt="You are a helpful assistant.",
api_key=pool,
model_name="gemini-2.5-flash",
)One command converts your raw JSONL to any fine-tuning format:
| Format | --format flag |
Target framework |
|---|---|---|
| ShareGPT | sharegpt |
LLaMA Factory · FastChat · Axolotl |
| Alpaca | alpaca |
LLaMA Factory · many community trainers |
| HuggingFace Messages | messages |
TRL SFTTrainer · HuggingFace ecosystem |
| LLaMA Factory | llama_factory |
LLaMA Factory native format |
| Oumi | oumi |
Oumi training framework |
| OpenAI Fine-tune | openai_finetune |
OpenAI fine-tuning API |
| DPO | dpo |
TRL DPOTrainer · preference training |
| Raw | raw |
Custom pipelines — minimal processing |
# Export and split into train/val in one shot
afterimage export -i output/dataset.jsonl -f sharegpt -f messages --split 0.9AfterImage runs a two-agent loop per dialog:
- Correspondent generates user questions — driven by personas, document context, or custom instruction callbacks
- Respondent answers — with optional RAG context injected per turn
- Quality gate scores each dialog using LLM-as-judge + embedding metrics; retries below-threshold dialogs automatically
- Storage writes each dialog incrementally — crash-safe, resumable
- Export converts the raw JSONL to any training format in a single CLI command
The fastest path to generation is a YAML config:
# examples/configs/basic.yaml
generation:
num_dialogs: 100
max_turns: 4
max_concurrency: 5
model:
provider: gemini # gemini | openai | deepseek | openrouter | local
model_name: gemini-2.5-flash
api_key_env: GEMINI_API_KEY # environment variable name
respondent:
system_prompt: |
You are an expert assistant. Answer clearly and concisely.
# Optional: document grounding (RAG)
# documents:
# provider: directory # directory | file | jsonl | memory | qdrant
# path: ./my_docs/
# Optional: persona diversity
# personas:
# enabled: true
# Optional: context-grounded instruction generation
# context:
# enabled: true
# num_random_contexts: 2
# Optional: quality gate
# quality:
# auto_improve: true
output:
path: ./output/dataset.jsonl
storage: jsonl # jsonl | sql# Validate config before running
afterimage validate -c examples/configs/basic.yaml
# Run
afterimage generate -c examples/configs/basic.yamlAll YAML options and their defaults are documented at afterimage.altai.dev.
afterimage/ Core library
├── providers/ LLM, document, and embedding providers
├── callbacks/ Instruction generators, stopping criteria, prompt modifiers
├── evaluation/ LLM-as-judge and embedding-based evaluators
├── preference/ DPO / RLHF preference pair generation
├── integrations/ Export format adapters (ShareGPT, Alpaca, Messages, …)
├── analytics/ Dataset analytics engine and HTML report generator
└── server/ FastAPI REST server with SSE progress streaming
examples/
├── configs/ Ready-to-run YAML configs (basic, RAG, local, budget)
├── caselaw_rag/ Qdrant + HF CAP embeddings tutorial (index + generate)
├── demo_ui/ Gradio web UI — interactive generation + fine-tuning
└── *.py Python API usage examples
docs/ Sphinx sources (hosted at afterimage.altai.dev)
tests/ pytest suite — Python 3.11, 3.12, 3.13
Contributions are welcome. Read DESIGN.md for architecture notes before opening a large PR.
# Clone and install all extras
git clone https://github.com/altaidevorg/afterimage
cd afterimage
uv sync --all-extras
# Run the test suite
pytest
# Check style
ruff check .
ruff format .Open an issue before submitting significant changes — it helps align on design direction early and avoids wasted effort.
Built by Altai Dev · Documentation · PyPI · Blog

