This repository trains a LoRA adapter on the Alignment Research Dataset using Unsloth.
I use an up-to-date, filtered version of the Alignment Research Dataset:
- Re-ran the author's code to get the latest posts
- Added resources from Anthropic, Apollo Research, and Redwood Research blogs
- Removed some sources entirely: agentmodels, arbital, distill, agisf
- Cleaned the text: removed links, converted HTML to Markdown, stripped CSS artifacts, and fixed encoding issues (see
scripts/clean_data.py)
I kept only the text field to keep things simple and uploaded it to Hugging Face: toiwuo87/alignment-data-filtered
I used QLoRA with Unsloth to fine-tune Llama 3.3 70B Instruct on the cleaned dataset in a continued pre-training fashion.
First, I ran prepare_data.py to filter out documents exceeding 8000 tokens, then split the data into training (~13.5k documents) and validation (1000 documents).
To train:
python src/train.py --config config.yamlTraining took 11 hours on a single GH200.
The adapters are available at: toiwuo87/llama-3.3-70B-ft-safetydocs
At this scale, fine-tuning didn't produce strong behavioral effects. Compared to the base model, fine-tuned model shows some improved awareness of AI safety and security considerations and tends to structure responses more like LessWrong posts, stating definitions and assumptions upfront, but the overall impact was limited.
Transcripts and evaluations coming soon.
For a more rigorous exploration of how alignment-related content in training data affects model behavior, see Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment.