Reading Safety

This repository trains a LoRA adapter on the Alignment Research Dataset using Unsloth.

Dataset

I use an up-to-date, filtered version of the Alignment Research Dataset:

Re-ran the author's code to get the latest posts
Added resources from Anthropic, Apollo Research, and Redwood Research blogs
Removed some sources entirely: agentmodels, arbital, distill, agisf
Cleaned the text: removed links, converted HTML to Markdown, stripped CSS artifacts, and fixed encoding issues (see scripts/clean_data.py)

I kept only the text field to keep things simple and uploaded it to Hugging Face: toiwuo87/alignment-data-filtered

Training

I used QLoRA with Unsloth to fine-tune Llama 3.3 70B Instruct on the cleaned dataset in a continued pre-training fashion.

First, I ran prepare_data.py to filter out documents exceeding 8000 tokens, then split the data into training (~13.5k documents) and validation (1000 documents).

To train:

python src/train.py --config config.yaml

Training took 11 hours on a single GH200.

The adapters are available at: toiwuo87/llama-3.3-70B-ft-safetydocs

Results

At this scale, fine-tuning didn't produce strong behavioral effects. Compared to the base model, fine-tuned model shows some improved awareness of AI safety and security considerations and tends to structure responses more like LessWrong posts, stating definitions and assumptions upfront, but the overall impact was limited.

Transcripts and evaluations coming soon.

Related Work

For a more rigorous exploration of how alignment-related content in training data affects model behavior, see Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
configs		configs
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reading Safety

Dataset

Training

Results

Related Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reading Safety

Dataset

Training

Results

Related Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages