Skip to content

AyseAsude/reading-safety

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reading Safety

This repository trains a LoRA adapter on the Alignment Research Dataset using Unsloth.

Dataset

I use an up-to-date, filtered version of the Alignment Research Dataset:

  • Re-ran the author's code to get the latest posts
  • Added resources from Anthropic, Apollo Research, and Redwood Research blogs
  • Removed some sources entirely: agentmodels, arbital, distill, agisf
  • Cleaned the text: removed links, converted HTML to Markdown, stripped CSS artifacts, and fixed encoding issues (see scripts/clean_data.py)

I kept only the text field to keep things simple and uploaded it to Hugging Face: toiwuo87/alignment-data-filtered

Training

I used QLoRA with Unsloth to fine-tune Llama 3.3 70B Instruct on the cleaned dataset in a continued pre-training fashion.

First, I ran prepare_data.py to filter out documents exceeding 8000 tokens, then split the data into training (~13.5k documents) and validation (1000 documents).

To train:

python src/train.py --config config.yaml

Training took 11 hours on a single GH200.

The adapters are available at: toiwuo87/llama-3.3-70B-ft-safetydocs

Results

At this scale, fine-tuning didn't produce strong behavioral effects. Compared to the base model, fine-tuned model shows some improved awareness of AI safety and security considerations and tends to structure responses more like LessWrong posts, stating definitions and assumptions upfront, but the overall impact was limited.

Transcripts and evaluations coming soon.

Related Work

For a more rigorous exploration of how alignment-related content in training data affects model behavior, see Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages