This folder contains implementations for machine unlearning methods on LLM360 models. Machine unlearning is a pre-deployment safety measure designed to remove hazardous knowledge from language models. Unlearned models are inherently safe, as they lack the knowledge to be misused.
Here's a list of unlearning methods we have implemented so far.
| Method | Model |
|---|---|
| max_entropy | CrystalChat |
| min_posterior | CrystalChat |
| random_matching | CrystalChat |
| RMU | CrystalChat |
unlearn.py is the main entrypoint for running unlearning methods. It uses python modules in methods/ and utils/ folders.
The methods/ folder contains the implementations for unlearning methods:
training.py: All training loop implementationsutils.py: Loss functions and other method-related utils
The utils/ folder contains helper functions for model/dataset IO:
data_utils.py: Dataloader for text datasetsmodel_utils.py: Model IO utils
By default, unlearned models are saved to models/ folder. Please store all training datasets to the data/ folder.
Note
This project uses the bio-forget-corpus from the WMDP Benchmark for unlearning training. Access to this dataset requires a separate request. Please follow the instructions provided here to obtain the necessary permissions. By default, the dataloader is configured to load the dataset from data/bio_forget.jsonl.
- Clone and enter the repo:
git clone https://github.com/LLM360/Analysis360.git cd Analysis360/analysis/unlearning - Install dependencies:
pip install -r requirements.txt
- To install
lm-eval, please check the installation instructions in themetrics/harnessfolder.
An example usage is provided in the demo.ipynb, which can be executed with a single A100 80G GPU.