-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Description
Overview
This issue captures some of the key steps required to reproduce the Constitutional AI paper steps to fine tune a RLHF model with feedback generated by a RLAIF model.
Phase One
- Gather a dataset of harmful prompts
- Create a base script to compose prompts using a base constitution
- Generate a new dataset of prompts + responses using Carper's GPT-J RLHF to review / critique the output
- Fine-tune the original model on revised responses using supervised learning
Phase Two
- Sample the fine tuned model using the dataset of harmful prompts to create a new dataset with multiple outputs
- Train a "reward model' (i.e. https://github.com/Dahoas/reward-modeling) to select the best result (fine tuned preference model)
- Use RLAIF training to fine tune the RLHF model
Metadata
Metadata
Assignees
Labels
No labels

