feat(train_reward_model): add chatml formatting and aggregation of more statistics #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

maxreciprocate wants to merge 2 commits into main from update-reward-trainer

README.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -2,35 +2,51 @@
  
    A repository for transformer critique learning and generation.

    ## Scalar reward models

    Train [OpenLLaMA-13B](https://github.com/openlm-research/open_llama) on [Helpful and Harmless dataset](https://github.com/anthropics/hh-rlhf):

    Train [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) on [UltraFeedback](https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleaned) dataset:

    ```bash

    accelerate launch --config_file configs/accelerate/zero2.yaml \

               train_reward_model.py \

               --model_path openlm-research/open_llama_13b \

               --dataset pvduy/rm_oa_hh \

               --batch_size 1 \

               --model_path mistralai/Mistral-7B-Instruct-v0.1 \

               --dataset allenai/ultrafeedback_binarized_cleaned:train_prefs \

               --batch_size 4 \

               --eval_interval 1000 \

               --lr 0.00001 \

               --lr 0.000003 \

               --weight_decay 0 \

               --num_unfrozen_layers 12 \

               --gradient_checkpointing \

               --checkpoint_dir checkpoints \

               --calibration_datasets reciprocate/vicuna-fair-eval

               --calibration_datasets allenai/ultrafeedback_binarized_cleaned:test_prefs Intel/orca_dpo_pairs reciprocate/fair-eval

    ```

    Usage:

    ```python

    from transformers import AutoModelForSequenceClassification, AutoTokenizer

    from transformers import pipeline

    reward_fn = pipeline(

        "text-classification",

        model="reciprocate/mistral-7b-rm",

        truncation=True,

        max_length=4096,

        function_to_apply="none"

    )

    ckpt = "reciprocate/openllama-13b_rm_oasst-hh"

    model = AutoModelForSequenceClassification.from_pretrained(ckpt, load_in_4bit=True)

    tokenizer = AutoTokenizer.from_pretrained(ckpt)

    chats = [[

        {"role": "user", "content": "When was the battle at Waterloo?"},

        {"role": "assistant", "content": "I think it was in 1983, but please double-check that when you have a chance."}

    ], [

        {"role": "user", "content": "When was the battle at Waterloo?"},

        {"role": "assistant", "content": "The battle at Waterloo took place on June 18, 1815."}

    ]]

    model(**tokenizer("ASSISTANT: This sentence is a lie.", return_tensors="pt"))[0].item()

    inputs = [reward_fn.tokenizer.apply_chat_template(chat, tokenize=False) for chat in chats]

    output = reward_fn(inputs)

    scores = [x["score"] for x in output]

    scores

    ```

    Output:

    ```python

    -1.626953125

    >>> [-1.0530743598937988, 0.6916144490242004]

    ```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(train_reward_model): add chatml formatting and aggregation of more statistics #21

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

feat(train_reward_model): add chatml formatting and aggregation of more statistics #21

Are you sure you want to change the base?

Uh oh!

feat(train_reward_model): add chatml formatting and aggregation of more statistics #21

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!