You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was hoping to use the GRPO Trainer for an RL project, but I also want to use a neural based model as part of my reward function. Due to memory constraints, I am hoping to use the same base model, tacking on a previously finetuned "Reward LoRA" adapter within my reward function while also enabling peft within the GRPO Trainer to perform the RL training process using LoRa. So I would basically be using GRPO to train base model + "RL LoRA" (only learnable params) while utilizing base model + "Reward LoRA" (these have already been trained so are fixed, no learning here) in a reward function. Is this possible? It seems like PEFT does give some flexibility in terms of swapping adapters, but I am mainly worried about memory and thus getting access to the base model being finetuned within the GRPO Trainer in order to utilize it along with my "Reward Adapter" for inference within my reward function.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi there,
New to TRL so would really appreciate any help!
I was hoping to use the GRPO Trainer for an RL project, but I also want to use a neural based model as part of my reward function. Due to memory constraints, I am hoping to use the same base model, tacking on a previously finetuned "Reward LoRA" adapter within my reward function while also enabling peft within the GRPO Trainer to perform the RL training process using LoRa. So I would basically be using GRPO to train base model + "RL LoRA" (only learnable params) while utilizing base model + "Reward LoRA" (these have already been trained so are fixed, no learning here) in a reward function. Is this possible? It seems like PEFT does give some flexibility in terms of swapping adapters, but I am mainly worried about memory and thus getting access to the base model being finetuned within the GRPO Trainer in order to utilize it along with my "Reward Adapter" for inference within my reward function.
Would really appreciate any guidance! Thanks!
Beta Was this translation helpful? Give feedback.
All reactions