diff --git a/_posts/2025-04-22-lecture-22.md b/_posts/2025-04-22-lecture-22.md new file mode 100644 index 000000000..a7ec9bf22 --- /dev/null +++ b/_posts/2025-04-22-lecture-22.md @@ -0,0 +1,149 @@ +--- +layout: distill +title: Lecture 22 – Supervised Fine-Tuning of LLMs +description: Reinforcement Learning, Parameter Fine-Tuning, Prompt Optimization +date: 2025-04-22 + +lecturers: + - name: Ben Lengerich + url: "https://lengerichlab.github.io/" + +authors: + - name: Arjun Ghelani # author's full name + +abstract: > + Diving into optimization of large-scale LLMs and how to increase efficiency in output +--- + +## Announcements + +- Project presentations: April 29 and May 1. +- Submit peer review forms on Canvas each day to earn up to 2% bonus. +- Due by: Friday, May 2. + +--- + +## LLM Overview + +### GPT Training Objective: MLE + +- A LLM is ann autoregressive generative model that predicts the likelihood of a token in the next position in a chain + + +$$P_{\theta}(X) = \prod_{i} \prod_{t} P_{\theta}(X_{i,t} \mid X_{i, + +### Supervised Fine-Tuning +Show the language model how to appropriately respond to prompty of different types +"Behavior cloning" (output is a behavior that you want the LLM to reproduce) + + + +A smaller (1.3B parameter) model can outperform a 175B model if pretrained properly + +### Reinforcement Learning with Human Feedback + + + +Get **cheap, fast** human feedback with a rating system. After a response, indicate a "thumbs up" or "thumbs down" to provide reinforcement learning feedback to the model for training and optimizing future responses + +$r_{\theta}$: the reward model being trained, parameterized by $\theta$. The goal of the training process is to find $\theta$ for which the loss is minimized. + +The training data format: +* $x$: prompty +* $y_w$: winning response +* $y_l$: losing response + +For each training sample ($x$, $y_w$, $y_l$): +* $s_w$ = $r_{\theta}(x, y_w)$ +* $s_l$ = $r_{\theta}(x, y_l)$ +* Loss value: $-log(\sigma(s_w - s_l))$ + +Goal: find $\theta$ to minimize the expected loss for all training samples. $-\mathbb{E}_x log(\sigma(s_w - s_l))$ + +### Does human feedback reduce model hallucations? + +**How to Fix with RL** – John Schulman 2023 +1. Adjust output distribution so model is allowed to express uncertainty, challenge premise, admit error (can use bheavior cloning) +2. Use RL to precisely learn behavior boundary + +In actuality, human feedback increases hallucination rate in comparison to a baseline SFT model + +## Efficient Parameter Fine-Tuning + +### Low-Rank Adaption (LoRA) + +Hypothesis: The change in weights during model adaptation has a low "**intrinsic rank**" + + + +### Retrieval-Augment Generation + +Resource access enables personalization + + + +### More Efficient Personalization + +Learn to breakdown the embeddings of responses into a personalized and a universal subspace +If we get a user's history, we can find where their queries tend to be represented in the universal subspace. And then project the response we were going to give into the personalized subspace, for a personalized response. + + + +## Prompting + +### Few-Shot / Zero-Shot Learning + +One key emergent ability in GPT-2 is **zero-shot learning**: the ability to do many tasks with **no examples**, and **no gradient updates**, by simply: +- Specifying the right sequence prediction problem (e.g. question answering) +- Comparing probabilities of sequences + +**"In-Context Learning"** +Example of how to use Few-Shot: +- Translate English to French: +- sea otter => loutre de mer +- peppermint => menthe poivrée +- plush girafe => girafe peluche +- cheese => ____ + +## Chain-of-Thought + +Essentially, just show your work. Tell the LLM to state their steps, forcing it to show a step-by-step process to help with computation. + +## Reasoning Models + +The chain-of-thought idea happens already "under the hood". Not necessary to explicitly prompt the model + + + + + + + + + + + + + + + + + + + diff --git a/assets/img/notes/lecture-22/gpt_layout.png b/assets/img/notes/lecture-22/gpt_layout.png new file mode 100644 index 000000000..febe2b565 Binary files /dev/null and b/assets/img/notes/lecture-22/gpt_layout.png differ diff --git a/assets/img/notes/lecture-22/human_feedback.png b/assets/img/notes/lecture-22/human_feedback.png new file mode 100644 index 000000000..f8cafe7fa Binary files /dev/null and b/assets/img/notes/lecture-22/human_feedback.png differ diff --git a/assets/img/notes/lecture-22/lora.png b/assets/img/notes/lecture-22/lora.png new file mode 100644 index 000000000..a4414c007 Binary files /dev/null and b/assets/img/notes/lecture-22/lora.png differ diff --git a/assets/img/notes/lecture-22/personalized.png b/assets/img/notes/lecture-22/personalized.png new file mode 100644 index 000000000..cfd22d509 Binary files /dev/null and b/assets/img/notes/lecture-22/personalized.png differ diff --git a/assets/img/notes/lecture-22/rag.png b/assets/img/notes/lecture-22/rag.png new file mode 100644 index 000000000..11054a086 Binary files /dev/null and b/assets/img/notes/lecture-22/rag.png differ diff --git a/assets/img/notes/lecture-22/reasoning.png b/assets/img/notes/lecture-22/reasoning.png new file mode 100644 index 000000000..527f10ba3 Binary files /dev/null and b/assets/img/notes/lecture-22/reasoning.png differ diff --git a/assets/img/notes/lecture-22/sft.png b/assets/img/notes/lecture-22/sft.png new file mode 100644 index 000000000..7ec63913d Binary files /dev/null and b/assets/img/notes/lecture-22/sft.png differ diff --git a/assets/img/notes/lecture-22/unsuper_to_super.png b/assets/img/notes/lecture-22/unsuper_to_super.png new file mode 100644 index 000000000..a8371f5aa Binary files /dev/null and b/assets/img/notes/lecture-22/unsuper_to_super.png differ