AdaptInfer · arjunghelani · Apr 29, 2025 · Apr 29, 2025
diff --git a/_posts/2025-04-22-lecture-22.md b/_posts/2025-04-22-lecture-22.md
@@ -0,0 +1,149 @@
+---
+layout: distill
+title: Lecture 22 – Supervised Fine-Tuning of LLMs
+description: Reinforcement Learning, Parameter Fine-Tuning, Prompt Optimization
+date: 2025-04-22
+
+lecturers:
+  - name: Ben Lengerich
+    url: "https://lengerichlab.github.io/"
+
+authors:
+  - name: Arjun Ghelani  # author's full name
+
+abstract: >
+  Diving into optimization of large-scale LLMs and how to increase efficiency in output 
+---
+
+## Announcements
+
+- Project presentations: April 29 and May 1.
+- Submit peer review forms on Canvas each day to earn up to 2% bonus.
+- Due by: Friday, May 2.
+
+---
+
+## LLM Overview
+
+### GPT Training Objective: MLE
+
+- A LLM is ann autoregressive generative model that predicts the likelihood of a token in the next position in a chain
+<img src="{{ 'assets/img/notes/lecture-22/gpt_layout.png' | relative_url }}" />
+
+$$P_{\theta}(X) = \prod_{i} \prod_{t} P_{\theta}(X_{i,t} \mid X_{i,<t})$$
+
+-**Probabilistic objective:** Max log-likelihood of observed seqs
+
+$$\max_{\theta} \sum_{i} \sum_{t} \log P_{\theta}(X_{i,t} \mid X_{i,<t})$$
+
+### What does MLE not do?
+- No **task goals**
+- No **explicit reward**
+- No utility
+- Dataset selection drives everything 
+
+- \textcolor{red}{Can we fine-tune our model to be **useful** after learning unsupervised P(X) learning?}
+
+### From Unsupervised to Supervised
+
+<img src="{{ 'assets/img/notes/lecture-22/unsuper_to_super.png' | relative_url }}" />
+
+### Supervised Fine-Tuning
+Show the language model how to appropriately respond to prompty of different types
+"Behavior cloning" (output is a behavior that you want the LLM to reproduce)
+
+<img src="{{ 'assets/img/notes/lecture-22/sft.png' | relative_url }}" />
+
+A smaller (1.3B parameter) model can outperform a 175B model if pretrained properly
+
+### Reinforcement Learning with Human Feedback
+
+<img src="{{ 'assets/img/notes/lecture-22/human_feedback.png' | relative_url }}" />
+
+Get **cheap, fast** human feedback with a rating system. After a response, indicate a "thumbs up" or "thumbs down" to provide reinforcement learning feedback to the model for training and optimizing future responses
+
+$r_{\theta}$: the reward model being trained, parameterized by $\theta$. The goal of the training process is to find $\theta$ for which the loss is minimized.
+
+The training data format:
+* $x$: prompty
+* $y_w$: winning response
+* $y_l$: losing response
+
+For each training sample ($x$, $y_w$, $y_l$):
+* $s_w$ = $r_{\theta}(x, y_w)$
+* $s_l$ = $r_{\theta}(x, y_l)$
+* Loss value: $-log(\sigma(s_w - s_l))$
+
+Goal: find $\theta$ to minimize the expected loss for all training samples. $-\mathbb{E}_x log(\sigma(s_w - s_l))$
+
+### Does human feedback reduce model hallucations?
+
+**How to Fix with RL** – John Schulman 2023
+1. Adjust output distribution so model is allowed to express uncertainty, challenge premise, admit error (can use bheavior cloning)
+2. Use RL to precisely learn behavior boundary
+
+In actuality, human feedback increases hallucination rate in comparison to a baseline SFT model
+
+## Efficient Parameter Fine-Tuning
+
+### Low-Rank Adaption (LoRA)
+
+Hypothesis: The change in weights during model adaptation has a low "**intrinsic rank**"
+
+<img src="{{ 'assets/img/notes/lecture-22/lora.png' | relative_url }}" />
+
+### Retrieval-Augment Generation
+
+Resource access enables personalization
+
+<img src="{{ 'assets/img/notes/lecture-22/rag.png' | relative_url }}" />
+
+### More Efficient Personalization
+
+Learn to breakdown the embeddings of responses into a personalized and a universal subspace
+If we get a user's history, we can find where their queries tend to be represented in the universal subspace. And then project the response we were going to give into the personalized subspace, for a personalized response.
+
+<img src="{{ 'assets/img/notes/lecture-22/personalized.png' | relative_url }}" />
+
+## Prompting
+
+### Few-Shot / Zero-Shot Learning
+
+One key emergent ability in GPT-2 is **zero-shot learning**: the ability to do many tasks with **no examples**, and **no gradient updates**, by simply:
+- Specifying the right sequence prediction problem (e.g. question answering)
+- Comparing probabilities of sequences 
+
+**"In-Context Learning"**
+Example of how to use Few-Shot:
+- Translate English to French:
+- sea otter => loutre de mer
+- peppermint => menthe poivrée
+- plush girafe => girafe peluche
+- cheese => ____
+
+## Chain-of-Thought
+
+Essentially, just show your work. Tell the LLM to state their steps, forcing it to show a step-by-step process to help with computation.
+
+## Reasoning Models
+
+The chain-of-thought idea happens already "under the hood". Not necessary to explicitly prompt the model
+
+<img src="{{ 'assets/img/notes/lecture-22/reasoning.png' | relative_url }}" />
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/assets/img/notes/lecture-22/gpt_layout.png b/assets/img/notes/lecture-22/gpt_layout.png
diff --git a/assets/img/notes/lecture-22/human_feedback.png b/assets/img/notes/lecture-22/human_feedback.png
diff --git a/assets/img/notes/lecture-22/lora.png b/assets/img/notes/lecture-22/lora.png
diff --git a/assets/img/notes/lecture-22/personalized.png b/assets/img/notes/lecture-22/personalized.png
diff --git a/assets/img/notes/lecture-22/rag.png b/assets/img/notes/lecture-22/rag.png
diff --git a/assets/img/notes/lecture-22/reasoning.png b/assets/img/notes/lecture-22/reasoning.png
diff --git a/assets/img/notes/lecture-22/sft.png b/assets/img/notes/lecture-22/sft.png
diff --git a/assets/img/notes/lecture-22/unsuper_to_super.png b/assets/img/notes/lecture-22/unsuper_to_super.png