Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 149 additions & 0 deletions _posts/2025-04-22-lecture-22.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
---
layout: distill
title: Lecture 22 – Supervised Fine-Tuning of LLMs
description: Reinforcement Learning, Parameter Fine-Tuning, Prompt Optimization
date: 2025-04-22

lecturers:
- name: Ben Lengerich
url: "https://lengerichlab.github.io/"

authors:
- name: Arjun Ghelani # author's full name

abstract: >
Diving into optimization of large-scale LLMs and how to increase efficiency in output
---

## Announcements

- Project presentations: April 29 and May 1.
- Submit peer review forms on Canvas each day to earn up to 2% bonus.
- Due by: Friday, May 2.

---

## LLM Overview

### GPT Training Objective: MLE

- A LLM is ann autoregressive generative model that predicts the likelihood of a token in the next position in a chain
<img src="{{ 'assets/img/notes/lecture-22/gpt_layout.png' | relative_url }}" />

$$P_{\theta}(X) = \prod_{i} \prod_{t} P_{\theta}(X_{i,t} \mid X_{i,<t})$$

-**Probabilistic objective:** Max log-likelihood of observed seqs

$$\max_{\theta} \sum_{i} \sum_{t} \log P_{\theta}(X_{i,t} \mid X_{i,<t})$$

### What does MLE not do?
- No **task goals**
- No **explicit reward**
- No utility
- Dataset selection drives everything

- \textcolor{red}{Can we fine-tune our model to be **useful** after learning unsupervised P(X) learning?}

### From Unsupervised to Supervised

<img src="{{ 'assets/img/notes/lecture-22/unsuper_to_super.png' | relative_url }}" />

### Supervised Fine-Tuning
Show the language model how to appropriately respond to prompty of different types
"Behavior cloning" (output is a behavior that you want the LLM to reproduce)

<img src="{{ 'assets/img/notes/lecture-22/sft.png' | relative_url }}" />

A smaller (1.3B parameter) model can outperform a 175B model if pretrained properly

### Reinforcement Learning with Human Feedback

<img src="{{ 'assets/img/notes/lecture-22/human_feedback.png' | relative_url }}" />

Get **cheap, fast** human feedback with a rating system. After a response, indicate a "thumbs up" or "thumbs down" to provide reinforcement learning feedback to the model for training and optimizing future responses

$r_{\theta}$: the reward model being trained, parameterized by $\theta$. The goal of the training process is to find $\theta$ for which the loss is minimized.

The training data format:
* $x$: prompty
* $y_w$: winning response
* $y_l$: losing response

For each training sample ($x$, $y_w$, $y_l$):
* $s_w$ = $r_{\theta}(x, y_w)$
* $s_l$ = $r_{\theta}(x, y_l)$
* Loss value: $-log(\sigma(s_w - s_l))$

Goal: find $\theta$ to minimize the expected loss for all training samples. $-\mathbb{E}_x log(\sigma(s_w - s_l))$

### Does human feedback reduce model hallucations?

**How to Fix with RL** – John Schulman 2023
1. Adjust output distribution so model is allowed to express uncertainty, challenge premise, admit error (can use bheavior cloning)
2. Use RL to precisely learn behavior boundary

In actuality, human feedback increases hallucination rate in comparison to a baseline SFT model

## Efficient Parameter Fine-Tuning

### Low-Rank Adaption (LoRA)

Hypothesis: The change in weights during model adaptation has a low "**intrinsic rank**"

<img src="{{ 'assets/img/notes/lecture-22/lora.png' | relative_url }}" />

### Retrieval-Augment Generation

Resource access enables personalization

<img src="{{ 'assets/img/notes/lecture-22/rag.png' | relative_url }}" />

### More Efficient Personalization

Learn to breakdown the embeddings of responses into a personalized and a universal subspace
If we get a user's history, we can find where their queries tend to be represented in the universal subspace. And then project the response we were going to give into the personalized subspace, for a personalized response.

<img src="{{ 'assets/img/notes/lecture-22/personalized.png' | relative_url }}" />

## Prompting

### Few-Shot / Zero-Shot Learning

One key emergent ability in GPT-2 is **zero-shot learning**: the ability to do many tasks with **no examples**, and **no gradient updates**, by simply:
- Specifying the right sequence prediction problem (e.g. question answering)
- Comparing probabilities of sequences

**"In-Context Learning"**
Example of how to use Few-Shot:
- Translate English to French:
- sea otter => loutre de mer
- peppermint => menthe poivrée
- plush girafe => girafe peluche
- cheese => ____

## Chain-of-Thought

Essentially, just show your work. Tell the LLM to state their steps, forcing it to show a step-by-step process to help with computation.

## Reasoning Models

The chain-of-thought idea happens already "under the hood". Not necessary to explicitly prompt the model

<img src="{{ 'assets/img/notes/lecture-22/reasoning.png' | relative_url }}" />

















Binary file added assets/img/notes/lecture-22/gpt_layout.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/img/notes/lecture-22/human_feedback.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/img/notes/lecture-22/lora.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/img/notes/lecture-22/personalized.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/img/notes/lecture-22/rag.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/img/notes/lecture-22/reasoning.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/img/notes/lecture-22/sft.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/img/notes/lecture-22/unsuper_to_super.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.