diff --git a/authors.yaml b/authors.yaml index 65df76987d..a4bc501da1 100644 --- a/authors.yaml +++ b/authors.yaml @@ -370,4 +370,9 @@ joshbickett: lupie: name: "Lucie Lozinski" website: "https://twitter.com/thisloops" - avatar: "https://avatars.githubusercontent.com/u/6293148" \ No newline at end of file + avatar: "https://avatars.githubusercontent.com/u/6293148" + +alexl-oai: + name: "Alex Lowden" + website: "https://www.linkedin.com/in/alex-lowden01/" + avatar: "https://avatars.githubusercontent.com/u/215167546" \ No newline at end of file diff --git a/examples/Fine_tuning_direct_preference_optimization_guide.ipynb b/examples/Fine_tuning_direct_preference_optimization_guide.ipynb new file mode 100644 index 0000000000..0f3e759bcb --- /dev/null +++ b/examples/Fine_tuning_direct_preference_optimization_guide.ipynb @@ -0,0 +1,698 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Fine-Tuning Techniques: Choosing Between SFT, DPO, and RFT (Including a Guide to DPO)\n", + " \n", + "*This guide is for developers and ML practitioners who have some experience with OpenAIʼs APIs and wish to use their fine-tuned models for research or other appropriate uses. OpenAI’s services are not intended for the personalized treatment or diagnosis of any medical condition and are subject to our [applicable terms](https://openai.com/policies/).*\n", + " \n", + "This guide discusses fine-tuning methods supported by OpenAI, specifically highlighting what each method is best for and not best for, to help you identify the most suitable technique for your use case. It then provides an in-depth look at one particular method — Direct Preference Optimization (DPO) — and provides links to existing guides for the other techniques.\n", + " \n", + "**What is fine-tuning?** Fine-tuning is the process of continuing training on a smaller, domain-specific dataset to optimize a model for a specific task. There are two main reasons why we would typically fine-tune:\n", + "1. Improve model performance on a specific task \n", + "2. Improve model efficiency (reduce the number of tokens needed, distill expertise into a smaller model, etc.)\n", + " \n", + "Currently, the OpenAI platform supports four fine-tuning methods:\n", + "- **Supervised fine-tuning (SFT):** this technique employs traditional supervised learning using input-output pairs to adjust model parameters. The training process adjusts model weights to minimize the difference between predicted and target outputs across the provided examples. The model will replicate features that it finds in provided pairs. \n", + "- **Vision fine-tuning:** this technique extends supervised fine-tuning to multimodal data by processing both text and image in a unified training framework. The training process adjusts model weights to minimize errors across text-image pairs and as a result improve the model's understanding of image inputs. \n", + "- **Direct preference optimization (DPO):** this technique uses pairwise comparisons (e.g., preferred and rejected example responses) to optimize a model to favor certain outputs over others. The model learns to replicate the preference patterns found in the provided comparison data. \n", + "- **Reinforcement fine-tuning (RFT):** this technique uses reinforcement learning with a reward signal (via a grader or reward model) to fine-tune the model for complex objectives. In RFT, the model generates outputs for given prompts during training, and each output is evaluated for quality. The model's parameters are then updated to maximize the reward, reinforcing behaviors that lead to better outcomes. This iterative feedback loop encourages the model to improve reasoning or decision-making strategies. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To help you select the appropriate fine-tuning technique, the table below summarizes the scenarios each method is best suited for, as well as those for which it is not well suited:\n", + "\n", + "| **Technique** | **Good For** | **Not Good For** |\n", + "| ---------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- |\n", + "| **Supervised fine-tuning (SFT)** | Emphasizing knowledge already present in the model.
Customizing response structure or tone.
Generating content in a specific format.
Teaching complex instructions or correcting instruction-following failures.
Optimizing cost/latency (saving tokens from prompt or distilling). | Adding entirely new knowledge (consider RAG instead).
Tasks with subjective quality. |\n", + "| **Vision fine-tuning** | Specialized visual recognition tasks (e.g., image classification).
Domain-specific image understanding.
Correcting failures in instruction following for complex prompts. | Purely textual tasks.
Generalized visual tasks without specific context.
General image understanding. |\n", + "| **Direct preference optimization (DPO)** | Aligning model outputs with subjective preferences (tone, politeness).
Refining outputs via human-rated feedback.
Achieving nuanced behavioral alignment. | Learning completely new tasks.
Tasks without clear human preference signals. |\n", + "| **Reinforcement fine-tuning (RFT)** | Complex domain-specific tasks that require advanced reasoning.
Refining existing partial capabilities (fostering emergent behaviours).
Tasks with measurable feedback.
Scenarios with limited explicit labels where reward signals can be defined. | Tasks where the model has no initial skill.
Tasks without clear feedback or measurable signals. |\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Today, there are pre-existing Cookbooks for: \n", + "\n", + "- Supervised fine-tuning (SFT): (1) [How to fine-tune chat models](https://cookbook.openai.com/examples/how_to_finetune_chat_models) (2) [Leveraging model distillation to fine-tune a model](https://cookbook.openai.com/examples/leveraging_model_distillation_to_fine-tune_a_model)\n", + "- Vision fine-tuning: [Vision fine-tuning on GPT-4o for visual question answering](https://cookbook.openai.com/examples/multimodal/vision_fine_tuning_on_gpt4o_for_visual_question_answering)\n", + "- Reinforcement fine-tuning (RFT): (1) [Reinforcement fine-tuning (RFT)](https://cookbook.openai.com/examples/reinforcement_fine_tuning), (2) [Reinforcement fine-tuning for healthbench QA](https://cookbook.openai.com/examples/fine-tuned_qa/reinforcement_finetuning_healthbench)\n", + "\n", + "Direct preference optimization (DPO) will be covered in this guide." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## **Guide to Direct Preference Optimization**\n", + " \n", + "As mentioned above, [Direct Preference Optimization (DPO)](https://platform.openai.com/docs/guides/direct-preference-optimization) is an alignment technique for fine-tuning language models using pairwise preference data (e.g., ranked pairs of responses). DPO directly optimizes a model to favor certain outputs over others using explicit pairwise comparisons, typically from human preferences. This approach simplifies alignment and eliminates the need for a separate reward model or complex reinforcement learning procedures, making DPO a lightweight alternative to techniques such as Reinforcement Learning from Human Feedback (RLHF).\n", + " \n", + "When should you use DPO? DPO excels in scenarios when response quality is subjective, cannot be measured objectively, or when nuanced criteria such as tone, style, appropriateness, or clarity matter - typically cases where multiple valid outputs exist. Example applications where DPO is particularly effective in aligning AI responses include: \n", + "- Enhancing Conversational AI Responses\n", + "- Improving Code Generation Quality & Style\n", + "- Ensuring Compliance with Legal, Ethical & Safety Standards \n", + "- Controlling Brand Voice, Professionalism, & Tone\n", + "- Customizing Creative Outputs & User Experience\n", + "\n", + "By fine-tuning on explicit pairs of preferred vs non‑preferred completions, DPO aligns model outputs to these nuanced preferences. The below table gives examples of pairwise preference data for a fictional AI assistant that represents an organization, where preferred responses are clear, professional, and aligned with brand standards.\n", + " \n", + "| **Example Question** | **Chosen Response** | **Rejected Response** |\n", + "|------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|\n", + "| **Q1:** *How do I review your product?* | To submit a product review, please visit your account dashboard, select the product, and click ‘Write a review.’ Share your honest experience, rate key features, and submit when ready. | Yo, just leave some quick stars or whatever, it’s chill! |\n", + "| **Q2:** *How do I review your product?* | We welcome your feedback! In the ‘Reviews’ section on the product page, click ‘Leave a Review,’ rate it, and add your comments about what you liked or areas for improvement. | Just scribble something—doesn’t matter what, honestly. |\n", + "| **Q3:** *How to troubleshoot this particular error?* | To address the error ‘X101,’ first clear your cache, then verify your internet connection. If the issue remains, follow our step-by-step guide at [Support → Troubleshooting → Error X101]. | Just reboot it, I guess. If it doesn't work, you're on your own! |\n", + " \n", + "In this guide, weʼll walk through how to apply DPO using the fine-tuning API. You will learn key steps to take in order to successfully run preference fine-tuning jobs for your use-cases.\n", + " \n", + "Here’s what we’ll cover:\n", + " \n", + "- **1. Recommended Workflow**\n", + "- **2. Demonstration Scenario**\n", + "- **3. Generating the Dataset**\n", + "- **4. Benchmarking the Base Model**\n", + "- **5. Fine-Tuning**\n", + "- **6. Using your Fine-Tuned Model**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## **1. Recommended Workflow**\n", + " \n", + "OpenAI recommends the following workflow: \n", + "1. Performing Supervised Fine-Tuning (SFT) on a subset of your preferred responses. \n", + "2. Using the SFT fine-tuned model as the starting point, apply DPO using preference comparison data. \n", + " \n", + "Performing Supervised Fine-Tuning (SFT) before Direct Preference Optimization (DPO) enhances model alignment and overall performance by establishing a robust initial policy, ensuring the model already prefers correct responses. This reduces the magnitude of weight updates during DPO, stabilizing training and preventing overfitting by allowing DPO to efficiently refine subtle nuances. Consequently, the combined SFT-then-DPO workflow converges faster and yields higher-quality results.\n", + "\n", + "In this guide, we'll focus exclusively on applying Direct Preference Optimization (DPO). However, depending on your use case, you may find performance gains from first performing Supervised Fine-Tuning (SFT). If so, you can follow the SFT guide linked above, save the resulting model ID, and use that as the starting point for your DPO job." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## **2. Demonstration Scenario**\n", + "\n", + "To make things concrete, let’s walk through fine-tuning a customer-facing AI assistant to follow a fictional brand’s voice and style. Imagine Good Vibes Corp, an organization that prides itself on a friendly, enthusiastic tone with a personal touch. \n", + " \n", + "They want their customer AI assistant to answer queries in a way that reflects these brand guidelines (e.g. an upbeat attitude, polite language, and a friendly sign-off), and prefer those responses over more generic or curt answers. This is a good scenario for DPO: there’s no objectively correct answer format, but there is a preferred style.\n", + " \n", + "DPO will help the model learn from comparisons which style is preferred. We'll outline the steps to: (1) generate a synthetic preference dataset of prompts with paired responses (one in the desired brand voice and one not). (2) Evaluate base model performance using the OpenAI evals API. (3) Prepare and upload the data in the required JSONL format for preference fine-tuning. (4) Fine-tune the model with DPO using the OpenAI fine-tuning API. (5) Evaluate the fine-tuned model using the OpenAI evals API to show how the brand-style preference improved.\n", + "\n", + "We are going to synthesize a dataset for this demonstration. First, let’s create a seed bank of questions to generate more variations from.\n", + "\n", + "Let’s get started!" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "! pip install openai nest-asyncio --quiet" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "PROMPT_SEED_POOL = [\n", + " \"Hi, I ordered a gadget last week. When will it arrive?\",\n", + " \"Your product stopped working after two days. Can I get help?\",\n", + " \"Do you offer discounts for long-term customers?\",\n", + " \"Can I change the shipping address for my order?\",\n", + " \"What is your return policy for damaged items?\",\n", + " \"My tracking number hasn't updated in three days—can you check the status?\",\n", + " \"How long is the warranty on your products, and how do I submit a claim?\",\n", + " \"Can I add gift wrapping to my order before it ships?\",\n", + " \"Do you accept PayPal or other alternative payment methods?\",\n", + " \"Is there an option to expedite shipping if my order hasn't left the warehouse yet?\",\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## **3. Generating the Dataset**\n", + "\n", + "Next, we’ll define functions to take each prompt from our seed bank and generate related questions. We’ll create a dataset of preference pairs by first generating these prompt variations, then producing both a preferred and a rejected response for every prompt. \n", + "\n", + "This dataset is synthetic and serves to illustrate the mechanics of Direct Preference Optimization — when developing your own application you should collect or curate a high-quality, preference dataset. Note: the volume of data required for DPO depends on the use case; generally more is better (thousands to tens of thousands), and for preference pairs the ordering logic should be consistent (e.g. if A > B and B > C, then A > C)." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import asyncio\n", + "from openai import AsyncOpenAI\n", + "from typing import List, Dict, Any\n", + "\n", + "async_client = AsyncOpenAI()\n", + "\n", + "SYSTEM_PROMPT = \"You are a customer-support assistant.\"\n", + "\n", + "\n", + "async def _generate_related_questions_from_prompt(\n", + " prompt: str, k: int, sem: asyncio.Semaphore, *, model: str\n", + ") -> List[str]:\n", + " \"\"\"Return *k* distinct customer-service questions related to the given prompt.\"\"\"\n", + " out: List[str] = []\n", + " async with sem:\n", + " for _ in range(k):\n", + " resp = await async_client.responses.create(\n", + " model=model,\n", + " input=[\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": (\n", + " \"Return ONE distinct, realistic customer-service question \"\n", + " \"related in topic or theme to the following question, \"\n", + " \"but NOT a direct paraphrase.\"\n", + " ),\n", + " },\n", + " {\"role\": \"user\", \"content\": prompt},\n", + " ],\n", + " temperature=0.9,\n", + " max_output_tokens=60,\n", + " )\n", + " out.append(resp.output_text.strip())\n", + " return out\n", + "\n", + "\n", + "async def expand_prompt_pool(\n", + " prompts: List[str], *, k: int = 3, concurrency: int = 32, model: str\n", + ") -> List[str]:\n", + " \"\"\"Expand each prompt into *k* related questions using the given model.\"\"\"\n", + " sem = asyncio.Semaphore(concurrency)\n", + " tasks = [\n", + " _generate_related_questions_from_prompt(p, k, sem, model=model) for p in prompts\n", + " ]\n", + " results = await asyncio.gather(*tasks)\n", + " return [v for sub in results for v in sub]\n", + "\n", + "\n", + "async def _generate_preference_pair(\n", + " prompt: str, sem: asyncio.Semaphore, *, model: str\n", + ") -> Dict[str, Any]:\n", + " \"\"\"Generate a preference pair for the given prompt.\"\"\"\n", + " async with sem:\n", + " friendly_task = async_client.responses.create(\n", + " model=model,\n", + " input=[\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": (\n", + " \"You are Good Vibes Corp's exceptionally energetic, outrageously friendly and \"\n", + " \"enthusiastic support agent.\"\n", + " ),\n", + " },\n", + " {\"role\": \"user\", \"content\": prompt},\n", + " ],\n", + " temperature=0.7, # higher temperature to increase creativity & on-brand tone adherence\n", + " max_output_tokens=80,\n", + " )\n", + " blunt_task = async_client.responses.create(\n", + " model=model,\n", + " input=[\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a terse, factual support agent with no empathy or politeness.\",\n", + " },\n", + " {\"role\": \"user\", \"content\": prompt},\n", + " ],\n", + " temperature=0.3, # lower temperature to limit creativity & emphasize tonal difference\n", + " max_output_tokens=80,\n", + " )\n", + " friendly, blunt = await asyncio.gather(friendly_task, blunt_task)\n", + " return {\n", + " \"input\": {\n", + " \"messages\": [\n", + " {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n", + " {\"role\": \"user\", \"content\": prompt},\n", + " ]\n", + " },\n", + " \"preferred_output\": [\n", + " {\"role\": \"assistant\", \"content\": friendly.output_text}\n", + " ],\n", + " \"non_preferred_output\": [\n", + " {\"role\": \"assistant\", \"content\": blunt.output_text}\n", + " ],\n", + " }" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, using these defined functions we'll build our dataset by generating friendly versus blunt response pairs. The friendly responses reflect the brand's desired communication style. We'll do this asynchronously for efficiency, creating a dataset suited for Direct Preference Optimization." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Dataset ready with 500 pairs.\n" + ] + } + ], + "source": [ + "import math\n", + "import nest_asyncio\n", + "\n", + "\n", + "async def build_dataset(\n", + " *,\n", + " pair_count: int = 500,\n", + " concurrency: int = 8,\n", + " expand_prompt_pool_model: str,\n", + " generate_preference_pair_model: str,\n", + ") -> List[Dict[str, Any]]:\n", + " \"\"\"Return *pair_count* preference pairs (single-shot expansion).\"\"\"\n", + "\n", + " seed = PROMPT_SEED_POOL\n", + " deficit = max(0, pair_count - len(seed))\n", + " k = max(1, math.ceil(deficit / len(seed)))\n", + "\n", + " expanded = await expand_prompt_pool(\n", + " seed,\n", + " k=k,\n", + " concurrency=concurrency,\n", + " model=expand_prompt_pool_model,\n", + " )\n", + " prompt_bank = (seed + expanded)[:pair_count]\n", + "\n", + " sem = asyncio.Semaphore(concurrency)\n", + " tasks = [\n", + " _generate_preference_pair(p, sem, model=generate_preference_pair_model)\n", + " for p in prompt_bank\n", + " ]\n", + " return await asyncio.gather(*tasks)\n", + "\n", + "\n", + "nest_asyncio.apply()\n", + "pairs = await build_dataset(\n", + " pair_count=500,\n", + " concurrency=8,\n", + " expand_prompt_pool_model=\"gpt-4.1-mini-2025-04-14\",\n", + " generate_preference_pair_model=\"gpt-4.1-mini-2025-04-14\",\n", + ")\n", + "print(f\"Dataset ready with {len(pairs)} pairs.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## **4. Benchmarking the Base Model**\n", + "\n", + "Below, we split our dataset into training, validation, and testing sets. We also show a sample from the training dataset, which demonstrates a clear difference between the preferred (friendly, on-brand) and non-preferred (blunt, neutral) responses for that input pair." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'input': {'messages': [{'role': 'system',\n", + " 'content': 'You are a customer-support assistant.'},\n", + " {'role': 'user',\n", + " 'content': 'Hi, I ordered a gadget last week. When will it arrive?'}]},\n", + " 'preferred_output': [{'role': 'assistant',\n", + " 'content': 'Hey there, awesome friend! 🌟 Thanks a bunch for reaching out! I’d LOVE to help you track down your gadget so you can start enjoying it ASAP! 🎉 Could you please share your order number or the email you used to place the order? Let’s make this delivery magic happen! 🚀✨'}],\n", + " 'non_preferred_output': [{'role': 'assistant',\n", + " 'content': 'Provide your order number for delivery status.'}]}" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# set dataset sizes\n", + "n = len(pairs)\n", + "n_train = int(0.8 * n)\n", + "n_val = int(0.1 * n)\n", + "n_test = n - n_train - n_val\n", + "\n", + "# split dataset into train, test & validation\n", + "train_pairs = pairs[:n_train]\n", + "val_pairs = pairs[n_train : n_train + n_val]\n", + "test_pairs = pairs[n_train + n_val :]\n", + "train_pairs[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To assess the model's performance prior to fine-tuning, we'll use an automated grader (LLM-as-a-Judge) to score each response for friendliness and empathy. The grader will assign a score from 0 to 4 for each answer, allowing us to compute a mean baseline score for the base model. \n", + "\n", + "To do this, we first generate responses for the base model on the test set, then use the OpenAI evals API to create and run an evaluation with an automated grader. " + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "async def generate_responses(\n", + " testset,\n", + " model,\n", + " temperature=0.0,\n", + " max_output_tokens=80,\n", + " concurrency=8,\n", + "):\n", + " \"\"\"\n", + " Generate responses for each prompt in the testset using the OpenAI responses API.\n", + " Returns: List of dicts: [{\"prompt\": ..., \"response\": ...}, ...]\n", + " \"\"\"\n", + " async_client = AsyncOpenAI()\n", + " sem = asyncio.Semaphore(concurrency)\n", + "\n", + " async def get_response(prompt):\n", + " async with sem:\n", + " resp = await async_client.responses.create(\n", + " model=model,\n", + " input=[\n", + " {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n", + " {\"role\": \"user\", \"content\": prompt},\n", + " ],\n", + " temperature=temperature,\n", + " max_output_tokens=max_output_tokens,\n", + " )\n", + " return {\"prompt\": prompt, \"response\": resp.output_text}\n", + "\n", + " tasks = [get_response(item[\"item\"][\"input\"]) for item in testset]\n", + " results = await asyncio.gather(*tasks)\n", + " return results\n", + "\n", + "\n", + "# generate responses for the base model over the test set\n", + "base_model = \"gpt-4.1-mini-2025-04-14\"\n", + "testset = [\n", + " {\"item\": {\"input\": pair[\"input\"][\"messages\"][1][\"content\"]}} for pair in test_pairs\n", + "]\n", + "responses = await generate_responses(testset, model=base_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we'll use the OpenAI evals API to create & run an evaluation with an automated grader, starting by defining the rubric for the LLM-as-a-Judge. Note: we will access responses via data logging, so in order for this to work, you'll need to be in an org where data logging isn't disabled (through zdr, etc.). If you aren't sure if this is the case for you, go to https://platform.openai.com/logs?api=responses and see if you can see the responses you just generated." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "JUDGE_SYSTEM = \"\"\"\n", + "You judge whether a reply matches Good Vibes Corp's desired tone:\n", + "energetic, super-friendly, enthusiastic.\n", + "\n", + "Score 0-4 (higher = more energy):\n", + "\n", + "4 - Highly enthusiastic: multiple upbeat phrases / emojis / exclamations, clear empathy, proactive help.\n", + "3 - Energetic & friendly: visible enthusiasm cue (≥1 emoji OR exclamation OR upbeat phrase), warm second-person tone.\n", + "2 - Pleasant: polite & positive but lacks obvious enthusiasm cues.\n", + "1 - Neutral: correct, businesslike, minimal warmth.\n", + "0 - Rude, negative, or unhelpful.\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "from openai import OpenAI\n", + "\n", + "sync_client = OpenAI()\n", + "\n", + "# set judge model\n", + "judge_model = \"gpt-4.1-2025-04-14\"\n", + "\n", + "# create the evaluation\n", + "logs_eval = sync_client.evals.create(\n", + " name=\"Good Vibes Corp Tone Eval\",\n", + " data_source_config={\n", + " \"type\": \"logs\",\n", + " },\n", + " testing_criteria=[\n", + " {\n", + " \"type\": \"score_model\",\n", + " \"name\": \"General Evaluator\",\n", + " \"model\": judge_model,\n", + " \"input\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": JUDGE_SYSTEM,\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": (\n", + " \"**User input**\\n\"\n", + " \"{{item.input}}\\n\"\n", + " \"**Response to evaluate**\\n\"\n", + " \"{{sample.output_text}}\"\n", + " ),\n", + " },\n", + " ],\n", + " \"range\": [0, 4],\n", + " \"pass_threshold\": 2,\n", + " }\n", + " ],\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "# run the evaluation\n", + "base_run = sync_client.evals.runs.create(\n", + " name=base_model,\n", + " eval_id=logs_eval.id,\n", + " data_source={\n", + " \"type\": \"responses\",\n", + " \"source\": {\"type\": \"responses\", \"limit\": len(test_pairs)},\n", + " },\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Average score: 2.525\n" + ] + } + ], + "source": [ + "# score base model\n", + "base_data = sync_client.evals.runs.output_items.list(\n", + " eval_id=logs_eval.id, run_id=base_run.id\n", + ").data\n", + "base_scores = [s.results[0][\"score\"] for s in base_data]\n", + "print(\"Average score:\", sum(base_scores) / len(base_scores))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## **5. Fine-Tuning**\n", + "\n", + "With a baseline established, we can now fine-tune the model using the training set and DPO. This process will teach the model to prefer responses that align with our desired style, based on the preference pairs we created earlier.\n", + "\n", + "Note: **beta (β)** is a unique fine-tuning hyperparameter for Direct Preference Optimization (DPO). It’s a floating-point number ranging between 0 and 2, controlling the balance between preserving a model’s existing behavior and adapting to new, preference-aligned responses.\n", + "- High β (close to 2): makes the model more conservative, strongly favoring previous behavior. The fine-tuned model will show minimal deviations from its original style or characteristics, emphasizing consistency and avoiding abrupt changes.\n", + "- Moderate β (around 1): balances between adherence to prior behavior and adaptation to new preferences. Recommended as a sensible starting point for most practical scenarios.\n", + "- Low β (close to 0): encourages aggressive adaptation, causing the model to prioritize newly provided preferences more prominently. This might result in significant stylistic shifts and greater alignment with explicit preferences but could lead to unexpected or overly specialized outputs.\n", + "\n", + "Technically, beta scales the difference in log-probabilities in the DPO loss; a larger β causes the sigmoid-based loss function to saturate with smaller probability differences, yielding smaller weight updates (thus preserving old behavior). It is recommended to experiment systematically with the β value to achieve optimal results tailored to your specific use-case and desired trade-offs between stability and adaptation." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Fine-tuning job created: job_id = ftjob-5QPmA36QezFRGoXjuvIAPuAQ\n" + ] + } + ], + "source": [ + "import io\n", + "import json\n", + "\n", + "# create training file\n", + "train_buf = io.BytesIO(\"\\n\".join(json.dumps(p) for p in train_pairs).encode())\n", + "train_buf.name = \"train.jsonl\"\n", + "train_file_id = sync_client.files.create(file=train_buf, purpose=\"fine-tune\").id\n", + "\n", + "# create validation file\n", + "val_buf = io.BytesIO(\"\\n\".join(json.dumps(p) for p in val_pairs).encode())\n", + "val_buf.name = \"val.jsonl\"\n", + "val_file_id = sync_client.files.create(file=val_buf, purpose=\"fine-tune\").id\n", + "\n", + "# create a fine-tuning job\n", + "ft = sync_client.fine_tuning.jobs.create(\n", + " model=base_model,\n", + " training_file=train_file_id,\n", + " validation_file=val_file_id,\n", + " method={\n", + " \"type\": \"dpo\",\n", + " \"dpo\": {\n", + " \"hyperparameters\": {\n", + " \"n_epochs\": 2,\n", + " \"beta\": 0.1,\n", + " \"batch_size\": 8,\n", + " }\n", + " },\n", + " },\n", + ")\n", + "print(f\"Fine-tuning job created: job_id = {ft.id}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## **6. Using your Fine-Tuned Model**\n", + "\n", + "Once fine-tuning is complete, we'll evaluate the DPO-tuned model on the same test set. By comparing the mean scores before and after fine-tuning, as well as reviewing example outputs, we can see how the model's alignment with our preferences has improved." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# generate responses\n", + "job = sync_client.fine_tuning.jobs.retrieve(ft.id)\n", + "if job.status == \"succeeded\":\n", + " responses = await generate_responses(testset, model=job.fine_tuned_model)\n", + "\n", + " post_run = sync_client.evals.runs.create(\n", + " name=ft.id,\n", + " eval_id=logs_eval.id,\n", + " data_source={\n", + " \"type\": \"responses\",\n", + " \"source\": {\"type\": \"responses\", \"limit\": len(test_pairs)},\n", + " },\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Δ mean: 0.45\n", + "\n", + "=== SAMPLE COMPARISON ===\n", + "Prompt:\n", + " Can I upgrade to faster delivery if my package is still being processed?\n", + "\n", + "Base model reply: \n", + " Whether you can upgrade to express shipping while your order is still being processed depends on the store's policies. Generally, many stores allow shipping upgrades before the order is shipped. \n", + "\n", + "To assist you better, could you please provide your order number or the name of the store you ordered from? Alternatively, you can contact the store's customer service directly to request the upgrade. \n", + "\n", + "DPO-tuned model reply \n", + " Hi! I’d be happy to help with that. If your package hasn’t shipped yet, there’s a good chance we can upgrade your delivery speed. Could you please provide me with your order number? I’ll check the status and let you know the available options for faster delivery.\n" + ] + } + ], + "source": [ + "# get scores from the evaluation\n", + "post_data = sync_client.evals.runs.output_items.list(\n", + " eval_id=logs_eval.id, run_id=post_run.id\n", + ").data\n", + "post_scores = [s.results[0][\"score\"] for s in post_data]\n", + "\n", + "# print scores & a sample comparison from the test set for illustration\n", + "print(\n", + " \"Δ mean:\",\n", + " sum(t - b for b, t in zip(base_scores, post_scores)) / len(base_scores),\n", + ")\n", + "print(\"\\n=== SAMPLE COMPARISON ===\")\n", + "idx = 0\n", + "print(f\"Prompt:\\n {testset[idx]['item']['input']}\\n\")\n", + "print(f\"Base model reply: \\n {base_data[idx].sample.output[0].content} \\n\")\n", + "print(f\"DPO-tuned model reply \\n {post_data[idx].sample.output[0].content}\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv311", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/registry.yaml b/registry.yaml index 019ac9ce4e..fdc78fa2c9 100644 --- a/registry.yaml +++ b/registry.yaml @@ -4,6 +4,14 @@ # should build pages for, and indicates metadata such as tags, creation date and # authors for each page. +- title: Fine-Tuning Techniques - Choosing Between SFT, DPO, and RFT (With a Guide to DPO) + path: examples/Fine_tuning_direct_preference_optimization_guide.ipynb + date: 2025-06-18 + authors: + - alexl-oai + tags: + - fine-tuning + - title: MCP-Powered Agentic Voice Framework path: examples/partners/mcp_powered_voice_agents/mcp_powered_agents_cookbook.ipynb date: 2025-06-17