openai
diff --git a/‎examples/evaluation/use-cases/mcp_eval_notebook.ipynb
Lines changed: 26 additions & 0 deletions b/‎examples/evaluation/use-cases/mcp_eval_notebook.ipynb
Lines changed: 26 additions & 0 deletions
diff --git a/‎images/mcp_eval_improved_data.png
726 KB b/‎images/mcp_eval_improved_data.png
726 KB
diff --git a/‎images/mcp_eval_improved_output.png
381 KB b/‎images/mcp_eval_improved_output.png
381 KB
@@ -450,6 +450,8 @@
    "id": "ee1f655b",
    "metadata": {},
    "source": [
+    "Note that the 4.1 model was constructed to never use its tools to answer the query thus it never called the MCP server. The o4-mini model wasn't explicitly instructed to use it's tools either but it wasn't forbidden, thus it called the MCP server 3 times. We can see that the 4.1 model performed worse than the o4 model. Also notable is the one example that the o4-mini model failed was one where the MCP tool was not used.\n",
+    "\n",
     "We can also check a detailed analysis of the outputs from each model for manual inspection and further analysis."
    ]
   },
@@ -806,6 +808,30 @@
     "    print(item.sample.output[0].content)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "0936def6",
+   "metadata": {},
+   "source": [
+    "## How can we improve?\n",
+    "\n",
+    "If we add the phrase \"Always use your tools since they are the way to get the right answer in this task.\" to the system message of the o4-mini model, what do you think will happen? (try it out)\n",
+    "\n",
+    "<br><br><br>\n",
+    "\n",
+    "\n",
+    "If you guessed that the model would now call to MCP tool everytime and get every answer correct, you are right!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cf797a91",
+   "metadata": {},
+   "source": [
+    "![Evaluation Data Tab](../../../images/mcp_eval_improved_output.png)\n",
+    "![Evaluation Data Tab](../../../images/mcp_eval_improved_data.png)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "924619e0",