diff --git a/examples/evaluation/use-cases/mcp_eval_notebook.ipynb b/examples/evaluation/use-cases/mcp_eval_notebook.ipynb
index 10f6fbf131..5448d566c0 100644
--- a/examples/evaluation/use-cases/mcp_eval_notebook.ipynb
+++ b/examples/evaluation/use-cases/mcp_eval_notebook.ipynb
@@ -450,6 +450,8 @@
"id": "ee1f655b",
"metadata": {},
"source": [
+ "Note that the 4.1 model was constructed to never use its tools to answer the query thus it never called the MCP server. The o4-mini model wasn't explicitly instructed to use it's tools either but it wasn't forbidden, thus it called the MCP server 3 times. We can see that the 4.1 model performed worse than the o4 model. Also notable is the one example that the o4-mini model failed was one where the MCP tool was not used.\n",
+ "\n",
"We can also check a detailed analysis of the outputs from each model for manual inspection and further analysis."
]
},
@@ -806,6 +808,30 @@
" print(item.sample.output[0].content)"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "0936def6",
+ "metadata": {},
+ "source": [
+ "## How can we improve?\n",
+ "\n",
+ "If we add the phrase \"Always use your tools since they are the way to get the right answer in this task.\" to the system message of the o4-mini model, what do you think will happen? (try it out)\n",
+ "\n",
+ "
\n",
+ "\n",
+ "\n",
+ "If you guessed that the model would now call to MCP tool everytime and get every answer correct, you are right!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cf797a91",
+ "metadata": {},
+ "source": [
+ "\n",
+ ""
+ ]
+ },
{
"cell_type": "markdown",
"id": "924619e0",
diff --git a/images/mcp_eval_improved_data.png b/images/mcp_eval_improved_data.png
new file mode 100644
index 0000000000..4275df0461
Binary files /dev/null and b/images/mcp_eval_improved_data.png differ
diff --git a/images/mcp_eval_improved_output.png b/images/mcp_eval_improved_output.png
new file mode 100644
index 0000000000..c153d0dc2c
Binary files /dev/null and b/images/mcp_eval_improved_output.png differ