diff --git a/examples/evaluation/use-cases/mcp_eval_notebook.ipynb b/examples/evaluation/use-cases/mcp_eval_notebook.ipynb index 10f6fbf131..5448d566c0 100644 --- a/examples/evaluation/use-cases/mcp_eval_notebook.ipynb +++ b/examples/evaluation/use-cases/mcp_eval_notebook.ipynb @@ -450,6 +450,8 @@ "id": "ee1f655b", "metadata": {}, "source": [ + "Note that the 4.1 model was constructed to never use its tools to answer the query thus it never called the MCP server. The o4-mini model wasn't explicitly instructed to use it's tools either but it wasn't forbidden, thus it called the MCP server 3 times. We can see that the 4.1 model performed worse than the o4 model. Also notable is the one example that the o4-mini model failed was one where the MCP tool was not used.\n", + "\n", "We can also check a detailed analysis of the outputs from each model for manual inspection and further analysis." ] }, @@ -806,6 +808,30 @@ " print(item.sample.output[0].content)" ] }, + { + "cell_type": "markdown", + "id": "0936def6", + "metadata": {}, + "source": [ + "## How can we improve?\n", + "\n", + "If we add the phrase \"Always use your tools since they are the way to get the right answer in this task.\" to the system message of the o4-mini model, what do you think will happen? (try it out)\n", + "\n", + "


\n", + "\n", + "\n", + "If you guessed that the model would now call to MCP tool everytime and get every answer correct, you are right!" + ] + }, + { + "cell_type": "markdown", + "id": "cf797a91", + "metadata": {}, + "source": [ + "![Evaluation Data Tab](../../../images/mcp_eval_improved_output.png)\n", + "![Evaluation Data Tab](../../../images/mcp_eval_improved_data.png)" + ] + }, { "cell_type": "markdown", "id": "924619e0", diff --git a/images/mcp_eval_improved_data.png b/images/mcp_eval_improved_data.png new file mode 100644 index 0000000000..4275df0461 Binary files /dev/null and b/images/mcp_eval_improved_data.png differ diff --git a/images/mcp_eval_improved_output.png b/images/mcp_eval_improved_output.png new file mode 100644 index 0000000000..c153d0dc2c Binary files /dev/null and b/images/mcp_eval_improved_output.png differ