Skip to content

Commit 6bccfd9

Browse files
authored
improve mcp_eval notebook (#1901)
1 parent 569af89 commit 6bccfd9

File tree

3 files changed

+26
-0
lines changed

3 files changed

+26
-0
lines changed

examples/evaluation/use-cases/mcp_eval_notebook.ipynb

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -450,6 +450,8 @@
450450
"id": "ee1f655b",
451451
"metadata": {},
452452
"source": [
453+
"Note that the 4.1 model was constructed to never use its tools to answer the query thus it never called the MCP server. The o4-mini model wasn't explicitly instructed to use it's tools either but it wasn't forbidden, thus it called the MCP server 3 times. We can see that the 4.1 model performed worse than the o4 model. Also notable is the one example that the o4-mini model failed was one where the MCP tool was not used.\n",
454+
"\n",
453455
"We can also check a detailed analysis of the outputs from each model for manual inspection and further analysis."
454456
]
455457
},
@@ -806,6 +808,30 @@
806808
" print(item.sample.output[0].content)"
807809
]
808810
},
811+
{
812+
"cell_type": "markdown",
813+
"id": "0936def6",
814+
"metadata": {},
815+
"source": [
816+
"## How can we improve?\n",
817+
"\n",
818+
"If we add the phrase \"Always use your tools since they are the way to get the right answer in this task.\" to the system message of the o4-mini model, what do you think will happen? (try it out)\n",
819+
"\n",
820+
"<br><br><br>\n",
821+
"\n",
822+
"\n",
823+
"If you guessed that the model would now call to MCP tool everytime and get every answer correct, you are right!"
824+
]
825+
},
826+
{
827+
"cell_type": "markdown",
828+
"id": "cf797a91",
829+
"metadata": {},
830+
"source": [
831+
"![Evaluation Data Tab](../../../images/mcp_eval_improved_output.png)\n",
832+
"![Evaluation Data Tab](../../../images/mcp_eval_improved_data.png)"
833+
]
834+
},
809835
{
810836
"cell_type": "markdown",
811837
"id": "924619e0",

images/mcp_eval_improved_data.png

726 KB
Loading

images/mcp_eval_improved_output.png

381 KB
Loading

0 commit comments

Comments
 (0)