Skip to content

Commit 94d24ca

Browse files
committed
Add example downloading feedback an druns from test project (#117)
1 parent 410a7c6 commit 94d24ca

File tree

3 files changed

+381
-0
lines changed

3 files changed

+381
-0
lines changed

README.md

+1
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ Test and benchmark your LLM systems using methods in these evaluation recipes:
4242
- [Unit Testing with Pytest](./testing-examples/pytest-ut/): write individual unit tests and log assertions as feedback.
4343
- [Evaluating Existing Runs](./testing-examples/evaluate-existing-test-project/evaluate_runs.ipynb): add ai-assisted feedback and evaluation metrics to existing run traces.
4444
- [Naming Test Projects](./testing-examples/naming-test-projects/naming-test-projects.md): manually name your tests with `run_on_dataset(..., project_name='my-project-name')`
45+
- [How to download feedback and examples from a test project](./testing-examples/download-feedback-and-examples/download_example.ipynb): export the predictions, evaluation results, and other information to programmatically add to your reports.
4546

4647

4748
### TypeScript / JavaScript Testing Examples

testing-examples/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,4 @@ sidebar_position: 4
1212
- [Unit Testing with Pytest](./pytest-ut/): write individual unit tests and log assertions as feedback.
1313
- [Evaluating Existing Runs](./evaluate-existing-test-project/evaluate_runs.ipynb): add ai-assisted feedback and evaluation metrics to existing run traces.
1414
- [Naming Test Projects](./naming-test-projects/naming-test-projects.md): manually name your tests with `run_on_dataset(..., project_name='my-project-name')`
15+
- [How to download feedback and examples from a test project](./download-feedback-and-examples/download_example.ipynb): export the predictions, evaluation results, and other information to programmatically add to your reports.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,379 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "8b45a82a-a0dd-489b-bba8-c34f8e5afb96",
6+
"metadata": {},
7+
"source": [
8+
"# How to download feedback and examples from a test project\n",
9+
"\n",
10+
"When testing with Langsmith, all the traces, examples, and evaluation feedback are saved so you have a full audit of what happened.\n",
11+
"This way you can see the aggregate metrics of the test run and compare on an example by example basis. You can also download the run and evaluation result information\n",
12+
"to use in external reporting software.\n",
13+
"\n",
14+
"In this walkthrough, we will show how to export the feedback and examples from a Langsmith test project. The main steps are:\n",
15+
"\n",
16+
"1. Create a dataset\n",
17+
"2. Run testing\n",
18+
"3. Export feedback and examples\n",
19+
"\n",
20+
"## Setup\n",
21+
"\n",
22+
"Install langchain and any other dependencies for your chain. We will install pandas as well for this walkthrough to put the retrieved data in a dataframe."
23+
]
24+
},
25+
{
26+
"cell_type": "code",
27+
"execution_count": null,
28+
"id": "214520b3-801a-43eb-9eb5-017c1c6a9107",
29+
"metadata": {
30+
"tags": []
31+
},
32+
"outputs": [],
33+
"source": [
34+
"# %pip install -U langsmith langchain anthropic pandas --quiet"
35+
]
36+
},
37+
{
38+
"cell_type": "code",
39+
"execution_count": 1,
40+
"id": "e820af26-12aa-4f2c-9ffb-345e68dfc638",
41+
"metadata": {
42+
"tags": []
43+
},
44+
"outputs": [],
45+
"source": [
46+
"import uuid\n",
47+
"import os\n",
48+
"\n",
49+
"unique_id = uuid.uuid4().hex[0:8]\n",
50+
"os.environ[\"LANGCHAIN_API_KEY\"] = \"\"\n",
51+
"os.environ[\"LANGCHAIN_ENDPOINT\"] = \"\"\n",
52+
"os.environ[\"LANGCHAIN_PROJECT\"] = f\"Retrieve feedback and examples {unique_id}\""
53+
]
54+
},
55+
{
56+
"cell_type": "markdown",
57+
"id": "33f813bc-14b0-4015-aed6-d4365f4b9047",
58+
"metadata": {},
59+
"source": [
60+
"## 1. Create a dataset\n",
61+
"\n",
62+
"We will create a simple KV dataset with a poem topic and a constraint letter (which the model should not use)."
63+
]
64+
},
65+
{
66+
"cell_type": "code",
67+
"execution_count": 2,
68+
"id": "6594fa55-8278-453e-b923-fb5eb18b9458",
69+
"metadata": {
70+
"tags": []
71+
},
72+
"outputs": [],
73+
"source": [
74+
"from langsmith import Client\n",
75+
"\n",
76+
"client = Client()\n",
77+
"\n",
78+
"examples= [\n",
79+
" (\"roses\", \"o\"),\n",
80+
" (\"vikings\", \"v\"),\n",
81+
" (\"planet earth\", \"e\"),\n",
82+
" (\"Sirens of Titan\", \"t\"),\n",
83+
"]\n",
84+
"\n",
85+
"dataset_name = f\"Download Feedback and Examples {unique_id}\"\n",
86+
"dataset = client.create_dataset(dataset_name)\n",
87+
"\n",
88+
"for prompt, constraint in examples:\n",
89+
" client.create_example({\"input\": prompt, \"constraint\": constraint}, dataset_id=dataset.id, outputs={\"constraint\": constraint})"
90+
]
91+
},
92+
{
93+
"cell_type": "markdown",
94+
"id": "787a8a19-986c-4334-8ef6-08e884b45bcd",
95+
"metadata": {},
96+
"source": [
97+
"## 2. Run testing\n",
98+
"\n",
99+
"We will use a simple custom evaluator that checks whether the prediction contains the constraint letter."
100+
]
101+
},
102+
{
103+
"cell_type": "code",
104+
"execution_count": 3,
105+
"id": "df839dab-f0dc-435a-982d-2bcc501257ce",
106+
"metadata": {
107+
"tags": []
108+
},
109+
"outputs": [],
110+
"source": [
111+
"from typing import Any\n",
112+
"from langchain.evaluation import StringEvaluator\n",
113+
"\n",
114+
"class ConstraintEvaluator(StringEvaluator):\n",
115+
" \n",
116+
" @property\n",
117+
" def requires_reference(self):\n",
118+
" return True\n",
119+
" \n",
120+
" def _evaluate_strings(self, prediction: str, reference: str, **kwargs: Any) -> dict:\n",
121+
" # Reference in this case is the letter that should not be present\n",
122+
" return {\n",
123+
" \"score\": bool(reference not in prediction),\n",
124+
" \"reasoning\": f\"prediction contains the letter {reference}\",\n",
125+
" }"
126+
]
127+
},
128+
{
129+
"cell_type": "code",
130+
"execution_count": 4,
131+
"id": "3dc523be-b157-44a6-a820-f2bb2d2757ba",
132+
"metadata": {
133+
"tags": []
134+
},
135+
"outputs": [
136+
{
137+
"name": "stdout",
138+
"output_type": "stream",
139+
"text": [
140+
"View the evaluation results for project 'test-kind-prose-61' at:\n",
141+
"https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/9890dcfa-92af-42d0-9df2-aa3dccd13788\n",
142+
"[------------------------------------------------->] 4/4"
143+
]
144+
}
145+
],
146+
"source": [
147+
"from langchain import chat_models, prompts, schema\n",
148+
"from langchain.smith import RunEvalConfig\n",
149+
"\n",
150+
"chain = (\n",
151+
" prompts.PromptTemplate.from_template(\"Write a poem about {input} without using the letter {constraint}. Respond directly with the poem with no explanation.\")\n",
152+
" | chat_models.ChatAnthropic()\n",
153+
" | schema.output_parser.StrOutputParser()\n",
154+
")\n",
155+
"\n",
156+
"eval_config = RunEvalConfig(\n",
157+
" custom_evaluators=[ConstraintEvaluator()],\n",
158+
" input_key=\"input\",\n",
159+
")\n",
160+
"\n",
161+
"test_results = client.run_on_dataset(\n",
162+
" dataset_name=dataset_name,\n",
163+
" llm_or_chain_factory=chain,\n",
164+
" evaluation=eval_config,\n",
165+
")"
166+
]
167+
},
168+
{
169+
"cell_type": "markdown",
170+
"id": "75f3c533-8ee6-466b-9082-372710287873",
171+
"metadata": {},
172+
"source": [
173+
"## 3. Review the feedback and examples"
174+
]
175+
},
176+
{
177+
"cell_type": "markdown",
178+
"id": "8462317f-bfe6-4d0f-b418-a353d51c59da",
179+
"metadata": {
180+
"tags": []
181+
},
182+
"source": [
183+
"If you want to directly use the results, you can easily access them in tabular format by calling `to_dataframe()` on the test_results. "
184+
]
185+
},
186+
{
187+
"cell_type": "code",
188+
"execution_count": 5,
189+
"id": "da614c30-63ab-4c7f-a366-583bb89c01b7",
190+
"metadata": {
191+
"tags": []
192+
},
193+
"outputs": [
194+
{
195+
"data": {
196+
"text/html": [
197+
"<div>\n",
198+
"<style scoped>\n",
199+
" .dataframe tbody tr th:only-of-type {\n",
200+
" vertical-align: middle;\n",
201+
" }\n",
202+
"\n",
203+
" .dataframe tbody tr th {\n",
204+
" vertical-align: top;\n",
205+
" }\n",
206+
"\n",
207+
" .dataframe thead th {\n",
208+
" text-align: right;\n",
209+
" }\n",
210+
"</style>\n",
211+
"<table border=\"1\" class=\"dataframe\">\n",
212+
" <thead>\n",
213+
" <tr style=\"text-align: right;\">\n",
214+
" <th></th>\n",
215+
" <th>ConstraintEvaluator</th>\n",
216+
" <th>input</th>\n",
217+
" <th>output</th>\n",
218+
" <th>reference</th>\n",
219+
" </tr>\n",
220+
" </thead>\n",
221+
" <tbody>\n",
222+
" <tr>\n",
223+
" <th>3e9a2c05-f7f5-4309-b755-abeb76719f26</th>\n",
224+
" <td>False</td>\n",
225+
" <td>{'input': 'Sirens of Titan', 'constraint': 't'}</td>\n",
226+
" <td>Here is a poem about Sirens of Titan without ...</td>\n",
227+
" <td>{'constraint': 't'}</td>\n",
228+
" </tr>\n",
229+
" <tr>\n",
230+
" <th>9fddeee7-9080-46c8-b280-405b4f3d18cb</th>\n",
231+
" <td>False</td>\n",
232+
" <td>{'input': 'planet earth', 'constraint': 'e'}</td>\n",
233+
" <td>Our orb of life, a vision grand,\\nWith oceans...</td>\n",
234+
" <td>{'constraint': 'e'}</td>\n",
235+
" </tr>\n",
236+
" <tr>\n",
237+
" <th>d74ea995-6ae6-488a-8da8-21695f8949fe</th>\n",
238+
" <td>False</td>\n",
239+
" <td>{'input': 'vikings', 'constraint': 'v'}</td>\n",
240+
" <td>Here is a poem about Vikings without using th...</td>\n",
241+
" <td>{'constraint': 'v'}</td>\n",
242+
" </tr>\n",
243+
" <tr>\n",
244+
" <th>e059d4b8-3079-40c7-a6b7-16347d4fc568</th>\n",
245+
" <td>False</td>\n",
246+
" <td>{'input': 'roses', 'constraint': 'o'}</td>\n",
247+
" <td>Here is a poem about roses without using the ...</td>\n",
248+
" <td>{'constraint': 'o'}</td>\n",
249+
" </tr>\n",
250+
" </tbody>\n",
251+
"</table>\n",
252+
"</div>"
253+
],
254+
"text/plain": [
255+
" ConstraintEvaluator \\\n",
256+
"3e9a2c05-f7f5-4309-b755-abeb76719f26 False \n",
257+
"9fddeee7-9080-46c8-b280-405b4f3d18cb False \n",
258+
"d74ea995-6ae6-488a-8da8-21695f8949fe False \n",
259+
"e059d4b8-3079-40c7-a6b7-16347d4fc568 False \n",
260+
"\n",
261+
" input \\\n",
262+
"3e9a2c05-f7f5-4309-b755-abeb76719f26 {'input': 'Sirens of Titan', 'constraint': 't'} \n",
263+
"9fddeee7-9080-46c8-b280-405b4f3d18cb {'input': 'planet earth', 'constraint': 'e'} \n",
264+
"d74ea995-6ae6-488a-8da8-21695f8949fe {'input': 'vikings', 'constraint': 'v'} \n",
265+
"e059d4b8-3079-40c7-a6b7-16347d4fc568 {'input': 'roses', 'constraint': 'o'} \n",
266+
"\n",
267+
" output \\\n",
268+
"3e9a2c05-f7f5-4309-b755-abeb76719f26 Here is a poem about Sirens of Titan without ... \n",
269+
"9fddeee7-9080-46c8-b280-405b4f3d18cb Our orb of life, a vision grand,\\nWith oceans... \n",
270+
"d74ea995-6ae6-488a-8da8-21695f8949fe Here is a poem about Vikings without using th... \n",
271+
"e059d4b8-3079-40c7-a6b7-16347d4fc568 Here is a poem about roses without using the ... \n",
272+
"\n",
273+
" reference \n",
274+
"3e9a2c05-f7f5-4309-b755-abeb76719f26 {'constraint': 't'} \n",
275+
"9fddeee7-9080-46c8-b280-405b4f3d18cb {'constraint': 'e'} \n",
276+
"d74ea995-6ae6-488a-8da8-21695f8949fe {'constraint': 'v'} \n",
277+
"e059d4b8-3079-40c7-a6b7-16347d4fc568 {'constraint': 'o'} "
278+
]
279+
},
280+
"execution_count": 5,
281+
"metadata": {},
282+
"output_type": "execute_result"
283+
}
284+
],
285+
"source": [
286+
"test_results.to_dataframe()"
287+
]
288+
},
289+
{
290+
"cell_type": "markdown",
291+
"id": "fbb7ff41-d1a2-4c92-8c63-803a0cc34b9c",
292+
"metadata": {},
293+
"source": [
294+
"If you want to fetch the feedback and examples for a historic test project, you can use the SDK:"
295+
]
296+
},
297+
{
298+
"cell_type": "code",
299+
"execution_count": null,
300+
"id": "33a1c081-7cd6-408f-949d-f6897f4baf3c",
301+
"metadata": {
302+
"tags": []
303+
},
304+
"outputs": [],
305+
"source": [
306+
"# Can be any previous test projects\n",
307+
"test_project = test_results['project_name']"
308+
]
309+
},
310+
{
311+
"cell_type": "code",
312+
"execution_count": null,
313+
"id": "b4426821-bae0-4702-86e9-c5d7bbaceb20",
314+
"metadata": {
315+
"tags": []
316+
},
317+
"outputs": [],
318+
"source": [
319+
"import pandas as pd\n",
320+
"\n",
321+
"runs = client.list_runs(project_name=test_project, execution_order=1)\n",
322+
"\n",
323+
"df = pd.DataFrame(\n",
324+
" [\n",
325+
" {\n",
326+
" \"example_id\": r.reference_example_id,\n",
327+
" **r.inputs,\n",
328+
" **(r.outputs or {}),\n",
329+
" **{k: v for f in client.list_feedback(run_ids=[r.id]) for k, v in [(f\"{f.key}.score\", f.score), (f\"{f.key}.comment\", f.comment)]},\n",
330+
" \"reference\": client.read_example(r.reference_example_id).outputs\n",
331+
" }\n",
332+
" for r in runs\n",
333+
" ]\n",
334+
")\n",
335+
"df"
336+
]
337+
},
338+
{
339+
"cell_type": "markdown",
340+
"id": "eba511e2-c9ec-4268-b655-5163fa882086",
341+
"metadata": {},
342+
"source": [
343+
"## Conclusion\n",
344+
"\n",
345+
"In this example we showed how to download feedback and examples from a test project. You can directly use the result object from the run or use the SDK to fetch the results and feedback.\n",
346+
"Use this to analyze further or to programmatically add result information to your existing reports."
347+
]
348+
},
349+
{
350+
"cell_type": "code",
351+
"execution_count": null,
352+
"id": "0f637c05-80e7-43be-b44e-2a337139a183",
353+
"metadata": {},
354+
"outputs": [],
355+
"source": []
356+
}
357+
],
358+
"metadata": {
359+
"kernelspec": {
360+
"display_name": "Python 3 (ipykernel)",
361+
"language": "python",
362+
"name": "python3"
363+
},
364+
"language_info": {
365+
"codemirror_mode": {
366+
"name": "ipython",
367+
"version": 3
368+
},
369+
"file_extension": ".py",
370+
"mimetype": "text/x-python",
371+
"name": "python",
372+
"nbconvert_exporter": "python",
373+
"pygments_lexer": "ipython3",
374+
"version": "3.11.2"
375+
}
376+
},
377+
"nbformat": 4,
378+
"nbformat_minor": 5
379+
}

0 commit comments

Comments
 (0)