Data Extraction Example (#150)

hinthornw · hinthornw · commit d9aa2e143236 · 2023-10-31T20:54:54.000-07:00
diff --git a/README.md b/README.md
@@ -41,6 +41,7 @@ Test and benchmark your LLM systems using methods in these evaluation recipes:
 - [RAG Evaluation using Fixed Sources](./testing-examples/using-fixed-sources/using_fixed_sources.ipynb): evaluate the response component of a RAG  (retrieval-augmented generation) pipeline by providing retrieved documents in the dataset
 - [Evaluating an Agent's intermediate steps](./testing-examples/agent_steps/evaluating_agents.ipynb): compare the sequence of actions taken by an agent to an expected trajectory to grade effective tool use.
 - [Evaluating a Conversational Chat Bot](./testing-examples/chat-single-turn/chat_evaluation_single_turn.ipynb): Evaluate chatbots within multi-turn conversations by treating each data point as an individual dialogue turn. This guide shows how to set up a multi-turn conversation dataset and evaluate a simple chat bot on it.
+- [Evaluating an Extraction Chain](./testing-examples/data-extraction/contract-extraction.ipynb): measure the similarity between the extracted structured content and structured labels using LangChain's json evaluators.
 - [Comparison Evals](./testing-examples/comparing-runs/comparing-qa.ipynb): use labeled preference scoring to contrast system versions and determine the most optimal outputs.
 - You can incorporate LangSmith in your existing testing framework:
     - [LangSmith in Pytest](./testing-examples/pytest/) benchmark your chain in pytest and assert aggregate metrics meet the quality bar.
diff --git a/testing-examples/README.md b/testing-examples/README.md
@@ -9,6 +9,7 @@ sidebar_position: 4
 - [RAG Evaluation using Fixed Sources](./using-fixed-sources/using_fixed_sources.ipynb): evaluate the response component of a RAG pipeline by providing retrieved documents in the dataset.
 - [Evaluating an Agent's intermediate steps](./agent_steps/evaluating_agents.ipynb): compare the sequence of actions taken by an agent to an expected trajectory to grade effective tool use.
 - [Evaluating a Conversational Chat Bot](./chat-single-turn/chat_evaluation_single_turn.ipynb): Evaluate chatbots within multi-turn conversations by treating each data point as an individual dialogue turn. This guide shows how to set up a multi-turn conversation dataset and evaluate a simple chat bot on it.
+- [Evaluating an Extraction Chain](./data-extraction/contract-extraction.ipynb): measure the similarity between the extracted structured content and structured labels using LangChain's json evaluators.
 - [Comparison Evals](./comparing-runs/comparing-qa.ipynb): use labeled preference scoring to contrast system versions and determine the most optimal outputs.
 - You can incorporate LangSmith in your existing testing framework:
     - [LangSmith in Pytest](./pytest/) benchmark your chain in pytest and assert aggregate metrics meet the quality bar.
diff --git a/testing-examples/data-extraction/contract-extraction.ipynb b/testing-examples/data-extraction/contract-extraction.ipynb
@@ -0,0 +1,257 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "66854c19-6b01-4046-8ac5-27940537c35c",
+   "metadata": {},
+   "source": [
+    "# Evaluating an Extraction Chain\n",
+    "\n",
+    "Structured data [extraction](https://python.langchain.com/docs/use_cases/extraction) from unstructured text is a core part of any LLM applications. Whether it's preparing structured rows for database insertion, deriving API parameters for function calling and forms, or for building knowledge graphs, the utility is present.\n",
+    "\n",
+    "This walkthrough presents a method to evaluate an extraction chain. While our example dataset revolves around legal briefs, the principles and techniques laid out here are widely applicable across various domains and use-cases.\n",
+    "\n",
+    "By the end of this guide, you'll be equipped to set up and evaluate extraction chains tailored to your specific needs, ensuring your applications extract information both effectively and efficiently.\n",
+    "\n",
+    "![Contract Exctraction Dataset](./img/contract-extraction-dataset.png)\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "This walkthrough requires LangChain and Anthropic. Ensure they're installed and that you've configured the necessary API keys."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "0464edf8-751b-4b80-9f01-70af40f00c4b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install -U --quiet langchain langsmith langchain_experimental anthropic jsonschema"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "a1e3b17b-736a-4991-8122-11bf3ac121c5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import uuid\n",
+    "\n",
+    "uid = uuid.uuid4()\n",
+    "os.environ[\"LANGCHAIN_API_KEY\"] = \"YOUR API KEY\"\n",
+    "os.environ[\"ANTHROPIC_API_KEY\"] = \"sk-ant-***\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b876b92c-74a5-47ae-9efc-dab64bf88d19",
+   "metadata": {},
+   "source": [
+    "## 1. Create dataset\n",
+    "\n",
+    "For this task, we will be filling out details about legal contracts from their context. We have prepared a mall labeled dataset for this walkthrough based on the Contract Understanding Atticus Dataset (CUAD)([link](https://github.com/TheAtticusProject/cuad)). You can explore the [Contract Extraction](https://smith.langchain.com/public/08ab7912-006e-4c00-a973-0f833e74907b/d) dataset at the provided link."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "921efddb-8210-4d5f-8705-41e6f9521b28",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langsmith import Client\n",
+    "\n",
+    "share_token = \"08ab7912-006e-4c00-a973-0f833e74907b\"\n",
+    "dataset_name = f\"Contract Extraction - {uid}\"\n",
+    "\n",
+    "client = Client()\n",
+    "examples = list(client.list_shared_examples(share_token))\n",
+    "dataset = client.create_dataset(dataset_name=dataset_name)\n",
+    "client.create_examples(\n",
+    "    inputs=[e.inputs for e in examples],\n",
+    "    outputs=[e.outputs for e in examples],\n",
+    "    dataset_id=dataset.id,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "067a9e51-7fcb-4de5-87af-644b7ca9b893",
+   "metadata": {},
+   "source": [
+    "## 2. Define extraction chain\n",
+    "\n",
+    "Our dataset inputs are quite long, so we will be testing out the experimental [Anthropic Functions](https://python.langchain.com/docs/integrations/chat/anthropic_functions) chain for this extraction task. This chain prompts the model to respond in XML that conforms to the provided schema.\n",
+    "\n",
+    "Below, we will define the contract schema to extract"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "e51793dc-ee9f-491c-aa91-fb32cf66308c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from typing import List, Optional, Union\n",
+    "\n",
+    "from pydantic import BaseModel\n",
+    "\n",
+    "\n",
+    "class Address(BaseModel):\n",
+    "    street: str\n",
+    "    city: str\n",
+    "    state: str\n",
+    "    zip_code: str\n",
+    "    country: Optional[str]\n",
+    "\n",
+    "\n",
+    "class Party(BaseModel):\n",
+    "    name: str\n",
+    "    address: Address\n",
+    "    type: Optional[str]\n",
+    "\n",
+    "\n",
+    "class Section(BaseModel):\n",
+    "    title: str\n",
+    "    content: str\n",
+    "\n",
+    "\n",
+    "class Contract(BaseModel):\n",
+    "    document_title: str\n",
+    "    exhibit_number: Optional[str]\n",
+    "    effective_date: str\n",
+    "    parties: List[Party]\n",
+    "    sections: List[Section]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "25e90fd2-bd7c-499d-b02d-3eee98550031",
+   "metadata": {},
+   "source": [
+    "Now we can define our extraction chain.  We define it in the `create_chain`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "6a18eb63-c525-40d6-be1d-bd4c52fbfbdf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain import hub\n",
+    "from langchain.chains import create_extraction_chain\n",
+    "from langchain.chat_models import ChatAnthropic\n",
+    "from langchain_experimental.llms.anthropic_functions import AnthropicFunctions\n",
+    "\n",
+    "contract_prompt = hub.pull(\"wfh/anthropic_contract_extraction\")\n",
+    "\n",
+    "\n",
+    "extraction_subchain = create_extraction_chain(\n",
+    "    Contract.schema(),\n",
+    "    llm=AnthropicFunctions(model=\"claude-2\", max_tokens=20_000),\n",
+    "    prompt=contract_prompt,\n",
+    ")\n",
+    "# Dataset inputs have an \"context\" key, but this chain\n",
+    "# expects a dict with an \"input\" key\n",
+    "chain = (\n",
+    "    (lambda x: {\"input\": x[\"context\"]})\n",
+    "    | extraction_subchain\n",
+    "    | (lambda x: {\"output\": x[\"text\"]})\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ee760e32-6d0b-421a-b73b-c15c60f35883",
+   "metadata": {},
+   "source": [
+    "## 3. Evaluate\n",
+    "\n",
+    "For this evaluation, we'll utilize the JSON edit distance evaluator, which standardizes the extracted entities and then determines a normalized string edit distance between the canonical versions. It is a fast way to check for the similarity between two json objects without relying explicitly on an LLM."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "65d8ff4f-f5fa-4f21-a412-9a8b21749ec3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import logging\n",
+    "\n",
+    "# We will suppress any errors here since the documents are long\n",
+    "# and could pollute the notebook output\n",
+    "logger = logging.getLogger()\n",
+    "logger.setLevel(logging.CRITICAL)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "03383cdd-8fef-480f-bddf-b0616ed6e0c9",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "View the evaluation results for project 'test-extraneous-weather-30' at:\n",
+      "https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/06d577cc-77b2-45a2-80a5-34c232bba9af?eval=true\n",
+      "[------------------------------------------------->] 16/16"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain.smith import RunEvalConfig\n",
+    "\n",
+    "eval_config = RunEvalConfig(\n",
+    "    evaluators=[\"json_edit_distance\"],\n",
+    ")\n",
+    "res = client.run_on_dataset(\n",
+    "    dataset_name=dataset_name,\n",
+    "    llm_or_chain_factory=chain,\n",
+    "    evaluation=eval_config,\n",
+    "    # In case you are rate-limited\n",
+    "    concurrency_level=2,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7f59fb9d-9cb1-4365-9619-28d2897e0dfd",
+   "metadata": {},
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "In this walkthrough, we showcased a methodical approach to evaluating an extraction chain applied to template filling for legal briefs.\n",
+    "You can use similar techniques to evaluate chains intended to return structured output."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/testing-examples/data-extraction/img/contract-extraction-dataset.png b/testing-examples/data-extraction/img/contract-extraction-dataset.png