Future-House · Mar 3, 2025
diff --git a/‎docs/tutorials/running_on_lfrqa.ipynb
+464 b/‎docs/tutorials/running_on_lfrqa.ipynb
+464
diff --git a/‎docs/tutorials/running_on_lfrqa.md
+79-33 b/‎docs/tutorials/running_on_lfrqa.md
+79-33
@@ -0,0 +1,464 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Measuring PaperQA2 with LFRQA\n",
+    "> This tutorial is available as a Jupyter notebook [here](https://github.com/Future-House/paper-qa/blob/main/docs/tutorials/running_on_lfrqa.md)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Overview\n",
+    "\n",
+    "The **LFRQA dataset** was introduced in the paper [_RAG-QA Arena: Evaluating Domain Robustness for Long-Form Retrieval-Augmented Question Answering_](https://arxiv.org/pdf/2407.13998). It features **1,404 science questions** (along with other categories) that have been human-annotated with answers. This tutorial walks through the process of setting up the dataset for use and benchmarking.\n",
+    "\n",
+    "## Download the Annotations\n",
+    "\n",
+    "First, we need to obtain the annotated dataset from the official repository:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create a new directory for the dataset\n",
+    "!mkdir -p data/rag-qa-benchmarking\n",
+    "\n",
+    "# Get the annotated questions\n",
+    "!curl https://raw.githubusercontent.com/awslabs/rag-qa-arena/refs/heads/main/data/\\\n",
+    "annotations_science_with_citation.jsonl \\\n",
+    "-o data/rag-qa-benchmarking/annotations_science_with_citation.jsonl"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "\n",
+    "## Download the Robust-QA Documents\n",
+    "\n",
+    "LFRQA is built upon **Robust-QA**, so we must download the relevant documents:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Download the Lotte dataset, which includes the required documents\n",
+    "!curl https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz --output lotte.tar.gz\n",
+    "\n",
+    "# Extract the dataset\n",
+    "!tar -xvzf lotte.tar.gz\n",
+    "\n",
+    "# Move the science test collection to our dataset folder\n",
+    "!cp lotte/science/test/collection.tsv ./data/rag-qa-benchmarking/science_test_collection.tsv\n",
+    "\n",
+    "# Clean up unnecessary files\n",
+    "!rm lotte.tar.gz\n",
+    "!rm -rf lotte"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For more details, refer to the original paper: [_RAG-QA Arena: Evaluating Domain Robustness for Long-Form Retrieval-Augmented Question Answering_](https://arxiv.org/pdf/2407.13998)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "\n",
+    "## Load the Data\n",
+    "\n",
+    "We now load the documents into a pandas dataframe:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "import pandas as pd\n",
+    "\n",
+    "# Load questions and answers dataset\n",
+    "rag_qa_benchmarking_dir = os.path.join(\"data\", \"rag-qa-benchmarking\")\n",
+    "\n",
+    "# Load documents dataset\n",
+    "lfrqa_docs_df = pd.read_csv(\n",
+    "    os.path.join(rag_qa_benchmarking_dir, \"science_test_collection.tsv\"),\n",
+    "    sep=\"\\t\",\n",
+    "    names=[\"doc_id\", \"doc_text\"],\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Select the Documents to Use\n",
+    "RobustQA consists on 1.7M documents. Hence, it takes around 3 hours to build the whole index.\n",
+    "\n",
+    "To run a test, we can use 1% of the dataset. This will be accomplished by selecting the first 1% available documents and the questions referent to these documents."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "proportion_to_use = 1 / 100\n",
+    "amount_of_docs_to_use = int(len(lfrqa_docs_df) * proportion_to_use)\n",
+    "print(f\"Using {amount_of_docs_to_use} out of {len(lfrqa_docs_df)} documents\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prepare the Document Files\n",
+    "We now create the document directory and store each document as a separate text file, so that paperqa can build the index."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "partial_docs = lfrqa_docs_df.head(amount_of_docs_to_use)\n",
+    "lfrqa_directory = os.path.join(rag_qa_benchmarking_dir, \"lfrqa\")\n",
+    "os.makedirs(\n",
+    "    os.path.join(lfrqa_directory, \"science_docs_for_paperqa\", \"files\"), exist_ok=True\n",
+    ")\n",
+    "\n",
+    "for i, row in partial_docs.iterrows():\n",
+    "    doc_id = row[\"doc_id\"]\n",
+    "    doc_text = row[\"doc_text\"]\n",
+    "\n",
+    "    with open(\n",
+    "        os.path.join(\n",
+    "            lfrqa_directory, \"science_docs_for_paperqa\", \"files\", f\"{doc_id}.txt\"\n",
+    "        ),\n",
+    "        \"w\",\n",
+    "        encoding=\"utf-8\",\n",
+    "    ) as f:\n",
+    "        f.write(doc_text)\n",
+    "\n",
+    "    if i % int(len(partial_docs) * 0.05) == 0:\n",
+    "        progress = (i + 1) / len(partial_docs)\n",
+    "        print(f\"Progress: {progress:.2%}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create the Manifest File\n",
+    "The **manifest file** keeps track of document metadata for the dataset. We need to fill some fields so that paperqa doesn’t try to get metadata using llm calls. This will make the indexing process faster."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "manifest = partial_docs.copy()\n",
+    "manifest[\"file_location\"] = manifest[\"doc_id\"].apply(lambda x: f\"files/{x}.txt\")\n",
+    "manifest[\"doi\"] = \"\"\n",
+    "manifest[\"title\"] = manifest[\"doc_id\"]\n",
+    "manifest[\"key\"] = manifest[\"doc_id\"]\n",
+    "manifest[\"docname\"] = manifest[\"doc_id\"]\n",
+    "manifest[\"citation\"] = \"_\"\n",
+    "manifest = manifest.drop(columns=[\"doc_id\", \"doc_text\"])\n",
+    "manifest.to_csv(\n",
+    "    os.path.join(lfrqa_directory, \"science_docs_for_paperqa\", \"manifest.csv\"),\n",
+    "    index=False,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Filter and Save Questions\n",
+    "Finally, we load the questions and filter them to ensure we only include questions that reference the selected documents:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "questions_df = pd.read_json(\n",
+    "    os.path.join(rag_qa_benchmarking_dir, \"annotations_science_with_citation.jsonl\"),\n",
+    "    lines=True,\n",
+    ")\n",
+    "partial_questions = questions_df[\n",
+    "    questions_df.gold_doc_ids.apply(\n",
+    "        lambda ids: all(_id < amount_of_docs_to_use for _id in ids)\n",
+    "    )\n",
+    "]\n",
+    "partial_questions.to_csv(\n",
+    "    os.path.join(lfrqa_directory, \"questions.csv\"),\n",
+    "    index=False,\n",
+    ")\n",
+    "\n",
+    "print(\"Using\", len(partial_questions), \"questions\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Install paperqa\n",
+    "From now on, we will be using the paperqa library, so we need to install it:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install paper-qa"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Index the Documents\n",
+    "\n",
+    "Now we will build an index for the LFRQA documents. The index is a **Tantivy index**, which is a fast, full-text search engine library written in Rust. Tantivy is designed to handle large datasets efficiently, making it ideal for searching through a vast collection of papers or documents.\n",
+    "\n",
+    "Feel free to adjust the concurrency settings as you like. Because we defined a manifest, we don’t need any API keys for building this index because we don't discern any citation metadata, but you do need LLM API keys to answer questions.\n",
+    "\n",
+    "Remember that this process is quick for small portions of the dataset, but can take around 3 hours for the whole dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import nest_asyncio\n",
+    "\n",
+    "nest_asyncio.apply()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We add the line above to handle async code within a notebook.\n",
+    "\n",
+    "However, to improve compatibility and speed up the indexing process, we strongly recommend running the following code in a separate `.py` file"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "from paperqa import Settings\n",
+    "from paperqa.agents import build_index\n",
+    "from paperqa.settings import AgentSettings, IndexSettings, ParsingSettings\n",
+    "\n",
+    "settings = Settings(\n",
+    "    agent=AgentSettings(\n",
+    "        index=IndexSettings(\n",
+    "            name=\"lfrqa_science_index\",\n",
+    "            paper_directory=os.path.join(\n",
+    "                \"data\", \"rag-qa-benchmarking\", \"lfrqa\", \"science_docs_for_paperqa\"\n",
+    "            ),\n",
+    "            index_directory=os.path.join(\n",
+    "                \"data\", \"rag-qa-benchmarking\", \"lfrqa\", \"science_docs_for_paperqa_index\"\n",
+    "            ),\n",
+    "            manifest_file=\"manifest.csv\",\n",
+    "            concurrency=10_000,\n",
+    "            batch_size=10_000,\n",
+    "        )\n",
+    "    ),\n",
+    "    parsing=ParsingSettings(\n",
+    "        use_doc_details=False,\n",
+    "        defer_embedding=True,\n",
+    "    ),\n",
+    ")\n",
+    "\n",
+    "build_index(settings=settings)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After this runs, you will have an index ready to use!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Benchmark!\n",
+    "After you have built the index, you are ready to run the benchmark. We advice running this in a separate `.py` file.\n",
+    "\n",
+    "To run this, you will need to have the [`ldp`](https://github.com/Future-House/ldp) and [`fhaviary[lfrqa]`](https://github.com/Future-House/aviary/blob/main/packages/lfrqa/README.md#installation) packages installed.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install ldp \"fhaviary[lfrqa]\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import asyncio\n",
+    "import json\n",
+    "import logging\n",
+    "import os\n",
+    "\n",
+    "import pandas as pd\n",
+    "from aviary.envs.lfrqa import LFRQAQuestion, LFRQATaskDataset\n",
+    "from ldp.agent import SimpleAgent\n",
+    "from ldp.alg.runners import Evaluator, EvaluatorConfig\n",
+    "\n",
+    "from paperqa import Settings\n",
+    "from paperqa.settings import AgentSettings, IndexSettings\n",
+    "\n",
+    "logging.basicConfig(level=logging.ERROR)\n",
+    "\n",
+    "log_results_dir = os.path.join(\"data\", \"rag-qa-benchmarking\", \"results\")\n",
+    "os.makedirs(log_results_dir, exist_ok=True)\n",
+    "\n",
+    "\n",
+    "async def log_evaluation_to_json(lfrqa_question_evaluation: dict) -> None:  # noqa: RUF029\n",
+    "    json_path = os.path.join(\n",
+    "        log_results_dir, f\"{lfrqa_question_evaluation['qid']}.json\"\n",
+    "    )\n",
+    "    with open(json_path, \"w\") as f:  # noqa: ASYNC230\n",
+    "        json.dump(lfrqa_question_evaluation, f, indent=2)\n",
+    "\n",
+    "\n",
+    "async def evaluate() -> None:\n",
+    "    settings = Settings(\n",
+    "        agent=AgentSettings(\n",
+    "            index=IndexSettings(\n",
+    "                name=\"lfrqa_science_index\",\n",
+    "                paper_directory=os.path.join(\n",
+    "                    \"data\", \"rag-qa-benchmarking\", \"lfrqa\", \"science_docs_for_paperqa\"\n",
+    "                ),\n",
+    "                index_directory=os.path.join(\n",
+    "                    \"data\",\n",
+    "                    \"rag-qa-benchmarking\",\n",
+    "                    \"lfrqa\",\n",
+    "                    \"science_docs_for_paperqa_index\",\n",
+    "                ),\n",
+    "            )\n",
+    "        )\n",
+    "    )\n",
+    "\n",
+    "    data: list[LFRQAQuestion] = [\n",
+    "        LFRQAQuestion(**row)\n",
+    "        for row in pd.read_csv(\n",
+    "            os.path.join(\"data\", \"rag-qa-benchmarking\", \"lfrqa\", \"questions.csv\")\n",
+    "        )[[\"qid\", \"question\", \"answer\", \"gold_doc_ids\"]].to_dict(orient=\"records\")\n",
+    "    ]\n",
+    "\n",
+    "    dataset = LFRQATaskDataset(\n",
+    "        data=data,\n",
+    "        settings=settings,\n",
+    "        evaluation_callback=log_evaluation_to_json,\n",
+    "    )\n",
+    "\n",
+    "    evaluator = Evaluator(\n",
+    "        config=EvaluatorConfig(batch_size=3),\n",
+    "        agent=SimpleAgent(),\n",
+    "        dataset=dataset,\n",
+    "    )\n",
+    "    await evaluator.evaluate()\n",
+    "\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    asyncio.run(evaluate())\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After running this, you can find the results in the `data/rag-qa-benchmarking/results` folder. Here is an example of how to read them:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import glob\n",
+    "\n",
+    "json_files = glob.glob(os.path.join(rag_qa_benchmarking_dir, \"results\", \"*.json\"))\n",
+    "\n",
+    "data = []\n",
+    "for file in json_files:\n",
+    "    with open(file) as f:\n",
+    "        json_data = json.load(f)\n",
+    "        json_data[\"qid\"] = file.split(\"/\")[-1].replace(\".json\", \"\")\n",
+    "        data.append(json_data)\n",
+    "\n",
+    "results_df = pd.DataFrame(data).set_index(\"qid\")\n",
+    "results_df[\"winner\"].value_counts(normalize=True)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -1,4 +1,20 @@
+---
+jupyter:
+  jupytext:
+    text_representation:
+      extension: .md
+      format_name: markdown
+      format_version: '1.3'
+      jupytext_version: 1.16.7
+  kernelspec:
+    display_name: .venv
+    language: python
+    name: python3
+---
+
 # Measuring PaperQA2 with LFRQA
+> This tutorial is available as a Jupyter notebook [here](https://github.com/Future-House/paper-qa/blob/main/docs/tutorials/running_on_lfrqa.md)
+
 
 ## Overview
 
@@ -8,41 +24,55 @@ The **LFRQA dataset** was introduced in the paper [_RAG-QA Arena: Evaluating Dom
 
 First, we need to obtain the annotated dataset from the official repository:
 
-```bash
+
+```python
 # Create a new directory for the dataset
-mkdir -p data/rag-qa-benchmarking
+!mkdir -p data/rag-qa-benchmarking
 
 # Get the annotated questions
-curl https://raw.githubusercontent.com/awslabs/rag-qa-arena/refs/heads/main/data/annotations_science_with_citation.jsonl -o data/rag-qa-benchmarking/annotations_science_with_citation.jsonl
+!curl https://raw.githubusercontent.com/awslabs/rag-qa-arena/refs/heads/main/data/\
+annotations_science_with_citation.jsonl \
+-o data/rag-qa-benchmarking/annotations_science_with_citation.jsonl
 ```
 
+<!-- #region -->
+
+
 ## Download the Robust-QA Documents
 
 LFRQA is built upon **Robust-QA**, so we must download the relevant documents:
 
-```bash
+<!-- #endregion -->
+
+```python
 # Download the Lotte dataset, which includes the required documents
-curl https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz --output lotte.tar.gz
+!curl https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz --output lotte.tar.gz
 
 # Extract the dataset
-tar -xvzf lotte.tar.gz
+!tar -xvzf lotte.tar.gz
 
 # Move the science test collection to our dataset folder
-cp lotte/science/test/collection.tsv ./data/rag-qa-benchmarking/science_test_collection.tsv
+!cp lotte/science/test/collection.tsv ./data/rag-qa-benchmarking/science_test_collection.tsv
 
 # Clean up unnecessary files
-rm lotte.tar.gz
-rm -rf lotte
+!rm lotte.tar.gz
+!rm -rf lotte
 ```
 
 For more details, refer to the original paper: [_RAG-QA Arena: Evaluating Domain Robustness for Long-Form Retrieval-Augmented Question Answering_](https://arxiv.org/pdf/2407.13998).
 
+<!-- #region -->
+
+
 ## Load the Data
 
 We now load the documents into a pandas dataframe:
 
+<!-- #endregion -->
+
 ```python
 import os
+
 import pandas as pd
 
 # Load questions and answers dataset
@@ -57,10 +87,9 @@ lfrqa_docs_df = pd.read_csv(
 ```
 
 ## Select the Documents to Use
+RobustQA consists on 1.7M documents. Hence, it takes around 3 hours to build the whole index.
 
-RobustQA consists on 1.7M documents, so building the whole index will take around 3 hours.
-
-If you want to run a test, you can use a portion of the dataset and the questions that can be answered only on those documents.
+To run a test, we can use 1% of the dataset. This will be accomplished by selecting the first 1% available documents and the questions referent to these documents.
 
 ```python
 proportion_to_use = 1 / 100
@@ -69,7 +98,6 @@ print(f"Using {amount_of_docs_to_use} out of {len(lfrqa_docs_df)} documents")
 ```
 
 ## Prepare the Document Files
-
 We now create the document directory and store each document as a separate text file, so that paperqa can build the index.
 
 ```python
@@ -98,7 +126,6 @@ for i, row in partial_docs.iterrows():
 ```
 
 ## Create the Manifest File
-
 The **manifest file** keeps track of document metadata for the dataset. We need to fill some fields so that paperqa doesn’t try to get metadata using llm calls. This will make the indexing process faster.
 
 ```python
@@ -109,15 +136,14 @@ manifest["title"] = manifest["doc_id"]
 manifest["key"] = manifest["doc_id"]
 manifest["docname"] = manifest["doc_id"]
 manifest["citation"] = "_"
-manifest.drop(columns=["doc_id", "doc_text"], inplace=True)
+manifest = manifest.drop(columns=["doc_id", "doc_text"])
 manifest.to_csv(
     os.path.join(lfrqa_directory, "science_docs_for_paperqa", "manifest.csv"),
     index=False,
 )
 ```
 
 ## Filter and Save Questions
-
 Finally, we load the questions and filter them to ensure we only include questions that reference the selected documents:
 
 ```python
@@ -127,42 +153,53 @@ questions_df = pd.read_json(
 )
 partial_questions = questions_df[
     questions_df.gold_doc_ids.apply(
-        lambda ids: all(id < amount_of_docs_to_use for id in ids)
+        lambda ids: all(_id < amount_of_docs_to_use for _id in ids)
     )
 ]
 partial_questions.to_csv(
     os.path.join(lfrqa_directory, "questions.csv"),
     index=False,
 )
+
+print("Using", len(partial_questions), "questions")
 ```
 
 ## Install paperqa
-
 From now on, we will be using the paperqa library, so we need to install it:
 
-```bash
-pip install paper-qa
+```python
+!pip install paper-qa
 ```
 
-## Index the documents
+## Index the Documents
 
-Copy the following to a file and run it. Feel free to adjust the concurrency as you like.
+Now we will build an index for the LFRQA documents. The index is a **Tantivy index**, which is a fast, full-text search engine library written in Rust. Tantivy is designed to handle large datasets efficiently, making it ideal for searching through a vast collection of papers or documents.
 
-You don’t need any api keys for building this index because we don't discern any citation metadata, but you do need LLM api keys to answer questions.
+Feel free to adjust the concurrency settings as you like. Because we defined a manifest, we don’t need any API keys for building this index because we don't discern any citation metadata, but you do need LLM API keys to answer questions.
 
 Remember that this process is quick for small portions of the dataset, but can take around 3 hours for the whole dataset.
 
+```python
+import nest_asyncio
+
+nest_asyncio.apply()
+```
+
+We add the line above to handle async code within a notebook.
+
+However, to improve compatibility and speed up the indexing process, we strongly recommend running the following code in a separate `.py` file
+
 ```python
 import os
 
-from paperqa import Settings, ask
+from paperqa import Settings
 from paperqa.agents import build_index
 from paperqa.settings import AgentSettings, IndexSettings, ParsingSettings
 
 settings = Settings(
     agent=AgentSettings(
         index=IndexSettings(
-            name="lfrqa_science_index0.1",
+            name="lfrqa_science_index",
             paper_directory=os.path.join(
                 "data", "rag-qa-benchmarking", "lfrqa", "science_docs_for_paperqa"
             ),
@@ -183,17 +220,23 @@ settings = Settings(
 build_index(settings=settings)
 ```
 
-After this runs, you will get an answer!
+After this runs, you will have an index ready to use!
+
 
 ## Benchmark!
+After you have built the index, you are ready to run the benchmark. We advice running this in a separate `.py` file.
+
+To run this, you will need to have the [`ldp`](https://github.com/Future-House/ldp) and [`fhaviary[lfrqa]`](https://github.com/Future-House/aviary/blob/main/packages/lfrqa/README.md#installation) packages installed.
 
-After you have built the index, you are ready to run the benchmark.
 
-Copy the following into a file and run it. To run this, you will need to have the [`ldp`](https://github.com/Future-House/ldp) and [`fhaviary[lfrqa]`](https://github.com/Future-House/aviary/blob/main/packages/lfrqa/README.md#installation) packages installed.
+```python
+!pip install ldp "fhaviary[lfrqa]"
+```
 
 ```python
 import asyncio
 import json
+import logging
 import os
 
 import pandas as pd
@@ -204,16 +247,19 @@ from ldp.alg.runners import Evaluator, EvaluatorConfig
 from paperqa import Settings
 from paperqa.settings import AgentSettings, IndexSettings
 
+logging.basicConfig(level=logging.ERROR)
 
 log_results_dir = os.path.join("data", "rag-qa-benchmarking", "results")
 os.makedirs(log_results_dir, exist_ok=True)
 
 
-async def log_evaluation_to_json(lfrqa_question_evaluation: dict) -> None:
+async def log_evaluation_to_json(
+    lfrqa_question_evaluation: dict,
+) -> None:  # noqa: RUF029
     json_path = os.path.join(
         log_results_dir, f"{lfrqa_question_evaluation['qid']}.json"
     )
-    with open(json_path, "w") as f:
+    with open(json_path, "w") as f:  # noqa: ASYNC230
         json.dump(lfrqa_question_evaluation, f, indent=2)
 
 
@@ -260,11 +306,11 @@ if __name__ == "__main__":
     asyncio.run(evaluate())
 ```
 
+
 After running this, you can find the results in the `data/rag-qa-benchmarking/results` folder. Here is an example of how to read them:
 
 ```python
 import glob
-import json
 
 json_files = glob.glob(os.path.join(rag_qa_benchmarking_dir, "results", "*.json"))
 
@@ -275,6 +321,6 @@ for file in json_files:
         json_data["qid"] = file.split("/")[-1].replace(".json", "")
         data.append(json_data)
 
-df = pd.DataFrame(data).set_index("qid")
-df["winner"].value_counts(normalize=True)
+results_df = pd.DataFrame(data).set_index("qid")
+results_df["winner"].value_counts(normalize=True)
 ```