From 293e517cc5b5fbc3f895d569402e7c06f2ae19ad Mon Sep 17 00:00:00 2001
From: vinhn <vinhn@nvidia.com>
Date: Wed, 21 May 2025 03:09:14 +0000
Subject: [PATCH 1/5] adding the NIM LLM TCO calculator tool

---
 .../TCO_calculator/TCO_calculator.ipynb       | 621 ++++++++++++++++++
 1 file changed, 621 insertions(+)
 create mode 100644 genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb

diff --git a/genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb b/genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb
new file mode 100644
index 00000000..c36d6776
--- /dev/null
+++ b/genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb
@@ -0,0 +1,621 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "f3e4c571-624d-4db6-b4d9-ae912879967b",
+   "metadata": {},
+   "source": [
+    "# GenAI-perf -> NIM LLM TCO Calculator Data Connector\n",
+    "\n",
+    "This notebook shows you how to do LLM performance benchmarking with the NVIDIA GenAI-perf tool and then export the data to an Excel spreadsheet, which can be used to transfer the data to the NIM [spreadsheet TCO calculator tool](https://docs.google.com/spreadsheets/d/1UF_sy89kcLIkdnK0dC-6QwcAgVDUV0ANJ22JnC2dW7g/edit?gid=0#gid=0).\n",
+    "\n",
+    "Note: the NIM LLM TCO calculator is implemented as a Google spreadsheet. Please make a private copy for your own usage.\n",
+    "\n",
+    "\n",
+    "To execute this notebook, you can use the NVIDIA Pytorch container:\n",
+    "```\n",
+    "docker run --gpus=all --ipc=host --net=host --rm -it -v $PWD:/myworkspace nvcr.io/nvidia/pytorch:25.03-py3 bash  \n",
+    "```\n",
+    "\n",
+    "Then from within the docker interactive session:\n",
+    "```\n",
+    "jupyter lab --ip 0.0.0.0 --port=8888 --allow-root --notebook-dir=/myworkspace\n",
+    "```\n",
+    "\n",
+    "First, we define some metadata fields describing the deployment environment.\n",
+    "\n",
+    "**Notes:**\n",
+    "- NIM engine ID  provides both the backend type (e.g. TensorRT-LLM, vLLM or SGlang) and precision. You can find this information when the NIM container starts.\n",
+    "\n",
+    "- This notebook collects data corresponding to a single deployment environment described by the metadata field.  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "93c18473-09ea-4a6f-87fa-d67fa3f7daa5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "meta_field = {\n",
+    " 'Model': \"meta-llama/Meta-Llama-3-8B-Instruct\",\n",
+    " 'GPU Type': \"H100_80GB\",\n",
+    " 'number_of_gpus': 1,\n",
+    " 'Precision': \"BF16\",\n",
+    " 'Execution Mode': \"NIM-TRTLLM\",\n",
+    "}\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "70b3df53-c103-4de2-81f5-419aa4d65f83",
+   "metadata": {},
+   "source": [
+    "## Pre-requisite\n",
+    "\n",
+    "First, we install the GenAI-perf tool in the Pytorch container. \n",
+    "As a client-side LLM-focused benchmarking tool, NVIDIA GenAI-Perf provides key metrics such as time to first token (TTFT), inter-token latency (ITL), tokens per second (TPS), requests per second (RPS) and more. GenAI-Perf also supports any LLM inference service conforming to the OpenAI API specification, a widely accepted de facto standard in the industry. For this benchmarking guide, we’ll use NVIDIA NIM, a collection of inference microservices that offer high-throughput and low-latency inference for both base and fine-tuned LLMs. NIM features ease-of-use and enterprise-grade security and manageability. \n",
+    "\n",
+    "### Install GenAI-perf tool"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ad5de6fe-8547-4259-956a-980aa8b71dce",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "pip install genai-perf==0.0.12"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9e6351a6-a5a3-4067-831e-abe26ae53969",
+   "metadata": {},
+   "source": [
+    "### Setting up a NIM LLM server (optional)\n",
+    "\n",
+    "If you don't already have a target for benchmarking, like an openAI compatible LLM service, let's setup one. \n",
+    "\n",
+    "NVIDIA NIM provides the easiest and quickest way to put LLMs and other AI foundation models into production. Read [A Simple Guide to Deploying Generative AI with NVIDIA NIM](https://developer.nvidia.com/blog/a-simple-guide-to-deploying-generative-ai-with-nvidia-nim/) or consult the latest [NIM LLM documentation](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html) to get started, which will walk you through hardware requirements and prerequisites, including NVIDIA NGC API keys.\n",
+    "\n",
+    "For convenience, the following commands have been provided for deploying NIM and executing inference from the [Getting Started Guide](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html):   \n",
+    "\n",
+    "                                                                                                    \n",
+    "```\n",
+    "export NGC_API_KEY=<YOUR_NGC_API_KEY> \n",
+    "\n",
+    "# Choose a container name for bookkeeping\n",
+    "export CONTAINER_NAME=llama-3.1-8b-instruct\n",
+    "\n",
+    "# Choose a LLM NIM Image from NGC\n",
+    "export IMG_NAME=\"nvcr.io/nim/meta/${CONTAINER_NAME}:latest\"\n",
+    "\n",
+    "# Choose a path on your system to cache the downloaded models\n",
+    "export LOCAL_NIM_CACHE=./cache/nim\n",
+    "mkdir -p \"$LOCAL_NIM_CACHE\"\n",
+    "\n",
+    "# Start the LLM NIM\n",
+    "docker run -it --rm --name=$CONTAINER_NAME \\\n",
+    "  --gpus all \\\n",
+    "  --shm-size=16GB \\\n",
+    "  -e NGC_API_KEY \\\n",
+    "  -v \"$LOCAL_NIM_CACHE:/opt/nim/.cache\" \\\n",
+    "  -u $(id -u) \\\n",
+    "  -p 8000:8000 \\\n",
+    "  $IMG_NAME\n",
+    "```\n",
+    "\n",
+    "\n",
+    "## Performance benchmarking script\n",
+    "\n",
+    "The next step is to define the use cases (i.e. input/output sequence length scenarios) and carry out the benchmarking."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "e8395733-ce18-4447-845c-b3579acc2067",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Writing benchmark.sh\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%writefile benchmark.sh\n",
+    "declare -A useCases\n",
+    "\n",
+    "# Populate the array with use case descriptions and their specified input/output lengths\n",
+    "useCases[\"Translation\"]=\"200/200\"\n",
+    "useCases[\"Text classification\"]=\"200/5\"\n",
+    "useCases[\"Text summary\"]=\"1000/200\"\n",
+    "useCases[\"Code generation\"]=\"200/1000\"\n",
+    "\n",
+    "# Function to execute genAI-perf with the input/output lengths as arguments\n",
+    "runBenchmark() {\n",
+    "    local description=\"$1\"\n",
+    "    local lengths=\"${useCases[$description]}\"\n",
+    "    IFS='/' read -r inputLength outputLength <<< \"$lengths\"\n",
+    "\n",
+    "    echo \"Running genAI-perf for $description with input length $inputLength and output length $outputLength\"\n",
+    "    #Runs\n",
+    "    for concurrency in 1 2 5 10 50 100 250; do\n",
+    "\n",
+    "        local INPUT_SEQUENCE_LENGTH=$inputLength\n",
+    "        local INPUT_SEQUENCE_STD=0\n",
+    "        local OUTPUT_SEQUENCE_LENGTH=$outputLength\n",
+    "        local CONCURRENCY=$concurrency\n",
+    "        local MODEL=meta/llama-3.1-8b-instruct\n",
+    "        \n",
+    "        genai-perf profile \\\n",
+    "            -m $MODEL \\\n",
+    "            --endpoint-type chat \\\n",
+    "            --service-kind openai \\\n",
+    "            --streaming \\\n",
+    "            -u localhost:8000 \\\n",
+    "            --synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH \\\n",
+    "            --synthetic-input-tokens-stddev $INPUT_SEQUENCE_STD \\\n",
+    "            --concurrency $CONCURRENCY \\\n",
+    "            --output-tokens-mean $OUTPUT_SEQUENCE_LENGTH \\\n",
+    "            --extra-inputs max_tokens:$OUTPUT_SEQUENCE_LENGTH \\\n",
+    "            --extra-inputs min_tokens:$OUTPUT_SEQUENCE_LENGTH \\\n",
+    "            --extra-inputs ignore_eos:true \\\n",
+    "            --tokenizer meta-llama/Meta-Llama-3-8B-Instruct \\\n",
+    "            --measurement-interval 30000 \\\n",
+    "            --profile-export-file ${INPUT_SEQUENCE_LENGTH}_${OUTPUT_SEQUENCE_LENGTH}.json \\\n",
+    "            -- \\\n",
+    "            -v \\\n",
+    "            --max-threads=256\n",
+    "    \n",
+    "    done\n",
+    "}\n",
+    "\n",
+    "# Iterate over all defined use cases and run the benchmark script for each\n",
+    "for description in \"${!useCases[@]}\"; do\n",
+    "    runBenchmark \"$description\"\n",
+    "done\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "603f1941-5206-4bca-a547-028e0ea50f21",
+   "metadata": {},
+   "source": [
+    "Next, we execute the bash script, which will carry out the defined benchmarking scenarios and gather the data in a default directory named `artifacts` under the current working directory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6cbfacd3-5755-4c0b-ae23-3abffceebbdb",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "bash benchmark.sh"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c480b28d-6816-4c84-9124-bdc56fc81f41",
+   "metadata": {},
+   "source": [
+    "## Reading gen-AI-perf data\n",
+    "\n",
+    "Once performance benchmarking is done, we read and collect the results in a single data frame."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "3815b7f1-b51b-44ea-bbbc-b023ec0beeca",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gen_AI_perf_field = [\n",
+    " 'Inter Token 90th Percentile Latency (ms)',\n",
+    " 'Inter Token 99th Percentile Latency (ms)',\n",
+    " 'Inter Token Average Latency (ms)',\n",
+    " 'Time to First Token 90th Percentile Latency (ms)',\n",
+    " 'Time to First Token 99th Percentile Latency (ms)',\n",
+    " 'Time to First Token Average Latency (ms)',\n",
+    " 'Request 90th Percentile Latency (ms)',\n",
+    " 'Request 99th Percentile Latency (ms)',\n",
+    " 'Request Latency (ms)',\n",
+    " 'Requests per Second',\n",
+    " 'Tokens per Second']\n",
+    "\n",
+    "# Other experimental params: 'Seq Length (ISL/OSL)', 'Concurrency',"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "ff69c986-0c9b-46a9-8f28-800cd61ab24d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/tmp/ipykernel_203/961144172.py:38: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
+      "  df = pd.concat([df, pd.DataFrame([row])], ignore_index=True)\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os\n",
+    "import json\n",
+    "import pandas as pd\n",
+    "\n",
+    "root_dir = \"./artifacts\"\n",
+    "directory_prefix = \"meta_llama-3.1-8b-instruct-openai-chat-concurrency\" # Change this to fit the actual model deployed\n",
+    "\n",
+    "ISL_OSL_list = [\"200_5\", \"200_200\", \"1000_200\", \"200_1000\"]\n",
+    "concurrencies = [1, 2, 5, 10, 50, 100, 250]\n",
+    "df = pd.DataFrame(columns=gen_AI_perf_field)\n",
+    "\n",
+    "for con in concurrencies:\n",
+    "    for ISL_OSL in ISL_OSL_list:\n",
+    "        filename = os.path.join(root_dir, directory_prefix+str(con), f\"{ISL_OSL}_genai_perf.json\")\n",
+    "        \n",
+    "        # Open and read the file\n",
+    "        with open(filename, 'r') as file:\n",
+    "            data = json.load(file)\n",
+    "        \n",
+    "        row =  {\n",
+    "         'Inter Token 90th Percentile Latency (ms)': data[\"inter_token_latency\"][\"p90\"],\n",
+    "         'Inter Token 99th Percentile Latency (ms)': data[\"inter_token_latency\"][\"p99\"],\n",
+    "         'Inter Token Average Latency (ms)': data[\"inter_token_latency\"][\"avg\"],\n",
+    "         'Time to First Token 90th Percentile Latency (ms)': data[\"time_to_first_token\"][\"p90\"],\n",
+    "         'Time to First Token 99th Percentile Latency (ms)': data[\"time_to_first_token\"][\"p99\"],\n",
+    "         'Time to First Token Average Latency (ms)': data[\"time_to_first_token\"][\"avg\"],\n",
+    "         'Request 90th Percentile Latency (ms)': data[\"request_latency\"][\"p90\"],\n",
+    "         'Request 99th Percentile Latency (ms)': data[\"request_latency\"][\"p99\"],\n",
+    "         'Request Latency (ms)': data[\"request_latency\"][\"avg\"],\n",
+    "         'Requests per Second': data[\"request_throughput\"][\"avg\"],\n",
+    "         'Tokens per Second': data[\"output_token_throughput\"][\"avg\"],\n",
+    "         'Seq Length (ISL/OSL)': ISL_OSL,\n",
+    "         'Concurrency': con\n",
+    "        } \n",
+    "        \n",
+    "        row = meta_field | row\n",
+    "        \n",
+    "        df = pd.concat([df, pd.DataFrame([row])], ignore_index=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3a997b59-3d4e-4877-953c-088563aa8998",
+   "metadata": {},
+   "source": [
+    "## Exporting data to excel format\n",
+    "\n",
+    "We next export the benchmarking data to a TCO-tool compatible format, which comprises both metadata fields as well as performance metric fields."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "f39710a9-882c-44aa-b428-d7ed2976eb23",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Inter Token 90th Percentile Latency (ms)</th>\n",
+       "      <th>Inter Token 99th Percentile Latency (ms)</th>\n",
+       "      <th>Inter Token Average Latency (ms)</th>\n",
+       "      <th>Time to First Token 90th Percentile Latency (ms)</th>\n",
+       "      <th>Time to First Token 99th Percentile Latency (ms)</th>\n",
+       "      <th>Time to First Token Average Latency (ms)</th>\n",
+       "      <th>Request 90th Percentile Latency (ms)</th>\n",
+       "      <th>Request 99th Percentile Latency (ms)</th>\n",
+       "      <th>Request Latency (ms)</th>\n",
+       "      <th>Requests per Second</th>\n",
+       "      <th>Tokens per Second</th>\n",
+       "      <th>Model</th>\n",
+       "      <th>GPU Type</th>\n",
+       "      <th>number_of_gpus</th>\n",
+       "      <th>Precision</th>\n",
+       "      <th>Execution Mode</th>\n",
+       "      <th>Seq Length (ISL/OSL)</th>\n",
+       "      <th>Concurrency</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>9.594225</td>\n",
+       "      <td>10.384453</td>\n",
+       "      <td>9.041131</td>\n",
+       "      <td>18.409172</td>\n",
+       "      <td>19.843728</td>\n",
+       "      <td>17.393711</td>\n",
+       "      <td>66.557111</td>\n",
+       "      <td>71.716564</td>\n",
+       "      <td>62.599366</td>\n",
+       "      <td>15.961360</td>\n",
+       "      <td>95.768158</td>\n",
+       "      <td>meta-llama/Meta-Llama-3-8B-Instruct</td>\n",
+       "      <td>H100_80GB</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>BF16</td>\n",
+       "      <td>NIM-TRTLLM</td>\n",
+       "      <td>200_5</td>\n",
+       "      <td>1.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>10.887888</td>\n",
+       "      <td>11.263200</td>\n",
+       "      <td>10.615027</td>\n",
+       "      <td>18.011177</td>\n",
+       "      <td>38.893825</td>\n",
+       "      <td>18.188744</td>\n",
+       "      <td>2195.400860</td>\n",
+       "      <td>2265.867540</td>\n",
+       "      <td>2138.409700</td>\n",
+       "      <td>0.467599</td>\n",
+       "      <td>93.865874</td>\n",
+       "      <td>meta-llama/Meta-Llama-3-8B-Instruct</td>\n",
+       "      <td>H100_80GB</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>BF16</td>\n",
+       "      <td>NIM-TRTLLM</td>\n",
+       "      <td>200_200</td>\n",
+       "      <td>1.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>11.618933</td>\n",
+       "      <td>11.998436</td>\n",
+       "      <td>11.210382</td>\n",
+       "      <td>62.158805</td>\n",
+       "      <td>79.053020</td>\n",
+       "      <td>54.133457</td>\n",
+       "      <td>2390.421083</td>\n",
+       "      <td>2467.364641</td>\n",
+       "      <td>2294.288986</td>\n",
+       "      <td>0.435829</td>\n",
+       "      <td>87.527501</td>\n",
+       "      <td>meta-llama/Meta-Llama-3-8B-Instruct</td>\n",
+       "      <td>H100_80GB</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>BF16</td>\n",
+       "      <td>NIM-TRTLLM</td>\n",
+       "      <td>1000_200</td>\n",
+       "      <td>1.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>11.376184</td>\n",
+       "      <td>11.402237</td>\n",
+       "      <td>11.155124</td>\n",
+       "      <td>19.120465</td>\n",
+       "      <td>19.441144</td>\n",
+       "      <td>18.441507</td>\n",
+       "      <td>11367.166599</td>\n",
+       "      <td>11417.401786</td>\n",
+       "      <td>11155.899836</td>\n",
+       "      <td>0.089634</td>\n",
+       "      <td>89.584068</td>\n",
+       "      <td>meta-llama/Meta-Llama-3-8B-Instruct</td>\n",
+       "      <td>H100_80GB</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>BF16</td>\n",
+       "      <td>NIM-TRTLLM</td>\n",
+       "      <td>200_1000</td>\n",
+       "      <td>1.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>10.997904</td>\n",
+       "      <td>13.013792</td>\n",
+       "      <td>10.076813</td>\n",
+       "      <td>33.621545</td>\n",
+       "      <td>40.719498</td>\n",
+       "      <td>30.210196</td>\n",
+       "      <td>86.358385</td>\n",
+       "      <td>100.114304</td>\n",
+       "      <td>80.594263</td>\n",
+       "      <td>24.799054</td>\n",
+       "      <td>148.794324</td>\n",
+       "      <td>meta-llama/Meta-Llama-3-8B-Instruct</td>\n",
+       "      <td>H100_80GB</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>BF16</td>\n",
+       "      <td>NIM-TRTLLM</td>\n",
+       "      <td>200_5</td>\n",
+       "      <td>2.0</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   Inter Token 90th Percentile Latency (ms)  \\\n",
+       "0                                  9.594225   \n",
+       "1                                 10.887888   \n",
+       "2                                 11.618933   \n",
+       "3                                 11.376184   \n",
+       "4                                 10.997904   \n",
+       "\n",
+       "   Inter Token 99th Percentile Latency (ms)  Inter Token Average Latency (ms)  \\\n",
+       "0                                 10.384453                          9.041131   \n",
+       "1                                 11.263200                         10.615027   \n",
+       "2                                 11.998436                         11.210382   \n",
+       "3                                 11.402237                         11.155124   \n",
+       "4                                 13.013792                         10.076813   \n",
+       "\n",
+       "   Time to First Token 90th Percentile Latency (ms)  \\\n",
+       "0                                         18.409172   \n",
+       "1                                         18.011177   \n",
+       "2                                         62.158805   \n",
+       "3                                         19.120465   \n",
+       "4                                         33.621545   \n",
+       "\n",
+       "   Time to First Token 99th Percentile Latency (ms)  \\\n",
+       "0                                         19.843728   \n",
+       "1                                         38.893825   \n",
+       "2                                         79.053020   \n",
+       "3                                         19.441144   \n",
+       "4                                         40.719498   \n",
+       "\n",
+       "   Time to First Token Average Latency (ms)  \\\n",
+       "0                                 17.393711   \n",
+       "1                                 18.188744   \n",
+       "2                                 54.133457   \n",
+       "3                                 18.441507   \n",
+       "4                                 30.210196   \n",
+       "\n",
+       "   Request 90th Percentile Latency (ms)  Request 99th Percentile Latency (ms)  \\\n",
+       "0                             66.557111                             71.716564   \n",
+       "1                           2195.400860                           2265.867540   \n",
+       "2                           2390.421083                           2467.364641   \n",
+       "3                          11367.166599                          11417.401786   \n",
+       "4                             86.358385                            100.114304   \n",
+       "\n",
+       "   Request Latency (ms)  Requests per Second  Tokens per Second  \\\n",
+       "0             62.599366            15.961360          95.768158   \n",
+       "1           2138.409700             0.467599          93.865874   \n",
+       "2           2294.288986             0.435829          87.527501   \n",
+       "3          11155.899836             0.089634          89.584068   \n",
+       "4             80.594263            24.799054         148.794324   \n",
+       "\n",
+       "                                 Model   GPU Type  number_of_gpus Precision  \\\n",
+       "0  meta-llama/Meta-Llama-3-8B-Instruct  H100_80GB             1.0      BF16   \n",
+       "1  meta-llama/Meta-Llama-3-8B-Instruct  H100_80GB             1.0      BF16   \n",
+       "2  meta-llama/Meta-Llama-3-8B-Instruct  H100_80GB             1.0      BF16   \n",
+       "3  meta-llama/Meta-Llama-3-8B-Instruct  H100_80GB             1.0      BF16   \n",
+       "4  meta-llama/Meta-Llama-3-8B-Instruct  H100_80GB             1.0      BF16   \n",
+       "\n",
+       "  Execution Mode Seq Length (ISL/OSL)  Concurrency  \n",
+       "0     NIM-TRTLLM                200_5          1.0  \n",
+       "1     NIM-TRTLLM              200_200          1.0  \n",
+       "2     NIM-TRTLLM             1000_200          1.0  \n",
+       "3     NIM-TRTLLM             200_1000          1.0  \n",
+       "4     NIM-TRTLLM                200_5          2.0  "
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5baf8e86-c8d1-42fc-94d3-15b592a5adc9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install openpyxl"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "125f78e6-cc51-4091-bb16-9a1d8403d6cf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "columns = [\n",
+    " 'Model',\n",
+    " 'GPU Type',\n",
+    " 'Seq Length (ISL/OSL)',\n",
+    " 'number_of_gpus',\n",
+    " 'Concurrency',\n",
+    " 'Precision',\n",
+    " 'Execution Mode',\n",
+    " 'Inter Token 90th Percentile Latency (ms)',\n",
+    " 'Inter Token 99th Percentile Latency (ms)',\n",
+    " 'Inter Token Average Latency (ms)',\n",
+    " 'Time to First Token 90th Percentile Latency (ms)',\n",
+    " 'Time to First Token 99th Percentile Latency (ms)',\n",
+    " 'Time to First Token Average Latency (ms)',\n",
+    " 'Request 90th Percentile Latency (ms)',\n",
+    " 'Request 99th Percentile Latency (ms)',\n",
+    " 'Request Latency (ms)',\n",
+    " 'Requests per Second',\n",
+    " 'Tokens per Second'\n",
+    " ]\n",
+    "df[columns].to_excel('data.xlsx', index=False)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "becc138b-6d92-49aa-a9a6-3ad31ad75c87",
+   "metadata": {},
+   "source": [
+    "## Importing the data to the TCO calculator\n",
+    "\n",
+    "The [NIM TCO calculator tool](https://docs.google.com/spreadsheets/d/1UF_sy89kcLIkdnK0dC-6QwcAgVDUV0ANJ22JnC2dW7g/edit?gid=0#gid=0) is implemented as a Google spreadsheet. You can use Google spreadsheet to open the excel file above, then simply copy the data rows into the \"data\" subsheet of the TCO calculator. That will complete the import phase and make the new data available in the TCO calculator."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d14c9fb5-244a-4471-b66e-8a78e9a97d2b",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

From 1b9fc8f3cf37a5cd971044ec2f20e6c739d067a3 Mon Sep 17 00:00:00 2001
From: Vinh Nguyen <vinh.nguyenx@gmail.com>
Date: Thu, 22 May 2025 10:14:22 +1000
Subject: [PATCH 2/5] Update
 genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb

Co-authored-by: Anthony Casagrande <acasagrande@nvidia.com>
---
 genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb b/genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb
index c36d6776..c885f682 100644
--- a/genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb
+++ b/genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb
@@ -80,7 +80,7 @@
    "source": [
     "### Setting up a NIM LLM server (optional)\n",
     "\n",
-    "If you don't already have a target for benchmarking, like an openAI compatible LLM service, let's setup one. \n",
+    "If you don't already have a target for benchmarking, like an OpenAI compatible LLM service, let's setup one. \n",
     "\n",
     "NVIDIA NIM provides the easiest and quickest way to put LLMs and other AI foundation models into production. Read [A Simple Guide to Deploying Generative AI with NVIDIA NIM](https://developer.nvidia.com/blog/a-simple-guide-to-deploying-generative-ai-with-nvidia-nim/) or consult the latest [NIM LLM documentation](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html) to get started, which will walk you through hardware requirements and prerequisites, including NVIDIA NGC API keys.\n",
     "\n",

From f9b898ed3536afef21207fc819287f0972f9672b Mon Sep 17 00:00:00 2001
From: Vinh Nguyen <vinh.nguyenx@gmail.com>
Date: Thu, 22 May 2025 10:18:07 +1000
Subject: [PATCH 3/5] Update
 genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb

Co-authored-by: Anthony Casagrande <acasagrande@nvidia.com>
---
 genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb b/genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb
index c885f682..35937309 100644
--- a/genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb
+++ b/genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb
@@ -270,7 +270,7 @@
     "\n",
     "for con in concurrencies:\n",
     "    for ISL_OSL in ISL_OSL_list:\n",
-    "        filename = os.path.join(root_dir, directory_prefix+str(con), f\"{ISL_OSL}_genai_perf.json\")\n",
+    "        filename = os.path.join(root_dir, f\"{directory_prefix}{con}\", f\"{ISL_OSL}_genai_perf.json\")\n",
     "        \n",
     "        # Open and read the file\n",
     "        with open(filename, 'r') as file:\n",

From 7a88035a5e4f41acd50e0b0286cabbe16966dac2 Mon Sep 17 00:00:00 2001
From: vinhn <vinhn@nvidia.com>
Date: Thu, 22 May 2025 01:51:28 +0000
Subject: [PATCH 4/5] fix after review

---
 .../TCO_calculator/TCO_calculator.ipynb       | 214 +++++++-----------
 1 file changed, 83 insertions(+), 131 deletions(-)

diff --git a/genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb b/genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb
index 35937309..7c098879 100644
--- a/genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb
+++ b/genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb
@@ -28,12 +28,12 @@
     "**Notes:**\n",
     "- NIM engine ID  provides both the backend type (e.g. TensorRT-LLM, vLLM or SGlang) and precision. You can find this information when the NIM container starts.\n",
     "\n",
-    "- This notebook collects data corresponding to a single deployment environment described by the metadata field.  "
+    "- This notebook collects data corresponding to a single deployment environment described by the metadata field. In this tutorial, we will make use of the `Meta-Llama-3-8B-Instruct` model. Note that NVIDIA NGC and HuggingFace model hub use slightly different identifier for this model."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 2,
    "id": "93c18473-09ea-4a6f-87fa-d67fa3f7daa5",
    "metadata": {},
    "outputs": [],
@@ -44,7 +44,7 @@
     " 'number_of_gpus': 1,\n",
     " 'Precision': \"BF16\",\n",
     " 'Execution Mode': \"NIM-TRTLLM\",\n",
-    "}\n"
+    "}"
    ]
   },
   {
@@ -64,9 +64,7 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "ad5de6fe-8547-4259-956a-980aa8b71dce",
-   "metadata": {
-    "scrolled": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "%%bash\n",
@@ -90,18 +88,16 @@
     "```\n",
     "export NGC_API_KEY=<YOUR_NGC_API_KEY> \n",
     "\n",
-    "# Choose a container name for bookkeeping\n",
-    "export CONTAINER_NAME=llama-3.1-8b-instruct\n",
-    "\n",
     "# Choose a LLM NIM Image from NGC\n",
-    "export IMG_NAME=\"nvcr.io/nim/meta/${CONTAINER_NAME}:latest\"\n",
+    "export CONTAINER_NAME=meta/llama-3.1-8b-instruct # NGC model name\n",
+    "export IMG_NAME=\"nvcr.io/nim/${CONTAINER_NAME}:latest\"\n",
     "\n",
     "# Choose a path on your system to cache the downloaded models\n",
     "export LOCAL_NIM_CACHE=./cache/nim\n",
     "mkdir -p \"$LOCAL_NIM_CACHE\"\n",
     "\n",
     "# Start the LLM NIM\n",
-    "docker run -it --rm --name=$CONTAINER_NAME \\\n",
+    "docker run -it --rm --name=llama-3.1-8b-instruct  \\\n",
     "  --gpus all \\\n",
     "  --shm-size=16GB \\\n",
     "  -e NGC_API_KEY \\\n",
@@ -119,22 +115,19 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": null,
    "id": "e8395733-ce18-4447-845c-b3579acc2067",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Writing benchmark.sh\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "%%writefile benchmark.sh\n",
+    "#!/usr/bin/env bash\n",
+    "\n",
     "declare -A useCases\n",
     "\n",
+    "export MODEL=meta/llama-3.1-8b-instruct # NGC model name\n",
+    "export TOKENIZER_PATH=meta-llama/Meta-Llama-3-8B-Instruct # Either a HF model or path to a local folder containing the tokenizer \n",
+    "\n",
     "# Populate the array with use case descriptions and their specified input/output lengths\n",
     "useCases[\"Translation\"]=\"200/200\"\n",
     "useCases[\"Text classification\"]=\"200/5\"\n",
@@ -155,7 +148,6 @@
     "        local INPUT_SEQUENCE_STD=0\n",
     "        local OUTPUT_SEQUENCE_LENGTH=$outputLength\n",
     "        local CONCURRENCY=$concurrency\n",
-    "        local MODEL=meta/llama-3.1-8b-instruct\n",
     "        \n",
     "        genai-perf profile \\\n",
     "            -m $MODEL \\\n",
@@ -170,7 +162,7 @@
     "            --extra-inputs max_tokens:$OUTPUT_SEQUENCE_LENGTH \\\n",
     "            --extra-inputs min_tokens:$OUTPUT_SEQUENCE_LENGTH \\\n",
     "            --extra-inputs ignore_eos:true \\\n",
-    "            --tokenizer meta-llama/Meta-Llama-3-8B-Instruct \\\n",
+    "            --tokenizer $TOKENIZER_PATH \\\n",
     "            --measurement-interval 30000 \\\n",
     "            --profile-export-file ${INPUT_SEQUENCE_LENGTH}_${OUTPUT_SEQUENCE_LENGTH}.json \\\n",
     "            -- \\\n",
@@ -220,42 +212,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
-   "id": "3815b7f1-b51b-44ea-bbbc-b023ec0beeca",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "gen_AI_perf_field = [\n",
-    " 'Inter Token 90th Percentile Latency (ms)',\n",
-    " 'Inter Token 99th Percentile Latency (ms)',\n",
-    " 'Inter Token Average Latency (ms)',\n",
-    " 'Time to First Token 90th Percentile Latency (ms)',\n",
-    " 'Time to First Token 99th Percentile Latency (ms)',\n",
-    " 'Time to First Token Average Latency (ms)',\n",
-    " 'Request 90th Percentile Latency (ms)',\n",
-    " 'Request 99th Percentile Latency (ms)',\n",
-    " 'Request Latency (ms)',\n",
-    " 'Requests per Second',\n",
-    " 'Tokens per Second']\n",
-    "\n",
-    "# Other experimental params: 'Seq Length (ISL/OSL)', 'Concurrency',"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 3,
    "id": "ff69c986-0c9b-46a9-8f28-800cd61ab24d",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/tmp/ipykernel_203/961144172.py:38: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
-      "  df = pd.concat([df, pd.DataFrame([row])], ignore_index=True)\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "import os\n",
     "import json\n",
@@ -264,13 +224,13 @@
     "root_dir = \"./artifacts\"\n",
     "directory_prefix = \"meta_llama-3.1-8b-instruct-openai-chat-concurrency\" # Change this to fit the actual model deployed\n",
     "\n",
-    "ISL_OSL_list = [\"200_5\", \"200_200\", \"1000_200\", \"200_1000\"]\n",
-    "concurrencies = [1, 2, 5, 10, 50, 100, 250]\n",
-    "df = pd.DataFrame(columns=gen_AI_perf_field)\n",
+    "ISL_OSL_LIST = [\"200_5\", \"200_200\", \"1000_200\", \"200_1000\"]\n",
+    "CONCURRENCIES  = [1, 2, 5, 10, 50, 100, 250]\n",
+    "df = pd.DataFrame()\n",
     "\n",
-    "for con in concurrencies:\n",
-    "    for ISL_OSL in ISL_OSL_list:\n",
-    "        filename = os.path.join(root_dir, f\"{directory_prefix}{con}\", f\"{ISL_OSL}_genai_perf.json\")\n",
+    "for concurrency in CONCURRENCIES :\n",
+    "    for isl_osl in ISL_OSL_LIST:\n",
+    "        filename = os.path.join(root_dir, f\"{directory_prefix}{concurrency}\", f\"{isl_osl}_genai_perf.json\")\n",
     "        \n",
     "        # Open and read the file\n",
     "        with open(filename, 'r') as file:\n",
@@ -288,8 +248,8 @@
     "         'Request Latency (ms)': data[\"request_latency\"][\"avg\"],\n",
     "         'Requests per Second': data[\"request_throughput\"][\"avg\"],\n",
     "         'Tokens per Second': data[\"output_token_throughput\"][\"avg\"],\n",
-    "         'Seq Length (ISL/OSL)': ISL_OSL,\n",
-    "         'Concurrency': con\n",
+    "         'Seq Length (ISL/OSL)': isl_osl,\n",
+    "         'Concurrency': concurrency\n",
     "        } \n",
     "        \n",
     "        row = meta_field | row\n",
@@ -304,12 +264,12 @@
    "source": [
     "## Exporting data to excel format\n",
     "\n",
-    "We next export the benchmarking data to a TCO-tool compatible format, which comprises both metadata fields as well as performance metric fields."
+    "We next export the benchmarking data to a NIM TCO Calculator compatible format, which comprises both metadata fields as well as performance metric fields."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 4,
    "id": "f39710a9-882c-44aa-b428-d7ed2976eb23",
    "metadata": {},
    "outputs": [
@@ -334,6 +294,11 @@
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
+       "      <th>Model</th>\n",
+       "      <th>GPU Type</th>\n",
+       "      <th>number_of_gpus</th>\n",
+       "      <th>Precision</th>\n",
+       "      <th>Execution Mode</th>\n",
        "      <th>Inter Token 90th Percentile Latency (ms)</th>\n",
        "      <th>Inter Token 99th Percentile Latency (ms)</th>\n",
        "      <th>Inter Token Average Latency (ms)</th>\n",
@@ -345,11 +310,6 @@
        "      <th>Request Latency (ms)</th>\n",
        "      <th>Requests per Second</th>\n",
        "      <th>Tokens per Second</th>\n",
-       "      <th>Model</th>\n",
-       "      <th>GPU Type</th>\n",
-       "      <th>number_of_gpus</th>\n",
-       "      <th>Precision</th>\n",
-       "      <th>Execution Mode</th>\n",
        "      <th>Seq Length (ISL/OSL)</th>\n",
        "      <th>Concurrency</th>\n",
        "    </tr>\n",
@@ -357,6 +317,11 @@
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
+       "      <td>meta-llama/Meta-Llama-3-8B-Instruct</td>\n",
+       "      <td>H100_80GB</td>\n",
+       "      <td>1</td>\n",
+       "      <td>BF16</td>\n",
+       "      <td>NIM-TRTLLM</td>\n",
        "      <td>9.594225</td>\n",
        "      <td>10.384453</td>\n",
        "      <td>9.041131</td>\n",
@@ -368,16 +333,16 @@
        "      <td>62.599366</td>\n",
        "      <td>15.961360</td>\n",
        "      <td>95.768158</td>\n",
-       "      <td>meta-llama/Meta-Llama-3-8B-Instruct</td>\n",
-       "      <td>H100_80GB</td>\n",
-       "      <td>1.0</td>\n",
-       "      <td>BF16</td>\n",
-       "      <td>NIM-TRTLLM</td>\n",
        "      <td>200_5</td>\n",
-       "      <td>1.0</td>\n",
+       "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
+       "      <td>meta-llama/Meta-Llama-3-8B-Instruct</td>\n",
+       "      <td>H100_80GB</td>\n",
+       "      <td>1</td>\n",
+       "      <td>BF16</td>\n",
+       "      <td>NIM-TRTLLM</td>\n",
        "      <td>10.887888</td>\n",
        "      <td>11.263200</td>\n",
        "      <td>10.615027</td>\n",
@@ -389,16 +354,16 @@
        "      <td>2138.409700</td>\n",
        "      <td>0.467599</td>\n",
        "      <td>93.865874</td>\n",
-       "      <td>meta-llama/Meta-Llama-3-8B-Instruct</td>\n",
-       "      <td>H100_80GB</td>\n",
-       "      <td>1.0</td>\n",
-       "      <td>BF16</td>\n",
-       "      <td>NIM-TRTLLM</td>\n",
        "      <td>200_200</td>\n",
-       "      <td>1.0</td>\n",
+       "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
+       "      <td>meta-llama/Meta-Llama-3-8B-Instruct</td>\n",
+       "      <td>H100_80GB</td>\n",
+       "      <td>1</td>\n",
+       "      <td>BF16</td>\n",
+       "      <td>NIM-TRTLLM</td>\n",
        "      <td>11.618933</td>\n",
        "      <td>11.998436</td>\n",
        "      <td>11.210382</td>\n",
@@ -410,16 +375,16 @@
        "      <td>2294.288986</td>\n",
        "      <td>0.435829</td>\n",
        "      <td>87.527501</td>\n",
-       "      <td>meta-llama/Meta-Llama-3-8B-Instruct</td>\n",
-       "      <td>H100_80GB</td>\n",
-       "      <td>1.0</td>\n",
-       "      <td>BF16</td>\n",
-       "      <td>NIM-TRTLLM</td>\n",
        "      <td>1000_200</td>\n",
-       "      <td>1.0</td>\n",
+       "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
+       "      <td>meta-llama/Meta-Llama-3-8B-Instruct</td>\n",
+       "      <td>H100_80GB</td>\n",
+       "      <td>1</td>\n",
+       "      <td>BF16</td>\n",
+       "      <td>NIM-TRTLLM</td>\n",
        "      <td>11.376184</td>\n",
        "      <td>11.402237</td>\n",
        "      <td>11.155124</td>\n",
@@ -431,16 +396,16 @@
        "      <td>11155.899836</td>\n",
        "      <td>0.089634</td>\n",
        "      <td>89.584068</td>\n",
-       "      <td>meta-llama/Meta-Llama-3-8B-Instruct</td>\n",
-       "      <td>H100_80GB</td>\n",
-       "      <td>1.0</td>\n",
-       "      <td>BF16</td>\n",
-       "      <td>NIM-TRTLLM</td>\n",
        "      <td>200_1000</td>\n",
-       "      <td>1.0</td>\n",
+       "      <td>1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
+       "      <td>meta-llama/Meta-Llama-3-8B-Instruct</td>\n",
+       "      <td>H100_80GB</td>\n",
+       "      <td>1</td>\n",
+       "      <td>BF16</td>\n",
+       "      <td>NIM-TRTLLM</td>\n",
        "      <td>10.997904</td>\n",
        "      <td>13.013792</td>\n",
        "      <td>10.076813</td>\n",
@@ -452,25 +417,27 @@
        "      <td>80.594263</td>\n",
        "      <td>24.799054</td>\n",
        "      <td>148.794324</td>\n",
-       "      <td>meta-llama/Meta-Llama-3-8B-Instruct</td>\n",
-       "      <td>H100_80GB</td>\n",
-       "      <td>1.0</td>\n",
-       "      <td>BF16</td>\n",
-       "      <td>NIM-TRTLLM</td>\n",
        "      <td>200_5</td>\n",
-       "      <td>2.0</td>\n",
+       "      <td>2</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "text/plain": [
-       "   Inter Token 90th Percentile Latency (ms)  \\\n",
-       "0                                  9.594225   \n",
-       "1                                 10.887888   \n",
-       "2                                 11.618933   \n",
-       "3                                 11.376184   \n",
-       "4                                 10.997904   \n",
+       "                                 Model   GPU Type  number_of_gpus Precision  \\\n",
+       "0  meta-llama/Meta-Llama-3-8B-Instruct  H100_80GB               1      BF16   \n",
+       "1  meta-llama/Meta-Llama-3-8B-Instruct  H100_80GB               1      BF16   \n",
+       "2  meta-llama/Meta-Llama-3-8B-Instruct  H100_80GB               1      BF16   \n",
+       "3  meta-llama/Meta-Llama-3-8B-Instruct  H100_80GB               1      BF16   \n",
+       "4  meta-llama/Meta-Llama-3-8B-Instruct  H100_80GB               1      BF16   \n",
+       "\n",
+       "  Execution Mode  Inter Token 90th Percentile Latency (ms)  \\\n",
+       "0     NIM-TRTLLM                                  9.594225   \n",
+       "1     NIM-TRTLLM                                 10.887888   \n",
+       "2     NIM-TRTLLM                                 11.618933   \n",
+       "3     NIM-TRTLLM                                 11.376184   \n",
+       "4     NIM-TRTLLM                                 10.997904   \n",
        "\n",
        "   Inter Token 99th Percentile Latency (ms)  Inter Token Average Latency (ms)  \\\n",
        "0                                 10.384453                          9.041131   \n",
@@ -514,22 +481,15 @@
        "3          11155.899836             0.089634          89.584068   \n",
        "4             80.594263            24.799054         148.794324   \n",
        "\n",
-       "                                 Model   GPU Type  number_of_gpus Precision  \\\n",
-       "0  meta-llama/Meta-Llama-3-8B-Instruct  H100_80GB             1.0      BF16   \n",
-       "1  meta-llama/Meta-Llama-3-8B-Instruct  H100_80GB             1.0      BF16   \n",
-       "2  meta-llama/Meta-Llama-3-8B-Instruct  H100_80GB             1.0      BF16   \n",
-       "3  meta-llama/Meta-Llama-3-8B-Instruct  H100_80GB             1.0      BF16   \n",
-       "4  meta-llama/Meta-Llama-3-8B-Instruct  H100_80GB             1.0      BF16   \n",
-       "\n",
-       "  Execution Mode Seq Length (ISL/OSL)  Concurrency  \n",
-       "0     NIM-TRTLLM                200_5          1.0  \n",
-       "1     NIM-TRTLLM              200_200          1.0  \n",
-       "2     NIM-TRTLLM             1000_200          1.0  \n",
-       "3     NIM-TRTLLM             200_1000          1.0  \n",
-       "4     NIM-TRTLLM                200_5          2.0  "
+       "  Seq Length (ISL/OSL)  Concurrency  \n",
+       "0                200_5            1  \n",
+       "1              200_200            1  \n",
+       "2             1000_200            1  \n",
+       "3             200_1000            1  \n",
+       "4                200_5            2  "
       ]
      },
-     "execution_count": 10,
+     "execution_count": 4,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -550,7 +510,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": null,
    "id": "125f78e6-cc51-4091-bb16-9a1d8403d6cf",
    "metadata": {},
    "outputs": [],
@@ -587,14 +547,6 @@
     "\n",
     "The [NIM TCO calculator tool](https://docs.google.com/spreadsheets/d/1UF_sy89kcLIkdnK0dC-6QwcAgVDUV0ANJ22JnC2dW7g/edit?gid=0#gid=0) is implemented as a Google spreadsheet. You can use Google spreadsheet to open the excel file above, then simply copy the data rows into the \"data\" subsheet of the TCO calculator. That will complete the import phase and make the new data available in the TCO calculator."
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d14c9fb5-244a-4471-b66e-8a78e9a97d2b",
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {

From 8c3a43a16e4d04c852eb3030d32f92094aa1dcb3 Mon Sep 17 00:00:00 2001
From: vinhn <vinhn@nvidia.com>
Date: Thu, 22 May 2025 01:59:54 +0000
Subject: [PATCH 5/5] restructure folder

---
 genai-perf/notebooks/README.md                               | 5 +++++
 .../TCO_calculator => notebooks}/TCO_calculator.ipynb        | 0
 2 files changed, 5 insertions(+)
 create mode 100644 genai-perf/notebooks/README.md
 rename genai-perf/{genai_perf/TCO_calculator => notebooks}/TCO_calculator.ipynb (100%)

diff --git a/genai-perf/notebooks/README.md b/genai-perf/notebooks/README.md
new file mode 100644
index 00000000..56e22551
--- /dev/null
+++ b/genai-perf/notebooks/README.md
@@ -0,0 +1,5 @@
+# GenAI-Perf Utility Notebooks
+
+This folder contains the various utility notebooks for GenAI-Perf. 
+
+1. [TCO_calculator.ipynb](TCO_calculator.ipynb): This notebook allows user to benchmark a NIM LLM deployment, then export the data to the NIM total cost of ownership (TCO) calculator.
\ No newline at end of file
diff --git a/genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb b/genai-perf/notebooks/TCO_calculator.ipynb
similarity index 100%
rename from genai-perf/genai_perf/TCO_calculator/TCO_calculator.ipynb
rename to genai-perf/notebooks/TCO_calculator.ipynb