Skip to content

Commit 0c56b57

Browse files
authored
Re-add batch inference foundation model examples with streaming test data approach (#2287)
Use streaming approach to pull in 100 rows of data for each batch notebook.
1 parent 6142c51 commit 0c56b57

File tree

9 files changed

+3517
-146
lines changed

9 files changed

+3517
-146
lines changed

sdk/python/foundation-models/system/inference/automatic-speech-recognition/asr-batch-endpoint.ipynb

+92-20
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,11 @@
1313
"`automatic-speech-recognition` (ASR) converts a speech signal to text, mapping a sequence of audio inputs to text outputs. Virtual assistants like Siri and Alexa use ASR models to help users everyday, and there are many other useful user-facing applications like live captioning and note-taking during meetings.\n",
1414
"\n",
1515
"### Model\n",
16-
"Models that can perform the `automatic-speech-recognition` task are tagged with `task: automatic-speech-recognition`. We will use the `openai-whisper-large` model in this notebook. If you opened this notebook from a specific model card, remember to replace the specific model name. If you don't find a model that suits your scenario or domain, you can discover and [import models from HuggingFace hub](../../import/import-model-from-huggingface.ipynb) and then use them for inference. \n",
16+
"Models that can perform the `automatic-speech-recognition` task are tagged with `task: automatic-speech-recognition`. We will use the `openai-whisper-large` model in this notebook. If you opened this notebook from a specific model card, remember to replace the specific model name. If you don't find a model that suits your scenario or domain, you can discover and [import models from HuggingFace hub](../../import/import_model_into_registry.ipynb) and then use them for inference. \n",
1717
"\n",
1818
"### Inference data\n",
19-
"We will use custom audio files that have been uploaded to the cloud. \\\n",
20-
"You can replace the links with any audio file stored on the cloud and verify inference.\n",
19+
"We will use the [Librispeech ASR](https://huggingface.co/datasets/librispeech_asr/viewer/clean/test) dataset. \\\n",
20+
"You can use also use custom audio files stored on the cloud and verify inference.\n",
2121
"- Most common audio formats (m4a, wav, flac, wma, mp3, etc.) are supported.\n",
2222
"- The whisper model can process only 30 seconds of data at a time, so if the file you upload is longer than 30 seconds, only the first 30 seconds will be transcribed. This can be circumvented by splitting the file into 30 second chunks.\n",
2323
"\n",
@@ -51,7 +51,9 @@
5151
"source": [
5252
"# Import packages used by the following code snippets\n",
5353
"import csv\n",
54+
"import json\n",
5455
"import os\n",
56+
"import requests\n",
5557
"import time\n",
5658
"\n",
5759
"import pandas as pd\n",
@@ -174,11 +176,14 @@
174176
"cell_type": "markdown",
175177
"metadata": {},
176178
"source": [
177-
"### 3. Prepare data for inference.\n",
179+
"### 3. Prepare data for inference\n",
178180
"\n",
179-
"A copy of a small subset from the [Librispeech ASR](https://huggingface.co/datasets/librispeech_asr) is available in the [librispeech-dataset](./librispeech-dataset/) folder. The next few cells show basic data preparation:\n",
180-
"* Visualize some data rows\n",
181-
"* We want this sample to run quickly, so save a smaller dataset containing a fraction of the original."
181+
"We will test with a small subset from the [Librispeech ASR](https://huggingface.co/datasets/librispeech_asr/viewer/clean/test) dataset, saving the sample in the `librispeech-dataset` folder. The next few cells show basic data preparation:\n",
182+
"* Download the data.\n",
183+
"* Visualize some data rows.\n",
184+
"* Save the data.\n",
185+
"\n",
186+
"We want this sample to run quickly, so we are using a smaller dataset containing a fraction of the original."
182187
]
183188
},
184189
{
@@ -189,35 +194,102 @@
189194
"source": [
190195
"# Define directories and filenames as variables\n",
191196
"dataset_dir = \"librispeech-dataset\"\n",
192-
"training_datafile = \"train_clean_100.csv\"\n",
197+
"test_datafile = \"test_100.csv\"\n",
193198
"\n",
194199
"batch_dir = \"batch\"\n",
195200
"batch_inputs_dir = os.path.join(batch_dir, \"inputs\")\n",
196201
"batch_input_file = \"batch_input.csv\"\n",
202+
"os.makedirs(dataset_dir, exist_ok=True)\n",
197203
"os.makedirs(batch_dir, exist_ok=True)\n",
198204
"os.makedirs(batch_inputs_dir, exist_ok=True)"
199205
]
200206
},
207+
{
208+
"attachments": {},
209+
"cell_type": "markdown",
210+
"metadata": {},
211+
"source": [
212+
"#### 3.1 Download the data\n",
213+
"We want to get the URLs for the audio files. To do this, pull down the data using `requests` instead of the Huggingface `datasets` module. The MLflow model's signature specifies the input should be a column named `\"audio\"` and a column named `\"language\"`."
214+
]
215+
},
216+
{
217+
"cell_type": "code",
218+
"execution_count": null,
219+
"metadata": {},
220+
"outputs": [],
221+
"source": [
222+
"testdata = requests.get(\n",
223+
" \"https://datasets-server.huggingface.co/first-rows?dataset=librispeech_asr&config=clean&split=test&offset=0&limit=100\"\n",
224+
").text\n",
225+
"testdata = json.loads(testdata)"
226+
]
227+
},
228+
{
229+
"cell_type": "code",
230+
"execution_count": null,
231+
"metadata": {},
232+
"outputs": [],
233+
"source": [
234+
"audio_urls_and_text = [\n",
235+
" (row[\"row\"][\"audio\"][0][\"src\"], row[\"row\"][\"text\"]) for row in testdata[\"rows\"]\n",
236+
"]"
237+
]
238+
},
239+
{
240+
"cell_type": "code",
241+
"execution_count": null,
242+
"metadata": {},
243+
"outputs": [],
244+
"source": [
245+
"test_df = pd.DataFrame(data=audio_urls_and_text, columns=[\"audio\", \"text\"])"
246+
]
247+
},
248+
{
249+
"attachments": {},
250+
"cell_type": "markdown",
251+
"metadata": {},
252+
"source": [
253+
"These are all spoken in English, so create a `language` column of all `'en'`."
254+
]
255+
},
256+
{
257+
"cell_type": "code",
258+
"execution_count": null,
259+
"metadata": {},
260+
"outputs": [],
261+
"source": [
262+
"test_df[\"language\"] = \"en\"\n",
263+
"test_df.to_csv(os.path.join(\".\", dataset_dir, test_datafile), index=False)"
264+
]
265+
},
266+
{
267+
"attachments": {},
268+
"cell_type": "markdown",
269+
"metadata": {},
270+
"source": [
271+
"#### 3.2 Visualize a few rows of data"
272+
]
273+
},
201274
{
202275
"cell_type": "code",
203276
"execution_count": null,
204277
"metadata": {},
205278
"outputs": [],
206279
"source": [
207-
"# Load the ./librispeech-dataset/train_clean_100.csv file into a pandas dataframe and show the first 5 rows\n",
208280
"pd.set_option(\n",
209281
" \"display.max_colwidth\", 0\n",
210282
") # Set the max column width to 0 to display the full text\n",
211-
"train_df = pd.read_csv(os.path.join(\".\", dataset_dir, training_datafile))\n",
212-
"train_df.head()"
283+
"test_df.head()"
213284
]
214285
},
215286
{
216287
"attachments": {},
217288
"cell_type": "markdown",
218289
"metadata": {},
219290
"source": [
220-
"Save the input data to files of smaller batches for testing. The MLflow model's signature specifies the input should be a column named `\"audio\"` and a column named `\"language\"`."
291+
"#### 3.3 Save the data\n",
292+
"Save the input data to files of smaller batches for testing."
221293
]
222294
},
223295
{
@@ -226,7 +298,7 @@
226298
"metadata": {},
227299
"outputs": [],
228300
"source": [
229-
"batch_df = train_df[[\"audio\", \"language\"]]\n",
301+
"batch_df = test_df[[\"audio\", \"language\"]]\n",
230302
"\n",
231303
"# Divide this into files of 10 rows each\n",
232304
"batch_size_per_predict = 10\n",
@@ -300,9 +372,9 @@
300372
" instance_count=1,\n",
301373
" logging_level=\"info\",\n",
302374
" max_concurrency_per_instance=1,\n",
303-
" mini_batch_size=10,\n",
375+
" mini_batch_size=2,\n",
304376
" output_file_name=\"predictions.csv\",\n",
305-
" retry_settings=BatchRetrySettings(max_retries=3, timeout=300),\n",
377+
" retry_settings=BatchRetrySettings(max_retries=3, timeout=600),\n",
306378
")\n",
307379
"workspace_ml_client.begin_create_or_update(deployment).result()"
308380
]
@@ -334,7 +406,7 @@
334406
"cell_type": "markdown",
335407
"metadata": {},
336408
"source": [
337-
"### 5. Run a batch inference job.\n",
409+
"### 5. Run a batch inference job\n",
338410
"Invoke the batch endpoint with the input parameter pointing to the folder containing the batch inference input. This creates a pipeline job using the default deployment in the endpoint. Wait for the job to complete."
339411
]
340412
},
@@ -358,7 +430,7 @@
358430
"cell_type": "markdown",
359431
"metadata": {},
360432
"source": [
361-
"### 6. Review inference predictions. \n",
433+
"### 6. Review inference predictions\n",
362434
"Download the predictions from the job output and review the predictions using a dataframe."
363435
]
364436
},
@@ -390,7 +462,7 @@
390462
"cell_type": "markdown",
391463
"metadata": {},
392464
"source": [
393-
"Record the input file name and set the original index value in the `'index'` column for each input file. Join the `train_df` with ground truth into the input dataframe."
465+
"Record the input file name and set the original index value in the `'index'` column for each input file. Join the `test_df` containing ground truth into the input dataframe."
394466
]
395467
},
396468
{
@@ -408,7 +480,7 @@
408480
" input_df.append(input)\n",
409481
"input_df = pd.concat(input_df)\n",
410482
"input_df.set_index(\"index\", inplace=True)\n",
411-
"input_df = input_df.join(train_df.drop(columns=[\"audio\", \"language\"]))\n",
483+
"input_df = input_df.join(test_df.drop(columns=[\"audio\", \"language\"]))\n",
412484
"\n",
413485
"input_df.head()"
414486
]
@@ -471,7 +543,7 @@
471543
"name": "python",
472544
"nbconvert_exporter": "python",
473545
"pygments_lexer": "ipython3",
474-
"version": "3.8.13"
546+
"version": "3.9.12"
475547
},
476548
"vscode": {
477549
"interpreter": {

0 commit comments

Comments
 (0)