|
13 | 13 | "`automatic-speech-recognition` (ASR) converts a speech signal to text, mapping a sequence of audio inputs to text outputs. Virtual assistants like Siri and Alexa use ASR models to help users everyday, and there are many other useful user-facing applications like live captioning and note-taking during meetings.\n",
|
14 | 14 | "\n",
|
15 | 15 | "### Model\n",
|
16 |
| - "Models that can perform the `automatic-speech-recognition` task are tagged with `task: automatic-speech-recognition`. We will use the `openai-whisper-large` model in this notebook. If you opened this notebook from a specific model card, remember to replace the specific model name. If you don't find a model that suits your scenario or domain, you can discover and [import models from HuggingFace hub](../../import/import-model-from-huggingface.ipynb) and then use them for inference. \n", |
| 16 | + "Models that can perform the `automatic-speech-recognition` task are tagged with `task: automatic-speech-recognition`. We will use the `openai-whisper-large` model in this notebook. If you opened this notebook from a specific model card, remember to replace the specific model name. If you don't find a model that suits your scenario or domain, you can discover and [import models from HuggingFace hub](../../import/import_model_into_registry.ipynb) and then use them for inference. \n", |
17 | 17 | "\n",
|
18 | 18 | "### Inference data\n",
|
19 |
| - "We will use custom audio files that have been uploaded to the cloud. \\\n", |
20 |
| - "You can replace the links with any audio file stored on the cloud and verify inference.\n", |
| 19 | + "We will use the [Librispeech ASR](https://huggingface.co/datasets/librispeech_asr/viewer/clean/test) dataset. \\\n", |
| 20 | + "You can use also use custom audio files stored on the cloud and verify inference.\n", |
21 | 21 | "- Most common audio formats (m4a, wav, flac, wma, mp3, etc.) are supported.\n",
|
22 | 22 | "- The whisper model can process only 30 seconds of data at a time, so if the file you upload is longer than 30 seconds, only the first 30 seconds will be transcribed. This can be circumvented by splitting the file into 30 second chunks.\n",
|
23 | 23 | "\n",
|
|
51 | 51 | "source": [
|
52 | 52 | "# Import packages used by the following code snippets\n",
|
53 | 53 | "import csv\n",
|
| 54 | + "import json\n", |
54 | 55 | "import os\n",
|
| 56 | + "import requests\n", |
55 | 57 | "import time\n",
|
56 | 58 | "\n",
|
57 | 59 | "import pandas as pd\n",
|
|
174 | 176 | "cell_type": "markdown",
|
175 | 177 | "metadata": {},
|
176 | 178 | "source": [
|
177 |
| - "### 3. Prepare data for inference.\n", |
| 179 | + "### 3. Prepare data for inference\n", |
178 | 180 | "\n",
|
179 |
| - "A copy of a small subset from the [Librispeech ASR](https://huggingface.co/datasets/librispeech_asr) is available in the [librispeech-dataset](./librispeech-dataset/) folder. The next few cells show basic data preparation:\n", |
180 |
| - "* Visualize some data rows\n", |
181 |
| - "* We want this sample to run quickly, so save a smaller dataset containing a fraction of the original." |
| 181 | + "We will test with a small subset from the [Librispeech ASR](https://huggingface.co/datasets/librispeech_asr/viewer/clean/test) dataset, saving the sample in the `librispeech-dataset` folder. The next few cells show basic data preparation:\n", |
| 182 | + "* Download the data.\n", |
| 183 | + "* Visualize some data rows.\n", |
| 184 | + "* Save the data.\n", |
| 185 | + "\n", |
| 186 | + "We want this sample to run quickly, so we are using a smaller dataset containing a fraction of the original." |
182 | 187 | ]
|
183 | 188 | },
|
184 | 189 | {
|
|
189 | 194 | "source": [
|
190 | 195 | "# Define directories and filenames as variables\n",
|
191 | 196 | "dataset_dir = \"librispeech-dataset\"\n",
|
192 |
| - "training_datafile = \"train_clean_100.csv\"\n", |
| 197 | + "test_datafile = \"test_100.csv\"\n", |
193 | 198 | "\n",
|
194 | 199 | "batch_dir = \"batch\"\n",
|
195 | 200 | "batch_inputs_dir = os.path.join(batch_dir, \"inputs\")\n",
|
196 | 201 | "batch_input_file = \"batch_input.csv\"\n",
|
| 202 | + "os.makedirs(dataset_dir, exist_ok=True)\n", |
197 | 203 | "os.makedirs(batch_dir, exist_ok=True)\n",
|
198 | 204 | "os.makedirs(batch_inputs_dir, exist_ok=True)"
|
199 | 205 | ]
|
200 | 206 | },
|
| 207 | + { |
| 208 | + "attachments": {}, |
| 209 | + "cell_type": "markdown", |
| 210 | + "metadata": {}, |
| 211 | + "source": [ |
| 212 | + "#### 3.1 Download the data\n", |
| 213 | + "We want to get the URLs for the audio files. To do this, pull down the data using `requests` instead of the Huggingface `datasets` module. The MLflow model's signature specifies the input should be a column named `\"audio\"` and a column named `\"language\"`." |
| 214 | + ] |
| 215 | + }, |
| 216 | + { |
| 217 | + "cell_type": "code", |
| 218 | + "execution_count": null, |
| 219 | + "metadata": {}, |
| 220 | + "outputs": [], |
| 221 | + "source": [ |
| 222 | + "testdata = requests.get(\n", |
| 223 | + " \"https://datasets-server.huggingface.co/first-rows?dataset=librispeech_asr&config=clean&split=test&offset=0&limit=100\"\n", |
| 224 | + ").text\n", |
| 225 | + "testdata = json.loads(testdata)" |
| 226 | + ] |
| 227 | + }, |
| 228 | + { |
| 229 | + "cell_type": "code", |
| 230 | + "execution_count": null, |
| 231 | + "metadata": {}, |
| 232 | + "outputs": [], |
| 233 | + "source": [ |
| 234 | + "audio_urls_and_text = [\n", |
| 235 | + " (row[\"row\"][\"audio\"][0][\"src\"], row[\"row\"][\"text\"]) for row in testdata[\"rows\"]\n", |
| 236 | + "]" |
| 237 | + ] |
| 238 | + }, |
| 239 | + { |
| 240 | + "cell_type": "code", |
| 241 | + "execution_count": null, |
| 242 | + "metadata": {}, |
| 243 | + "outputs": [], |
| 244 | + "source": [ |
| 245 | + "test_df = pd.DataFrame(data=audio_urls_and_text, columns=[\"audio\", \"text\"])" |
| 246 | + ] |
| 247 | + }, |
| 248 | + { |
| 249 | + "attachments": {}, |
| 250 | + "cell_type": "markdown", |
| 251 | + "metadata": {}, |
| 252 | + "source": [ |
| 253 | + "These are all spoken in English, so create a `language` column of all `'en'`." |
| 254 | + ] |
| 255 | + }, |
| 256 | + { |
| 257 | + "cell_type": "code", |
| 258 | + "execution_count": null, |
| 259 | + "metadata": {}, |
| 260 | + "outputs": [], |
| 261 | + "source": [ |
| 262 | + "test_df[\"language\"] = \"en\"\n", |
| 263 | + "test_df.to_csv(os.path.join(\".\", dataset_dir, test_datafile), index=False)" |
| 264 | + ] |
| 265 | + }, |
| 266 | + { |
| 267 | + "attachments": {}, |
| 268 | + "cell_type": "markdown", |
| 269 | + "metadata": {}, |
| 270 | + "source": [ |
| 271 | + "#### 3.2 Visualize a few rows of data" |
| 272 | + ] |
| 273 | + }, |
201 | 274 | {
|
202 | 275 | "cell_type": "code",
|
203 | 276 | "execution_count": null,
|
204 | 277 | "metadata": {},
|
205 | 278 | "outputs": [],
|
206 | 279 | "source": [
|
207 |
| - "# Load the ./librispeech-dataset/train_clean_100.csv file into a pandas dataframe and show the first 5 rows\n", |
208 | 280 | "pd.set_option(\n",
|
209 | 281 | " \"display.max_colwidth\", 0\n",
|
210 | 282 | ") # Set the max column width to 0 to display the full text\n",
|
211 |
| - "train_df = pd.read_csv(os.path.join(\".\", dataset_dir, training_datafile))\n", |
212 |
| - "train_df.head()" |
| 283 | + "test_df.head()" |
213 | 284 | ]
|
214 | 285 | },
|
215 | 286 | {
|
216 | 287 | "attachments": {},
|
217 | 288 | "cell_type": "markdown",
|
218 | 289 | "metadata": {},
|
219 | 290 | "source": [
|
220 |
| - "Save the input data to files of smaller batches for testing. The MLflow model's signature specifies the input should be a column named `\"audio\"` and a column named `\"language\"`." |
| 291 | + "#### 3.3 Save the data\n", |
| 292 | + "Save the input data to files of smaller batches for testing." |
221 | 293 | ]
|
222 | 294 | },
|
223 | 295 | {
|
|
226 | 298 | "metadata": {},
|
227 | 299 | "outputs": [],
|
228 | 300 | "source": [
|
229 |
| - "batch_df = train_df[[\"audio\", \"language\"]]\n", |
| 301 | + "batch_df = test_df[[\"audio\", \"language\"]]\n", |
230 | 302 | "\n",
|
231 | 303 | "# Divide this into files of 10 rows each\n",
|
232 | 304 | "batch_size_per_predict = 10\n",
|
|
300 | 372 | " instance_count=1,\n",
|
301 | 373 | " logging_level=\"info\",\n",
|
302 | 374 | " max_concurrency_per_instance=1,\n",
|
303 |
| - " mini_batch_size=10,\n", |
| 375 | + " mini_batch_size=2,\n", |
304 | 376 | " output_file_name=\"predictions.csv\",\n",
|
305 |
| - " retry_settings=BatchRetrySettings(max_retries=3, timeout=300),\n", |
| 377 | + " retry_settings=BatchRetrySettings(max_retries=3, timeout=600),\n", |
306 | 378 | ")\n",
|
307 | 379 | "workspace_ml_client.begin_create_or_update(deployment).result()"
|
308 | 380 | ]
|
|
334 | 406 | "cell_type": "markdown",
|
335 | 407 | "metadata": {},
|
336 | 408 | "source": [
|
337 |
| - "### 5. Run a batch inference job.\n", |
| 409 | + "### 5. Run a batch inference job\n", |
338 | 410 | "Invoke the batch endpoint with the input parameter pointing to the folder containing the batch inference input. This creates a pipeline job using the default deployment in the endpoint. Wait for the job to complete."
|
339 | 411 | ]
|
340 | 412 | },
|
|
358 | 430 | "cell_type": "markdown",
|
359 | 431 | "metadata": {},
|
360 | 432 | "source": [
|
361 |
| - "### 6. Review inference predictions. \n", |
| 433 | + "### 6. Review inference predictions\n", |
362 | 434 | "Download the predictions from the job output and review the predictions using a dataframe."
|
363 | 435 | ]
|
364 | 436 | },
|
|
390 | 462 | "cell_type": "markdown",
|
391 | 463 | "metadata": {},
|
392 | 464 | "source": [
|
393 |
| - "Record the input file name and set the original index value in the `'index'` column for each input file. Join the `train_df` with ground truth into the input dataframe." |
| 465 | + "Record the input file name and set the original index value in the `'index'` column for each input file. Join the `test_df` containing ground truth into the input dataframe." |
394 | 466 | ]
|
395 | 467 | },
|
396 | 468 | {
|
|
408 | 480 | " input_df.append(input)\n",
|
409 | 481 | "input_df = pd.concat(input_df)\n",
|
410 | 482 | "input_df.set_index(\"index\", inplace=True)\n",
|
411 |
| - "input_df = input_df.join(train_df.drop(columns=[\"audio\", \"language\"]))\n", |
| 483 | + "input_df = input_df.join(test_df.drop(columns=[\"audio\", \"language\"]))\n", |
412 | 484 | "\n",
|
413 | 485 | "input_df.head()"
|
414 | 486 | ]
|
|
471 | 543 | "name": "python",
|
472 | 544 | "nbconvert_exporter": "python",
|
473 | 545 | "pygments_lexer": "ipython3",
|
474 |
| - "version": "3.8.13" |
| 546 | + "version": "3.9.12" |
475 | 547 | },
|
476 | 548 | "vscode": {
|
477 | 549 | "interpreter": {
|
|
0 commit comments