Parallelise Model Loading #466

vovw · 2024-11-16T15:43:07Z

Add model parallelise preloading capability for improved inference startup time

Adds model preloading functionality to improve initial inference latency by allowing models to be loaded into memory before they're needed.

Key changes:

Added --preload-models CLI arg to specify models for preloading
Introduced preload_model method in inference engine interface
Implemented preloading in MLX engine using existing shard loading
Enhanced preemptive download to also preload models after download
Added concurrent model preloading support in StandardNode

Primary motivation is reducing cold-start latency by preloading models before they're needed, useful for deployments requiring predictable latency.

Tested with MLX engine and verified working with preemptive downloads. Built on existing shard infrastructure, maintains backward compatibility.

test using

exo --preload-models model1, model2
exo --preload-models llama-3.2-1b,llama-3.1-8b

prev pr #360

vovw · 2024-11-16T23:56:40Z

@AlexCheema PTAL

lmk if u need more changes.

exo/main.py

exo/orchestration/standard_node.py

exo/main.py

vovw · 2024-11-19T18:49:20Z

@AlexCheema PTAL

exo/inference/inference_engine.py

exo/main.py

- Wrapped debug print statements with `DEBUG >= 2` condition for better logging control. - Consolidated shard building and preloading into a single operation using list comprehensions. - Improved error handling to cover all models in a batch, reducing redundancy. - Added clearer messaging for unsupported models. - Simplified code structure for better readability and performance.

vovw · 2024-11-20T19:31:18Z

@AlexCheema , say do I run the formatter over the whole codebase ?? or just the files I edited ?

AlexCheema · 2024-11-21T08:51:25Z

exo/inference/dummy_inference_engine.py

@@ -1,4 +1,4 @@
-from typing import Optional, Tuple, TYPE_CHECKING
+rom typing import Optional, Tuple, TYPE_CHECKING


AlexCheema · 2024-11-21T08:53:08Z

exo/main.py

@@ -133,17 +134,21 @@
  lambda req_id, tokens, __: topology_viz.update_prompt_output(req_id, inference_engine.tokenizer.decode(tokens)) if topology_viz and hasattr(inference_engine, "tokenizer") else None
 )

-def preemptively_start_download(request_id: str, opaque_status: str):
+
+async def preemptively_start_download(request_id: str, opaque_status: str):


I don't think this should be made async.
creating tasks fire-and-forget-style like it was before is better. if you need to run stuff in sequence, create a task for an asynchronous function that awaits each step.

AlexCheema · 2024-11-21T08:53:39Z

exo/main.py

-      asyncio.create_task(shard_downloader.ensure_shard(current_shard, inference_engine.__class__.__name__))
+      await shard_downloader.ensure_shard(current_shard, inference_engine.__class__.__name__)
+      await node.preload_models([current_shard])
+      return current_shard


why do we return this?

AlexCheema · 2024-11-21T08:53:54Z

exo/main.py

+    shards = [
+        shard for model in models_to_preload
+        if (shard := build_base_shard(model, inference_class)) is not None
+    ]


but these on one line please

AlexCheema · 2024-11-21T08:53:58Z

exo/main.py

+        unsupported = [
+            model for model in models_to_preload 
+            if not build_base_shard(model, inference_class)
+        ]


AlexCheema · 2024-11-21T08:54:15Z

exo/main.py

+    if DEBUG >= 2:
+      print(f"Preloading models: {models_to_preload}")


AlexCheema · 2024-11-28T07:11:30Z

Please respond to my review @vovw

vovw · 2024-11-28T14:34:01Z

Please respond to my review @vovw

flooded with college work rn will address these tomorrow

vovw added 2 commits November 16, 2024 20:54

previous pr code

10b4a47

Preemptively start download in the inference engine and bufix

7b9e3e4

vovw mentioned this pull request Nov 16, 2024

feat: Parallelise Model Loading #360

Closed

AlexCheema requested changes Nov 18, 2024

View reviewed changes

exo/main.py Outdated Show resolved Hide resolved

exo/orchestration/standard_node.py Outdated Show resolved Hide resolved

exo/main.py Show resolved Hide resolved

vovw and others added 2 commits November 18, 2024 14:56

fixed indentation, added --preload-models

8331d92

Merge branch 'main' into main

33c3580

AlexCheema requested changes Nov 20, 2024

View reviewed changes

exo/inference/inference_engine.py Show resolved Hide resolved

exo/main.py Outdated Show resolved Hide resolved

exo/main.py Outdated Show resolved Hide resolved

AlexCheema requested changes Nov 21, 2024

View reviewed changes

AlexCheema assigned vovw Nov 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelise Model Loading #466

Parallelise Model Loading #466

vovw commented Nov 16, 2024 •

edited

Loading

vovw commented Nov 16, 2024 •

edited

Loading

vovw commented Nov 19, 2024

vovw commented Nov 20, 2024

AlexCheema Nov 21, 2024

AlexCheema Nov 21, 2024

AlexCheema Nov 21, 2024

AlexCheema Nov 21, 2024

AlexCheema Nov 21, 2024

AlexCheema Nov 21, 2024

AlexCheema commented Nov 28, 2024

vovw commented Nov 28, 2024

		@@ -1,4 +1,4 @@
		from typing import Optional, Tuple, TYPE_CHECKING
		rom typing import Optional, Tuple, TYPE_CHECKING

		if DEBUG >= 2:
		print(f"Preloading models: {models_to_preload}")

Parallelise Model Loading #466

Are you sure you want to change the base?

Parallelise Model Loading #466

Conversation

vovw commented Nov 16, 2024 • edited Loading

Add model parallelise preloading capability for improved inference startup time

test using

vovw commented Nov 16, 2024 • edited Loading

vovw commented Nov 19, 2024

vovw commented Nov 20, 2024

AlexCheema Nov 21, 2024

Choose a reason for hiding this comment

AlexCheema Nov 21, 2024

Choose a reason for hiding this comment

AlexCheema Nov 21, 2024

Choose a reason for hiding this comment

AlexCheema Nov 21, 2024

Choose a reason for hiding this comment

AlexCheema Nov 21, 2024

Choose a reason for hiding this comment

AlexCheema Nov 21, 2024

Choose a reason for hiding this comment

AlexCheema commented Nov 28, 2024

vovw commented Nov 28, 2024

vovw commented Nov 16, 2024 •

edited

Loading

vovw commented Nov 16, 2024 •

edited

Loading