Async Processor (AP) - User Guide

Overview

The Problem: High-performance accelerators often suffer from low utilization in strictly online serving scenarios, or users may need to mix latency-insensitive workloads into slack capacity without impacting primary online serving.

The Value: This component enables efficient processing of requests where latency is not the primary constraint (i.e., the magnitude of the required SLO is ≥ minutes).
By utilizing an asynchronous, queue-based approach, users can perform tasks such as product classification, bulk summarizations, summarizing forum discussion threads, or performing near-realtime sentiment analysis over large groups of social media tweets without blocking real-time traffic.

Architecture Summary: The Async Processor is a composable component that provides services for managing these requests. It functions as an asynchronous worker that pulls jobs from a message queue and dispatches them to an inference gateway, decoupling job submission from immediate execution.

When to Use

• Latency Insensitivity: Suitable for workloads where immediate response is not required.

• Capacity Optimization: Useful for filling "slack" capacity in your inference pool.

Design Principles

The architecture adheres to the following core principles:

Bring Your Own Queue (BYOQ): All aspects of prioritization, routing, retries, and scaling are decoupled from the message queue implementation.
Composability: The end-user does not interact directly with the processor via an API. Instead, the processor interacts solely with the message queues, making it highly composable with offline batch processing and asynchronous workflows.
Resilience by Design: If real-time traffic spikes or errors occur, the system triggers intelligent retries for jobs, ensuring they eventually complete without manual intervention.

Deployment

To deploy the Async Processor into your K8S cluster, follow these steps:

Create an .env file with export statements overrides. E.g.:

IMAGE_TAG_BASE=<if needed to override for a private registry>
DEPLOY_LLM_D=false
DEPLOY_REDIS=false
DEPLOY_PROMETHEUS=false
AP_IMAGE_PULL_POLICY=Always

Run:

make deploy-ap-on-k8s

To test a request (only for the Redis implementation):

Subscribing to the result channel (different terminal window):

   export REDIS_IP=....
   kubectl run -i -t subscriberbox --rm --image=redis --restart=Never -- /usr/local/bin/redis-cli -h $REDIS_IP SUBSCRIBE result-queue

Publishing a request:

   export REDIS_IP=....
   kubectl run --rm -i -t publishmsgbox --image=redis --restart=Never -- /usr/local/bin/redis-cli -h $REDIS_IP PUBLISH request-queue '{"id" : "testmsg", "payload":{ "model":"food-review-1", "prompt":"Hi, good morning "}, "deadline" :"23472348233323" }'

Command line parameters

igw-base-url: Base URL of the IGW (e.g. https://localhost:30800).
concurrency: The number of concurrenct workers, default is 8.
request-merge-policy: Currently only supporting random-robin policy.
message-queue-impl: Implementation of the queueing system. Options are gcp-pubsub for GCP PubSub redis-sortedset for Redis Sorted Set (persisted and sorted) and redis-pubsub for ephemeral Redis-based implementation.

additional parameters may be specified for concrete message queue implementations

Request Messages and Consusmption

The async processor expects request messages to have the following format:

{
    "id" : "unique identifier for result mapping",
    "deadline" : "deadline in Unix seconds",
    "payload" : {regular inference payload as a byte array}
}

Example:

{
    "id" : "19933123533434",
    "deadline" : "1764045130",
    "payload": byte[]({"model":"food-review","prompt":"hi", "max_tokens":10,"temperature":0})
}

Request Merge Policy

The Async Processor supports multiple request message queues. A Request Merge Policy can be specified to define the merge strategy of messages from the different queues.

Currently the only policy supported is Random Robin Policy which randomly picks messages from the queues.

Retries

When a message processing has failed, either shedded or due to a server-side error, it will be scheduled for a retry (assuming the deadline has not passed).

Results

Results will be written to the results queue and will have the following structure:

{
    "id" : "id mapped to the request",
    "payload" : byte[]{/*inference result payload*/} ,
    // or
    "error" : "error's reason"
}

Implementations

Redis Sorted Set (Persisted)

A persisted implementation based on Redis SortedSets.

Redis Sorted Set Command line parameters

redis.ss.addr: Address of the Redis server. Default is localhost:6379.
redis.ss.request-path-url: Request path url (e.g.: "/v1/completions").
Mutally exclusive with redis.queues-config-file flag.")
redis.ss.inference-objective: InferenceObjective to use for requests (set as the HTTP header x-gateway-inference-objective if not empty).
Mutally exclusive with redis.ss.queues-config-file flag.
redis.ss.request-queue-name: The name of the sorted-set for the requests. Default is request-sortedset.
Mutally exclusive with redis.ss.queues-config-file flag.
redis.ss.result-queue-name: The name of the list for the results. Default is result-list.
redis.ss.queues-config-file: The configuration file name when using multiple queues.
Mutally exclusive with redis.ss.request-queue-name, redis.ss.request-path-url and redis.ss.inference-objective flags.
redis.ss.poll-interval-ms: Poll interval in milliseconds. Default is 1000.
redis.ss.batch-size: Number of messages to process per poll. Default is 10.

Redis Channels (Ephemeral)

NOTE: Consider using the Redis Sorted Set implementation for production use. As it is offers persistence and priority sorting.

An example implementation based on Redis channels is provided.

Redis Channels as the request queues.
Redis Sorted Set as the retry exponential backoff implementation.
Redis Channel as the result queue.

Redis Channels Command line parameters

redis.addr: Address of the Redis server. Default is localhost:6379.
redis.request-path-url: Request path url (e.g.: "/v1/completions").
Mutally exclusive with redis.queues-config-file flag.")
redis.inference-objective: InferenceObjective to use for requests (set as the HTTP header x-gateway-inference-objective if not empty).
Mutally exclusive with redis.queues-config-file flag.
redis.request-queue-name: The name of the channel for the requests. Default is request-queue.
Mutally exclusive with redis.queues-config-file flag.
redis.retry-queue-name: The name of the channel for the retries. Default is retry-sortedset.
redis.result-queue-name: The name of the channel for the results. Default is result-queue.
redis.queues-config-file: The configuration file name when using multiple queues.
Mutally exclusive with redis.request-queue-name, redis.request-path-url and redis.inference-objective flags.

Multiple Queues Configuration File Syntax

The configuration file when using the redis.queues-config-file flag should have the following format:

[
    {
       "queue_name": "some_channel_name", 
       "inference_objective": "some_inference_objective", 
       "request_path_url": "e.g.: /v1/completions"
    },
    ...
]

GCP Pub/Sub

The GCP PubSub implementation requires the user to configure the following:

Requests Topic and a Subscription having the following configurations:
- Exactly once delivery.
- Retries with exponential backoff.
- Dead Letter Queue (DLQ).
Results Topic.

Note: If DLQ is NOT configured for the request topic. Retried messages will be counted multiple times in the #_of_requests metric.

GCP PubSub Command line parameters

pubsub.project-id: The name GCP project ID using the PubSub API.
pubsub.request-path-url: Request path url (e.g.: "/v1/completions").
Mutally exclusive with pubsub.topics-config-file flag.
pubsub.inference-objective: InferenceObjective to use for requests (set as the HTTP header x-gateway-inference-objective if not empty).
Mutally exclusive with pubsub.topics-config-file flag.
pubsub.request-subscriber-id: The subscriber ID for the requests topic.
Mutally exclusive with pubsub.topics-config-file flag.
pubsub.result-topic-id: The results topic ID.
pubsub.topics-config-file: The configuration file name when using multiple topics.
Mutally exclusive with pubsub.request-subscriber-id, pubsub.request-path-url and pubsub.inference-objective flags.

Multiple Topics Configuration File Syntax

The configuration file when using the pubsub.topics-config-file flag should have the following format:

[
    {
       "subscriber_id": "some_subscriber_id", 
       "inference_objective": "some_inference_objective", 
       "request_path_url": "e.g.: /v1/completions"
    },
    ...
]

Development

A setup based on a KIND cluster with a Redis server for MQ is provided. In order to deploy everything run:

make deploy-ap-emulated-on-kind

Then, in a new terminal window register a subscriber:

kubectl exec -n redis redis-master-0 -- redis-cli SUBSCRIBE result-queue

Publish a message for async processing:

kubectl exec -n redis redis-master-0 -- redis-cli PUBLISH request-queue '{"id" : "testmsg", "payload":{ "model":"unsloth/Meta-Llama-3.1-8B", "prompt":"hi"}, "deadline" :"9999999999" }'

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github		.github
charts/async-processor		charts/async-processor
cmd		cmd
deploy		deploy
docs		docs
internal/logging		internal/logging
pkg		pkg
test/integration		test/integration
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prowlabels.yaml		.prowlabels.yaml
.typos.toml		.typos.toml
.yamllint.yml		.yamllint.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
PR_SIGNOFF.md		PR_SIGNOFF.md
README.md		README.md
SECURITY.md		SECURITY.md
go.mod		go.mod
go.sum		go.sum
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Async Processor (AP) - User Guide

Overview

When to Use

Design Principles

Table of Contents

Deployment

Command line parameters

Request Messages and Consusmption

Request Merge Policy

Retries

Results

Implementations

Redis Sorted Set (Persisted)

Redis Sorted Set Command line parameters

Redis Channels (Ephemeral)

Redis Channels Command line parameters

Multiple Queues Configuration File Syntax

GCP Pub/Sub

GCP PubSub Command line parameters

Multiple Topics Configuration File Syntax

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

llm-d-incubation/llm-d-async

Folders and files

Latest commit

History

Repository files navigation

Async Processor (AP) - User Guide

Overview

When to Use

Design Principles

Table of Contents

Deployment

Command line parameters

Request Messages and Consusmption

Request Merge Policy

Retries

Results

Implementations

Redis Sorted Set (Persisted)

Redis Sorted Set Command line parameters

Redis Channels (Ephemeral)

Redis Channels Command line parameters

Multiple Queues Configuration File Syntax

GCP Pub/Sub

GCP PubSub Command line parameters

Multiple Topics Configuration File Syntax

Development

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages