<time datetime=2025-01-26 class=text-body-secondary>Sunday, January 26, 2025</time></div><p>With the GPT series models shocking the world, a new era of AI innovation has begun. Besides the model training, because of the large model size and high computational cost, the inference process is also a challenge, not only the cost, but also the performance and efficiency. So when we look back to the late of 2023, we see lots of communities are building the inference engines, like the vLLM, TGI, LMDeploy and more others less well-known. But there still lacks a platform to provide an unified interface to serve LLM workloads in cloud and it should work smoothly with these inference engines. That’s the initial idea of llmaz. However, we didn’t start the work until middle of 2024 due to some unavoidable commitments. Anyway, today we are proud to announce the first minor release v0.1.0 of llmaz.</p><blockquote><p>💙 To make sure you will not leave with disappointments, we don’t have a lot of fancy features for v0.1.0, we just did a lot of dirty work to make sure it’s a workable solution, but we promise you, we will bring more exciting features in the near future.</p></blockquote><h2 id=architecture>Architecture</h2><p>First of all, let’s take a look at the architecture of llmaz: <img alt="llmaz architecture" src=/images/infra.png></p><p>Basically, llmaz works as a platform on top of Kubernetes and provides an unified interface for various kinds of inference engines, it has four CRDs as defined:</p><ul><li><strong>OpenModel</strong>: the model specification, which defines the model source, inference configurations and other metadata. It’s a cluster scoped resource.</li><li><strong>Playground</strong>: the facade to set the inference configurations, e.g. the model name, the replicas, the scaling policies, as simple as possible. It’s a namespace scoped resource.</li><li><strong>Inference Service</strong>: the full configurations for inference workload if Playground is not enough. Most of the time, you don’t need it. A Playground will create a Service automatically and it’s a namespace scoped resource.</li><li><strong>BackendRuntime</strong>: the backend runtime represents the actual inference engines, their images, resource requirements, together with their boot configurations. It’s a namespace scoped resource.</li></ul><p>With the abstraction of these CRDs, llmaz provides a simple way to deploy and manage the inference workloads, offering features like:</p><ul><li><strong>Easy of Use</strong>: People can quick deploy a LLM service with minimal configurations.</li><li><strong>Broad Backends Support</strong>: llmaz supports a wide range of advanced inference backends for different scenarios, like <em>vLLM</em>, <em>Text-Generation-Inference</em>, <em>SGLang</em>, <em>llama.cpp</em>. Find the full list of supported backends here.</li><li><strong>Accelerator Fungibility</strong>: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.</li><li><strong>SOTA Inference</strong>: llmaz supports the latest cutting-edge researches like Speculative Decoding to run on Kubernetes.</li><li><strong>Various Model Providers</strong>: llmaz supports a wide range of model providers, such as HuggingFace, ModelScope, ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.</li><li><strong>Multi-hosts Support</strong>: llmaz supports both single-host and multi-hosts scenarios from day 0.</li><li><strong>Scaling Efficiency</strong>: llmaz supports horizontal scaling with just 2-3 lines.</li></ul><p>With llmaz v0.1.0, all these features are available. Next, I’ll show you how to use llmaz.</p><h2 id=quick-start>Quick Start</h2><h3 id=installation>Installation</h3><p>First, you need to install the llmaz with helm charts, be note that the helm chart version is different with the llmaz version, 0.0.6 is exactly the version of llmaz v0.1.0.</p><div class=highlight><pre tabindex=0 style=background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-cmd data-lang=cmd><span style=display:flex><span>helm repo add inftyai https://inftyai.github.io/llmaz
0 commit comments