InftyAI
diff --git a/‎blog/2025/01/26/llmaz-a-new-inference-platform-for-llms-built-for-easy-to-use/index.html‎
Lines changed: 2 additions & 2 deletions b/‎blog/2025/01/26/llmaz-a-new-inference-platform-for-llms-built-for-easy-to-use/index.html‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎blog/_print/index.html‎
Lines changed: 1 addition & 1 deletion b/‎blog/_print/index.html‎
Lines changed: 1 addition & 1 deletion
@@ -1,5 +1,5 @@
 <!doctype html><html itemscope itemtype=http://schema.org/WebPage lang=en class=no-js><head><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1,shrink-to-fit=no"><link rel=canonical type=text/html href=https://llmaz.inftyai.com/blog/><link rel=alternate type=application/rss+xml href=https://llmaz.inftyai.com/blog/index.xml><meta name=robots content="noindex, nofollow"><link rel="shortcut icon" href=/favicons/favicon.ico><link rel=apple-touch-icon href=/favicons/apple-touch-icon-180x180.png sizes=180x180><link rel=icon type=image/png href=/favicons/favicon-16x16.png sizes=16x16><link rel=icon type=image/png href=/favicons/favicon-32x32.png sizes=32x32><link rel=icon type=image/png href=/favicons/android-36x36.png sizes=36x36><link rel=icon type=image/png href=/favicons/android-48x48.png sizes=48x48><link rel=icon type=image/png href=/favicons/android-72x72.png sizes=72x72><link rel=icon type=image/png href=/favicons/android-96x96.png sizes=96x96><link rel=icon type=image/png href=/favicons/android-144x144.png sizes=144x144><link rel=icon type=image/png href=/favicons/android-192x192.png sizes=192x192><title>Blog | llmaz</title>
-<meta name=description content="Easy, advanced inference platform for large language models on Kubernetes."><meta property="og:url" content="https://llmaz.inftyai.com/blog/"><meta property="og:site_name" content="llmaz"><meta property="og:title" content="Blog"><meta property="og:description" content="Easy, advanced inference platform for large language models on Kubernetes."><meta property="og:locale" content="en"><meta property="og:type" content="website"><meta itemprop=name content="Blog"><meta itemprop=description content="Easy, advanced inference platform for large language models on Kubernetes."><meta itemprop=dateModified content="2025-09-09T17:21:32+01:00"><meta name=twitter:card content="summary"><meta name=twitter:title content="Blog"><meta name=twitter:description content="Easy, advanced inference platform for large language models on Kubernetes."><link rel=preload href=/scss/main.min.e8bfe7c2c9da20bc5c4650ed3666e1af93397efb4155feece69f60b4ebfe2c4b.css as=style integrity="sha256-6L/nwsnaILxcRlDtNmbhr5M5fvtBVf7s5p9gtOv+LEs=" crossorigin=anonymous><link href=/scss/main.min.e8bfe7c2c9da20bc5c4650ed3666e1af93397efb4155feece69f60b4ebfe2c4b.css rel=stylesheet integrity="sha256-6L/nwsnaILxcRlDtNmbhr5M5fvtBVf7s5p9gtOv+LEs=" crossorigin=anonymous><script src=https://code.jquery.com/jquery-3.7.1.min.js integrity="sha512-v2CJ7UaYy4JwqLDIrZUI/4hqeoQieOmAZNXBeQyjo21dadnwR+8ZaIJVT8EE2iyI61OV8e6M8PP2/4hpQINQ/g==" crossorigin=anonymous></script><script defer src=https://unpkg.com/lunr@2.3.9/lunr.min.js integrity=sha384-203J0SNzyqHby3iU6hzvzltrWi/M41wOP5Gu+BiJMz5nwKykbkUx8Kp7iti0Lpli crossorigin=anonymous></script></head><body class="td-section td-blog"><header><nav class="td-navbar js-navbar-scroll" data-bs-theme=dark><div class="container-fluid flex-column flex-md-row"><a class=navbar-brand href=/><span class="navbar-brand__logo navbar-logo"><svg version="1.2" viewBox="0 0 29 30" width="29" height="30"><style>.a{fill:#fff;stroke:#e94751;stroke-linecap:round;stroke-linejoin:round;stroke-width:.4}</style><path class="a" d="m14.3.0-13.7 6.5c-.2.0-.2.3.0.4L3 8.1q.2.1.4.0l11-5.5q.2.0.3.0l1.9.9c.3.2.2.3.0.4L6 9.3c-.2.1-.2.3.0.4L8.5 11q.1.1.3.0l10.8-6q.1.0.2.0l2.5 1.2c.2.0.2.3.0.4l-10.1 5.5c-.4.1-.6.5-.2.8l2.3 1.2c.2.1.4.0.5.0L28.2 7c.4-.2.4-.7.0-.9L14.7.0q-.2.0-.4.0z"/><path class="a" d="m29 8.8v3.3l-.1.1-9.3 12.5 9.1-4.9c.1.0.3.0.3.2v3l-13.2 7c-.2.1-.6-.1-.6-.4v-3.5q0-.1.1-.1l9.3-12.2s0-.1-.1.0l-9 5.4c-.1.1-.3.0-.3-.2v-3c0-.3.2-.5.4-.6l12.9-6.9c.2-.1.5.0.5.3z"/><path class="a" d="m13.6 15.4-12.9-6.6c-.3-.2-.7.1-.7.4v13.3q0 .1.1.2l2.1 1c.2.1.5-.1.5-.4v-3.2l7.5 3.8v4.3q0 .2.2.3l2.7 1.5c.3.1.7-.1.7-.5V15.9c0-.2-.1-.4-.2-.5zM2.7 17.3v-3.4c0-.1.2-.2.3-.1l7 3.4c.1.1.2.3.2.5V21z"/></svg></span><span class=navbar-brand__name>llmaz</span></a><div class="td-navbar-nav-scroll ms-md-auto" id=main_navbar><ul class=navbar-nav><li class=nav-item><a class=nav-link href=/docs/><span>Documentation</span></a></li><li class=nav-item><a class=nav-link href=/docs/reference/><span>Reference</span></a></li><li class=nav-item><a class="nav-link active" href=/blog/><span>Blog</span></a></li><li class="nav-item dropdown d-none d-lg-block"><div class=dropdown><a class="nav-link dropdown-toggle" href=# role=button data-bs-toggle=dropdown aria-haspopup=true aria-expanded=false>Versions</a><ul class=dropdown-menu><li><a class=dropdown-item href=/docs>latest</a></li></ul></div></li></ul></div><div class="d-none d-lg-block"><div class="td-search td-search--offline"><div class=td-search__icon></div><input type=search class="td-search__input form-control" placeholder="Search this site…" aria-label="Search this site…" autocomplete=off data-offline-search-index-json-src=/offline-search-index.24adf07bddfc25b8b15923d1fc77bcc4.json data-offline-search-base-href=/ data-offline-search-max-results=10></div></div></div></nav></header><div class="container-fluid td-outer"><div class=td-main><div class="row flex-xl-nowrap"><div class="col-12 col-md-3 col-xl-2 td-sidebar d-print-none"></div><div class="d-none d-xl-block col-xl-2 td-toc d-print-none"></div><main class="col-12 col-md-9 col-xl-8 ps-md-5 pe-md-4" role=main><div class=td-content><div class="pageinfo pageinfo-primary d-print-none"><p>This is the multi-page printable view of this section.
+<meta name=description content="Easy, advanced inference platform for large language models on Kubernetes."><meta property="og:url" content="https://llmaz.inftyai.com/blog/"><meta property="og:site_name" content="llmaz"><meta property="og:title" content="Blog"><meta property="og:description" content="Easy, advanced inference platform for large language models on Kubernetes."><meta property="og:locale" content="en"><meta property="og:type" content="website"><meta itemprop=name content="Blog"><meta itemprop=description content="Easy, advanced inference platform for large language models on Kubernetes."><meta itemprop=dateModified content="2025-09-15T13:01:54+01:00"><meta name=twitter:card content="summary"><meta name=twitter:title content="Blog"><meta name=twitter:description content="Easy, advanced inference platform for large language models on Kubernetes."><link rel=preload href=/scss/main.min.e8bfe7c2c9da20bc5c4650ed3666e1af93397efb4155feece69f60b4ebfe2c4b.css as=style integrity="sha256-6L/nwsnaILxcRlDtNmbhr5M5fvtBVf7s5p9gtOv+LEs=" crossorigin=anonymous><link href=/scss/main.min.e8bfe7c2c9da20bc5c4650ed3666e1af93397efb4155feece69f60b4ebfe2c4b.css rel=stylesheet integrity="sha256-6L/nwsnaILxcRlDtNmbhr5M5fvtBVf7s5p9gtOv+LEs=" crossorigin=anonymous><script src=https://code.jquery.com/jquery-3.7.1.min.js integrity="sha512-v2CJ7UaYy4JwqLDIrZUI/4hqeoQieOmAZNXBeQyjo21dadnwR+8ZaIJVT8EE2iyI61OV8e6M8PP2/4hpQINQ/g==" crossorigin=anonymous></script><script defer src=https://unpkg.com/lunr@2.3.9/lunr.min.js integrity=sha384-203J0SNzyqHby3iU6hzvzltrWi/M41wOP5Gu+BiJMz5nwKykbkUx8Kp7iti0Lpli crossorigin=anonymous></script></head><body class="td-section td-blog"><header><nav class="td-navbar js-navbar-scroll" data-bs-theme=dark><div class="container-fluid flex-column flex-md-row"><a class=navbar-brand href=/><span class="navbar-brand__logo navbar-logo"><svg version="1.2" viewBox="0 0 29 30" width="29" height="30"><style>.a{fill:#fff;stroke:#e94751;stroke-linecap:round;stroke-linejoin:round;stroke-width:.4}</style><path class="a" d="m14.3.0-13.7 6.5c-.2.0-.2.3.0.4L3 8.1q.2.1.4.0l11-5.5q.2.0.3.0l1.9.9c.3.2.2.3.0.4L6 9.3c-.2.1-.2.3.0.4L8.5 11q.1.1.3.0l10.8-6q.1.0.2.0l2.5 1.2c.2.0.2.3.0.4l-10.1 5.5c-.4.1-.6.5-.2.8l2.3 1.2c.2.1.4.0.5.0L28.2 7c.4-.2.4-.7.0-.9L14.7.0q-.2.0-.4.0z"/><path class="a" d="m29 8.8v3.3l-.1.1-9.3 12.5 9.1-4.9c.1.0.3.0.3.2v3l-13.2 7c-.2.1-.6-.1-.6-.4v-3.5q0-.1.1-.1l9.3-12.2s0-.1-.1.0l-9 5.4c-.1.1-.3.0-.3-.2v-3c0-.3.2-.5.4-.6l12.9-6.9c.2-.1.5.0.5.3z"/><path class="a" d="m13.6 15.4-12.9-6.6c-.3-.2-.7.1-.7.4v13.3q0 .1.1.2l2.1 1c.2.1.5-.1.5-.4v-3.2l7.5 3.8v4.3q0 .2.2.3l2.7 1.5c.3.1.7-.1.7-.5V15.9c0-.2-.1-.4-.2-.5zM2.7 17.3v-3.4c0-.1.2-.2.3-.1l7 3.4c.1.1.2.3.2.5V21z"/></svg></span><span class=navbar-brand__name>llmaz</span></a><div class="td-navbar-nav-scroll ms-md-auto" id=main_navbar><ul class=navbar-nav><li class=nav-item><a class=nav-link href=/docs/><span>Documentation</span></a></li><li class=nav-item><a class=nav-link href=/docs/reference/><span>Reference</span></a></li><li class=nav-item><a class="nav-link active" href=/blog/><span>Blog</span></a></li><li class="nav-item dropdown d-none d-lg-block"><div class=dropdown><a class="nav-link dropdown-toggle" href=# role=button data-bs-toggle=dropdown aria-haspopup=true aria-expanded=false>Versions</a><ul class=dropdown-menu><li><a class=dropdown-item href=/docs>latest</a></li></ul></div></li></ul></div><div class="d-none d-lg-block"><div class="td-search td-search--offline"><div class=td-search__icon></div><input type=search class="td-search__input form-control" placeholder="Search this site…" aria-label="Search this site…" autocomplete=off data-offline-search-index-json-src=/offline-search-index.24adf07bddfc25b8b15923d1fc77bcc4.json data-offline-search-base-href=/ data-offline-search-max-results=10></div></div></div></nav></header><div class="container-fluid td-outer"><div class=td-main><div class="row flex-xl-nowrap"><div class="col-12 col-md-3 col-xl-2 td-sidebar d-print-none"></div><div class="d-none d-xl-block col-xl-2 td-toc d-print-none"></div><main class="col-12 col-md-9 col-xl-8 ps-md-5 pe-md-4" role=main><div class=td-content><div class="pageinfo pageinfo-primary d-print-none"><p>This is the multi-page printable view of this section.
 <a href=# onclick="return print(),!1">Click here to print</a>.</p><p><a href=/blog/>Return to the regular view of this page</a>.</p></div><h1 class=title>Blog</h1><ul><li><a href=#pg-c6072e0a8e38020f398cd5039662025d>llmaz, a new inference platform for LLMs built for easy to use</a></li></ul><div class=content></div></div><div class=td-content><h1 id=pg-c6072e0a8e38020f398cd5039662025d>llmaz, a new inference platform for LLMs built for easy to use</h1><div class=lead>A brief introduction to llmaz and the features published in the first minor release v0.1.0.</div><div class="td-byline mb-4">By <b><a href=https://github.com/kerthcet>Kante Yin</a> (<a href=https://inftyai.com/>InftyAI</a>)</b> |
 <time datetime=2025-01-26 class=text-body-secondary>Sunday, January 26, 2025</time></div><p>With the GPT series models shocking the world, a new era of AI innovation has begun. Besides the model training, because of the large model size and high computational cost, the inference process is also a challenge, not only the cost, but also the performance and efficiency. So when we look back to the late of 2023, we see lots of communities are building the inference engines, like the vLLM, TGI, LMDeploy and more others less well-known. But there still lacks a platform to provide an unified interface to serve LLM workloads in cloud and it should work smoothly with these inference engines. That&rsquo;s the initial idea of llmaz. However, we didn&rsquo;t start the work until middle of 2024 due to some unavoidable commitments. Anyway, today we are proud to announce the first minor release v0.1.0 of llmaz.</p><blockquote><p>💙 To make sure you will not leave with disappointments, we don&rsquo;t have a lot of fancy features for v0.1.0, we just did a lot of dirty work to make sure it&rsquo;s a workable solution, but we promise you, we will bring more exciting features in the near future.</p></blockquote><h2 id=architecture>Architecture</h2><p>First of all, let&rsquo;s take a look at the architecture of llmaz: <img alt="llmaz architecture" src=/images/infra.png></p><p>Basically, llmaz works as a platform on top of Kubernetes and provides an unified interface for various kinds of inference engines, it has four CRDs as defined:</p><ul><li><strong>OpenModel</strong>: the model specification, which defines the model source, inference configurations and other metadata. It&rsquo;s a cluster scoped resource.</li><li><strong>Playground</strong>: the facade to set the inference configurations, e.g. the model name, the replicas, the scaling policies, as simple as possible. It&rsquo;s a namespace scoped resource.</li><li><strong>Inference Service</strong>: the full configurations for inference workload if Playground is not enough. Most of the time, you don&rsquo;t need it. A Playground will create a Service automatically and it&rsquo;s a namespace scoped resource.</li><li><strong>BackendRuntime</strong>: the backend runtime represents the actual inference engines, their images, resource requirements, together with their boot configurations. It&rsquo;s a namespace scoped resource.</li></ul><p>With the abstraction of these CRDs, llmaz provides a simple way to deploy and manage the inference workloads, offering features like:</p><ul><li><strong>Easy of Use</strong>: People can quick deploy a LLM service with minimal configurations.</li><li><strong>Broad Backends Support</strong>: llmaz supports a wide range of advanced inference backends for different scenarios, like <em>vLLM</em>, <em>Text-Generation-Inference</em>, <em>SGLang</em>, <em>llama.cpp</em>. Find the full list of supported backends here.</li><li><strong>Accelerator Fungibility</strong>: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.</li><li><strong>SOTA Inference</strong>: llmaz supports the latest cutting-edge researches like Speculative Decoding to run on Kubernetes.</li><li><strong>Various Model Providers</strong>: llmaz supports a wide range of model providers, such as HuggingFace, ModelScope, ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.</li><li><strong>Multi-hosts Support</strong>: llmaz supports both single-host and multi-hosts scenarios from day 0.</li><li><strong>Scaling Efficiency</strong>: llmaz supports horizontal scaling with just 2-3 lines.</li></ul><p>With llmaz v0.1.0, all these features are available. Next, I&rsquo;ll show you how to use llmaz.</p><h2 id=quick-start>Quick Start</h2><h3 id=installation>Installation</h3><p>First, you need to install the llmaz with helm charts, be note that the helm chart version is different with the llmaz version, 0.0.6 is exactly the version of llmaz v0.1.0.</p><div class=highlight><pre tabindex=0 style=background-color:#f8f8f8;-moz-tab-size:4;-o-tab-size:4;tab-size:4><code class=language-cmd data-lang=cmd><span style=display:flex><span>helm repo add inftyai https://inftyai.github.io/llmaz
 </span></span><span style=display:flex><span>helm repo update