Releases · pacoxu/AI-Infra

把之前积压的内容在公众号发了一波基本清空

需要重新出发了

What's Changed

Add navigation TOC and Chinese README by @Copilot in #113
[WIP] Update Agent Sandbox with gvisor and snapshot for cold startup improvement by @Copilot in #117
Add dra-driver-cpu to DRA documentation by @Copilot in #121
Add kube-agentic-networking to Agentic Workflow section by @Copilot in #122
Add GPU Pod cold start optimization guide by @Copilot in #123
Add serverless AI inference platform documentation by @Copilot in #125
Add goal achievement chart for Cloud Native AI Infra Architect by @Copilot in #128
Add comprehensive LoRA documentation for multi-tenant LLM serving by @Copilot in #134
Add model switching documentation: Aegaeon token-level scheduling and vLLM sleep mode by @Copilot in #136
Document large-scale Kubernetes cluster technologies (KEP-2340, KEP-4988, DRANET, Spanner, Lustre) by @Copilot in #140
Add Gang Scheduling blog posts in English and Chinese by @Copilot in #142
Add bilingual blog on topology-aware scheduling: Device Plugin to DRA by @Copilot in #144
Add cgroup v2 migration blog post (bilingual) by @Copilot in #146
Add bilingual blog: JobSet In-Place Restart (Co-Evolving series) by @Copilot in #148
Add Agent Sandbox bilingual blog post (EN/ZH) by @Copilot in #150
Add AWS 10K Node EKS Ultra-Scale Clusters Blog Post (EN/ZH) by @Copilot in #152
[WIP] Refine goal achievement chart for cloud native AI infrastructure by @Copilot in #153
Add AI Native era (2025-2035) focus to RoadMap by @Copilot in #154
Add reference link for AI Native platform ideas by @pacoxu in #157
Blog: Inference orchestration solutions and convergence trends by @Copilot in #156
Add bilingual blog posts for Kubernetes safe upgrade and rollback by @Copilot in #161
Add KCD Hangzhou observability optimization blog post by @Copilot in #163
Add Ant Group 20K node cluster optimization documentation: 50% memory reduction by @Copilot in #165
Blog post on AI code attitudes in communities by @pacoxu in #167
Update Grove Mode description in inference orchestration by @pacoxu in #169
Add bilingual blog: Kubernetes community operations and AI/ML entry points by @Copilot in #173
Document Pod lifecycle enhancements: KEP-5307 Container Restart Rules and KEP-5532 RestartAllContainers by @Copilot in #176
Add Chinese translation of GKE 65K nodes blog posts by @Copilot in #180
Add bilingual Agones project introduction blog post by @Copilot in #182
Add comprehensive multi-tenancy isolation guide for AI infrastructure by @Copilot in #187
Add bilingual blog post: From SQL on CPUs to Inference on GPUs by @Copilot in #189
Add ByteDance large-scale Kubernetes solutions documentation by @Copilot in #191
Add DRANET Chinese blog post combining KubeCon NA 2025 keynote and IEEE LCN paper by @Copilot in #194
Add GPU fault detection and self-healing guide for Kubernetes by @Copilot in #196
Add vLLM 2025 Retrospective & 2026 Roadmap blog post by @Copilot in #199
Add OCI unified distribution blog post (Chinese + English) by @Copilot in #201
Revise project updates and learning path details by @pacoxu in #202
Sync README.zh-CN.md with README.md - Add missing documentation links and Goal Achievement Chart updates by @Copilot in #204
Add Chinese blog post on Ambient Global Compute from KubeCon NA 2025 by @Copilot in #207
Add comprehensive AI Agent platforms and frameworks documentation by @Copilot in #208
Add KubeCon EU 2026 Chinese blog with curated AI infrastructure sessions by @Copilot in #211
Add MLOps documentation: 7-layer architecture for repeatable ML lifecycle by @Copilot in #213

Full Changelog: v0.0.1...v0.0.2

What's Changed

Pod Lifecycle(AI): Pod startup speed optimization, cold-start, sleep mode, and offloading.

DRA updates: NVIDIA GPU Operator and DRA Driver, NRI

Workload solutions(P/D disaggregation): LWS, SGLang RBG, AIBrix StormService, Kthena, KServe, Dynamo, vllm Production Stackm OME.

KV Cache comparison: NIXL, LMCache, Mooncake

Scheduling: Volcano, NVIDIA Grove, Kueue, Godel, Koordinator, HAMI, KAI Scheduler.

Gateway: Envoy AI Gateway, Semantic Router, KGateway, Kong.

Performance testing and benchmarking tools

Community Update: AI Conformance, Kubernetes workgroups and CNCF tags/initiatives.

Large Scale Experts (MoE)

AIConfigurator

Observability

Training on Kubernetes: Kubeflow Trainer V2 and ArgoCD ; GPU checkpoint/restore

Serverless, Knative

AI workload isolation

parallelism

pre-training

初步成形

目前缺少一些基础的模型AI知识

另外训练内容可能相对较少

缺少中文

landscape 比较粗糙

但是

AI workloads 编排管理

现有相关的项目（不包含更高一层的agent 内容）

云原生方向为主

基本还是覆盖了的

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

Contributors

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

Uh oh!

Releases: pacoxu/AI-Infra

v0.0.2

What's Changed

Contributors

Uh oh!

v0.0.1

What's Changed

Uh oh!