Skip to content

Releases: pacoxu/AI-Infra

v0.0.2

13 Jan 07:58
a3cdabc

Choose a tag to compare

把之前积压的内容 在公众号发了一波 基本清空

需要重新出发了

What's Changed

  • Add navigation TOC and Chinese README by @Copilot in #113
  • [WIP] Update Agent Sandbox with gvisor and snapshot for cold startup improvement by @Copilot in #117
  • Add dra-driver-cpu to DRA documentation by @Copilot in #121
  • Add kube-agentic-networking to Agentic Workflow section by @Copilot in #122
  • Add GPU Pod cold start optimization guide by @Copilot in #123
  • Add serverless AI inference platform documentation by @Copilot in #125
  • Add goal achievement chart for Cloud Native AI Infra Architect by @Copilot in #128
  • Add comprehensive LoRA documentation for multi-tenant LLM serving by @Copilot in #134
  • Add model switching documentation: Aegaeon token-level scheduling and vLLM sleep mode by @Copilot in #136
  • Document large-scale Kubernetes cluster technologies (KEP-2340, KEP-4988, DRANET, Spanner, Lustre) by @Copilot in #140
  • Add Gang Scheduling blog posts in English and Chinese by @Copilot in #142
  • Add bilingual blog on topology-aware scheduling: Device Plugin to DRA by @Copilot in #144
  • Add cgroup v2 migration blog post (bilingual) by @Copilot in #146
  • Add bilingual blog: JobSet In-Place Restart (Co-Evolving series) by @Copilot in #148
  • Add Agent Sandbox bilingual blog post (EN/ZH) by @Copilot in #150
  • Add AWS 10K Node EKS Ultra-Scale Clusters Blog Post (EN/ZH) by @Copilot in #152
  • [WIP] Refine goal achievement chart for cloud native AI infrastructure by @Copilot in #153
  • Add AI Native era (2025-2035) focus to RoadMap by @Copilot in #154
  • Add reference link for AI Native platform ideas by @pacoxu in #157
  • Blog: Inference orchestration solutions and convergence trends by @Copilot in #156
  • Add bilingual blog posts for Kubernetes safe upgrade and rollback by @Copilot in #161
  • Add KCD Hangzhou observability optimization blog post by @Copilot in #163
  • Add Ant Group 20K node cluster optimization documentation: 50% memory reduction by @Copilot in #165
  • Blog post on AI code attitudes in communities by @pacoxu in #167
  • Update Grove Mode description in inference orchestration by @pacoxu in #169
  • Add bilingual blog: Kubernetes community operations and AI/ML entry points by @Copilot in #173
  • Document Pod lifecycle enhancements: KEP-5307 Container Restart Rules and KEP-5532 RestartAllContainers by @Copilot in #176
  • Add Chinese translation of GKE 65K nodes blog posts by @Copilot in #180
  • Add bilingual Agones project introduction blog post by @Copilot in #182
  • Add comprehensive multi-tenancy isolation guide for AI infrastructure by @Copilot in #187
  • Add bilingual blog post: From SQL on CPUs to Inference on GPUs by @Copilot in #189
  • Add ByteDance large-scale Kubernetes solutions documentation by @Copilot in #191
  • Add DRANET Chinese blog post combining KubeCon NA 2025 keynote and IEEE LCN paper by @Copilot in #194
  • Add GPU fault detection and self-healing guide for Kubernetes by @Copilot in #196
  • Add vLLM 2025 Retrospective & 2026 Roadmap blog post by @Copilot in #199
  • Add OCI unified distribution blog post (Chinese + English) by @Copilot in #201
  • Revise project updates and learning path details by @pacoxu in #202
  • Sync README.zh-CN.md with README.md - Add missing documentation links and Goal Achievement Chart updates by @Copilot in #204
  • Add Chinese blog post on Ambient Global Compute from KubeCon NA 2025 by @Copilot in #207
  • Add comprehensive AI Agent platforms and frameworks documentation by @Copilot in #208
  • Add KubeCon EU 2026 Chinese blog with curated AI infrastructure sessions by @Copilot in #211
  • Add MLOps documentation: 7-layer architecture for repeatable ML lifecycle by @Copilot in #213

Full Changelog: v0.0.1...v0.0.2

v0.0.1

05 Nov 07:23
ddb38ea

Choose a tag to compare

What's Changed

  • Pod Lifecycle(AI): Pod startup speed optimization, cold-start, sleep mode, and offloading.
  • DRA updates: NVIDIA GPU Operator and DRA Driver, NRI
  • Workload solutions(P/D disaggregation): LWS, SGLang RBG, AIBrix StormService, Kthena, KServe, Dynamo, vllm Production Stackm OME.
  • KV Cache comparison: NIXL, LMCache, Mooncake
  • Scheduling: Volcano, NVIDIA Grove, Kueue, Godel, Koordinator, HAMI, KAI Scheduler.
  • Gateway: Envoy AI Gateway, Semantic Router, KGateway, Kong.
  • Performance testing and benchmarking tools
  • Community Update: AI Conformance, Kubernetes workgroups and CNCF tags/initiatives.

More

  • Large Scale Experts (MoE)
  • AIConfigurator
  • Observability
  • Training on Kubernetes: Kubeflow Trainer V2 and ArgoCD ; GPU checkpoint/restore
  • Serverless, Knative
  • AI workload isolation
  • parallelism
  • pre-training

Full Changelog: https://github.com/pacoxu/AI-Infra/commits/v0.0.1

初步成形

  • 目前缺少一些基础的模型AI知识
  • 另外训练内容可能相对较少
  • 缺少中文
  • landscape 比较粗糙

但是

  • AI workloads 编排管理
  • 现有相关的项目(不包含更高一层的agent 内容)
  • 云原生方向为主

基本还是覆盖了的