Releases: pacoxu/AI-Infra
Releases · pacoxu/AI-Infra
v0.0.2
把之前积压的内容 在公众号发了一波 基本清空
需要重新出发了
What's Changed
- Add navigation TOC and Chinese README by @Copilot in #113
- [WIP] Update Agent Sandbox with gvisor and snapshot for cold startup improvement by @Copilot in #117
- Add dra-driver-cpu to DRA documentation by @Copilot in #121
- Add kube-agentic-networking to Agentic Workflow section by @Copilot in #122
- Add GPU Pod cold start optimization guide by @Copilot in #123
- Add serverless AI inference platform documentation by @Copilot in #125
- Add goal achievement chart for Cloud Native AI Infra Architect by @Copilot in #128
- Add comprehensive LoRA documentation for multi-tenant LLM serving by @Copilot in #134
- Add model switching documentation: Aegaeon token-level scheduling and vLLM sleep mode by @Copilot in #136
- Document large-scale Kubernetes cluster technologies (KEP-2340, KEP-4988, DRANET, Spanner, Lustre) by @Copilot in #140
- Add Gang Scheduling blog posts in English and Chinese by @Copilot in #142
- Add bilingual blog on topology-aware scheduling: Device Plugin to DRA by @Copilot in #144
- Add cgroup v2 migration blog post (bilingual) by @Copilot in #146
- Add bilingual blog: JobSet In-Place Restart (Co-Evolving series) by @Copilot in #148
- Add Agent Sandbox bilingual blog post (EN/ZH) by @Copilot in #150
- Add AWS 10K Node EKS Ultra-Scale Clusters Blog Post (EN/ZH) by @Copilot in #152
- [WIP] Refine goal achievement chart for cloud native AI infrastructure by @Copilot in #153
- Add AI Native era (2025-2035) focus to RoadMap by @Copilot in #154
- Add reference link for AI Native platform ideas by @pacoxu in #157
- Blog: Inference orchestration solutions and convergence trends by @Copilot in #156
- Add bilingual blog posts for Kubernetes safe upgrade and rollback by @Copilot in #161
- Add KCD Hangzhou observability optimization blog post by @Copilot in #163
- Add Ant Group 20K node cluster optimization documentation: 50% memory reduction by @Copilot in #165
- Blog post on AI code attitudes in communities by @pacoxu in #167
- Update Grove Mode description in inference orchestration by @pacoxu in #169
- Add bilingual blog: Kubernetes community operations and AI/ML entry points by @Copilot in #173
- Document Pod lifecycle enhancements: KEP-5307 Container Restart Rules and KEP-5532 RestartAllContainers by @Copilot in #176
- Add Chinese translation of GKE 65K nodes blog posts by @Copilot in #180
- Add bilingual Agones project introduction blog post by @Copilot in #182
- Add comprehensive multi-tenancy isolation guide for AI infrastructure by @Copilot in #187
- Add bilingual blog post: From SQL on CPUs to Inference on GPUs by @Copilot in #189
- Add ByteDance large-scale Kubernetes solutions documentation by @Copilot in #191
- Add DRANET Chinese blog post combining KubeCon NA 2025 keynote and IEEE LCN paper by @Copilot in #194
- Add GPU fault detection and self-healing guide for Kubernetes by @Copilot in #196
- Add vLLM 2025 Retrospective & 2026 Roadmap blog post by @Copilot in #199
- Add OCI unified distribution blog post (Chinese + English) by @Copilot in #201
- Revise project updates and learning path details by @pacoxu in #202
- Sync README.zh-CN.md with README.md - Add missing documentation links and Goal Achievement Chart updates by @Copilot in #204
- Add Chinese blog post on Ambient Global Compute from KubeCon NA 2025 by @Copilot in #207
- Add comprehensive AI Agent platforms and frameworks documentation by @Copilot in #208
- Add KubeCon EU 2026 Chinese blog with curated AI infrastructure sessions by @Copilot in #211
- Add MLOps documentation: 7-layer architecture for repeatable ML lifecycle by @Copilot in #213
Full Changelog: v0.0.1...v0.0.2
v0.0.1
What's Changed
- Pod Lifecycle(AI): Pod startup speed optimization, cold-start, sleep mode, and offloading.
- DRA updates: NVIDIA GPU Operator and DRA Driver, NRI
- Workload solutions(P/D disaggregation): LWS, SGLang RBG, AIBrix StormService, Kthena, KServe, Dynamo, vllm Production Stackm OME.
- KV Cache comparison: NIXL, LMCache, Mooncake
- Scheduling: Volcano, NVIDIA Grove, Kueue, Godel, Koordinator, HAMI, KAI Scheduler.
- Gateway: Envoy AI Gateway, Semantic Router, KGateway, Kong.
- Performance testing and benchmarking tools
- Community Update: AI Conformance, Kubernetes workgroups and CNCF tags/initiatives.
More
- Large Scale Experts (MoE)
- AIConfigurator
- Observability
- Training on Kubernetes: Kubeflow Trainer V2 and ArgoCD ; GPU checkpoint/restore
- Serverless, Knative
- AI workload isolation
- parallelism
- pre-training
Full Changelog: https://github.com/pacoxu/AI-Infra/commits/v0.0.1
初步成形
- 目前缺少一些基础的模型AI知识
- 另外训练内容可能相对较少
- 缺少中文
- landscape 比较粗糙
但是
- AI workloads 编排管理
- 现有相关的项目(不包含更高一层的agent 内容)
- 云原生方向为主
基本还是覆盖了的