Releases: AMD-AGI/Primus-Turbo
Releases · AMD-AGI/Primus-Turbo
v0.2.0
What's Changed
- fix(deepep): eliminate compile warning. by @zhenhuang12 in #123
- feat(deep_ep): support num_worst_token and use_defaulta_stream_as_comm_stream for internode. by @zhenhuang12 in #120
- feat(token_dispatcher): add DeepEPTokenDispatcher for MoE. by @zhenhuang12 in #114
- build: support multi-arch compilation (gfx942;gfx950) by @xiaobochen-amd in #124
- feat: attn add is_v3_atomic_fp32 env control by @xiaobochen-amd in #126
- chore: move router to moe dir by @xiaobochen-amd in #125
- [Sync-free MoE] feat: add swiglu, geglu and tokens_per_expert_to_mask api by @RuibinCheung in #122
- [HOTFIX] triton version requirement by @RuibinCheung in #130
- [Sync-free MoE] feat: refine act func by @RuibinCheung in #129
- fix(deepep): fix bug when use expert_capacity_factor by @zhenhuang12 in #127
- chore(docker): update default image to rocm/primus:v25.9_gfx942 by @xiaobochen-amd in #133
- [Aiter] Update aiter to fix pybind11 issue by @GeneDer in #132
- feat: gemm fp8 support cktile backend for both tensorwise and rowwise by @kyle-256 in #131
- [Fix] import activation module by @GeneDer in #137
- feat(deepep): move deep_ep header file to primus_turbo common header dir by @zhenhuang12 in #138
- feat(permute): permute op support to compute tokens_per_expert by @zhenhuang12 in #140
- feat: grouped gemm tensorwise impl update by @kyle-256 in #139
- chore: support jax=0.6.2 & jax cicd by @xiaobochen-amd in #136
- feat: add elementwise(unary/binary/quant/dequant) kernel by @xiaobochen-amd in #135
- chore: remove uselsee debug code by @kyle-256 in #141
- chore: refactor grouped gemm blockwise python code by @xiaobochen-amd in #142
- feat: add build ext and opt build efficiency by @xiaobochen-amd in #143
- feat: skip patch torch_extension when version >=2.8.0 by @zhenhuang12 in #144
- chore: refactor gemm fp8 api by @xiaobochen-amd in #145
- fix: skip disabled arch files in build by @xiaobochen-amd in #147
- chore: support quant gemm when m%128!=0 by @kyle-256 in #146
- feat: unify fp8 gemm API by @RuibinCheung in #148
- chore: update aiter version. by @xiaobochen-amd in #151
- add public primus-safe link in readme by @wenxie-amd in #152
- fix readme's typo error by @wenxie-amd in #153
- fix(deepep): fix internode_combine hang when set num_worst_token > 0 by @zhenhuang12 in #149
- [feat]: CK based block quant by @kyle-256 in #155
- feat(gemm): Add mxfp8 gemm and quantize kernel by @RuibinCheung in #154
- opt: grouped gemm perf when len(group_lens)==1 by @xiaobochen-amd in #158
- feat: add float4x2_e2m1 and float8_m8m0 data type by @xiaobochen-amd in #156
- feat(mxfp8): add k padding in bwd by @RuibinCheung in #160
- chore: Allow GEMM with k % 32 = 0 to participate in computation by @kyle-256 in #161
- feat: jax backend support grouped_gemm by @kyle-256 in #157
Full Changelog: v0.1.1...v0.2.0
v0.1.1
What's Changed
- feat(deepep): optimize gpu-cpu nosync. by @zhenhuang12 in #93
- feat: add std::numeric_limits specializations for fp8 (e4m3/e5m2) by @xiaobochen-amd in #98
- chore: set kPadN=true as default config by @kyle-256 in #99
- chore: fix groupedgemm fp8 tensorwise scale shape & add test cases by @xiaobochen-amd in #102
- feat(deepep): support internode for deepep. by @zhenhuang12 in #94
- chore: update variable_k implement by @kyle-256 in #104
- feat(deepep): improve intranode dispatch/combine performance. by @zhenhuang12 in #103
- chore: update readme & requirements.txt by @xiaobochen-amd in #105
- feat: use blockwise scaling for V in FP8 FA triton kernel by @hann-wang in #100
- feat: add python interfaces to create & destroy hip streams. by @yuankaichen-amd in #96
- chore: update gemm fp8 code. by @xiaobochen-amd in #78
- feat: refactor fp8 quant config by @RuibinCheung in #106
- feat: support rocm 7 by @xiaobochen-amd in #107
- fix: building on cpu nodes should not infer device arch by @GeneDer in #108
- chore: clean ck code by @xiaobochen-amd in #109
- doc: update readme by @xiaobochen-amd in #111
- doc(deepep): add DeepEP doc. by @zhenhuang12 in #110
- chore : clean unused cpp code & update benchmark code by @xiaobochen-amd in #112
- feat: build for arch==native by @xiaobochen-amd in #115
- doc: update primus product matrix. by @xiaobochen-amd in #116
- fix: gfx950 correctness issue by @kyle-256 in #117
- perf: update grouped gemm config on gfx355 to improve performance by @kyle-256 in #118
- build: remove -fgpu-rdc and --hip-link args by @wenxie-amd in #119
New Contributors
- @yuankaichen-amd made their first contribution in #96
- @GeneDer made their first contribution in #108
- @wenxie-amd made their first contribution in #119
Full Changelog: v0.1.0...v0.1.1
v0.1.0
What's Changed
- init: Project structure and basic operator setup by @xiaobochen-amd in #1
- ci: Add github CI by @xiaobochen-amd in #2
- feat: Add block-wise fp8gemm func by @xiaobochen-amd in #6
- ut: Numeric accuracy test for special functions and Gemm by @llying-001 in #8
- feat(async-tp): support fused_all_gather_matmul for async-tp. by @zhenhuang12 in #5
- benchmark: move numerical correctness to benchmark by @llying-001 in #9
- feat(attention): Add attention op. support bf16(ck/triton) and blockwise_fp8(Triton) by @kyle-256 in #7
- chore(format): integrate clang-format into pre-commit by @xiaobochen-amd in #10
- feat(gemm): Integrate ck gemm fp8 blockwise by @xiaobochen-amd in #13
- fix(gemm): ck fp8 gemm add meta func by @xiaobochen-amd in #14
- benchmark : add attention benchmark by @kyle-256 in #12
- fix(gemm): transA&transB use template in ck gemm fp8 blockwise by @xiaobochen-amd in #15
- feat: Add fp8 quant & dequant kernel and fp8 AlltoAll autograd function by @RuibinCheung in #17
- feat(ops): add blockwise fp8 grouped gemm by @xiaobochen-amd in #16
- feat(dtype): standard dtype naming and structure on cpp-side and python-side. by @xiaobochen-amd in #18
- feat(cpp): add cpp common code and float8 by @xiaobochen-amd in #19
- feat: Integrate hipblaslt with fp16/bf16 precision by @RuibinCheung in #20
- fix(attention): fix softmax_scale=None issue; rename TurboAttention by @kyle-256 in #22
- fix : fix attn bug when enable torch.compile by @xiaobochen-amd in #23
- ci : Use K8S by @haishuok0525 in #29
- feat(modules): Add MXLinear module. by @xiaobochen-amd in #25
- feat(moe): Add fused_moe_router function by @ChengYao-amd in #26
- refactor(gemm): refactor gemm ops/func and linear. by @xiaobochen-amd in #30
- feat(benchmark): add gemm_fp8 & grouped gemm benchmark; add flash_attn into attn benchmark by @kyle-256 in #32
- feat(async-tp): support fused_matmul_reduce_scatter(bf16) for async-tp. by @llying-001 in #28
- feat(async_tp): fused_all_gather_matmul support fp8. by @zhenhuang12 in #27
- feat(gemm): integrate hipblaslt fp8 by @RuibinCheung in #34
- refactor(ops): refactor gemm fp8 blockwise api by @xiaobochen-amd in #36
- feat(benchmark): add a torch pretrain e2e demo by @xiaobochen-amd in #38
- feat(attention): support attention context parallel with all2all communication type triton backend by @ChengYao-amd in #37
- feat(async-tp): remove triton-dist dependency. by @zhenhuang12 in #41
- feat(attention): update aiter, use hsa forward kernel when head_dim=128 by @kyle-256 in #43
- feat(rmsnorm): add rmsnorm hip kernel. by @xiaobochen-amd in #42
- feat(jax): add jax frontend & add rmsnorm lax by @xiaobochen-amd in #45
- feat(attention): add attn-cp all2all ck backend by @ChengYao-amd in #44
- chore: clean some benchmark code and modify grouped gemm code by @xiaobochen-amd in #40
- feat(deep_ep): primus_turbo integrate with deepep . by @zhenhuang12 in #48
- opt rmsnorm kernel perf by @xiaobochen-amd in #47
- feat(build): update build & remove aiter submodule by @xiaobochen-amd in #50
- feat(moe): add fused_moe_router_bkwd_triton by @ChengYao-amd in #51
- feat(attention): combine qkv all2all for cp-attn-a2a, refactor cp-att… by @ChengYao-amd in #53
- fix(deep_ep): fix intranode-dispatch bug and memory access fault on poolside 515B. by @zhenhuang12 in #54
- feat(grouped_gemm): add persistent grouped_gemm kernel. by @kyle-256 in #49
- feat(moe): fused router with scatter routing map & probs by @ChengYao-amd in #56
- feat: refine fp8 alltoall by @RuibinCheung in #57
- feat: update grouped gemm code & benchmark by @xiaobochen-amd in #58
- chore: add license header to source files by @xiaobochen-amd in #61
- feat(async-tp): support gemm_rs_overlap by pipeline method which splits inputs and use multi_stream copy by @llying-001 in #60
- feat(moe): fused router support arbitrary experts and selected groups by @ChengYao-amd in #62
- docs(readme): update readme.md by @xiaobochen-amd in #39
- fix(async_tp): make memory order safe. by @zhenhuang12 in #63
- fix(perf): fixed grouped_gemm_variable_k_postprocess perf issues by @xiaobochen-amd in #64
- feat(ck): update ck version and update ck grouped gemm code. by @xiaobochen-amd in #67
- refactor(attention): move files in utils dir to attention by @ChengYao-amd in #68
- fix: avoid aiter print warning in gfx950 by @xiaobochen-amd in #69
- feat: whl package name with git commit by @xiaobochen-amd in #70
- feat: add gfx950 arch to the FP8 FA triton kernel by @hann-wang in #52
- chore(attn): rename q_scale, k_scale, do_scale by @hann-wang in #71
- chore: convert transA/transB to trans_a/trans_b by @xiaobochen-amd in #74
- test(attention): numerical accuracy by @llying-001 in #72
- feat: gemm fp8 tensorwise support multi layouts and fp8-formats by @xiaobochen-amd in #76
- feat(deepep): dispatch return num_recv_tokens_per_expert of type tensor by @zhenhuang12 in #75
- feat(docs): add examples docs by @xiaobochen-amd in #77
- feat: grouped gemm support num_cu by @xiaobochen-amd in #79
- feat(attn): remove 192 128 head_size padding & refactor interface align with FA by @ChengYao-amd in #80
- chore: update setup and gitci by @xiaobochen-amd in #81
- feat(attn): refactor attn-cp ut by @ChengYao-amd in #82
- refactor(asynctp): refactor async-tp test cases. by @zhenhuang12 in #83
- fix: attention bug by @xiaobochen-amd in #84
- feat(grouped_gemm): grouped_gemm support fp8-tensorwise&fp8-rowwise. by @kyle-256 in #66
- chore: update README by @xiaobochen-amd in #85
- feat: add reduce kernel & opt quant tensorwise perf by @xiaobochen-amd in #86
- chore: update example and add codeowners cfg by @xiaobochen-amd in #87
- chore: add arch=native in setup & fix codeowners error path bug. by @xiaobochen-amd in #88
- feat(grouped_gemm): grouped_gemm_fp8 post-process optimization by @kyle-256 in #89
- 3rdparty: update aiter. by @xiaobochen-amd in #90
- docs: add source references and license headers for FlashAttention and TorchAO kernels by @xiaobochen-amd in #92
- chore: update ck & fix bf16 grouped gemm precision bug when n%256!=0 by @kyle-256 in #95
- chore: bump version to v0.1.0 by @xiaobochen-amd in #97
New Contributors
- @xiaobochen-amd made their first contribution in #1
- @llying-001 made their first contribution in #8
- @zhenhuang12 made their first contribution in https://github.com/AMD-AGI/...