What's Changed
- fix(deepep): eliminate compile warning. by @zhenhuang12 in #123
- feat(deep_ep): support num_worst_token and use_defaulta_stream_as_comm_stream for internode. by @zhenhuang12 in #120
- feat(token_dispatcher): add DeepEPTokenDispatcher for MoE. by @zhenhuang12 in #114
- build: support multi-arch compilation (gfx942;gfx950) by @xiaobochen-amd in #124
- feat: attn add is_v3_atomic_fp32 env control by @xiaobochen-amd in #126
- chore: move router to moe dir by @xiaobochen-amd in #125
- [Sync-free MoE] feat: add swiglu, geglu and tokens_per_expert_to_mask api by @RuibinCheung in #122
- [HOTFIX] triton version requirement by @RuibinCheung in #130
- [Sync-free MoE] feat: refine act func by @RuibinCheung in #129
- fix(deepep): fix bug when use expert_capacity_factor by @zhenhuang12 in #127
- chore(docker): update default image to rocm/primus:v25.9_gfx942 by @xiaobochen-amd in #133
- [Aiter] Update aiter to fix pybind11 issue by @GeneDer in #132
- feat: gemm fp8 support cktile backend for both tensorwise and rowwise by @kyle-256 in #131
- [Fix] import activation module by @GeneDer in #137
- feat(deepep): move deep_ep header file to primus_turbo common header dir by @zhenhuang12 in #138
- feat(permute): permute op support to compute tokens_per_expert by @zhenhuang12 in #140
- feat: grouped gemm tensorwise impl update by @kyle-256 in #139
- chore: support jax=0.6.2 & jax cicd by @xiaobochen-amd in #136
- feat: add elementwise(unary/binary/quant/dequant) kernel by @xiaobochen-amd in #135
- chore: remove uselsee debug code by @kyle-256 in #141
- chore: refactor grouped gemm blockwise python code by @xiaobochen-amd in #142
- feat: add build ext and opt build efficiency by @xiaobochen-amd in #143
- feat: skip patch torch_extension when version >=2.8.0 by @zhenhuang12 in #144
- chore: refactor gemm fp8 api by @xiaobochen-amd in #145
- fix: skip disabled arch files in build by @xiaobochen-amd in #147
- chore: support quant gemm when m%128!=0 by @kyle-256 in #146
- feat: unify fp8 gemm API by @RuibinCheung in #148
- chore: update aiter version. by @xiaobochen-amd in #151
- add public primus-safe link in readme by @wenxie-amd in #152
- fix readme's typo error by @wenxie-amd in #153
- fix(deepep): fix internode_combine hang when set num_worst_token > 0 by @zhenhuang12 in #149
- [feat]: CK based block quant by @kyle-256 in #155
- feat(gemm): Add mxfp8 gemm and quantize kernel by @RuibinCheung in #154
- opt: grouped gemm perf when len(group_lens)==1 by @xiaobochen-amd in #158
- feat: add float4x2_e2m1 and float8_m8m0 data type by @xiaobochen-amd in #156
- feat(mxfp8): add k padding in bwd by @RuibinCheung in #160
- chore: Allow GEMM with k % 32 = 0 to participate in computation by @kyle-256 in #161
- feat: jax backend support grouped_gemm by @kyle-256 in #157
Full Changelog: v0.1.1...v0.2.0