Releases · AMD-AGI/Primus-Turbo

04 Dec 02:37

xiaobochen-amd

v0.2.0

19ee6a5

v0.2.0 Latest

Latest

What's Changed

fix(deepep): eliminate compile warning. by @zhenhuang12 in #123
feat(deep_ep): support num_worst_token and use_defaulta_stream_as_comm_stream for internode. by @zhenhuang12 in #120
feat(token_dispatcher): add DeepEPTokenDispatcher for MoE. by @zhenhuang12 in #114
build: support multi-arch compilation (gfx942;gfx950) by @xiaobochen-amd in #124
feat: attn add is_v3_atomic_fp32 env control by @xiaobochen-amd in #126
chore: move router to moe dir by @xiaobochen-amd in #125
[Sync-free MoE] feat: add swiglu, geglu and tokens_per_expert_to_mask api by @RuibinCheung in #122
[HOTFIX] triton version requirement by @RuibinCheung in #130
[Sync-free MoE] feat: refine act func by @RuibinCheung in #129
fix(deepep): fix bug when use expert_capacity_factor by @zhenhuang12 in #127
chore(docker): update default image to rocm/primus:v25.9_gfx942 by @xiaobochen-amd in #133
[Aiter] Update aiter to fix pybind11 issue by @GeneDer in #132
feat: gemm fp8 support cktile backend for both tensorwise and rowwise by @kyle-256 in #131
[Fix] import activation module by @GeneDer in #137
feat(deepep): move deep_ep header file to primus_turbo common header dir by @zhenhuang12 in #138
feat(permute): permute op support to compute tokens_per_expert by @zhenhuang12 in #140
feat: grouped gemm tensorwise impl update by @kyle-256 in #139
chore: support jax=0.6.2 & jax cicd by @xiaobochen-amd in #136
feat: add elementwise(unary/binary/quant/dequant) kernel by @xiaobochen-amd in #135
chore: remove uselsee debug code by @kyle-256 in #141
chore: refactor grouped gemm blockwise python code by @xiaobochen-amd in #142
feat: add build ext and opt build efficiency by @xiaobochen-amd in #143
feat: skip patch torch_extension when version >=2.8.0 by @zhenhuang12 in #144
chore: refactor gemm fp8 api by @xiaobochen-amd in #145
fix: skip disabled arch files in build by @xiaobochen-amd in #147
chore: support quant gemm when m%128!=0 by @kyle-256 in #146
feat: unify fp8 gemm API by @RuibinCheung in #148
chore: update aiter version. by @xiaobochen-amd in #151
add public primus-safe link in readme by @wenxie-amd in #152
fix readme's typo error by @wenxie-amd in #153
fix(deepep): fix internode_combine hang when set num_worst_token > 0 by @zhenhuang12 in #149
[feat]: CK based block quant by @kyle-256 in #155
feat(gemm): Add mxfp8 gemm and quantize kernel by @RuibinCheung in #154
opt: grouped gemm perf when len(group_lens)==1 by @xiaobochen-amd in #158
feat: add float4x2_e2m1 and float8_m8m0 data type by @xiaobochen-amd in #156
feat(mxfp8): add k padding in bwd by @RuibinCheung in #160
chore: Allow GEMM with k % 32 = 0 to participate in computation by @kyle-256 in #161
feat: jax backend support grouped_gemm by @kyle-256 in #157

Full Changelog: v0.1.1...v0.2.0

Contributors

GeneDer, wenxie-amd, and 4 other contributors

Assets 2

15 Oct 01:05

xiaobochen-amd

v0.1.1

9655316

v0.1.1

What's Changed

feat(deepep): optimize gpu-cpu nosync. by @zhenhuang12 in #93
feat: add std::numeric_limits specializations for fp8 (e4m3/e5m2) by @xiaobochen-amd in #98
chore: set kPadN=true as default config by @kyle-256 in #99
chore: fix groupedgemm fp8 tensorwise scale shape & add test cases by @xiaobochen-amd in #102
feat(deepep): support internode for deepep. by @zhenhuang12 in #94
chore: update variable_k implement by @kyle-256 in #104
feat(deepep): improve intranode dispatch/combine performance. by @zhenhuang12 in #103
chore: update readme & requirements.txt by @xiaobochen-amd in #105
feat: use blockwise scaling for V in FP8 FA triton kernel by @hann-wang in #100
feat: add python interfaces to create & destroy hip streams. by @yuankaichen-amd in #96
chore: update gemm fp8 code. by @xiaobochen-amd in #78
feat: refactor fp8 quant config by @RuibinCheung in #106
feat: support rocm 7 by @xiaobochen-amd in #107
fix: building on cpu nodes should not infer device arch by @GeneDer in #108
chore: clean ck code by @xiaobochen-amd in #109
doc: update readme by @xiaobochen-amd in #111
doc(deepep): add DeepEP doc. by @zhenhuang12 in #110
chore : clean unused cpp code & update benchmark code by @xiaobochen-amd in #112
feat: build for arch==native by @xiaobochen-amd in #115
doc: update primus product matrix. by @xiaobochen-amd in #116
fix: gfx950 correctness issue by @kyle-256 in #117
perf: update grouped gemm config on gfx355 to improve performance by @kyle-256 in #118
build: remove -fgpu-rdc and --hip-link args by @wenxie-amd in #119

New Contributors

@yuankaichen-amd made their first contribution in #96
@GeneDer made their first contribution in #108
@wenxie-amd made their first contribution in #119

Full Changelog: v0.1.0...v0.1.1

Contributors

GeneDer, hann-wang, and 6 other contributors

Assets 2

11 Sep 05:18

xiaobochen-amd

v0.1.0

8f8fea6

v0.1.0

What's Changed

init: Project structure and basic operator setup by @xiaobochen-amd in #1
ci: Add github CI by @xiaobochen-amd in #2
feat: Add block-wise fp8gemm func by @xiaobochen-amd in #6
ut: Numeric accuracy test for special functions and Gemm by @llying-001 in #8
feat(async-tp): support fused_all_gather_matmul for async-tp. by @zhenhuang12 in #5
benchmark: move numerical correctness to benchmark by @llying-001 in #9
feat(attention): Add attention op. support bf16(ck/triton) and blockwise_fp8(Triton) by @kyle-256 in #7
chore(format): integrate clang-format into pre-commit by @xiaobochen-amd in #10
feat(gemm): Integrate ck gemm fp8 blockwise by @xiaobochen-amd in #13
fix(gemm): ck fp8 gemm add meta func by @xiaobochen-amd in #14
benchmark : add attention benchmark by @kyle-256 in #12
fix(gemm): transA&transB use template in ck gemm fp8 blockwise by @xiaobochen-amd in #15
feat: Add fp8 quant & dequant kernel and fp8 AlltoAll autograd function by @RuibinCheung in #17
feat(ops): add blockwise fp8 grouped gemm by @xiaobochen-amd in #16
feat(dtype): standard dtype naming and structure on cpp-side and python-side. by @xiaobochen-amd in #18
feat(cpp): add cpp common code and float8 by @xiaobochen-amd in #19
feat: Integrate hipblaslt with fp16/bf16 precision by @RuibinCheung in #20
fix(attention): fix softmax_scale=None issue; rename TurboAttention by @kyle-256 in #22
fix : fix attn bug when enable torch.compile by @xiaobochen-amd in #23
ci : Use K8S by @haishuok0525 in #29
feat(modules): Add MXLinear module. by @xiaobochen-amd in #25
feat(moe): Add fused_moe_router function by @ChengYao-amd in #26
refactor(gemm): refactor gemm ops/func and linear. by @xiaobochen-amd in #30
feat(benchmark): add gemm_fp8 & grouped gemm benchmark; add flash_attn into attn benchmark by @kyle-256 in #32
feat(async-tp): support fused_matmul_reduce_scatter(bf16) for async-tp. by @llying-001 in #28
feat(async_tp): fused_all_gather_matmul support fp8. by @zhenhuang12 in #27
feat(gemm): integrate hipblaslt fp8 by @RuibinCheung in #34
refactor(ops): refactor gemm fp8 blockwise api by @xiaobochen-amd in #36
feat(benchmark): add a torch pretrain e2e demo by @xiaobochen-amd in #38
feat(attention): support attention context parallel with all2all communication type triton backend by @ChengYao-amd in #37
feat(async-tp): remove triton-dist dependency. by @zhenhuang12 in #41
feat(attention): update aiter, use hsa forward kernel when head_dim=128 by @kyle-256 in #43
feat(rmsnorm): add rmsnorm hip kernel. by @xiaobochen-amd in #42
feat(jax): add jax frontend & add rmsnorm lax by @xiaobochen-amd in #45
feat(attention): add attn-cp all2all ck backend by @ChengYao-amd in #44
chore: clean some benchmark code and modify grouped gemm code by @xiaobochen-amd in #40
feat(deep_ep): primus_turbo integrate with deepep . by @zhenhuang12 in #48
opt rmsnorm kernel perf by @xiaobochen-amd in #47
feat(build): update build & remove aiter submodule by @xiaobochen-amd in #50
feat(moe): add fused_moe_router_bkwd_triton by @ChengYao-amd in #51
feat(attention): combine qkv all2all for cp-attn-a2a, refactor cp-att… by @ChengYao-amd in #53
fix(deep_ep): fix intranode-dispatch bug and memory access fault on poolside 515B. by @zhenhuang12 in #54
feat(grouped_gemm): add persistent grouped_gemm kernel. by @kyle-256 in #49
feat(moe): fused router with scatter routing map & probs by @ChengYao-amd in #56
feat: refine fp8 alltoall by @RuibinCheung in #57
feat: update grouped gemm code & benchmark by @xiaobochen-amd in #58
chore: add license header to source files by @xiaobochen-amd in #61
feat(async-tp): support gemm_rs_overlap by pipeline method which splits inputs and use multi_stream copy by @llying-001 in #60
feat(moe): fused router support arbitrary experts and selected groups by @ChengYao-amd in #62
docs(readme): update readme.md by @xiaobochen-amd in #39
fix(async_tp): make memory order safe. by @zhenhuang12 in #63
fix(perf): fixed grouped_gemm_variable_k_postprocess perf issues by @xiaobochen-amd in #64
feat(ck): update ck version and update ck grouped gemm code. by @xiaobochen-amd in #67
refactor(attention): move files in utils dir to attention by @ChengYao-amd in #68
fix: avoid aiter print warning in gfx950 by @xiaobochen-amd in #69
feat: whl package name with git commit by @xiaobochen-amd in #70
feat: add gfx950 arch to the FP8 FA triton kernel by @hann-wang in #52
chore(attn): rename q_scale, k_scale, do_scale by @hann-wang in #71
chore: convert transA/transB to trans_a/trans_b by @xiaobochen-amd in #74
test(attention): numerical accuracy by @llying-001 in #72
feat: gemm fp8 tensorwise support multi layouts and fp8-formats by @xiaobochen-amd in #76
feat(deepep): dispatch return num_recv_tokens_per_expert of type tensor by @zhenhuang12 in #75
feat(docs): add examples docs by @xiaobochen-amd in #77
feat: grouped gemm support num_cu by @xiaobochen-amd in #79
feat(attn): remove 192 128 head_size padding & refactor interface align with FA by @ChengYao-amd in #80
chore: update setup and gitci by @xiaobochen-amd in #81
feat(attn): refactor attn-cp ut by @ChengYao-amd in #82
refactor(asynctp): refactor async-tp test cases. by @zhenhuang12 in #83
fix: attention bug by @xiaobochen-amd in #84
feat(grouped_gemm): grouped_gemm support fp8-tensorwise&fp8-rowwise. by @kyle-256 in #66
chore: update README by @xiaobochen-amd in #85
feat: add reduce kernel & opt quant tensorwise perf by @xiaobochen-amd in #86
chore: update example and add codeowners cfg by @xiaobochen-amd in #87
chore: add arch=native in setup & fix codeowners error path bug. by @xiaobochen-amd in #88
feat(grouped_gemm): grouped_gemm_fp8 post-process optimization by @kyle-256 in #89
3rdparty: update aiter. by @xiaobochen-amd in #90
docs: add source references and license headers for FlashAttention and TorchAO kernels by @xiaobochen-amd in #92
chore: update ck & fix bf16 grouped gemm precision bug when n%256!=0 by @kyle-256 in #95
chore: bump version to v0.1.0 by @xiaobochen-amd in #97

New Contributors

@xiaobochen-amd made their first contribution in #1
@llying-001 made their first contribution in #8
@zhenhuang12 made their first contribution in https://github.com/AMD-AGI/...

Contributors

hann-wang, RuibinCheung, and 6 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

Contributors

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!

Releases: AMD-AGI/Primus-Turbo

v0.2.0

What's Changed

Contributors

Uh oh!

v0.1.1

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.0

What's Changed

New Contributors

Contributors

Uh oh!