Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
759 commits
Select commit Hold shift + click to select a range
f4e46d2
Fix bug.
zikun-li Sep 7, 2024
bacc515
fix: indeterminate output of customAllReduce
chenzhuofu Sep 7, 2024
101c420
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
chenzhuofu Sep 7, 2024
3a35387
fix: request expected latency
chenzhuofu Sep 8, 2024
9b2245b
feat: add GenerationRequest
chenzhuofu Sep 9, 2024
2112b48
feat: add EmissionMachine to simulate requests arrival
chenzhuofu Sep 9, 2024
86e31c3
chore: minor
chenzhuofu Sep 9, 2024
0997fad
chore: minor
chenzhuofu Sep 9, 2024
ae0b8e3
feat: update load_pending_requests logic
chenzhuofu Sep 9, 2024
132f68f
fix: dead lock in request manager; client wait until server init
chenzhuofu Sep 10, 2024
c57b3ee
feat: client support prompt input with slo_ratio
chenzhuofu Sep 10, 2024
2040cf7
feat: add an prompt processing script
chenzhuofu Sep 10, 2024
03ba37e
style: minor format
chenzhuofu Sep 10, 2024
36fb00e
feat: add slo attainment metric
chenzhuofu Sep 10, 2024
fd6f610
chore: minor
chenzhuofu Sep 10, 2024
6f89252
feat: separate max_tokens_per_batch for SSM and LLM
chenzhuofu Sep 10, 2024
d67d577
chore: remove redundant max_spec_tree_tokens
chenzhuofu Sep 11, 2024
1b5c66e
chore: minor
chenzhuofu Sep 11, 2024
d19cd75
style: format
chenzhuofu Sep 11, 2024
6c20f18
Merge pull request #1494 from flexflow/specscheduler-request-emission
chenzhuofu Sep 12, 2024
6e37125
chore: minor output
chenzhuofu Sep 14, 2024
3c4e50e
Fix bugs in the scheduler.
zikun-li Sep 14, 2024
62ac7ed
feat: add max_tokens_per_prefilling_batch
chenzhuofu Sep 14, 2024
da91d84
feat: support batched prefilling
chenzhuofu Sep 14, 2024
d013079
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
chenzhuofu Sep 15, 2024
1637ed4
style: format
chenzhuofu Sep 15, 2024
bcb028c
Add a switch for early termination based on slo attainment.
zikun-li Sep 15, 2024
020a210
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
zikun-li Sep 15, 2024
06d332c
fix: memory misalignment
chenzhuofu Sep 15, 2024
cf7b7b9
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
chenzhuofu Sep 15, 2024
5ddeb11
chore: minor
chenzhuofu Sep 16, 2024
fd6eb7b
Reimplemented add_tokens_to_spec_token_tree.
chenzhuofu Sep 16, 2024
4b4d55c
merge
chenzhuofu Sep 16, 2024
5623fc5
chore: refactor lock
chenzhuofu Sep 16, 2024
f524aac
fix: request per batch
chenzhuofu Sep 17, 2024
d42c6ce
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
chenzhuofu Sep 17, 2024
0b7a02f
Optimizes CPU performance of the scheduler
chenzhuofu Sep 18, 2024
fa13afa
chore: incr decode add slo attainment
chenzhuofu Sep 18, 2024
86f95dc
Optimized some usage of priority queues.
chenzhuofu Sep 18, 2024
f169812
feat: support slo ratio sampling
chenzhuofu Sep 18, 2024
e1f711b
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
chenzhuofu Sep 18, 2024
7ae7edd
fix: incr_decode doesn'y have slo attainment metric
chenzhuofu Sep 19, 2024
ff3af26
feat: support early_drop switch
chenzhuofu Sep 19, 2024
1a1dc56
chore: add request_per_second param
chenzhuofu Sep 19, 2024
9f034a4
chore: change early drop logic
chenzhuofu Sep 19, 2024
fe55382
feat: add emission output
chenzhuofu Sep 20, 2024
0420199
Dynamically control tree width to not exceed max_tokens_per_ssm_batch.
chenzhuofu Sep 21, 2024
7c7376a
Simplified the method to add tokens to the token trees.
chenzhuofu Sep 22, 2024
4396fc9
Dynamic max tree depth control
chenzhuofu Sep 24, 2024
eee85fe
feat: update raft dependency (select_k)
chenzhuofu Sep 24, 2024
7caaf72
feat: raft build file
chenzhuofu Sep 24, 2024
2ab10b1
chore: minor
chenzhuofu Sep 24, 2024
57f6378
feat: update argTopk op
chenzhuofu Sep 24, 2024
47be784
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
chenzhuofu Sep 24, 2024
9fa1f4e
chore: update emission trace
chenzhuofu Sep 24, 2024
0a516c6
feat: add TraceEmissionMachine
chenzhuofu Sep 26, 2024
2071273
Add back old scheduler
chenzhuofu Sep 28, 2024
79f9130
feat: add trace generator
chenzhuofu Oct 1, 2024
f224b5e
fix: initialization issue; read microsecond
chenzhuofu Oct 1, 2024
1fe612b
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
chenzhuofu Oct 1, 2024
18a70ff
Update nccl (#1507)
goliaro Sep 21, 2024
ebd45d3
speedup docker builds
goliaro Sep 22, 2024
347d9ad
update
goliaro Sep 22, 2024
62925bb
fix: emission time
chenzhuofu Oct 2, 2024
2e5db3c
feat: trace generator add scaling_factor
chenzhuofu Oct 2, 2024
a17ec6e
feat: add old_scheduler option
chenzhuofu Oct 3, 2024
efead4f
feat: cherry-pick https://github.com/flexflow/FlexFlow/commit/9784b5c…
jiazhihao Aug 12, 2024
285696e
update legion version
goliaro Aug 28, 2024
de55a2e
Fix nccl-induced segfault (#1481)
goliaro Aug 31, 2024
71d8a7b
add page_manager and request_manager functions
Bob-Chen222 Oct 3, 2024
0eaca39
add batch_config
Bob-Chen222 Oct 3, 2024
b5fbc8b
Add option to enable old scheduler.
chenzhuofu Oct 4, 2024
a1035f8
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
chenzhuofu Oct 4, 2024
03eb516
Merge.
chenzhuofu Oct 4, 2024
3fbb364
feat: cherry-pick from https://github.com/flexflow/FlexFlow/pull/1517…
jiazhihao Oct 3, 2024
6482d76
fix: long request support
chenzhuofu Oct 4, 2024
622b8a8
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
chenzhuofu Oct 4, 2024
a23cddb
fix: memory leakage in file_loader
chenzhuofu Oct 5, 2024
e845953
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
chenzhuofu Oct 5, 2024
3574e51
feat: support inf slo ratio
chenzhuofu Oct 5, 2024
4accd43
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
chenzhuofu Oct 5, 2024
272a2e9
chore: minor
chenzhuofu Oct 5, 2024
29f5c69
fix: add logic of batch prefilling, request should be taken back and …
chenzhuofu Oct 6, 2024
dcb61c7
style: minor format
chenzhuofu Oct 6, 2024
1659fde
chore: minor info output
chenzhuofu Oct 6, 2024
a2a5174
chore: use unordered_map in argtopk
chenzhuofu Oct 7, 2024
00a98eb
chore: minor
chenzhuofu Oct 7, 2024
8a28da5
chore: add goodput report
chenzhuofu Oct 7, 2024
1e68324
chore: minor
chenzhuofu Oct 7, 2024
239fe17
chore: replace busy_waiting to condition_variable
chenzhuofu Oct 7, 2024
381a808
feat: make some tasks concurrent
chenzhuofu Oct 8, 2024
151872f
request manager h and request manger cc to be continued
Bob-Chen222 Oct 8, 2024
904364d
Merge remote-tracking branch 'origin/specscheduler' into paged_attent…
chenzhuofu Oct 8, 2024
e3abef8
refactored the interface of block manager but may not be bug free
chenzhuofu Oct 8, 2024
d9ff5ee
chore: add more profiling
chenzhuofu Oct 9, 2024
73dc699
ckpt before build
chenzhuofu Oct 10, 2024
de0b803
some fix
Bob-Chen222 Oct 10, 2024
0e405c1
ready for sanity check
Bob-Chen222 Oct 10, 2024
dec2266
Merge remote-tracking branch 'origin/specscheduler' into paged_attent…
Bob-Chen222 Oct 10, 2024
6b4777e
fix last commit index
Bob-Chen222 Oct 10, 2024
b02d763
chore: minor
Bob-Chen222 Oct 11, 2024
8394f15
fix request id error
Bob-Chen222 Oct 11, 2024
34b3f37
fix: allreduce should handle `elts==0`
Bob-Chen222 Oct 11, 2024
2ec8b5b
fix spec token num
chenzhuofu Oct 11, 2024
b12df8c
fix small error in free_multiple_blocks
chenzhuofu Oct 11, 2024
6298f2a
ckpt single request
Bob-Chen222 Oct 11, 2024
c00ddec
add cleanup
Bob-Chen222 Oct 11, 2024
b1ff323
ckpt before index error in prepare_parameters
Bob-Chen222 Oct 11, 2024
0083c2f
fix: embedding use real batch_size
chenzhuofu Oct 11, 2024
869d326
fix: residualRMSNorm uses real batch size
chenzhuofu Oct 11, 2024
8a3975a
fix token error in prepare_batch_config
Bob-Chen222 Oct 11, 2024
f4e73ea
ckpt, something wrong in the prefilling
Bob-Chen222 Oct 11, 2024
f9d9415
fix: SigmoidSiluMulti uses real batch size
chenzhuofu Oct 12, 2024
6bb79dd
style: format
chenzhuofu Oct 12, 2024
4eeb021
ckpt
Bob-Chen222 Oct 12, 2024
12fafa3
Merge remote-tracking branch 'origin/specscheduler' into paged_attent…
Bob-Chen222 Oct 12, 2024
3ad0ca5
update
Bob-Chen222 Oct 12, 2024
1bafe66
chore: minor
chenzhuofu Oct 12, 2024
03e8c5b
fix: some minor issue
chenzhuofu Oct 13, 2024
efce3e7
fix: reduce cudaGraph memory consumption
chenzhuofu Oct 13, 2024
a224700
feat: add max_output_length
chenzhuofu Oct 13, 2024
5e1cb7c
feat: added upper limit for number of tokens to attain slo
chenzhuofu Oct 14, 2024
e7a8613
chore: minor
chenzhuofu Oct 14, 2024
755d422
feat: modify logic of early stop
chenzhuofu Oct 14, 2024
0185ae1
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
chenzhuofu Oct 14, 2024
c6b5deb
fix: load request as long as available
chenzhuofu Oct 14, 2024
b1e51b2
fix: bug in early stop
chenzhuofu Oct 14, 2024
ba2504e
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
chenzhuofu Oct 14, 2024
5975cd2
feat: trace generator sample the prompt
chenzhuofu Oct 14, 2024
8c84538
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
chenzhuofu Oct 14, 2024
b5b7594
feat: add mean tpot statistic
chenzhuofu Oct 14, 2024
57cfe1b
style: format
chenzhuofu Oct 14, 2024
ad9b240
fix: modify token add toward slo
chenzhuofu Oct 15, 2024
9cf66c1
fix: max_spec_tree_token_num
chenzhuofu Oct 15, 2024
c8d442d
feat: add two naive scheduling policies
chenzhuofu Oct 16, 2024
945dee9
Merge remote-tracking branch 'origin/specscheduler' into paged_attent…
Bob-Chen222 Oct 20, 2024
19e41d6
add some docuementation and delete print
Bob-Chen222 Oct 21, 2024
b1793fb
add additional flag max-kv-cache-size
Bob-Chen222 Oct 21, 2024
26cbf6a
chore: typo
chenzhuofu Oct 21, 2024
3ec91d9
fix tokenizer conversion
Oct 15, 2024
3f0383e
update
Oct 15, 2024
3f590ae
update
Oct 15, 2024
14eb152
update
sfc-gh-goliaro Oct 22, 2024
13615f4
add special tokens
sfc-gh-goliaro Oct 22, 2024
5a0c1ca
Update LLAMA tokenizer (#1524)
sfc-gh-goliaro Sep 29, 2024
f11bcf0
rope
sfc-gh-goliaro Oct 22, 2024
674eed7
fix
sfc-gh-goliaro Oct 22, 2024
2dab7cb
fix
sfc-gh-goliaro Oct 22, 2024
92199d0
linting
sfc-gh-goliaro Oct 22, 2024
3f61102
fix
sfc-gh-goliaro Oct 22, 2024
7f1c4e3
fix
sfc-gh-goliaro Oct 22, 2024
0fe773a
Merge pull request #1530 from flexflow/sd
chenzhuofu Oct 22, 2024
69f41f5
feat: set concurrency_barrier for nccl op
chenzhuofu Oct 23, 2024
fe39c54
chore: unify update_custom_mask calling
chenzhuofu Oct 23, 2024
d5e4d41
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
chenzhuofu Oct 23, 2024
f882347
style: format
chenzhuofu Oct 23, 2024
4593c79
fix: StreamingLLM custom_mask
chenzhuofu Oct 23, 2024
38c9610
fix: streamingllm execute correctly!
chenzhuofu Oct 24, 2024
e40f47f
fix: interleaving acc rate
chenzhuofu Oct 24, 2024
eac11f0
fix: minor
chenzhuofu Oct 24, 2024
d09259e
fix
jinhongyii Oct 26, 2024
f720144
feat: update weight file naming style
goliaro Feb 20, 2024
1c654af
fix: file_loader
goliaro Feb 20, 2024
0a9cb8f
fix
goliaro Feb 21, 2024
6f13b5b
fix
goliaro Feb 21, 2024
82cb8de
feat: update file_loader to latest ver. on `peft`
chenzhuofu Oct 30, 2024
1989185
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
chenzhuofu Oct 30, 2024
7667483
fix: alignment issue
chenzhuofu Oct 31, 2024
7b7db5b
fix: use double for latencies
chenzhuofu Nov 2, 2024
da92d53
feat: modified the logic of distributing the budget across requests
chenzhuofu Nov 4, 2024
fd65a90
Merge remote-tracking branch 'origin/specscheduler' into paged_attent…
Bob-Chen222 Nov 4, 2024
832f5cb
fix for merge
Bob-Chen222 Nov 4, 2024
4a7162f
init page manager at request manager init and clean the format
Bob-Chen222 Nov 4, 2024
6b74f93
ckpt
Bob-Chen222 Nov 5, 2024
20cb714
refactor and add kv cache flag via page manager
Bob-Chen222 Nov 5, 2024
311c450
ckpt for performance issue
Bob-Chen222 Nov 5, 2024
a493f2a
first attempt in incr decoding with page attention
Bob-Chen222 Nov 5, 2024
5250a3b
ckpt for nothing
Bob-Chen222 Nov 6, 2024
18d6d45
feat: modify the logic of the scheduler
chenzhuofu Nov 7, 2024
810983e
fix compilation error
Bob-Chen222 Nov 7, 2024
f7656be
all good for spec, now test incr
Bob-Chen222 Nov 7, 2024
8c203ec
typo
Bob-Chen222 Nov 7, 2024
3c158f8
workable incrdecoding!
Bob-Chen222 Nov 7, 2024
3b34a5b
Merge remote-tracking branch 'origin/specscheduler' into paged_attent…
Bob-Chen222 Nov 7, 2024
7d612f7
refactor
Bob-Chen222 Nov 8, 2024
07ec33e
some format
Bob-Chen222 Nov 8, 2024
dad3d0f
Update request_manager.h
Bob-Chen222 Nov 8, 2024
1693455
Update llama.cc
Bob-Chen222 Nov 8, 2024
a17c130
Update spec_infer.cc
Bob-Chen222 Nov 8, 2024
0f16daf
Update trace_generator.cc
Bob-Chen222 Nov 8, 2024
ff7de09
Update tree_inc_multihead_self_attention.cu
Bob-Chen222 Nov 8, 2024
e3815a9
Update tree_inc_multihead_self_attention.cu
Bob-Chen222 Nov 8, 2024
38f6ef8
Update tree_inc_multihead_self_attention.cu
Bob-Chen222 Nov 8, 2024
80ea225
Update page_manager.cc
Bob-Chen222 Nov 8, 2024
5fe3a8a
Update request_manager.cc
Bob-Chen222 Nov 8, 2024
a721926
Update request_manager.cc
Bob-Chen222 Nov 8, 2024
1e7e2ec
Update request_manager.cc
Bob-Chen222 Nov 8, 2024
1792981
Update request_manager.cc
Bob-Chen222 Nov 8, 2024
95023e6
final update
Bob-Chen222 Nov 8, 2024
9ce11b2
feat: load weights in parallel
goliaro Nov 9, 2024
b885c63
fix: compile bug
chenzhuofu Nov 14, 2024
9e062be
feat: upgrade to llama3 rope
chenzhuofu Nov 14, 2024
b798385
Specscheduler evaluation support code (#1541)
goliaro Nov 15, 2024
2990c88
cleanup
goliaro Nov 15, 2024
30efe4d
feat: use custom allreduce for performance
chenzhuofu Nov 16, 2024
76df177
chore: minor
chenzhuofu Nov 16, 2024
6c3bebc
chore: minor
chenzhuofu Nov 16, 2024
13dcb23
Merge pull request #1542 from flexflow/specscheduler_eval
chenzhuofu Nov 16, 2024
48b4153
fix: argtopk memory
chenzhuofu Nov 17, 2024
65f7f52
chore: eliminate inconsistence
goliaro Nov 19, 2024
127ca97
fix: add Legion concurrent_task_barrier to eliminate dead lock in All…
goliaro Nov 19, 2024
a74775b
feat: add SSM_TP
goliaro Nov 19, 2024
54acb6d
chore: minor
goliaro Nov 19, 2024
d845cb2
feat: add flashinfer ResidualRMSNorm
chenzhuofu Nov 21, 2024
1c2875f
feat: improve ResidualRMSNorm
chenzhuofu Nov 23, 2024
1f6dab4
fix: AllReduce minor
chenzhuofu Nov 23, 2024
8fb3917
style: format
chenzhuofu Nov 23, 2024
075d7b2
chore: remove unused
chenzhuofu Nov 23, 2024
4ce7256
chore: remove the concurrent_task_barrier wrapping customAllReduce
chenzhuofu Nov 23, 2024
7a820c1
feat: add device_prop to ff_handle
chenzhuofu Nov 28, 2024
c911faa
feat: add pytorch gemm_cublas
chenzhuofu Nov 28, 2024
d09124c
feat: add pytorch GEMM
chenzhuofu Nov 29, 2024
115a3ff
chore: remove unused
chenzhuofu Nov 29, 2024
1a5803e
feat: add absolute slo constraint
chenzhuofu Dec 4, 2024
7e29665
style: format
chenzhuofu Dec 4, 2024
afaa88f
feat: add seperate server baseline
chenzhuofu Dec 4, 2024
841bee1
fix: update tree depth
chenzhuofu Dec 5, 2024
b0a5918
feat: add a switch for fcfs baseline
chenzhuofu Dec 6, 2024
4c1b2ce
feat: added data structures in request manager to handle preempted re…
chenzhuofu Dec 6, 2024
9fb8885
fix: use num tokens to decode to replace spare latency
chenzhuofu Dec 8, 2024
aa2d36d
feat: support the policy fcfs and smallest time to attain
chenzhuofu Dec 8, 2024
04cf206
chore: scheduling policy minor enhancement
chenzhuofu Dec 9, 2024
522473b
Merge branch 'specscheduler' of github.com:flexflow/FlexFlow into spe…
chenzhuofu Dec 9, 2024
847ec41
Merge branch 'specscheduler' into coutinuous-batching-schedulers
chenzhuofu Dec 9, 2024
3e619d8
Merge pull request #1554 from flexflow/coutinuous-batching-schedulers
chenzhuofu Dec 9, 2024
76decb3
chore: minor
chenzhuofu Dec 9, 2024
b920838
feat: add overhead breakdown
chenzhuofu Dec 9, 2024
17cbc9c
fix: overhead breakdown
chenzhuofu Dec 10, 2024
a21f9fb
style: format
chenzhuofu Dec 10, 2024
bc67e97
:Merge branch 'specscheduler' of https://github.com/flexflow/flexflow…
chenzhuofu Jan 24, 2025
a5b7de6
fix: minor
goliaro Jan 26, 2025
76c23c0
feat: merge misc. from `page_attention_new`
chenzhuofu Jan 26, 2025
9c042f5
fix: merge page_manager, also fix some issues
chenzhuofu Jan 26, 2025
2a751fd
style: format code
chenzhuofu Jan 26, 2025
3ed67e4
fix: minor
chenzhuofu Jan 26, 2025
69b9f72
fix: merge page_manager, also fix some issues
chenzhuofu Jan 26, 2025
e0eca51
style: format code
chenzhuofu Jan 26, 2025
0f13a92
Merge branch 'paged_attention_new' of https://github.com/flexflow/fle…
chenzhuofu Jan 26, 2025
e2d6fc6
chore: remove outdated comments
chenzhuofu Jan 31, 2025
918356d
Merge pull request #82 from flexflow/paged_attention_new
chenzhuofu Jan 31, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ python/flexflow/core/flexflow_cffi_header.py
*.pb.h
*.o
*.a
*.nsys-rep
*.nfs*

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down Expand Up @@ -188,3 +190,8 @@ python/flexflow/version.txt

inference_tensors
tests/inference/python_test_configs/*.json

core.*
*.out
sharegpt.json
wildchat.json
8 changes: 7 additions & 1 deletion .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,10 @@
[submodule "deps/tokenizers-cpp"]
path = deps/tokenizers-cpp
url = https://github.com/mlc-ai/tokenizers-cpp.git
fetchRecurseSubmodules = true
fetchRecurseSubmodules = true
[submodule "deps/flashinfer"]
path = deps/flashinfer
url = https://github.com/flashinfer-ai/flashinfer.git
[submodule "deps/raft"]
path = deps/raft
url = https://github.com/rapidsai/raft.git
27 changes: 26 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@ project(FlexFlow)

include(ExternalProject)

enable_language(CXX)
enable_language(CUDA)
if (CMAKE_CXX_COMPILER_VERSION VERSION_LESS 8)
message(FATAL_ERROR "Your C++ compiler is too old. Please upgrade to version 8 or higher.")
endif()

# Set policy CMP0074 to eliminate cmake warnings
cmake_policy(SET CMP0074 NEW)
cmake_policy(SET CMP0077 NEW)
Expand Down Expand Up @@ -128,6 +134,9 @@ list(APPEND CC_FLAGS
list(APPEND NVCC_FLAGS
-std=c++17)

list(APPEND NVCC_FLAGS
--expt-relaxed-constexpr
--extended-lambda)

add_compile_options(${CC_FLAGS})
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} ${NVCC_FLAGS})
Expand Down Expand Up @@ -201,6 +210,12 @@ if(NOT BUILD_LEGION_ONLY)
# optional
include(optional)

set(CMAKE_PREFIX_PATH ${CMAKE_PREFIX_PATH} ${CMAKE_CURRENT_SOURCE_DIR}/deps/raft/cpp/build/install)
find_package(raft)
list(APPEND FLEXFLOW_INCLUDE_DIRS ${CMAKE_CURRENT_SOURCE_DIR}/deps/raft/cpp/include)

list(APPEND FLEXFLOW_INCLUDE_DIRS ${CMAKE_CURRENT_SOURCE_DIR}/deps/flashinfer/include)

if (FF_GPU_BACKEND STREQUAL "cuda")
list(APPEND FF_CC_FLAGS
-DFF_USE_CUDA)
Expand Down Expand Up @@ -290,6 +305,12 @@ if(NOT BUILD_LEGION_ONLY)
LIST_DIRECTORIES False
${FLEXFLOW_ROOT}/src/*.cu)

# tensorrt_llm custom allreduce
if(FF_USE_NCCL)
list(APPEND FLEXFLOW_INCLUDE_DIRS ${CMAKE_CURRENT_SOURCE_DIR}/deps/tensorrt_llm)
list(APPEND FLEXFLOW_GPU_SRC ${CMAKE_CURRENT_SOURCE_DIR}/deps/tensorrt_llm/tensorrt_llm/custom_allreduce_kernels.cu)
endif()

add_compile_definitions(FF_USE_CUDA)

if(BUILD_SHARED_LIBS)
Expand Down Expand Up @@ -397,6 +418,8 @@ if(NOT BUILD_LEGION_ONLY)
target_link_libraries(flexflow ${LEGION_LIBRARY} ${FLEXFLOW_EXT_LIBRARIES} nlohmann_json::nlohmann_json mpark_variant optional)
endif()

target_link_libraries(flexflow raft::raft)

#library api version, bump from time to time
set(SOVERSION 1)

Expand Down Expand Up @@ -425,7 +448,7 @@ if(NOT BUILD_LEGION_ONLY)
# generate the Legion Python bindings library. When building from pip, we need to do this post-install to prevent Legion from overwriting the path to the Legion shared library
add_custom_command(TARGET flexflow
POST_BUILD
COMMAND ${Python_EXECUTABLE} ${CMAKE_CURRENT_SOURCE_DIR}/deps/legion/bindings/python/setup.py build --cmake-build-dir ${Legion_BINARY_DIR}/runtime --prefix ${Legion_BINARY_DIR} --build-lib=${Legion_BINARY_DIR}/bindings/python ${Legion_PYTHON_EXTRA_INSTALL_ARGS}
COMMAND CMAKE_BUILD_DIR=${Legion_BINARY_DIR}/runtime CMAKE_INSTALL_PREFIX=${Legion_BINARY_DIR} ${Python_EXECUTABLE} ${CMAKE_CURRENT_SOURCE_DIR}/deps/legion/bindings/python/setup.py build --build-lib=${Legion_BINARY_DIR}/bindings/python ${Legion_PYTHON_EXTRA_INSTALL_ARGS}
WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/deps/legion/bindings/python
)
# create flexflow_python interpreter. When building from pip, we install the FF_HOME/python/flexflow_python script instead.
Expand Down Expand Up @@ -557,7 +580,9 @@ if(NOT BUILD_LEGION_ONLY)

if(FF_BUILD_ALL_INFERENCE_EXAMPLES OR FF_BUILD_ALL_EXAMPLES)
add_subdirectory(inference/spec_infer)
add_subdirectory(inference/simplified_infer)
add_subdirectory(inference/incr_decoding)
add_subdirectory(inference/trace_generator)
endif()


Expand Down
7 changes: 5 additions & 2 deletions FlexFlow.mk
Original file line number Diff line number Diff line change
Expand Up @@ -95,9 +95,12 @@ ifneq ($(strip $(FF_USE_PYTHON)), 1)
endif


INC_FLAGS += -I${FF_HOME}/include -I${FF_HOME}/inference -I${FF_HOME}/deps/optional/include -I${FF_HOME}/deps/variant/include -I${FF_HOME}/deps/json/include -I${FF_HOME}/deps/tokenizers-cpp/include -I${FF_HOME}/deps/tokenizers-cpp/sentencepiece/src
INC_FLAGS += -I${FF_HOME}/include -I${FF_HOME}/inference -I${FF_HOME}/deps/optional/include -I${FF_HOME}/deps/variant/include -I${FF_HOME}/deps/json/include -I${FF_HOME}/deps/tokenizers-cpp/include -I${FF_HOME}/deps/tokenizers-cpp/sentencepiece/src \
-I${FF_HOME}/deps/raft/cpp/include -I${FF_HOME}/deps/rmm/include -I${FF_HOME}/deps/spdlog/include \
-I${FF_HOME}/deps/flashinfer/include
CC_FLAGS += -DMAX_TENSOR_DIM=$(MAX_DIM) -DLEGION_MAX_RETURN_SIZE=32768
NVCC_FLAGS += -DMAX_TENSOR_DIM=$(MAX_DIM) -DLEGION_MAX_RETURN_SIZE=32768
NVCC_FLAGS += -DMAX_TENSOR_DIM=$(MAX_DIM) -DLEGION_MAX_RETURN_SIZE=32768 \
--expt-relaxed-constexpr --extended-lambda
HIPCC_FLAGS += -DMAX_TENSOR_DIM=$(MAX_DIM) -DLEGION_MAX_RETURN_SIZE=32768
GASNET_FLAGS +=
# For Point and Rect typedefs
Expand Down
Binary file added benchmarking/average_accepted_tokens.pdf
Binary file not shown.
88 changes: 88 additions & 0 deletions benchmarking/benchmark_incr_dec.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
#! /usr/bin/env bash
set -x
set -e

# Cd into directory holding this script
cd "${BASH_SOURCE[0]%/*}/../build"

# export BUILD_TYPE=Debug
# ../config/config.linux
make -j install

model_name=meta-llama/Llama-3.1-70B-Instruct
NGPUS=8
NCPUS=16
FSIZE=36000
ZSIZE=200000
CSIZE=100000

# comment these lines in for debugging
# model_name=meta-llama/Llama-3.1-8B-Instruct
# NGPUS=8
# FSIZE=36000
# ZSIZE=30000
# CSIZE=100000



MAX_SEQ_LEN=7000
tokens_per_batch=1024

batch_sizes=(
8
4
)

request_per_second_values=(
-1
1
2
4
8
)

dataset_name="sharegpt"
dataset_fp="../benchmarking/${dataset_name}.json"
partition_name="all"

export LEGION_BACKTRACE=1

# python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='meta-llama/Llama-3.1-70B-Instruct', allow_patterns='*.safetensors', max_workers=30)"
# python ../inference/utils/download_hf_model.py --half-precision-only $model_name --refresh-cache

for k in "${!request_per_second_values[@]}"; do
for j in "${!batch_sizes[@]}"; do
batch_size=${batch_sizes[$j]}
request_per_second=${request_per_second_values[$k]}

echo "Running dataset ${dataset_fp} with model ${model_name}, batch size ${batch_size}, tokens per batch ${tokens_per_batch}, and request per second ${request_per_second}"
# create model name version where "/" is replaced with "-"
model_name_=$(echo $model_name | tr / -)
if [ $request_per_second -gt 0 ]; then
rate=$request_per_second
else
rate="offline"
fi
log_fp="/usr/FlexFlow/inference/output/incr_dec_llm_${model_name_}_bz_${batch_size}_rate_${rate}_dataset_${dataset_name}.log"
output_fp="/usr/FlexFlow/inference/output/incr_dec_llm_${model_name_}_bz_${batch_size}_rate_${rate}_dataset_${dataset_name}.json"
metrics_fp="/usr/FlexFlow/inference/output/incr_dec_llm_${model_name_}_bz_${batch_size}_rate_${rate}_dataset_${dataset_name}.csv"
rm $metrics_fp $output_fp $log_fp || true

time ./inference/simplified_infer/incr_dec \
-ll:gpu $NGPUS -ll:cpu $NCPUS -ll:util $NCPUS \
-tensor-parallelism-degree $NGPUS \
-ll:fsize $FSIZE -ll:zsize $ZSIZE -ll:csize $CSIZE \
--fusion \
--max-sequence-length $MAX_SEQ_LEN \
--max-requests-per-batch $batch_size \
--max-tokens-per-batch $tokens_per_batch \
--max-output-length 1024 \
--request-per-second ${request_per_second} \
-llm-model $model_name \
-trace ${dataset_fp} \
-trace-output-path ${output_fp} \
-csv-output-path $metrics_fp \
-target-partition ${partition_name} \
2>&1 | tee ${log_fp}
done
done
109 changes: 109 additions & 0 deletions benchmarking/benchmark_specinfer.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
#! /usr/bin/env bash
set -x
set -e

# Cd into directory holding this script
cd "${BASH_SOURCE[0]%/*}/../build"

# export BUILD_TYPE=Debug
# ../config/config.linux
make -j
source ./set_python_envs.sh
# reset

model_name=meta-llama/Llama-3.1-70B-Instruct
NGPUS=8
NCPUS=16
FSIZE=36000
ZSIZE=200000
CSIZE=100000

# comment these lines in for debugging
# model_name=meta-llama/Llama-3.1-8B-Instruct
# NGPUS=8
# FSIZE=36000
# ZSIZE=30000
# CSIZE=100000
######################################

small_model_names=(
Zhuominc/Llama-3-330M
meta-llama/Llama-3.2-1B-Instruct
meta-llama/Llama-3.2-3B-Instruct
meta-llama/Llama-3.1-8B-Instruct
)

MAX_SEQ_LEN=7000
tokens_per_batch=1024
max_tree_depth=8
expansion_degree=3

batch_sizes=(
8
4
)

request_per_second_values=(
-1
1
2
4
8
)

dataset_name="sharegpt"
dataset_fp="../benchmarking/${dataset_name}.json"
partition_name="all"

export LEGION_BACKTRACE=1

# python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='meta-llama/Llama-3.1-70B-Instruct', allow_patterns='*.safetensors', max_workers=30)"
python ../inference/utils/download_hf_model.py --half-precision-only $model_name
for small_model_name in "${small_model_names[@]}"; do
python ../inference/utils/download_hf_model.py --half-precision-only $small_model_name
done

for k in "${!request_per_second_values[@]}"; do
for j in "${!batch_sizes[@]}"; do
for i in "${!small_model_names[@]}"; do
small_model_name=${small_model_names[$i]}
batch_size=${batch_sizes[$j]}
request_per_second=${request_per_second_values[$k]}

echo "Running dataset ${dataset_fp} with model ${model_name}, draft model ${small_model_name}, batch size ${batch_size}, tokens per batch ${tokens_per_batch}, and request per second ${request_per_second}"
# create model name version where "/" is replaced with "-"
model_name_=$(echo $model_name | tr / -)
small_model_name_=$(echo $small_model_name | tr / -)
if [ $request_per_second -gt 0 ]; then
rate=$request_per_second
else
rate="offline"
fi
log_fp="/usr/FlexFlow/inference/output/specinfer_llm_${model_name_}_ssm_${small_model_name_}_bz_${batch_size}_rate_${rate}_dataset_${dataset_name}.log"
output_fp="/usr/FlexFlow/inference/output/specinfer_llm_${model_name_}_ssm_${small_model_name_}_bz_${batch_size}_rate_${rate}_dataset_${dataset_name}.json"
metrics_fp="/usr/FlexFlow/inference/output/specinfer_llm_${model_name_}_ssm_${small_model_name_}_bz_${batch_size}_rate_${rate}_dataset_${dataset_name}.csv"
rm $metrics_fp $output_fp $log_fp || true

time ./inference/suffix_decoding/specinfer \
-ll:gpu $NGPUS -ll:cpu $NCPUS -ll:util $NCPUS \
-tensor-parallelism-degree $NGPUS \
-ssm-tp-degree $NGPUS \
-ll:fsize $FSIZE -ll:zsize $ZSIZE -ll:csize $CSIZE \
--fusion \
--max-sequence-length $MAX_SEQ_LEN \
--max-requests-per-batch $batch_size \
--max-tokens-per-batch $tokens_per_batch \
--max-output-length 1024 \
--max-tree-depth ${max_tree_depth} \
--expansion-degree ${expansion_degree} \
--request-per-second ${request_per_second} \
-llm-model $model_name \
-ssm-model $small_model_name \
-trace ${dataset_fp} \
-trace-output-path ${output_fp} \
-csv-output-path $metrics_fp \
-target-partition ${partition_name} \
2>&1 | tee ${log_fp}
done
done
done
Loading
Loading