implementing llm inference and making it go faster

in this repository, i'm going to implement increasingly complex llm inference optimizations!

to understand the basics of how llms work, refer to my other respository where i implement llama3 from scratch and explain how llms work one matrix multiplication at a time: https://github.com/naklecha/llama3-from-scratch; this repo of mine has like 15k stars, kinda wild!

(this repo is currently wip)

note: this is not production quality code, and it will never be, it's just for educational purposes. i like single file, no functions, no classes, codebases that are easy to understand. also, writing code this way is alot more aesthetic.

what i have so far, in order of increasing complexity:

0.py (15 tokens per second) similar to my llama3-from-scratch repo (logs)
1.py (15 tokens per second) same as 0.py but for multiple prompts (logs)
2.py (116 tokens per second) uses batch matrix multiplications for attention computation (all heads at once) (logs)
3.py (342 tokens per second) multiple prompts simultaneously with improved matrix operations and parallel token generation (logs)
4.py (7160 tokens per second) added kv caching, proper batch processing (removing completed prompts from the batch) (logs)
.
.
.
one day this will be 50.py & faster than vllm
baseline.py (10514 tokens per second, single prompt) using vllm, 50 prompts at once, 500 tokens generated each

note: i adding c++ variants of these files into the cpp-implementations folder. so far i implemented 4.py in c++. (logs)

deets:

hardware right now: single 4090
model: llama3.2-1b-instruct
batch size: 50 prompts at once
if you don't want to run the code yourself, you can look at the outputs in the outputs folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

implementing llm inference and making it go faster

what i have so far, in order of increasing complexity:

deets:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
cpp-implementation		cpp-implementation
outputs		outputs
.gitignore		.gitignore
0.py		0.py
1.py		1.py
2.py		2.py
3.py		3.py
4.py		4.py
README.md		README.md
baseline.py		baseline.py
config.py		config.py
prompts.py		prompts.py
requirements.txt		requirements.txt

naklecha/llm-inference-optimizations-explained

Folders and files

Latest commit

History

Repository files navigation

implementing llm inference and making it go faster

what i have so far, in order of increasing complexity:

deets:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages