Skip to content

CharryWu/learn-cuda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

I would like to self learn to become a machine learning infrastructure software engineer, to be able to get an mid level or senior offer from one of the hot AI startups. I'm a busy frontend developer at Intuit and work 955 schedule, so I will only have time during weekends and workday's nights to learn. My learning has to be pinpoint and interview oriented. I'd like you to be my teacher into machine learning and infrastructures (CUDA, vllm, pytorch, linear algebra, pre-training, post training). Lay out a roadmap of items I shall learn. I'd also like to build working code samples along the way and post to my personal blogs / toy projects on GitHub to validate my learning. Give me the optimized learning strategies for me, so I can digest important concepts in the earliest amount of time and able to explore on my own to dive deep into each topic

This is an ambitious but achievable transition. Since you are already a software engineer (and a busy one), the "standard" path of watching 40-hour lectures on Coursera will fail you. You will lose momentum before you write a single line of CUDA.

To get a mid/senior offer at a hot AI startup (like Anthropic, Databricks, or newer labs), you need to move away from "using" libraries to "building" them.

Here is your Pinpoint, Build-First Roadmap.

Core Learning Strategy: "The Vertical Slice"

Don't learn "Linear Algebra" then "C++" then "CUDA" in isolation. Instead, pick one critical component of LLM infrastructure (e.g., the Attention Mechanism) and learn the entire stack for that specific thing:

  • Math: How matrix multiplication and Softmax work.
  • Kernel: How to write it in CUDA.
  • Framework: How to wrap it in PyTorch.
  • System: How to serve it in vLLM.

This keeps your learning interview-ready at every stage.


Phase 1: The Foundation (C++ & CUDA)

Goal: Understand how GPUs actually work and write your first kernel. Timeframe: 4-6 Weeks (Evenings/Weekends)

1. The Crash Course (Don't over-study) You need C++, but only enough to write kernels. You don't need to master template metaprogramming yet.

  • Learn: Pointers, memory management (malloc/free), and the difference between Host (CPU) and Device (GPU) memory.
  • Resource: CUDA Programming Guide (Read Chapters 1-3 only).
  • Watch: CUDA Mode lectures on YouTube. This is the gold standard community for ML systems right now.

2. Toy Project 1: "The Matrix Multiplier" Write a Python script that multiplies two matrices. Then, write a C++/CUDA extension to do it faster.

  • Task A: Implement naive Matrix Multiplication in CUDA __global__ kernel.
  • Task B: Optimize it using Shared Memory and Tiling (This is a guaranteed interview question).
  • Deliverable: A GitHub repo cuda-matmul-f32 where you run python benchmark.py and see your custom kernel beat standard PyTorch (or come close).

Phase 2: The Framework (PyTorch Internals)

Goal: Bridge the gap between your kernel and the Python data scientist. Timeframe: 4 Weeks

1. Concepts to Pinpoint

  • Strides: How tensors are stored in memory vs. how they look.
  • Autograd: How PyTorch builds the backward graph.
  • FFI (Foreign Function Interface): How Python talks to C++.

2. Toy Project 2: "Custom Gelu Layer" Implement the GELU activation function (used in GPT/Llama) from scratch.

  • Task: Write the CUDA kernel for the forward pass AND the backward pass (calculating gradients manually).
  • Integration: Use torch.utils.cpp_extension to load it.
  • Deliverable: A blog post titled "Writing a Custom PyTorch Extension from Scratch". Show the math derivation of the derivative, the CUDA code, and the Python binding.

Phase 3: The System (Inference & vLLM)

Goal: Understand how to serve these models at scale. This is where the jobs are. Timeframe: 6 Weeks

1. The "Must-Read" Paper

2. Toy Project 3: "Mini-Serve" Don't just use vLLM; build a tiny, terrible version of it.

  • Task: Build a Python web server (FastAPI) that holds a dummy model.
  • Feature A: Implement Continuous Batching. Instead of waiting for Request A to finish, inject Request B into the batch immediately.
  • Feature B: Implement a simplified KV Cache manager.
  • Deliverable: A GitHub project mini-llm-server. The README should compare your "Continuous Batching" throughput vs. a naive FIFO queue.

Phase 4: Distributed Training (The Senior Differentiator)

Goal: Scale beyond one GPU. Timeframe: Ongoing

1. Concepts

  • DDP (Distributed Data Parallel): Replicate model, split data.
  • FSDP (Fully Sharded Data Parallel): Split model and data (vital for huge models).
  • Tensor Parallelism: Split a single matrix multiplication across GPUs.

2. Strategic Learning

  • Resource: Read the PyTorch Distributed Overview.
  • Exercise: Rent 2 cheap GPUs (e.g., on Lambda Labs or RunPod) for a weekend ($5 cost). Run a simple training script using torch.distributed.launch and print the rank of each process.

Optimized Learning Schedule (The "955" Routine)

Since you are busy, context switching is your enemy.

  • Monday - Thursday (Nights): Read/Plan only. Don't code. Read the CUDA docs, read the vLLM source code, read the math. 30-45 mins max. This primes your brain.
  • Friday Night: Setup. Get your environment ready. Docker containers up, SSH connections ready.
  • Saturday (Deep Work): Build. 4-6 hour block. This is when you write the CUDA kernel or the Python server.
  • Sunday: Write. Document what you learned. If you don't blog it, you can't show it to a recruiter.

Resume & Interview Strategy

The "Hook" for Recruiters: On your resume, under a "Personal Projects" section (or "ML Infrastructure" section), list your projects explicitly:

Custom FlashAttention Kernel (C++, CUDA, PyTorch) > Implemented a tiled matrix multiplication kernel with shared memory optimization, achieving 50x speedup over naive implementation. Integrated into PyTorch via C++ extension.

Common Interview Questions to Prep:

  1. System Design: "Design a system to serve Llama-3-70B to 10k concurrent users." (Key: vLLM, paged attention, autoscaling).
  2. Low Level: "How do you optimize a dot product on a GPU?" (Key: memory coalescing, warp divergence).
  3. Framework: "Explain how loss.backward() works internally."

Next Step for You

Would you like me to generate the starter code structure for "Toy Project 1: The Matrix Multiplier," including the C++ setup and Python benchmark script, so you can start this weekend?

About

Learn CUDA programming: leetgpu.com

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages