I would like to self learn to become a machine learning infrastructure software engineer, to be able to get an mid level or senior offer from one of the hot AI startups. I'm a busy frontend developer at Intuit and work 955 schedule, so I will only have time during weekends and workday's nights to learn. My learning has to be pinpoint and interview oriented. I'd like you to be my teacher into machine learning and infrastructures (CUDA, vllm, pytorch, linear algebra, pre-training, post training). Lay out a roadmap of items I shall learn. I'd also like to build working code samples along the way and post to my personal blogs / toy projects on GitHub to validate my learning. Give me the optimized learning strategies for me, so I can digest important concepts in the earliest amount of time and able to explore on my own to dive deep into each topic
This is an ambitious but achievable transition. Since you are already a software engineer (and a busy one), the "standard" path of watching 40-hour lectures on Coursera will fail you. You will lose momentum before you write a single line of CUDA.
To get a mid/senior offer at a hot AI startup (like Anthropic, Databricks, or newer labs), you need to move away from "using" libraries to "building" them.
Here is your Pinpoint, Build-First Roadmap.
Don't learn "Linear Algebra" then "C++" then "CUDA" in isolation. Instead, pick one critical component of LLM infrastructure (e.g., the Attention Mechanism) and learn the entire stack for that specific thing:
- Math: How matrix multiplication and Softmax work.
- Kernel: How to write it in CUDA.
- Framework: How to wrap it in PyTorch.
- System: How to serve it in vLLM.
This keeps your learning interview-ready at every stage.
Goal: Understand how GPUs actually work and write your first kernel. Timeframe: 4-6 Weeks (Evenings/Weekends)
1. The Crash Course (Don't over-study) You need C++, but only enough to write kernels. You don't need to master template metaprogramming yet.
- Learn: Pointers, memory management (malloc/free), and the difference between Host (CPU) and Device (GPU) memory.
- Resource: CUDA Programming Guide (Read Chapters 1-3 only).
- Watch: CUDA Mode lectures on YouTube. This is the gold standard community for ML systems right now.
2. Toy Project 1: "The Matrix Multiplier" Write a Python script that multiplies two matrices. Then, write a C++/CUDA extension to do it faster.
- Task A: Implement naive Matrix Multiplication in CUDA
__global__kernel. - Task B: Optimize it using Shared Memory and Tiling (This is a guaranteed interview question).
- Deliverable: A GitHub repo
cuda-matmul-f32where you runpython benchmark.pyand see your custom kernel beat standard PyTorch (or come close).
Goal: Bridge the gap between your kernel and the Python data scientist. Timeframe: 4 Weeks
1. Concepts to Pinpoint
- Strides: How tensors are stored in memory vs. how they look.
- Autograd: How PyTorch builds the backward graph.
- FFI (Foreign Function Interface): How Python talks to C++.
2. Toy Project 2: "Custom Gelu Layer" Implement the GELU activation function (used in GPT/Llama) from scratch.
- Task: Write the CUDA kernel for the forward pass AND the backward pass (calculating gradients manually).
- Integration: Use
torch.utils.cpp_extensionto load it. - Deliverable: A blog post titled "Writing a Custom PyTorch Extension from Scratch". Show the math derivation of the derivative, the CUDA code, and the Python binding.
Goal: Understand how to serve these models at scale. This is where the jobs are. Timeframe: 6 Weeks
1. The "Must-Read" Paper
- Paper: Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM).
- Why: This paper reinvented LLM serving. You must understand KV Cache, PagedAttention, and Continuous Batching.
2. Toy Project 3: "Mini-Serve" Don't just use vLLM; build a tiny, terrible version of it.
- Task: Build a Python web server (FastAPI) that holds a dummy model.
- Feature A: Implement Continuous Batching. Instead of waiting for Request A to finish, inject Request B into the batch immediately.
- Feature B: Implement a simplified KV Cache manager.
- Deliverable: A GitHub project
mini-llm-server. The README should compare your "Continuous Batching" throughput vs. a naive FIFO queue.
Goal: Scale beyond one GPU. Timeframe: Ongoing
1. Concepts
- DDP (Distributed Data Parallel): Replicate model, split data.
- FSDP (Fully Sharded Data Parallel): Split model and data (vital for huge models).
- Tensor Parallelism: Split a single matrix multiplication across GPUs.
2. Strategic Learning
- Resource: Read the PyTorch Distributed Overview.
- Exercise: Rent 2 cheap GPUs (e.g., on Lambda Labs or RunPod) for a weekend ($5 cost). Run a simple training script using
torch.distributed.launchand print the rank of each process.
Since you are busy, context switching is your enemy.
- Monday - Thursday (Nights): Read/Plan only. Don't code. Read the CUDA docs, read the vLLM source code, read the math. 30-45 mins max. This primes your brain.
- Friday Night: Setup. Get your environment ready. Docker containers up, SSH connections ready.
- Saturday (Deep Work): Build. 4-6 hour block. This is when you write the CUDA kernel or the Python server.
- Sunday: Write. Document what you learned. If you don't blog it, you can't show it to a recruiter.
The "Hook" for Recruiters: On your resume, under a "Personal Projects" section (or "ML Infrastructure" section), list your projects explicitly:
Custom FlashAttention Kernel (C++, CUDA, PyTorch) > Implemented a tiled matrix multiplication kernel with shared memory optimization, achieving 50x speedup over naive implementation. Integrated into PyTorch via C++ extension.
Common Interview Questions to Prep:
- System Design: "Design a system to serve Llama-3-70B to 10k concurrent users." (Key: vLLM, paged attention, autoscaling).
- Low Level: "How do you optimize a dot product on a GPU?" (Key: memory coalescing, warp divergence).
- Framework: "Explain how
loss.backward()works internally."
Would you like me to generate the starter code structure for "Toy Project 1: The Matrix Multiplier," including the C++ setup and Python benchmark script, so you can start this weekend?