gauravagerwala
diff --git a/‎.exp/design-workflow-1-grok-1-inference-and-sampling.md‎
Lines changed: 132 additions & 0 deletions b/‎.exp/design-workflow-1-grok-1-inference-and-sampling.md‎
Lines changed: 132 additions & 0 deletions
diff --git a/‎.exp/design-workflow-2-model-loading-and-initialization.md‎
Lines changed: 115 additions & 0 deletions b/‎.exp/design-workflow-2-model-loading-and-initialization.md‎
Lines changed: 115 additions & 0 deletions
@@ -0,0 +1,132 @@
+# Workflow Design: 1 - Grok-1 Inference and Sampling
+
+## Overview
+
+The \"Grok-1 Inference and Sampling\" workflow provides the machinery to load the Grok-1 model's 314 billion parameters from a checkpoint, initialize the decoder-only transformer architecture with Mixture-of-Experts (MoE) layers and Grouped Query Attention (GQA), set up distributed sharding across GPUs using JAX meshes and PJIT, tokenize prompts with SentencePiece, and generate text autoregressively. Sampling incorporates temperature-controlled softmax, nucleus (top-p) filtering for diversity control, and top-k logging. The design emphasizes correctness for validation, supporting batched multi-request handling via a generator that manages KV caches per request slot, padding for variable lengths, and efficient decode steps post-prefill.
+
+Key inputs: Checkpoint in `./checkpoints/ckpt-0/`, `tokenizer.model`, GPU cluster, prompts as `Request` objects (prompt str, temperature float, nucleus_p float, rng_seed int, max_len int).  
+Outputs: Generated text strings.  
+Entry points: `run.py` for test run, or `InferenceRunner().run()` generator for streaming requests.  
+Relevant files: `run.py`, `runners.py`, `model.py`, `checkpoint.py`, `tokenizer.model`.
+
+The workflow orchestrates model loading, compilation of sharded compute functions, prompt processing (prefill KV cache while sampling first token), and iterative single-token generation using cached attention keys/values, until max length or EOS.
+
+## Components
+
+### run.py
+- Defines Grok-1 hyperparameters via `LanguageModelConfig` and `TransformerConfig` (e.g., vocab_size=131072, sequence_len=8192, emb_size=6144, num_layers=64, num_q_heads=48, num_kv_heads=8, num_experts=8, num_selected_experts=2, widening_factor=8, key_size=128, shard_activations=True).
+- Instantiates `InferenceRunner` with `ModelRunner`, checkpoint_path=\"./checkpoints/\", tokenizer_path=\"./tokenizer.model\", mesh configs (local=(1,8) for 8 GPUs, between_hosts=(1,1)).
+- Calls `initialize()` and `run()` generator, demonstrates sampling via `sample_from_model(gen, prompt, max_len=100, temperature=0.01)` on example prompt.
+
+### runners.py
+- **ModelRunner**: Configures model dtype (bfloat16), computes batch sizes from bs_per_device * devices * replicas, creates hybrid JAX mesh, defines/transforms Haiku forward functions for full pass and logits-only, applies partition rules for sharding, loads or inits params via checkpoint restore. Supports quantization and activation sharding.
+- **InferenceRunner**: Loads tokenizer, computes param sharding from shapes, compiles PJIT functions:
+  - `new_memory`: Initialize KV cache for batch/seq_len.
+  - `prefill_memory`: For a request slot, encode prompt, pad, process full prompt forward (with new memory, length tracking), sample first gen token, update global batch memory/settings/rngs/last_output.
+  - `sample_step`: For all active slots, forward on last token using shared memory, sample next token, update memory (donate for efficiency).
+- `run()` generator: Precompiles with dummy prompts for pad buckets, manages fixed batch slots (some free), yields for requests, fills slots via prefill, loops stepping all active, appends tokens per slot on host, yields decoded text when done, deactivates slot. Handles concurrency via free_slots list.
+
+### model.py
+- Architecture: Embeddings → 64 Transformer layers (RMSNorm → GQA MultiHeadAttention with RoPE & KV cache → residual → RMSNorm → MoELayer → residual) → output linear to logits.
+- **MultiHeadAttention**: GQA (48 query / 8 KV heads, head_dim=128), supports caching via `Memory` (list of `KVMemory` per layer with k,v,step).
+- **MoELayer**: Router selects top-2 of 8 experts per token, each expert is SwiGLU FFN; uses shard_map/vmap for dispatch (validation-focused, not optimized).
+- Other: `RotaryEmbedding`, custom `Linear` with quantization, `apply_rules` for sharding specs (P('data'), P('model'), etc.).
+- Forward callable via `make(mesh)` integrates sharding, returns `LanguageModelOutput` (logits, model_state=Memory).
+
+### checkpoint.py
+- `restore()`: Computes shapes, loads pickled sharded checkpoint files (handles `QuantizedWeight8bit`), copies to shared memory (/dev/shm) for fast access, syncs across hosts via broadcast, shards into JAX arrays matching specified sharding/mesh. Supports params_only, init_state fallback, rename/exclude rules.
+
+### tokenizer.model & Others
+- SentencePiece for subword tokenization (pad_token=0, eos_token=2).
+- Dependencies: JAX (distributed arrays, pjit, shard_map), Haiku (modules/transform), NumPy/Jax.numpy, sentencepiece.
+- checkpoints/: Directory for downloaded weights (torrent or HF).
+
+## Initialization Sequence
+
+```mermaid
+sequenceDiagram
+    participant User
+    participant RunPy as run.py
+    participant IR as InferenceRunner
+    participant MR as ModelRunner
+    participant Model as model.py
+    participant Checkpoint as checkpoint.py
+    participant JAX as JAX Runtime
+    User->>RunPy: Execute main()
+    RunPy->>IR: Create with config, MR, paths, meshes
+    IR->>MR: initialize(dummy_data, meshes)
+    MR->>Model: model.initialize(), fprop_dtype=bf16
+    Note over MR,JAX: Calculate batch sizes, create mesh (data, model axes)
+    MR->>MR: hk.transform forward/logits_fn with pjit sharding
+    MR->>Checkpoint: load_or_init -> restore(shapes, mesh, sharding)
+    Checkpoint->>MR: Sharded params (TrainingState)
+    IR->>IR: Load tokenizer, compile pjit funcs (sample_step, prefill_memory, new_memory) with shardings
+    IR->>IR: Precompile with dummy prompts for pad_sizes
+    RunPy->>IR: gen = run()  // generator setup with initial memory, settings, etc.
+```
+
+## Inference and Sampling Sequence
+
+```mermaid
+sequenceDiagram
+    participant Gen as Generator (run())
+    participant Req as Request
+    participant Tok as Tokenizer
+    participant Prefill as prefill_memory
+    participant Step as sample_step
+    participant LM as LM forward
+    participant Samp as sample_token
+    participant Mem as KV Memory
+    participant Out as Output
+
+    Note over Gen: Initial setup: memory, rngs, settings, last_output
+
+    Gen->>Req: yield (wait for input)
+    Req->>Gen: send Request(prompt, temp, p, seed, max_len)
+    Gen->>Tok: encode(prompt) -> tokens
+    Gen->>Gen: pad tokens, create settings, active=1
+    Gen->>Prefill: call prefill_memory(tokens, len, new_settings, slot)
+    Prefill->>LM: hk_forward(tokens, new_mem, length, active)  // process prompt
+    LM->>Samp: sample_token from logits  // sample first token?
+    Prefill->>Mem: update KV cache with prompt tokens + first?
+    Prefill->>Gen: updated rngs, last_output, memory, settings
+    loop Autoregressive Sampling (while active and < max_len)
+        Gen->>Step: sample_step(params, rngs, last_output, memory, settings)
+        Step->>LM: hk_forward(last_token, memory)  // decode step
+        LM->>Samp: sample_token(logits, settings)
+        Step->>Mem: update memory with new KV (donate old)
+        Step->>Gen: new rngs, sample_output, memory
+        Gen->>Gen: append token to sequence, copy to host
+        alt Reached max_len or EOS?
+            Gen->>Out: decode all tokens -> yield text
+            Gen->>Gen: deactivate slot, free for new req
+        end
+    end
+```
+
+## Sharding and Distributed Execution
+
+- **Mesh Configuration**: `make_mesh(local=(data_replicas, model_par), between_hosts=(data_hosts, model_hosts))` creates hybrid mesh for SPMD parallelism. E.g., local 1x8 shards model across 8 GPUs.
+- **Sharding Specs**: Model params sharded per rules (e.g., embedding over model, attention QKV over data/model/head). Activations optionally sharded. KV Memory sharded over data axis.
+- **PJIT & Compilation**: Functions wrapped in `hk.transform` then `pjit` with explicit in/out shardings, static args, donation for memory efficiency. Precompilation with dummies reduces first-run latency.
+- **Multi-Host**: Checkpoint loading syncs via `multihost_utils`, assumes launched with `jax process_count()` matching topology.
+- **Memory Optimizations**: bfloat16 compute, 8-bit weight quantization (dequant on fly), KV cache management, activation checkpointing/sharding, padding truncation.
+
+## Sampling Mechanism
+
+- **sample_token**: Scales logits by 1/temp, applies mask (-inf to disallowed), nucleus filter (sort probs, threshold at cumsum >=1-p, mask others to -inf), categorical sample from softmax. Returns token, prob, top-k tokens/probs.
+- **top_p_filter**: Sorts logits descending, soft max to probs, finds minimal set summing to p mass.
+- **Batch Integration**: Settings (temp, p, mask, active) broadcasted/vmap'ed across batch. Active flag skips inactive computations by resetting cache steps.
+- **RNG**: Per-slot PRNG keys split and updated each step.
+- **Defaults**: nucleus_p=1.0 (full dist), temp=0.01 for low randomness in tests, top_k=8 for auxiliary.
+
+## Other Design Aspects
+
+- **KV Caching**: `Memory` dataclass with layers of `KVMemory(k: [batch,heads,seqlen,head_dim], v:..., step:scalar)`. Updated via dynamic_update and pad_to_max_len. Supports variable lengths per slot via length param in forward.
+- **Batching Strategy**: Fixed global batch_size, slots filled on-demand. Pad buckets (e.g., 1024) group similar lengths? Code uses bisect for bucket but pads to bucket size in prefill.
+- **Error/Edge Cases**: Assumes sufficient memory/GPUs; handles long contexts by left-truncation/padding. No built-in EOS handling (relies on max_len or app logic). Quantized weights require custom unpickling.
+- **Performance Notes**: MoE router/experts use JAX vmap/shard_map (serial per-token, inefficient for prod). Focus on correctness/single-host validation.
+- **Extensibility**: Modular Haiku design allows custom configs/modules. Generator interface suits serving multiple prompts concurrently.
+- **Dependencies & Setup**: `requirements.txt` (jax[cuda12_pip], haiku, etc.). Download ckpt via torrent/HF, place in checkpoints/.
+
+This document captures the high-level design, derived from code analysis.
@@ -0,0 +1,115 @@
+# Design: Workflow #2 - Model Loading and Initialization
+
+## Overview
+
+This workflow defines the Grok-1 model architecture using JAX and Haiku, and handles loading model parameters from a sharded checkpoint or initializing them randomly. It supports advanced features like 8-bit weight quantization, activation sharding for memory efficiency, and distributed sharding across multiple GPUs and hosts via JAX's SPMD parallelism.
+
+**Inputs:**
+- Model configurations (`LanguageModelConfig`, `TransformerConfig`) specifying Grok-1 hyperparameters (e.g., 64 layers, 6144 embed dim, MoE with 8 experts/2 selected, GQA with 48/8 heads).
+- Checkpoint path (e.g., `./checkpoints/ckpt-0/` containing sharded tensor files).
+- Mesh configurations: `local_mesh_config` (GPUs per host, e.g., (1, 8)), `between_hosts_config` (replicas/hosts).
+- Dummy init data for shape inference and initialization.
+
+**Outputs:**
+- `TrainingState` with sharded parameters (`params`), ready for use in forward passes or inference/training loops.
+
+The process ensures efficient loading of 314B parameters, correct mapping between checkpoint structure and model params (via rename/exclude rules), and proper distribution to devices.
+
+**Entry Point:** `runners.ModelRunner.load_or_init()` or `checkpoint.restore()`.
+
+## Components
+
+### Model Definition (`model.py`)
+- **Configurations:**
+  - `TransformerConfig`: Core params including `emb_size=6144`, `key_size=128`, `num_layers=64`, `num_q_heads=48`, `num_kv_heads=8`, MoE settings (`num_experts=8`, `num_selected_experts=2`, `widening_factor=8`), sharding axes (`data_axis`, `model_axis`), activation sharding flag.
+  - `LanguageModelConfig`: Extends with `vocab_size=131072`, `sequence_len=8192`, embedding/output scales, `make()` method to instantiate `LanguageModel` Haiku module (embeddings → transformer → output logits).
+- **Architecture Modules:** Haiku-based decoder-only transformer with RMSNorm, Multi-Head Attention (GQA, RoPE, KV caching), MoE FFN (SwiGLU), custom Linear with quantization support.
+- **Sharding:** `partition_rules()` returns specs like `P('model', None)` for weights, enabling data/model parallelism.
+- **Initialization:** Uses Haiku initializers with config scales; supports `fprop_dtype=jnp.bfloat16`.
+
+### Orchestration (`runners.py`)
+- **`ModelRunner` dataclass:** Central coordinator.
+  - `initialize(init_data, local_mesh_config, between_hosts_config)`: Computes batch sizes, creates JAX mesh, defines `forward` and `logits_fn` via `hk.transform` and `pjit` for sharded execution, derives `state_sharding` using `eval_shape` and partition rules.
+  - `load_or_init(init_data, from_checkpoint=True)`: Branches to checkpoint loading or random init; wraps in mesh context for sharding.
+- Supports custom `init_fn`, RNG seeding, transform flags for full state (params/optimizers in future).
+
+### Checkpoint Handling (`checkpoint.py`)
+- **`restore(checkpoint_path, state_shapes, mesh, between_hosts_config, state_sharding, params_only, init_state)`:** Loads and shards params.
+  - `load_tensors()`: Multithreaded (32 workers) parallel unpickling of sharded files (`tensor{i:05d}_{idx:03d}`) based on process index.
+  - `replace_with_load_state()`: Maps checkpoint keys to model structure using regex rename/exclude rules, fills missing with zeros or init.
+  - Assembly: Flattens/unflattens trees, sanity checks param keys.
+  - Distribution: `multihost_utils.host_local_array_to_global_array` to create sharded global arrays.
+- **Optimizations:** `fast_unpickle`/`fast_pickle` using `/dev/shm` temp files for I/O speed; handles `QuantizedWeight8bit`.
+- Logging per rank for debugging.
+
+## Sequence Diagram
+
+```mermaid
+sequenceDiagram
+    participant S as Script/User
+    participant MR as ModelRunner
+    participant MD as Model (model.py)
+    participant CL as Checkpoint (checkpoint.py)
+    participant JM as JAX Mesh
+    participant D as Devices
+
+    S->>+MR: new ModelRunner(config)
+    MR->>+MD: model.make(mesh) [in init]
+    Note right of MR: initialize(local_mesh_config, between_hosts_config, init_data)
+    MR->>+JM: make_mesh(configs)
+    JM-->>-MR: mesh
+    MR->>+MR: hk.transform(forward) & pjit
+    MR->>+MR: compute state_sharding via eval_shape & partition_rules
+
+    alt Load from Checkpoint
+        MR->>+MR: load_or_init(init_data, from_checkpoint=True)
+        MR->>+MR: eval_shape(init_fn) -> shapes
+        MR->>+CL: restore(path, shapes, mesh, sharding, params_only=True)
+        Note right of CL: load_tensors(): parallel unpickle sharded tensors<br/>from ckpt-0/tensorXXXX_YYY
+        CL->>+JM: host_local_to_global_array(state, mesh, sharding)
+        JM->>+D: Shard params across devices/hosts
+        D-->>-JM: 
+        JM-->>-CL: Sharded state
+        CL-->>-MR: params
+    else Random Init
+        MR->>+MR: load_or_init(init_data, from_checkpoint=False)
+        MR->>+MR: init_fn(rng, init_data) -> forward.init(rng, inputs)
+        Note right of MR: Generates random params matching shapes
+        MR->>+JM: Shard new params
+        JM-->>-MR: Sharded params
+    end
+
+    MR-->>-S: Sharded TrainingState(params)
+```
+
+## Additional Design Aspects
+
+### Sharding Strategy
+- **Mesh Axes:** Data (batch parallelism/replicas), Model (parameter sharding).
+- **Rules:** Explicit `PartitionSpec` for components (e.g., QKV projections sharded over heads/model axis, MoE experts replicated or sharded).
+- **Activation Sharding:** Configurable to shard intermediates along data axis, reducing per-device memory.
+- **KV Memory:** Sharded for caching in autoregressive generation.
+
+### Quantization and Precision
+- **8-bit Weights:** Checkpoint may contain `QuantizedWeight8bit`; dequantized on-the-fly in Linear layers.
+- **Compute:** bfloat16 for forward pass to balance precision/speed.
+- **Memory Management:** Sharding + quantization enable loading on limited hardware (e.g., 8x H100s).
+
+### Error Handling and Validation
+- Param key mismatch raises ValueError with details.
+- Exclusion/rename rules for flexibility (e.g., adapting external checkpoints).
+- Per-rank logging for distributed debugging.
+- Shape consistency via `eval_shape` before loading.
+
+### Trade-offs
+- **Performance vs. Simplicity:** Uses JAX standard ops; MoE inefficient (no fused kernels/expert parallelism) for validation focus.
+- **Resource Intensive:** Requires fast storage/network for multi-host loading; assumes high-end GPUs.
+- **Extensibility:** Modular configs allow variants; easy integration with custom init_fns.
+
+### Relevant Files
+- `model.py`: Architecture, configs, partition rules.
+- `runners.py`: ModelRunner, mesh setup, load_or_init.
+- `checkpoint.py`: restore, tensor loading, sharding utils.
+- `run.py`: Example config instantiation and runner usage.
+
+This design prioritizes correctness and distributed scalability for the massive Grok-1 model.