This course provides a hands-on introduction to Mixture of Experts (MoE) models through four progressive notebooks. Each notebook builds on the previous ones, taking you from theoretical understanding to practical implementation of advanced MoE concepts.
This course is designed for industry ML engineers who wish to gain hands-on experience with Mixture of Experts (MoE) models. It is also well-suited for ML researchers interested in exploring and experimenting with cutting‑edge methods for model compression and performance optimization.
Learners should be comfortable with the following
- Deep learning architectures — transformers, attention mechanisms, and feed-forward networks
- Training and optimization — gradient descent variants, regularization, and overfitting control
- Practical skills — Python programming experience and intermediate understanding of PyTorch
Those who want to build or refresh these fundamentals can first complete introductory and intermediate material such as the Intro to AI or Optimizing Generative AI on Arm course.
- Theory to Practice — start with computational benefits, end with advanced optimization
- Building Blocks — each notebook provides necessary components for the next
- Progressive Complexity — simple concepts build to sophisticated implementations
- Python 3.8 or newer
- PyTorch
- Transformers
- PyTorch backend to accelerate training (e.g., Google Colab T4 or A100) is strongly recommended
We recommend running these labs on Google Colab with T4 GPUs. All computations can be completed using the GPUs available in Colab.
You are expected to be familiar with configuring software across different operating systems, and you may need to install additional packages depending on your setup. Several commands in the labs reference tools specific to the Colab environment, so you may need to adapt them if you are working on another system, such as your own PC.
We also provide links to saved checkpoints for each training epoch, along with code examples for loading model weights from these checkpoints.
Finally, there are options to speed up computation. The dataset created in Part Two is quite large and contributes significantly to training and evaluation time. Using a smaller subset of the dataset can greatly reduce this overhead and may also offer an interesting comparative analysis of model accuracy.
The first notebook explores the computational advantages of MoE architectures compared to traditional dense models, with a focus on theoretical and practical efficiency metrics.
- Conditional Computation: MoE models only activate a subset of parameters for each input
- Expert Specialization: Different experts learn to handle different aspects of the input space
- Computational Efficiency:
- Theoretical FLOP analysis based on model architecture
- Impact of expert count and batch size on speedup
- Memory efficiency through parameter sharing
# Theoretical FLOP calculation
dense_flops = batch_size * seq_len * d_model * d_ff
moe_flops = batch_size * seq_len * (d_model * num_experts + d_ff)
speedup = dense_flops / moe_flops- Architecture Impact:
- Varying fully connected layer sizes
- Different expert counts
- Batch size effects on throughput
- Theoretical vs Actual Speedup:
- FLOP count comparison
- Memory bandwidth considerations
- Practical implementation overheads
- Understanding the computational benefits of MoE
- Measuring inference speed and memory usage
- Analyzing the impact of architectural choices
- Comparing theoretical and practical speedups
- Data Sources:
- Amazon product reviews
- Yelp business reviews
- IMDB movie reviews
- Data Collection: Gathering text from multiple domains
- Preprocessing: Tokenization and formatting
- Domain Labeling: Tracking data source for routing analysis
# Combining multiple datasets
amazon_data = load_dataset("amazon_reviews_multi")
yelp_data = load_dataset("yelp_review_full")
imdb_data = load_dataset("imdb")
# Creating unified dataset with domain labels
combined_data = {
"text": [...],
"label": [...],
"domain": ["amazon", "yelp", "imdb", ...]
}# Dataset structure
{
"input_ids": tokenized_text,
"attention_mask": padding_mask,
"label": sentiment_class,
"domain": data_source
}- Creating domain-diverse datasets
- Implementing proper data preprocessing
- Understanding the importance of domain information
We implement and train a transformer-based MoE model for sentiment analysis, focusing on the core architecture and training process.
- Transformer Architecture: Multi-head attention and feed-forward layers
- MoE Implementation: Expert routing and processing
- Training Pipeline: Optimization and evaluation
class TransformerEncoder(nn.Module):
def __init__(self, vocab_size, d_model=128, num_heads=4,
num_layers=2, num_experts=4):
# Transformer with MoE layers
self.layers = nn.ModuleList([
TransformerBlock(use_moe=True, num_experts=num_experts)
for _ in range(num_layers)
])- Implementing transformer-based MoE
- Training and evaluating MoE models
- Understanding model architecture choices
In this notebook we move to explore advanced MoE concepts, focusing on routing behavior and load balancing optimization.
-
Gating Function Analysis
- Comparing learned vs. random routing
- Measuring routing effectiveness
-
Expert Specialization
- Computing expert similarity
- Analyzing expert diversity
-
Load Distribution
- Tracking expert utilization
- Identifying routing imbalances
-
Load Balancing Implementation
- Adding balancing loss
- Fine-tuning for balanced routing
class MoEFeedForwardBucketedWithLoadBalancing(nn.Module):
def load_balancing_loss(self, gate_probs, expert_indices):
expert_fraction = torch.bincount(expert_indices).float()
expert_fraction = expert_fraction / expert_fraction.sum()
mean_gate_probs = gate_probs.mean(dim=0)
return torch.sum(expert_fraction * mean_gate_probs)- Understanding routing behavior
- Implementing load balancing
- Optimizing expert utilization
- Maintaining accuracy while improving balance
In our previous attempts, we faced a fundamental challenge: achieving both high accuracy and balanced expert utilization seemed mutually exclusive. Traditional approaches that tried to optimize both objectives simultaneously often led to suboptimal results, with the model either achieving good accuracy at the cost of imbalanced expert utilization, or vice versa.
Drawing inspiration from the Expectation-Maximization (EM) algorithm, we developed a novel alternating optimization strategy that separates the training into two distinct phases:
-
Expert Training Phase (E-step):
- Freeze the routing mechanism (gate)
- Allow experts to specialize in their assigned tokens
- Focus solely on maximizing classification accuracy
- This phase enables experts to develop deep, specialized knowledge
-
Gate Training Phase (M-step):
- Freeze the expert parameters
- Train the routing mechanism to distribute tokens more evenly
- Incorporate a sophisticated load balancing loss
- This phase ensures fair utilization of all experts
-
Alternating Training:
- Experts and gate are trained in alternating epochs
- Each component gets dedicated time to optimize its specific objective
- Prevents interference between accuracy and balancing goals
-
Enhanced Load Balancing:
- KL divergence loss for uniform distribution
- Stricter capacity constraints (0.8-1.2 of uniform)
- Entropy regularization for gate probabilities
- Multiple balancing terms work together synergistically
-
Sophisticated Loss Function:
- Classification loss for accuracy
- Load balancing loss only during gate training
- Capacity constraints to prevent over/under utilization
- Entropy regularization to encourage uniform routing
The EM-inspired approach has shown remarkable success:
- Classification accuracy comparable to non-MoE models
- Even distribution of tokens across experts
- Stable training without the previous trade-offs
- Clear specialization of experts (as shown by low cosine similarities)
The implementation includes:
- Selective parameter freezing for alternating phases
- Comprehensive monitoring of both accuracy and balancing metrics
- Enhanced routing information tracking
- Sophisticated visualization tools for analysis
This approach represents a significant advancement in MoE training, demonstrating that it's possible to achieve both high accuracy and balanced expert utilization through careful architectural design and training strategy.
- Theory to Practice: Start with computational benefits, end with advanced optimization
- Building Blocks: Each notebook provides necessary components for the next
- Progressive Complexity: Simple concepts build to sophisticated implementations
- Install required packages
- Follow notebooks in sequence
- Experiment with different configurations
- Analyze results and modify approaches
This course provides both theoretical understanding and practical implementation skills for working with MoE models in modern deep learning applications.
The original content was produced by Professor Constantine Caramanis at the University of Texas at Austin, in collaboration with Kieran Hejmadi from Arm Education.
You are free to fork or clone this material. See LICENSE.md for the complete license.
Arm is committed to making the language we use inclusive, meaningful, and respectful. Our goal is to remove and replace non-inclusive language from our vocabulary to reflect our values and represent our global ecosystem.
Arm is working actively with our partners, standards bodies, and the wider ecosystem to adopt a consistent approach to the use of inclusive language and to eradicate and replace offensive terms. We recognise that this will take time. This course may contain references to non-inclusive language; it will be updated with newer terms as those terms are agreed and ratified with the wider community.
Contact us at [email protected] with questions or comments about this course. You can also report non-inclusive and offensive terminology usage in Arm content at [email protected].
