Skip to content

jonathanrenusch/StationHitClassifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StationHitClassifier

🚀 About

This repository contains the results of a study on Graph Neural Network (GNN) for binary hit classification in the muon spectrometer of the ATLAS experiment based at CERN (Geneva, Switzerland). The work was carried out as part of a dedicated work package within the Next Generation Trigger Project, an initiative involving multiple LHC experiments funded by the Eric Schmidt foundation to explore innovative trigger solutions for future collider conditions. Our specific task focuses on improving the processing speed of muon tracking within the ATLAS Event Filter to ensure efficient operation under the challenging high pile-up and increased bunch crossing rates expected at the High Luminosity LHC, while maintaining excellent trigger efficiency and event reconstruction quality.

📖 Quick Guide:

This guide provides an overview of the steps involved in generating data, configuring training settings, and exporting trained models to the ONNX format.

Environment Setup
  • Conda Environment: To replicate the environment used for model training on the lxplus cluster, you can find the dependency list in requirements.txt.
  • Environment Creation Scripts: For automated environment creation (especially useful on the CERN condor managed lxplus cluster), use the provided scripts:
Data Generation
  • Dataset Class: Generate your training, validation, and testing datasets using the dedicated MuonSpDataset class. This class is designed for efficient data handling.
  • Data Generation Example: For a practical demonstration of how to utilize the MuonSpDataset class to create data splits and generate smaller data samples useful for debugging, please refer to the create_GNN_data.py script.
  • Normalization Statistics: Our models offer the flexibility to integrate normalization directly into the model inference process. To enable this, you must first compute normalization statistics on your training data before commencing training. A detailed example of how to generate these statistics can be found in the generate_norm_stats script.
Configuration Files
  • Architecture-Specific Configurations: To train any of the available model architectures, you need to configure a corresponding YAML configuration file. We provide optimized hyperparameter settings for the three most relevant architectures in the configs directory. These pre-defined configurations include carefully tuned parameters such as batch size, optimizer, loss function, number of layers, and node counts, tailored for each specific model.
  • Configuration Parameters: When using a predefined configuration file, you have the option to specify the path to your pre-computed normalization_stats. If you intend to use the model's internal normalization, provide the file path; otherwise, set this variable to null. Additionally, you must define the following path variables:
    • experiment_dir: The path to the working directory where model checkpoints and other experiment-related files will be saved.
    • train_path: The path to your training data.
    • val_path: The path to your validation data.
    • test_path: The path to your test data.
Train, Validate, and Test a Model
  • To train, validate, and test your model, navigate to the StationHitClassifier directory in your terminal and execute the following command:
    python -m Classifier.train_binary_GNN -ct path/to/your/config
    You can append the -d flag for a faster debugging run with reduced number of batches and disabled logging and checkpointing.
  • The training script incorporates early stopping and learning rate scheduling using the ReduceLROnPlateau scheduler by default. Most training parameters, including the choice of optimizer and scheduler, can be configured through the specified configuration file. Additional optimizers and schedulers can be implemented within the lighting module.
  • Multi-GPU training is supported for all models. To enable it, modify the devices argument in your configuration file:
    • Set devices: -1 to utilize all available GPUs.
    • Specify a list of GPU indices (e.g., devices: [0, 1, 3]) to use specific GPUs.
  • Training progress, validation metrics, and test results are automatically logged to Comet ML if the COMET_API_KEY environment variable is properly configured in your runtime environment and the argument use_comet: False/True is set to True in your config file. If the environment variable is not set, a pytorch CSVLogger will instead log the same plots and metric locally in the experiment_dir.
  • The logged information includes real-time training and validation metrics, final testing metrics, the complete configuration file used for the run, and relevant performance plots specifically for binary classification tasks.
HTConder Jobs
  • Before submitting a job, ensure you correctly set the following paths in the ConfigRun.sh file:
    • source path/to/your/conda/bin/activate: Path to your Conda installation
    • conda activate path/to/your/conda_env/torch_env/: Path to your specific Conda environment
    • cd path/to/StationHitClassifier: Path to your cloned StationHitClassifier repository
  • ⚠️ Note: Up until 2025, lxplus only supports job submission scripts from the /afs filesystem. Submissions from /eos are not supported.
  • By default, the submission script requests one A100 GPU for a duration of one week. You can change the GPU type to H100 or request multiple GPUs by modifying the line request_gpus = 1 in the submit_ConfigRun.sh file. For additional configuration options, refer to the CERN Batch System documentation.
  • To submit a job using a configuration file on lxplus, run:
    condor_submit submit_ConfigRun.sh -append ARGS="-c path/to/your/config/file.yaml"
Hyperparameter Optimization
  • For advanced hyperparameter tuning or to explore potential parameter settings, please refer to the examples provided in the optuna directory. These examples demonstrate how to use the Optuna framework for efficient hyperparameter optimization.
Export to ONNX and Quantization
  • To export a trained model to ONNX—and optionally apply quantization for improved CPU inference—set the checkpoint_path variable in config yaml file used for training to the path of your Lightning checkpoint. Then, run the following command from within the StationHitClassifier directory:
    python -m Classifier.utils.export_quantize_ONNX -ct path/to/your/config -q
    To skip quantization, simply omit the -q flag.
  • The export_quantize_ONNX.py script will:
    • Create an onnx_benchmarking directory next to your checkpoint file.
    • Run inference on the test dataset (as specified in your training config), both before and after quantization.
    • Save output results as PNG images and store both the full-precision and quantized ONNX model files.

🧠 Machine Learning Architectures for Muon Identification

This section outlines three distinct machine learning architectures evaluated for processing muon hits, prioritizing performance, model size, and compatibility with the ONNX C++ API available within the ATLAS software framework (ATHENA) for integration into the event filter and reconstruction.

EdgeConvolution Model: Balanced Performance and ONNX Compatibility
  • The EdgeConvolution Model, inspired by the Dynamic Graph CNN for Learning on Point Clouds paper, serves as a robust and versatile architecture. While it may not achieve the absolute best performance or the smallest size among the evaluated models, it offers a compelling balance between these factors and, crucially, exhibits compatibility with the ONNX framework and its quantization functionalities. This ONNX compatibility was a primary design consideration, given the intended deployment within the ATLAS trigger chain.
  • The native PyTorch Geometric implementation of EdgeConv relies on operations not yet supported by ONNX. To overcome this limitation, a custom implementation of the EdgeConv layers was developed and can be found in Layers_and_Blocks.py. While this custom implementation incurs an acceptable performance trade-off compared to the baseline, it enables the essential ONNX export and deployment.
EdgeConvolution Model Architecture
Stacked Edge Convolution Graph Attention Model: High Offline Performance and Speed
  • This model represents an idealized exploration focused on maximizing offline classification performance (AUC) and speed while maintaining a relatively small model size. It leverages a stacked architecture incorporating elements from both the Dynamic Graph CNN for Learning on Point Clouds and the Graph Attention Networks papers.
  • Unfortunately, the native PyTorch Geometric implementation of this architecture utilizes operations that currently prevent ONNX export, hindering its deployment within the existing ATLAS trigger reconstruction chain. Attempts were made to resolve this compatibility issue through the development of custom layers aimed at replicating the baseline behavior. However, these custom implementations have not yet achieved comparable performance, resulting in a performance reduction of approximately 10-15% compared to the original baseline.
Stacked Edge Convolution Graph Attention Model Architecture
Multi-Headed Stacked Edge Convolution Flash Attention Model: Optimized for AUC and Size
  • Following extensive investigation into message-passing-based architectures, exploration shifted towards pure Transformer-based models. While these architectures appear better suited for capturing global relationships rather than node-level graph representations in this specific application, this exploration yielded a valuable hybrid architecture.
  • This Multi-Headed Stacked Edge Convolution Flash Attention Model achieves comparable performance to the EdgeConv workhorse model while being significantly smaller (approximately six times smaller). This size reduction is achieved by employing custom-built multi-headed EdgeCNN layers. These layers enable the network to apply message passing selectively to specific sensitivity regions within the Fourier encoding of the input data. Furthermore, the attention mechanism closely mirrors the standard attention mechanisms found in Transformer models, allowing for the utilization of native PyTorch Flash Attention implementations, contributing to efficiency.
Multi-Headed Stacked Edge Convolution Flash Attention Model Architecture

🌈 Fourier Encoding

All architectures in this repository leverage a continuous "Fourier encoding" as a crucial second step after input data normalization. This encoding, while conceptually similar to positional encodings used in Large Language Models (LLMs), distinguishes itself by having no learnable parameters or constraints related to sequence length or training data specifics.

Instead, for each continuous input feature, the Fourier encoding maps it into a higher-dimensional space using sine and cosine functions with exponentially increasing divisors. The underlying intuition is that this transformation enhances the network's sensitivity to minute variations within the continuous input features. Furthermore, this encoding scheme facilitates the use of larger latent spaces in conjunction with residual connections without sacrificing information.

Extensive experimentation comparing this Fourier encoding against alternatives, including learnable linear embeddings, encodings closely resembling positional encodings with learnable parameters, and hybrid approaches, consistently demonstrated the superior performance of the presented simplistic Fourier encoding.

Remarkably, the application of this encoding consistently yielded an approximate 2.5% gain in performance in AUC, particularly for architectures employing residual connections.

The Fourier encoding can be mathematically expressed as:

where:

  • $\mathcal{S}$ represents a list of Fourier scales (integers).
  • $b$ denotes the Fourier base (a single integer).

For a concrete implementation, please refer to the code here.

👩‍🔬 Research Principles

To ensure a fair and rigorous comparison across all model groups, the following research principles were adhered to:

  • Systematic Hyperparameter Optimization: Each model group underwent a comprehensive hyperparameter optimization process using the Optuna framework. Each group was allocated one week of runtime on an NVIDIA A100 GPU for this tuning.
  • Extensive Feature and Graph Studies: Significant effort was dedicated to in-depth investigations of feature selection strategies and graph construction methodologies.
  • Inference Optimization: We explored pruning and quantization techniques to enhance the inference speed of the models.

🏭 Development Principles

This project adheres to the following development principles to ensure maintainability, reproducibility, and flexibility:

  • Configuration-Driven:

    • All critical hyperparameters and training configurations for training, validation, and testing scripts, as well as model classes, are managed through YAML configuration files.
    • A single YAML file encapsulates both training parameters and the model architecture for each training run.
    • This design choice guarantees the reproducibility of machine learning experiments by clearly defining all settings.
  • Modular Training with PyTorch Lightning:

    • Training, testing, and validation pipelines are built using a 100% modular PyTorch Lightning framework.
    • Combined with the configuration files, this enables significant flexibility in utilizing various callbacks, learning rate schedules, and multi-GPU training setups with zero code changes.
  • Optional Experiment Tracking with Comet ML:

    • Comet ML is integrated as a default yet optional platform for logging and monitoring machine learning experiments.
    • Experiment tracking can be easily disabled by not configuring your Comet API key as an variable at runtime, providing flexibility for users who prefer alternative tools or local runs.
  • Automated Code Quality and Consistency:

    • The GitHub repository is configured to automatically enforce code formatting upon each commit using the Black code style, enhancing code readability and maintainability.
    • Furthermore, mypy is employed as a type checking hook to improve code reliability and reduce potential errors.
    • autoflake is integrated to automatically remove unused imports, contributing to improved code efficiency.
  • Rigorous Modularization and Execution:

    • A strong emphasis is placed on modularity across all class imports, promoting code organization and reusability.
    • To execute any script within the repository, please use the following consistent invocation method: python -m Classifier.<path_to_script>. This ensures proper module resolution and execution within the project structure.

💭 Challenges during Development

  • Compatibility with ATLAS Particle Reconstruction Software (Athena):

    • A significant hurdle was designing Graph Neural Network (GNN) and Attention-based architectures compatible with the ONNX (Open Neural Network Exchange) standard, as we use the pytorch API for ONNX to export models and run inference in Athena C++ code.
    • Many state-of-the-art GNN architectures, often implemented in PyTorch Geometric, utilize operations not directly supported by ONNX. This discrepancy between cutting-edge research and deployment constraints posed a major challenge.
    • To overcome this, we implemented custom message-passing layers for certain models to ensure ONNX exportability. While these custom layers resulted in a slight performance trade-off compared to the baseline PyTorch Geometric implementations, they still outperformed alternative technologies inherently compatible with ONNX.
  • Pure Binary Classification Performance (AUC and Rejection Rate):

    • Achieving initial reasonable performance metrics, such as Area Under the ROC Curve (AUC) and rejection rate, through hyperparameter optimization and leveraging existing research, was done using the Pytorch geometric framework
    • However, the primary difficulty lay in translating this performance to an ONNX-compatible model. The compatibility issues between PyTorch Geometric and ONNX became the central bottleneck, hindering the direct deployment of high-performing architectures.
  • Inference Speed:

    • Given the intended deployment within the ATLAS experiment's trigger framework, inference speed is a critical requirement alongside classification performance.
    • Considerable effort was invested in techniques such as model pruning, quantization, and the exploration of novel model architectures specifically designed to optimize inference speed within the constraints of the ONNX format and the ATLAS environment.

🤝 Contributing

🙌 Acknowledgements:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published