ml-tau-data

Data processing pipeline for the machine-learned hadronically-decaying tau lepton reconstruction and identification project. Takes EDM4HEP/PodioROOT simulation files and produces flat Parquet ntuples ready for ML training.

Overview

The workflow is managed by Snakemake and consists of four stages:

ntupelize — process each input ROOT file into a per-file Parquet (one SLURM job per file, grouped 20 per job)
merge_and_split — merge all per-file Parquets for each dataset and split into train / test
weights — compute (p, theta) reweighting matrices from the signal train set, then apply them to every split
validation — produce summary plots comparing signal and background distributions

Final outputs land in output_dir (configured in ntupelizer/config/workflow.yaml):

<output_dir>/
  z_train.parquet        # signal train (weighted)
  z_test.parquet         # signal test  (weighted)
  qq_train.parquet       # background train (weighted)
  qq_test.parquet        # background test  (weighted)
  weights/               # weight matrices and bin edges
  validation/            # validation plots

Intermediate per-file Parquets are written to temp_dir and deleted automatically after merging.

Setup

Create a virtual environment and install the package with all dependencies:

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e ".[full]"

Note: Snakemake 7.x is required. If you see AttributeError: module 'pulp' has no attribute 'list_solvers', your pulp version is incompatible. Fix with:
pip install "snakemake>=7,<8" "pulp>=2.7,<3"

Configuration

Edit ntupelizer/config/workflow.yaml before running:

output_dir: /path/to/output          # where final Parquets are written
temp_dir:   /path/to/tmp             # scratch space for per-file Parquets

datasets:
  p8_ee_Z_tautau_ecm91:
    input_dir: /path/to/signal/root/
    file_pattern: "*.root"
    short_name: z
    is_signal: true
    train_frac: 0.70

  p8_ee_Z_qq_ecm91:
    input_dir: /path/to/bkg/root/
    file_pattern: "*.root"
    short_name: qq
    is_signal: false
    train_frac: 0.70

weights:
  produce_plots: true
  add_weights: true

Ntupelizer parameters (collections, branches, lifetime variables) are in ntupelizer/config/ntupelizer_base/new.yaml.

Running the workflow

With SLURM (ntupelize stage runs on the cluster; everything else runs locally):

snakemake --profile ntupelizer/config/slurm

SLURM jobs are submitted to partition main9. Each group of 20 ntupelize jobs shares one sbatch allocation. Logs are written to logs/slurm/.

Locally (all stages on the current machine):

snakemake -j12    # 12 parallel jobs

Repository structure

Snakefile                          # workflow definition (all four stages)
ntupelizer/
  config/
    workflow.yaml                  # dataset paths, output dirs, weight settings
    ntupelizer.yaml                # selects ntupelizer variant (new/old)
    ntupelizer_base/new.yaml       # EDM4HEP/PodioROOT ntupelizer config
    slurm/config.yaml              # Snakemake SLURM profile
  scripts/
    ntupelize.py                   # stage 1 entry point (Hydra)
    merge_files.py                 # stage 2 entry point
    compute_weights.py             # stage 3a
    apply_weights.py               # stage 3b
    validate_ntuples.py            # stage 4
    slurm_status.py                # Snakemake cluster-status helper
  tools/
    ntupelizing.py                 # PodioROOTNtuplelizer / EDM4HEPNtupelizer
    clustering.py                  # reco and gen jet clustering (FastJet)
    matching.py                    # reco↔gen jet matching
    gen_tau_info_matcher.py        # MC tau decay-mode and visible p4 extraction
    particle_filters.py            # reco and MC particle selection
    lifetime.py                    # track impact-parameter / lifetime variables
    tau_decaymode.py               # decay mode classification
    weight_tools.py                # (p, theta) reweighting utilities
    general.py                     # shared helpers and DUMMY_P4_VECTOR
sim/                               # standalone CLD simulation scripts

Container

All heavy processing runs inside an Apptainer container:

/home/software/singularity/pytorch.simg:2025-09-01

The container is invoked automatically by Snakemake. No manual setup is needed.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
ntupelizer		ntupelizer
sim		sim
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ml-tau-data

Overview

Setup

Configuration

Running the workflow

Repository structure

Container

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ml-tau-data

Overview

Setup

Configuration

Running the workflow

Repository structure

Container

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages