perf: float32 output for numba RMSD and distance kernels, widen plot types

FridrichMethod · FridrichMethod · commit 13a6d2bc9997 · 2026-04-11T04:40:50.000-07:00
Complete the fp32-by-default audit following the 120k-frame OOM fix.
Float32 is the package default but two numba JIT kernels were still
allocating float64 output buffers even though their accumulators run
in double precision and the final cast to the user-resolved dtype is
a pure waste at large N:

- _backends/_rmsd_matrix._pairwise_rmsd: the O(n_frames^2) result
  buffer was allocated as float64 (115 GB at n=120k) while the QCP
  Newton-Raphson state and cross-covariance accumulators (Sxx etc.)
  stayed in C double anyway.  Now allocates float32 directly, halving
  the output-matrix footprint (saves 58 GB at n=120k) with no
  measurable precision loss -- the float64 scalars inside the prange
  loop still do all the math, only the final result[i, j] = val store
  truncates.  Added a dedicated test to guard the dtype and printed
  cross-backend agreement is now &lt;= 5e-7 nm (was 1e-6 nm for the old
  float64 kernel).
- _backends/_distances.distances_numba: same issue on the
  (n_frames, n_pairs) output -- now float32 native.  Half the memory
  for users who run numba distances on large N*M.

Also widened type annotations to match reality:

- RMSD numba kernel signatures: NDArray[np.float64] -&gt;
  NDArray[np.floating].  The _center_and_traces traces buffer
  remains float64 (O(n_frames), 1 MB even at 120k) because the
  QCP subtraction (G_a + G_b - 2*lambda) needs the extra bits.
- plots/contacts.plot_contact_map and contact_frequency_to_matrix:
  float64 -&gt; floating.  The internal n_residues^2 matrix now
  inherits the caller's dtype instead of forcing a float64 upcast.
- _dtype.py module docstring: rewritten to document the final
  fp32-by-default policy and the remaining fp64 holdouts (scalar
  QCP state, histogram2d, deeptime TICA, jax_enable_x64 for
  opt-in).

Tests updated:
- test_ca_distances.py: renamed TestNumbaKernel.test_output_dtype_float64
  -&gt; test_output_dtype_native_float32 with explanation of why the
  intermediate math stays double while the store is float32.
- test_clustering.py: added test_numba_backend_returns_float32 to
  guard against regression of the rmsd_numba output dtype.

All 570 tests pass.  Cross-backend numerical agreement verified at
n=500, 300 atoms: numba/torch/cupy/jax all within 5e-7 nm of mdtraj.
diff --git a/src/mdpp/_dtype.py b/src/mdpp/_dtype.py
@@ -2,48 +2,70 @@
 
 Default is ``np.float32``, which matches the precision of MD trajectory
 coordinates (mdtraj stores ``traj.xyz`` as float32) and is sufficient
-for all analysis operations in this package.
-
-Float64 appears in the analysis pipeline only where it is genuinely
-necessary or where an external library forces it:
-
-- **Numba JIT kernels** (``_backends/_distances.distances_numba``,
-  ``_backends/_rmsd_matrix._pairwise_rmsd``): compiled kernels output
-  float64 because Numba's ``float()`` cast maps to C ``double``.
-  Numba runs on CPU where float64 is at ~50% of float32 throughput,
-  so the cost is negligible and the extra precision is useful for the
-  QCP Newton-Raphson subtraction ``G_a + G_b - 2*lambda``.  Callers
-  cast the result to the resolved user dtype afterward.
+for all analysis operations in this package.  **Every** compute function
+returns float32 by default; users who want float64 must opt in either
+globally via :func:`set_default_dtype` or per-call via ``dtype=np.float64``.
+
+Design rules for new compute code
+---------------------------------
+
+1. The public function's last keyword argument is
+   ``dtype: DtypeArg = None``.
+2. Call ``resolved = resolve_dtype(dtype)`` at the top.
+3. Pass ``resolved`` through to every downstream buffer allocation and
+   cast outputs via ``np.asarray(result, dtype=resolved)`` /
+   ``result.astype(resolved, copy=False)`` so same-dtype returns do
+   not duplicate memory.
+4. **Backend kernels** (numba/torch/jax/cupy) return their native dtype
+   (``NDArray[np.floating]``) and should prefer float32 output unless
+   external precision is required.  The public wrapper's ``copy=False``
+   cast becomes a no-op when the kernel already returns the resolved
+   dtype, which is essential at large N where each redundant N^2 copy
+   can cost tens of GB.
+
+Where float64 still appears (and why)
+-------------------------------------
+
+These are the only places fp64 remains in the compute pipeline; each is
+either an O(1)-to-O(n) scalar buffer (not an OOM risk) or forced by an
+external library:
+
+- **QCP Newton-Raphson scalars** in ``_backends/_rmsd_matrix._pairwise_rmsd``
+  and the ``traces`` buffer in ``_center_and_traces``: accumulators
+  (``Sxx`` etc.) and the ``(G_a + G_b - 2*lambda)`` subtraction run in
+  double precision because Numba's ``0.0`` literal maps to C
+  ``double``.  Only the final ``result[i, j] = val`` store truncates
+  to float32 so the O(N^2) output matrix is half the memory of the
+  old float64 output (58 GB saved at n=120k).  The ``traces`` buffer
+  is O(n_frames) so the fp64 cost is negligible.
 - **GPU backends** (``_backends/_distances`` and
   ``_backends/_rmsd_matrix`` ``torch``/``jax``/``cupy`` variants):
-  compute **internally in float32** because consumer and workstation
-  NVIDIA GPUs run float64 at 1/36 -- 1/64 the throughput of float32.
-  Since 2026-04-11 these backends also **return native float32**
-  (the ``RMSDMatrixBackendFn`` / ``DistanceBackendFn`` Protocols were
-  widened from ``NDArray[np.float64]`` to ``NDArray[np.floating]``
-  so backends can report their natural dtype).  The public
-  ``compute_*`` wrappers then cast with ``astype(resolved, copy=False)``
-  so when the resolved dtype is also float32 (the package default)
-  **no additional copy is made** -- critical for large N where
-  every redundant copy of the ``(n_frames, n_frames)`` RMSD matrix
-  costs tens of GB (57 GB at n=120k).  Float32 QCP agrees with the
+  compute internally in float32 because consumer and workstation
+  NVIDIA GPUs run float64 at 1/36 -- 1/64 the throughput of float32,
+  and return native float32 directly.  Float32 QCP agrees with the
   float64 numba reference to ~1e-6 nm on realistic trajectories.
 - **Deeptime TICA** (``decomposition.compute_tica``): deeptime upcasts
-  to float64 internally for covariance estimation -- no explicit cast
-  is needed from our side.
+  to float64 internally for covariance estimation -- external to us.
+  The output is cast back to the resolved dtype by the wrapper.
 - **``np.histogram2d``** (``fes.compute_fes_2d``): returns float64
-  probability density regardless of input dtype (edges follow the
-  input dtype); the downstream log and energy arithmetic therefore
-  runs in float64 naturally.
+  probability density regardless of input dtype; the downstream log
+  and energy arithmetic therefore runs in float64 naturally.  Output
+  is O(bins^2), tiny.
 - **``np.mean`` on boolean arrays** (contacts, h-bonds): NumPy defaults
-  to float64 for boolean reductions.
+  to float64 for boolean reductions.  Output is O(n), tiny.
 - **``jax.config.update("jax_enable_x64", True)``** in
   ``_backends/_imports.require_jax``: enables float64 support in JAX
-  so ``jnp.float64`` arrays can round-trip through the JIT.  The
-  actual JAX compute still runs in float32 on GPU.
+  so ``jnp.float64`` arrays can round-trip through the JIT when the
+  user explicitly opts in.  The actual JAX compute still runs in
+  float32 on GPU by default.
+
+Opting into float64
+-------------------
 
 Use ``set_default_dtype(np.float64)`` to switch globally, or pass
-``dtype=np.float64`` to individual functions.
+``dtype=np.float64`` to individual functions.  Be aware that float64
+doubles the memory of every O(N^2) or O(N*M) intermediate, which will
+OOM at trajectory sizes above ~40k frames on a 128 GB host.
 """
 
 from __future__ import annotations
diff --git a/src/mdpp/analysis/_backends/_distances.py b/src/mdpp/analysis/_backends/_distances.py
@@ -115,9 +115,12 @@ def distances_numba(
             periodic boundary conditions.
 
     Returns:
-        Distances of shape ``(n_frames, n_pairs)`` in float64 (numba's
-        ``float()`` cast maps to C ``double``; the wrapper casts
-        ``copy=False`` to the user-resolved dtype).
+        Distances of shape ``(n_frames, n_pairs)`` in **float32**.
+        Intermediate math still promotes to C ``double`` via
+        ``float()`` so precision matches mdtraj's float32 output;
+        only the final store truncates to float32.  Half the
+        memory of the old float64 output (critical at large
+        ``n_frames * n_pairs``).
 
     Raises:
         ValueError: If any pair index is out of range.
@@ -127,10 +130,10 @@ def distances_numba(
     @njit(parallel=True, cache=True)
     def _kernel(
         xyz: NDArray[np.float32], pairs: NDArray[np.int_]
-    ) -> NDArray[np.float64]:  # pragma: no cover - JIT-compiled
+    ) -> NDArray[np.floating]:  # pragma: no cover - JIT-compiled
         n_frames = xyz.shape[0]
         n_pairs = pairs.shape[0]
-        out = np.empty((n_frames, n_pairs), dtype=np.float64)
+        out = np.empty((n_frames, n_pairs), dtype=np.float32)
         for f in prange(n_frames):
             for k in range(n_pairs):
                 i = pairs[k, 0]
diff --git a/src/mdpp/analysis/_backends/_rmsd_matrix.py b/src/mdpp/analysis/_backends/_rmsd_matrix.py
@@ -122,9 +122,16 @@ def __call__(
 
 @njit(cache=True)
 def _center_and_traces(
-    xyz: NDArray[np.float64],
-) -> NDArray[np.float64]:  # pragma: no cover - JIT
-    """Center each frame in-place and return per-frame sum-of-squares."""
+    xyz: NDArray[np.floating],
+) -> NDArray[np.floating]:  # pragma: no cover - JIT
+    """Center each frame in-place and return per-frame sum-of-squares.
+
+    ``traces`` is allocated in float64 so the QCP Newton-Raphson
+    subtraction ``G_a + G_b - 2*lambda`` preserves the few extra
+    significant bits that float32 would lose when ``lambda`` is
+    close to ``(G_a + G_b) / 2``.  This buffer is ``O(n_frames)`` so
+    the fp64 cost is negligible even at 120k frames (1 MB).
+    """
     n_frames = xyz.shape[0]
     n_atoms = xyz.shape[1]
     traces = np.empty(n_frames, dtype=np.float64)
@@ -149,11 +156,11 @@ def _center_and_traces(
 
 @njit(parallel=True, cache=True)
 def _pairwise_rmsd(
-    xyz: NDArray[np.float64],
-    traces: NDArray[np.float64],
+    xyz: NDArray[np.floating],
+    traces: NDArray[np.floating],
     pair_i: NDArray[np.int64],
     pair_j: NDArray[np.int64],
-) -> NDArray[np.float64]:  # pragma: no cover - JIT
+) -> NDArray[np.floating]:  # pragma: no cover - JIT
     """Compute symmetric pairwise RMSD matrix with QCP superposition.
 
     Uses the Quaternion Characteristic Polynomial method (Theobald 2005)
@@ -170,11 +177,22 @@ def _pairwise_rmsd(
     and caps CPU utilisation at 60-80%.  A single ``prange`` over the
     flat pair list gives every thread an equal slab of work, pushing
     utilisation close to 100%.
+
+    **Dtype policy.**  The accumulators (``Sxx`` etc.) and the QCP
+    Newton-Raphson state are all ``float64`` scalars (numba's
+    ``0.0`` literal maps to a C ``double``), so the quartic solve
+    preserves full double precision regardless of the input dtype.
+    Only the final store ``result[i, j] = val`` truncates to
+    ``float32``, which halves the O(N^2) output-matrix footprint
+    (58 GB saved at n=120k) while keeping the QCP precision that
+    the float64 accumulation provides.  The ``traces`` buffer is
+    also float64 for the same reason -- see
+    :func:`_center_and_traces`.
     """
     n_frames = xyz.shape[0]
     n_atoms = xyz.shape[1]
     n_pairs = pair_i.shape[0]
-    result = np.zeros((n_frames, n_frames))
+    result = np.zeros((n_frames, n_frames), dtype=np.float32)
     for p in prange(n_pairs):
         i = pair_i[p]
         j = pair_j[p]
diff --git a/src/mdpp/analysis/clustering.py b/src/mdpp/analysis/clustering.py
@@ -79,15 +79,16 @@ def compute_rmsd_matrix(
         ImportError: If the requested backend package is not installed.
 
     Memory note:
-        The GPU backends return their native ``float32`` buffer and
-        this wrapper casts with ``copy=False``, so when the resolved
-        dtype is float32 (the package default) there is **no second
-        copy** of the ``(n_frames, n_frames)`` matrix.  For a
-        120k-frame trajectory this saves ~115 GB of peak RAM versus
-        the old "cast to float64 for the Protocol contract, then
-        cast back" path.  Using ``backend="numba"`` or
-        ``dtype=np.float64`` still forces a copy because the numba
-        kernel is float64 native.
+        Every backend returns its native ``float32`` output matrix
+        (the numba kernel uses float64 accumulators internally but
+        stores float32 in the result buffer; GPU kernels compute in
+        float32 end-to-end).  This wrapper casts with ``copy=False``
+        so when the resolved dtype is float32 (the package default)
+        there is **no second copy** of the ``(n_frames, n_frames)``
+        matrix.  For a 120k-frame trajectory this saves ~115 GB of
+        peak RAM versus the old "cast to float64 for the Protocol
+        contract, then cast back" path.  Passing ``dtype=np.float64``
+        still forces a one-time upcast.
     """
     resolved = resolve_dtype(dtype)
     atom_indices = select_atom_indices(traj.topology, atom_selection)
diff --git a/src/mdpp/plots/contacts.py b/src/mdpp/plots/contacts.py
@@ -10,7 +10,7 @@
 
 
 def plot_contact_map(
-    frequency: NDArray[np.float64],
+    frequency: NDArray[np.floating],
     *,
     residue_ids: NDArray[np.int_] | None = None,
     ax: Axes | None = None,
@@ -59,10 +59,10 @@ def plot_contact_map(
 
 
 def contact_frequency_to_matrix(
-    frequency: NDArray[np.float64],
+    frequency: NDArray[np.floating],
     residue_pairs: NDArray[np.int_],
     n_residues: int,
-) -> NDArray[np.float64]:
+) -> NDArray[np.floating]:
     """Convert per-pair contact frequencies to a symmetric matrix.
 
     Args:
@@ -71,9 +71,10 @@ def contact_frequency_to_matrix(
         n_residues: Total number of residues for the output matrix.
 
     Returns:
-        Symmetric matrix of shape ``(n_residues, n_residues)``.
+        Symmetric matrix of shape ``(n_residues, n_residues)`` in the
+        same floating dtype as ``frequency`` (float32 by default).
     """
-    matrix = np.zeros((n_residues, n_residues), dtype=np.float64)
+    matrix = np.zeros((n_residues, n_residues), dtype=frequency.dtype)
     for pair_index in range(residue_pairs.shape[0]):
         i, j = residue_pairs[pair_index]
         matrix[i, j] = frequency[pair_index]
diff --git a/tests/analysis/test_ca_distances.py b/tests/analysis/test_ca_distances.py
@@ -203,11 +203,19 @@ def test_multi_frame(self) -> None:
         assert result[0, 0] == pytest.approx(1.0, abs=1e-6)
         assert result[1, 0] == pytest.approx(2.0, abs=1e-6)
 
-    def test_output_dtype_float64(self) -> None:
-        """Numba kernel returns float64 natively (``float()`` -> C double)."""
+    def test_output_dtype_native_float32(self) -> None:
+        """Numba kernel stores float32 output (intermediate math still C double).
+
+        Numba's ``float()`` cast maps to C ``double`` so the per-pair
+        ``dx*dx + dy*dy + dz*dz`` accumulation runs in double precision;
+        only the final ``out[f, k] = np.sqrt(...)`` store truncates to
+        float32.  This halves the ``(n_frames, n_pairs)`` output
+        footprint (critical at large N*M) without losing precision
+        relative to mdtraj's float32 coordinates.
+        """
         xyz = np.zeros((2, 2, 3), dtype=np.float32)
         result = distances_numba(_make_traj(xyz), _PAIR_01)
-        assert result.dtype == np.float64
+        assert result.dtype == np.float32
 
     def test_out_of_range_pair_raises(self) -> None:
         xyz = np.zeros((2, 3, 3), dtype=np.float32)
diff --git a/tests/analysis/test_clustering.py b/tests/analysis/test_clustering.py
@@ -353,6 +353,19 @@ def test_mdtraj_backend_returns_float32(self, tiny_traj: md.Trajectory) -> None:
         result = compute_rmsd_matrix(tiny_traj, atom_selection="all", backend="mdtraj")
         assert result.rmsd_matrix_nm.dtype == np.float32
 
+    def test_numba_backend_returns_float32(self, tiny_traj: md.Trajectory) -> None:
+        """``rmsd_numba`` now stores float32 in the result buffer.
+
+        The QCP accumulators and Newton-Raphson state are still
+        float64 inside the JIT kernel (numba's ``0.0`` literal is a C
+        ``double``), so precision is preserved; only the final
+        ``result[i, j] = val`` store truncates to float32.  At
+        n=120k this halves the output-matrix footprint from 115 GB
+        to 57 GB.
+        """
+        result = compute_rmsd_matrix(tiny_traj, atom_selection="all", backend="numba")
+        assert result.rmsd_matrix_nm.dtype == np.float32
+
     def test_wrapper_does_not_copy_when_dtype_matches(
         self,
         monkeypatch: pytest.MonkeyPatch,