erayfirat · google-labs-jules · Jan 14, 2026
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,3 @@
 .gitignore
 /venv
-/.pytest_cache
+/.pytest_cache__pycache__/
diff --git a/.jules/bolt.md b/.jules/bolt.md
@@ -1,3 +1,7 @@
 ## 2024-05-23 - [Regex Pre-compilation in Loops]
 **Learning:** Pre-compiling regular expressions (`re.compile`) at the module level provides a significant performance boost (measured ~1.8x speedup) when the regex is used inside a tight loop or a pandas `apply` function, compared to compiling it repeatedly or implicitly inside the loop. Vectorized string operations in Pandas are usually faster, but in complex logic cases (multiple prioritized regex groups + fallback logic), a simple pre-compiled regex with `apply` can sometimes be cleaner and sufficiently fast, or even faster if the vectorized approach requires multiple passes or expensive intermediate structures.
 **Action:** Always check for regex usage in loops or `apply` calls. If found, refactor to use module-level pre-compiled patterns. When considering vectorization, benchmark against the optimized loop version, as the overhead of complex vectorization might outweigh the benefits for moderate dataset sizes.
+
+## 2026-01-14 - [Pandas NaN vs None Truthiness]
+**Learning:** In this codebase, boolean checks like `if row['col']:` rely on `None` being Falsy. When creating DataFrames, missing keys default to `NaN` (float), which is Truthy. Explicitly filling missing values with `None` (object) is required to preserve this logic when optimizing DataFrame construction.
+**Action:** When optimizing DataFrame creation from dicts, if downstream logic uses boolean checks on columns, ensure missing values are explicitly set to `None` rather than relying on default `NaN`.
diff --git a/processing.py b/processing.py
@@ -100,14 +100,13 @@ def map_psms_to_spectra(spectra: List[Dict], psm_df: pd.DataFrame) -> pd.DataFra
     # Original: Multiple apply calls (4x iteration over full dataset)
 
     # Convert matched Series to list, replacing NaNs with empty dicts for DataFrame construction
-    specs_list = [x if isinstance(x, dict) else {} for x in matched_spec_series]
-    specs_df = pd.DataFrame(specs_list)
-    specs_df.index = psm_df.index  # Align index with original DataFrame
+    # Pre-fill None for missing matches to ensure Falsy behavior and avoid NaN (Truthy) issues
+    empty_spec = {'title': None, 'mz_array': None, 'intensity_array': None, 'pepmass': None}
+    specs_list = [x if isinstance(x, dict) else empty_spec for x in matched_spec_series.tolist()]
 
-    # Ensure required columns exist (if no spectra matched or mock data missing keys)
-    for col in ['title', 'mz_array', 'intensity_array', 'pepmass']:
-        if col not in specs_df.columns:
-            specs_df[col] = None
+    # Explicit columns skips schema inference and guarantees structure
+    specs_df = pd.DataFrame(specs_list, columns=['title', 'mz_array', 'intensity_array', 'pepmass'])
+    specs_df.index = psm_df.index  # Align index with original DataFrame
 
     mappings = pd.DataFrame({
         'psm_index': psm_df.index,