Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
.gitignore
/venv
/.pytest_cache
/.pytest_cache__pycache__/
4 changes: 4 additions & 0 deletions .jules/bolt.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
## 2024-05-23 - [Regex Pre-compilation in Loops]
**Learning:** Pre-compiling regular expressions (`re.compile`) at the module level provides a significant performance boost (measured ~1.8x speedup) when the regex is used inside a tight loop or a pandas `apply` function, compared to compiling it repeatedly or implicitly inside the loop. Vectorized string operations in Pandas are usually faster, but in complex logic cases (multiple prioritized regex groups + fallback logic), a simple pre-compiled regex with `apply` can sometimes be cleaner and sufficiently fast, or even faster if the vectorized approach requires multiple passes or expensive intermediate structures.
**Action:** Always check for regex usage in loops or `apply` calls. If found, refactor to use module-level pre-compiled patterns. When considering vectorization, benchmark against the optimized loop version, as the overhead of complex vectorization might outweigh the benefits for moderate dataset sizes.

## 2026-01-14 - [Pandas NaN vs None Truthiness]
**Learning:** In this codebase, boolean checks like `if row['col']:` rely on `None` being Falsy. When creating DataFrames, missing keys default to `NaN` (float), which is Truthy. Explicitly filling missing values with `None` (object) is required to preserve this logic when optimizing DataFrame construction.
**Action:** When optimizing DataFrame creation from dicts, if downstream logic uses boolean checks on columns, ensure missing values are explicitly set to `None` rather than relying on default `NaN`.
13 changes: 6 additions & 7 deletions processing.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,14 +100,13 @@ def map_psms_to_spectra(spectra: List[Dict], psm_df: pd.DataFrame) -> pd.DataFra
# Original: Multiple apply calls (4x iteration over full dataset)

# Convert matched Series to list, replacing NaNs with empty dicts for DataFrame construction
specs_list = [x if isinstance(x, dict) else {} for x in matched_spec_series]
specs_df = pd.DataFrame(specs_list)
specs_df.index = psm_df.index # Align index with original DataFrame
# Pre-fill None for missing matches to ensure Falsy behavior and avoid NaN (Truthy) issues
empty_spec = {'title': None, 'mz_array': None, 'intensity_array': None, 'pepmass': None}
specs_list = [x if isinstance(x, dict) else empty_spec for x in matched_spec_series.tolist()]

# Ensure required columns exist (if no spectra matched or mock data missing keys)
for col in ['title', 'mz_array', 'intensity_array', 'pepmass']:
if col not in specs_df.columns:
specs_df[col] = None
# Explicit columns skips schema inference and guarantees structure
specs_df = pd.DataFrame(specs_list, columns=['title', 'mz_array', 'intensity_array', 'pepmass'])
specs_df.index = psm_df.index # Align index with original DataFrame

mappings = pd.DataFrame({
'psm_index': psm_df.index,
Expand Down