Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
.gitignore
/venv
/.pytest_cache
/.pytest_cache__pycache__/
4 changes: 4 additions & 0 deletions .jules/bolt.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
## 2024-05-23 - [Regex Pre-compilation in Loops]
**Learning:** Pre-compiling regular expressions (`re.compile`) at the module level provides a significant performance boost (measured ~1.8x speedup) when the regex is used inside a tight loop or a pandas `apply` function, compared to compiling it repeatedly or implicitly inside the loop. Vectorized string operations in Pandas are usually faster, but in complex logic cases (multiple prioritized regex groups + fallback logic), a simple pre-compiled regex with `apply` can sometimes be cleaner and sufficiently fast, or even faster if the vectorized approach requires multiple passes or expensive intermediate structures.
**Action:** Always check for regex usage in loops or `apply` calls. If found, refactor to use module-level pre-compiled patterns. When considering vectorization, benchmark against the optimized loop version, as the overhead of complex vectorization might outweigh the benefits for moderate dataset sizes.

## 2024-05-23 - [Streaming IO for Large Files]
**Learning:** `pyteomics` parsers (specifically `mgf` and `mztab`) are compatible with `io.TextIOWrapper`, allowing for streaming file processing. This avoids the memory overhead of reading and decoding entire files into memory (`read().decode()`) before parsing, which is critical for large proteomics datasets.
**Action:** When handling file uploads or large text files, prefer wrapping the binary stream with `io.TextIOWrapper` instead of reading the full content into a string buffer.
Binary file added __pycache__/data_loading.cpython-312.pyc
Binary file not shown.
Binary file added __pycache__/processing.cpython-312.pyc
Binary file not shown.
8 changes: 4 additions & 4 deletions app.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,10 @@ def run_streamlit_app():

# Process files only when both are uploaded
if mgf_file and mztab_file:
# Decode uploaded file contents (Streamlit files are bytes by default)
# Use StringIO to create file-like objects for pyteomics parsers
spectra = load_mgf(io.StringIO(mgf_file.read().decode('utf-8')))
psm_df = load_mztab(io.StringIO(mztab_file.read().decode('utf-8')))
# Use TextIOWrapper to stream decoded text without reading entire file into memory
# This significantly reduces memory usage for large files compared to read().decode()
spectra = load_mgf(io.TextIOWrapper(mgf_file, encoding='utf-8'))
psm_df = load_mztab(io.TextIOWrapper(mztab_file, encoding='utf-8'))

# Create mappings between PSMs and spectra
mapped = map_psms_to_spectra(spectra, psm_df)
Expand Down
Binary file added tests/__pycache__/__init__.cpython-312.pyc
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
44 changes: 44 additions & 0 deletions tests/test_streaming_io.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@

import pytest
import io
import pandas as pd
from data_loading import load_mgf, load_mztab

def test_streaming_io_compatibility():
"""
Test that data loading functions can handle io.TextIOWrapper (streaming).
This ensures we don't regress on memory optimization where we avoid reading
the entire file into memory before parsing.
"""
# Sample MGF content (bytes)
mgf_content = b"""BEGIN IONS
TITLE=spec1
PEPMASS=450.25
1.0 10.0
END IONS
"""
# Sample mzTab content (bytes)
mztab_content = b"""MTD\tmzTab-version\t1.0.0
MTD\tmzTab-mode\tSummary
PSH\tsequence\tPSM_ID\tspectra_ref
PSM\tPEP1\t1\tms_run[1]:index=0
"""

# Simulate Streamlit UploadedFile (which provides a binary stream) for MGF
mgf_file = io.BytesIO(mgf_content)
# Wrap in TextIOWrapper as we do in the app
mgf_wrapper = io.TextIOWrapper(mgf_file, encoding='utf-8')

spectra = load_mgf(mgf_wrapper)
assert len(spectra) == 1
assert spectra[0]['title'] == 'spec1'

# Simulate Streamlit UploadedFile for mzTab
mztab_file = io.BytesIO(mztab_content)
# Wrap in TextIOWrapper
mztab_wrapper = io.TextIOWrapper(mztab_file, encoding='utf-8')

psm_df = load_mztab(mztab_wrapper)
assert isinstance(psm_df, pd.DataFrame)
assert len(psm_df) == 1
assert psm_df.iloc[0]['sequence'] == 'PEP1'