Skip to content

Conversation

briangreunke
Copy link
Contributor

@briangreunke briangreunke commented Sep 7, 2025

Central Lazy Loading System

In order to make the lazy loading process more DRY, I've attempted to centralize the lazy imports for scorers.

Key Changes:

  • Created a central lazy loading system for modules and attrs
  • Created a centralized "scorer" lazy loading system and dependencies
  • Updated the similarity scorers as PoCs

Notes

  • Missing imports are handled nicely in the scores (and optionally) via the catch kwarg in the Scorer

Generated Summary:

  • Introduced lazy loading mechanism for modules within the dreadnode package.
  • Created LazyImport and LazyAttr classes that facilitate delayed imports and attribute access.
  • Replaced direct imports in similarity.py with lazy imports to handle potential ImportErrors gracefully.
  • Added lazy loading for various libraries such as rapidfuzz, nltk, sentence-transformers, and scikit-learn.
  • Modified scorer evaluation methods to use the new lazy loading structure, improving performance and error handling.
  • Updated NLTK tokenization to check for required datasets (e.g., 'punkt') during evaluation.
  • Adjusted pyproject.toml to include new lazy module paths and modified linting rules accordingly.
  • Overall, these changes enhance modularity and reduce upfront dependency load, which may improve application startup times.

This summary was generated with ❤️ by rigging

@dreadnode-renovate-bot dreadnode-renovate-bot bot added the area/python Changes to Python package configuration and dependencies label Sep 7, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a centralized lazy loading system to eliminate duplication in import handling across scorer modules. The system provides consistent error handling for optional dependencies and consolidates lazy imports for third-party packages used in similarity scoring.

  • Created a core lazy loading framework with LazyImport and LazyAttr classes
  • Centralized scorer-specific lazy imports in a dedicated module
  • Refactored similarity scorers to use the new centralized system instead of individual import error handling

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 2 comments.

File Description
pyproject.toml Added lazy module to mypy configuration and disabled import-not-found errors
dreadnode/lazy/core.py Core lazy loading implementation with error handling classes
dreadnode/lazy/scorers.py Centralized lazy imports for all scorer dependencies
dreadnode/scorers/similarity.py Refactored to use centralized lazy imports, removing duplicate error handling code

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 273 to 277
def evaluate(data: t.Any, *, reference: str = reference) -> Metric:
vectorizer = TfidfVectorizer(stop_words="english")
candidate_text = str(data)
tfidf_matrix = vectorizer.fit_transform([candidate_text, reference])
sim = sklearn_cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
Copy link
Preview

Copilot AI Sep 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a new TfidfVectorizer instance on every evaluation call is inefficient. The vectorizer should be created once outside the evaluate function or cached to avoid repeated initialization overhead.

Suggested change
def evaluate(data: t.Any, *, reference: str = reference) -> Metric:
vectorizer = TfidfVectorizer(stop_words="english")
candidate_text = str(data)
tfidf_matrix = vectorizer.fit_transform([candidate_text, reference])
sim = sklearn_cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
# Fit the vectorizer once on the reference text
vectorizer = TfidfVectorizer(stop_words="english")
reference_vec = vectorizer.fit_transform([reference])
def evaluate(data: t.Any, *, reference: str = reference) -> Metric:
candidate_text = str(data)
candidate_vec = vectorizer.transform([candidate_text])
sim = sklearn_cosine_similarity(candidate_vec, reference_vec)[0][0]

Copilot uses AI. Check for mistakes.

Comment on lines +5 to +42
if t.TYPE_CHECKING:
import litellm as litellm # type: ignore[import-not-found]
import nltk as nltk # type: ignore[import-not-found]
from nltk.tokenize import ( # type: ignore[import-not-found]
word_tokenize as word_tokenize,
)
from nltk.translate.bleu_score import ( # type: ignore[import-not-found]
sentence_bleu as sentence_bleu,
)
from rapidfuzz import distance as distance # type: ignore[import-not-found]
from rapidfuzz import fuzz as fuzz # type: ignore[import-not-found]
from rapidfuzz import utils as utils # type: ignore[import-not-found]
from sentence_transformers import ( # type: ignore[import-not-found]
SentenceTransformer as SentenceTransformer,
)
from sentence_transformers import ( # type: ignore[import-not-found]
util as util,
)
from sklearn.feature_extraction.text import ( # type: ignore[import-not-found]
TfidfVectorizer as TfidfVectorizer,
)
from sklearn.metrics.pairwise import ( # type: ignore[import-not-found]
cosine_similarity as cosine_similarity,
)
else:
fuzz = LazyAttr("rapidfuzz", "fuzz", "text")
utils = LazyAttr("rapidfuzz", "utils", "text")
distance = LazyAttr("rapidfuzz", "distance", "text")
litellm = LazyImport("litellm", "llm")
util = LazyAttr("sentence_transformers", "util", "text", package_name="sentence-transformers")
TfidfVectorizer = LazyAttr(
"sklearn.feature_extraction.text", "TfidfVectorizer", "text", package_name="scikit-learn"
)
SentenceTransformer = LazyAttr(
"sentence_transformers", "SentenceTransformer", "text", package_name="sentence-transformers"
)
cosine_similarity = LazyAttr(
"sklearn.metrics.pairwise", "cosine_similarity", "text", package_name="scikit-learn"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Brian for tackling this. importlib.* is a good pattern for this, but couple issues I could see here,

  1. Dual maintenance problem, must maintain imports in two places

Example.,

if TYPE_CHECKING:
    from sklearn.metrics.pairwise import cosine_similarity as cosine_similarity  # Place 1
    from rapidfuzz import fuzz as fuzz  # Place 1 
    ....
else:
    cosine_similarity = LazyAttr("sklearn.metrics.pairwise", "cosine_similarity", "text")  # Place 2 
    fuzz = LazyAttr("rapidfuzz", "fuzz", "text")  # Place 2 
    ...

For insace, if we wanted to add new sklearn import, need to update at both blocks. Follows the same if any method changes, API changes, need to sync at 2 places. If it's out of sync, we might encounter run time erros.
2. Hidden small perf overhead in calls
every attribute access goers through getattr call, for exmaple

for text in large_dataset:
    score = fuzz.ratio(ref, text) # __getattr__overhead * 10,000 calls if we have those many examples in the large dataset

However the calls are minmal not expensive

Proposed solution: Use the similar pattern, what you have now, Pandas style: ref link
Something like below:

# dreadnode/utils/imports.py
def import_optional_dependency(
    name: str, 
    extra: str = "", 
    package_name: str = None
) -> Any:
    """Import optional dependency with helpful error message."""
    try:
        return importlib.import_module(name)
    except ImportError:
        pkg = package_name or name
        raise ImportError(
            f"Missing dependency '{name}'. "
            f"Install with: pip install {pkg} or dreadnode[{extra}]"
        ) from None

Then in scorers/similarity.py

def similarity_with_rapidfuzz(
    reference: str, 
    method: str = "ratio"
) -> Scorer:
    """RapidFuzz similarity scorer."""
    
    def evaluate(data: Any) -> Metric:
        # Import exactly when needed - no globals, no dual maintenance
        fuzz = import_optional_dependency("rapidfuzz.fuzz", extra="text", package_name="rapidfuzz")
        
        candidate_text = str(data)
        score = getattr(fuzz, method)(reference, candidate_text)
        return Metric(value=score / 100.0)
    
    return Scorer(evaluate, name=f"rapidfuzz_{method}")

def similarity_with_sentence_transformers(
    reference: str, 
    model_name: str = "all-MiniLM-L6-v2"
) -> Scorer:
    """Sentence Transformers similarity scorer."""
    
    def evaluate(data: Any) -> Metric:
        # Complex package - import both modules cleanly
        st = import_optional_dependency("sentence_transformers", extra="text", package_name="sentence-transformers")
        torch = import_optional_dependency("torch", extra="text")
        
        model = st.SentenceTransformer(model_name)
        # ... rest of implementation
        return Metric(value=similarity_score)
    
    return Scorer(evaluate, name="sentence_transformers")

I think wiuth this we could also minimize TYPE_CHECKING/else blocks and perf overhead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Will try it out!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to find an existing tidy example. Good find w/ pandas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/python Changes to Python package configuration and dependencies
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants