-
Notifications
You must be signed in to change notification settings - Fork 1
feat: testing centralized lazy imports #168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a centralized lazy loading system to eliminate duplication in import handling across scorer modules. The system provides consistent error handling for optional dependencies and consolidates lazy imports for third-party packages used in similarity scoring.
- Created a core lazy loading framework with
LazyImport
andLazyAttr
classes - Centralized scorer-specific lazy imports in a dedicated module
- Refactored similarity scorers to use the new centralized system instead of individual import error handling
Reviewed Changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
pyproject.toml | Added lazy module to mypy configuration and disabled import-not-found errors |
dreadnode/lazy/core.py | Core lazy loading implementation with error handling classes |
dreadnode/lazy/scorers.py | Centralized lazy imports for all scorer dependencies |
dreadnode/scorers/similarity.py | Refactored to use centralized lazy imports, removing duplicate error handling code |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
def evaluate(data: t.Any, *, reference: str = reference) -> Metric: | ||
vectorizer = TfidfVectorizer(stop_words="english") | ||
candidate_text = str(data) | ||
tfidf_matrix = vectorizer.fit_transform([candidate_text, reference]) | ||
sim = sklearn_cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a new TfidfVectorizer instance on every evaluation call is inefficient. The vectorizer should be created once outside the evaluate function or cached to avoid repeated initialization overhead.
def evaluate(data: t.Any, *, reference: str = reference) -> Metric: | |
vectorizer = TfidfVectorizer(stop_words="english") | |
candidate_text = str(data) | |
tfidf_matrix = vectorizer.fit_transform([candidate_text, reference]) | |
sim = sklearn_cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0] | |
# Fit the vectorizer once on the reference text | |
vectorizer = TfidfVectorizer(stop_words="english") | |
reference_vec = vectorizer.fit_transform([reference]) | |
def evaluate(data: t.Any, *, reference: str = reference) -> Metric: | |
candidate_text = str(data) | |
candidate_vec = vectorizer.transform([candidate_text]) | |
sim = sklearn_cosine_similarity(candidate_vec, reference_vec)[0][0] |
Copilot uses AI. Check for mistakes.
if t.TYPE_CHECKING: | ||
import litellm as litellm # type: ignore[import-not-found] | ||
import nltk as nltk # type: ignore[import-not-found] | ||
from nltk.tokenize import ( # type: ignore[import-not-found] | ||
word_tokenize as word_tokenize, | ||
) | ||
from nltk.translate.bleu_score import ( # type: ignore[import-not-found] | ||
sentence_bleu as sentence_bleu, | ||
) | ||
from rapidfuzz import distance as distance # type: ignore[import-not-found] | ||
from rapidfuzz import fuzz as fuzz # type: ignore[import-not-found] | ||
from rapidfuzz import utils as utils # type: ignore[import-not-found] | ||
from sentence_transformers import ( # type: ignore[import-not-found] | ||
SentenceTransformer as SentenceTransformer, | ||
) | ||
from sentence_transformers import ( # type: ignore[import-not-found] | ||
util as util, | ||
) | ||
from sklearn.feature_extraction.text import ( # type: ignore[import-not-found] | ||
TfidfVectorizer as TfidfVectorizer, | ||
) | ||
from sklearn.metrics.pairwise import ( # type: ignore[import-not-found] | ||
cosine_similarity as cosine_similarity, | ||
) | ||
else: | ||
fuzz = LazyAttr("rapidfuzz", "fuzz", "text") | ||
utils = LazyAttr("rapidfuzz", "utils", "text") | ||
distance = LazyAttr("rapidfuzz", "distance", "text") | ||
litellm = LazyImport("litellm", "llm") | ||
util = LazyAttr("sentence_transformers", "util", "text", package_name="sentence-transformers") | ||
TfidfVectorizer = LazyAttr( | ||
"sklearn.feature_extraction.text", "TfidfVectorizer", "text", package_name="scikit-learn" | ||
) | ||
SentenceTransformer = LazyAttr( | ||
"sentence_transformers", "SentenceTransformer", "text", package_name="sentence-transformers" | ||
) | ||
cosine_similarity = LazyAttr( | ||
"sklearn.metrics.pairwise", "cosine_similarity", "text", package_name="scikit-learn" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Brian for tackling this. importlib.*
is a good pattern for this, but couple issues I could see here,
- Dual maintenance problem, must maintain imports in two places
Example.,
if TYPE_CHECKING:
from sklearn.metrics.pairwise import cosine_similarity as cosine_similarity # Place 1
from rapidfuzz import fuzz as fuzz # Place 1
....
else:
cosine_similarity = LazyAttr("sklearn.metrics.pairwise", "cosine_similarity", "text") # Place 2
fuzz = LazyAttr("rapidfuzz", "fuzz", "text") # Place 2
...
For insace, if we wanted to add new sklearn import, need to update at both blocks. Follows the same if any method changes, API changes, need to sync at 2 places. If it's out of sync, we might encounter run time erros.
2. Hidden small perf overhead in calls
every attribute access goers through getattr call, for exmaple
for text in large_dataset:
score = fuzz.ratio(ref, text) # __getattr__overhead * 10,000 calls if we have those many examples in the large dataset
However the calls are minmal not expensive
Proposed solution: Use the similar pattern, what you have now, Pandas style: ref link
Something like below:
# dreadnode/utils/imports.py
def import_optional_dependency(
name: str,
extra: str = "",
package_name: str = None
) -> Any:
"""Import optional dependency with helpful error message."""
try:
return importlib.import_module(name)
except ImportError:
pkg = package_name or name
raise ImportError(
f"Missing dependency '{name}'. "
f"Install with: pip install {pkg} or dreadnode[{extra}]"
) from None
Then in scorers/similarity.py
def similarity_with_rapidfuzz(
reference: str,
method: str = "ratio"
) -> Scorer:
"""RapidFuzz similarity scorer."""
def evaluate(data: Any) -> Metric:
# Import exactly when needed - no globals, no dual maintenance
fuzz = import_optional_dependency("rapidfuzz.fuzz", extra="text", package_name="rapidfuzz")
candidate_text = str(data)
score = getattr(fuzz, method)(reference, candidate_text)
return Metric(value=score / 100.0)
return Scorer(evaluate, name=f"rapidfuzz_{method}")
def similarity_with_sentence_transformers(
reference: str,
model_name: str = "all-MiniLM-L6-v2"
) -> Scorer:
"""Sentence Transformers similarity scorer."""
def evaluate(data: Any) -> Metric:
# Complex package - import both modules cleanly
st = import_optional_dependency("sentence_transformers", extra="text", package_name="sentence-transformers")
torch = import_optional_dependency("torch", extra="text")
model = st.SentenceTransformer(model_name)
# ... rest of implementation
return Metric(value=similarity_score)
return Scorer(evaluate, name="sentence_transformers")
I think wiuth this we could also minimize TYPE_CHECKING/else blocks and perf overhead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Will try it out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to find an existing tidy example. Good find w/ pandas
Central Lazy Loading System
In order to make the lazy loading process more DRY, I've attempted to centralize the lazy imports for scorers.
Key Changes:
similarity
scorers as PoCsNotes
catch
kwarg in theScorer
Generated Summary:
dreadnode
package.LazyImport
andLazyAttr
classes that facilitate delayed imports and attribute access.similarity.py
with lazy imports to handle potential ImportErrors gracefully.rapidfuzz
,nltk
,sentence-transformers
, andscikit-learn
.pyproject.toml
to include new lazy module paths and modified linting rules accordingly.This summary was generated with ❤️ by rigging