-
Notifications
You must be signed in to change notification settings - Fork 0
Performance optimizations: eliminate O(n²) operations, fix correlation correctness, modernize pandas usage #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance optimizations: eliminate O(n²) operations, fix correlation correctness, modernize pandas usage #9
Conversation
Co-authored-by: EthanThePhoenix38 <[email protected]>
Co-authored-by: EthanThePhoenix38 <[email protected]>
Co-authored-by: EthanThePhoenix38 <[email protected]>
|
Please retry analysis of this Pull-Request directly on SonarQube Cloud |
Co-authored-by: EthanThePhoenix38 <[email protected]>
…or checking Co-authored-by: EthanThePhoenix38 <[email protected]>
EthanThePhoenix38
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
viewed ok
|
Penser à faire les tests fonctionnels, unitaires, E2E, pypi, mettre à jour les documents pypi, security, issue personnalisés et déployer release dans pypi. |
|
réussite à 100% des tests requis et mise à jour des badge de compliancy. pas de mention à la sécurité dans les commits. rajouter des fonctions si besoin, si manquants. pas de doublons, respecter la modularité et les classes d'objets. |
|
commenter javadoc tous les documents, versioning sémantique obligatoire. mettre à jour release partout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR focuses on performance optimizations and code quality improvements across multiple statistical analysis modules. The changes eliminate O(n²) operations, fix a critical correctness issue in correlation computation, and modernize pandas API usage throughout the codebase.
Key changes include:
- Replacing repeated DataFrame filtering with efficient groupby operations in variance analysis
- Fixing correlation chunking that produced incorrect results for large datasets
- Optimizing FFT operations for real-valued signals (2x speed improvement)
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| VarianceModule.py | Optimizes group data extraction using groupby to eliminate O(n²) filtering; adds NaN handling for Friedman test |
| TimeSeriesModule.py | Switches from fft to rfft for real signals (2x faster); updates deprecated .values to .to_numpy() |
| ProbabilistesModule.py | Increases parallel threshold from 1000 to 2000 elements to reduce overhead on medium datasets |
| FrequenceModule.py | Eliminates redundant value_counts() call by normalizing cached results; replaces assertion with RuntimeError |
| FactorielleModule.py | Replaces dict(zip(...)) pattern with more readable dict comprehensions |
| CorrelationModule.py | Removes incorrect column chunking that broke correlation correctness; simplifies to use pandas' optimized algorithms |
Comments suppressed due to low confidence (2)
py_stats_toolkit/stats/correlation/CorrelationModule.py:82
- The comment says "calcul parallèle des paires" (parallel computation of pairs), but the code is not actually performing parallel computation - it's using numpy vectorization. The comment should be updated to reflect the actual optimization being used (vectorized filtering) rather than misleading readers about parallel processing.
py_stats_toolkit/stats/variance/VarianceModule.py:72 - The groupby optimization creates the groupby object on every iteration of the list comprehension. This is inefficient and defeats the purpose of the optimization. Consider creating the groupby object once before the list comprehension and using a dictionary to access groups, similar to the approach used in _kruskal_wallis at lines 94-97.
def __init__(self):
"""Initialize variance module."""
super().__init__()
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - 'rolling_std': Rolling standard deviation | ||
| - 'trend': Linear trend coefficient | ||
| - 'summary': Statistical summary | ||
| """ |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rfftfreq function needs a sampling rate parameter (d) to produce correct frequency values. Without it, it defaults to d=1.0, which assumes unit sampling rate. If the series has a specific sampling rate or time step, the frequency and period calculations will be incorrect. The same issue exists in the original code with fftfreq, but should be addressed when using rfft. Consider whether the time series has a meaningful sampling rate that should be passed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in commit ae2fe0d. Added sampling rate parameter (d) to rfftfreq by detecting it from the series index (DatetimeIndex/TimedeltaIndex freq) or explicit timestamps. Also fixed deprecation warning by using pd.Timedelta() instead of .delta.
| # rfft is more efficient for real-valued data | ||
| # Compute FFT only on the positive frequencies to save computation | ||
| fft = np.fft.rfft(series.to_numpy()) | ||
| freqs = np.fft.rfftfreq(len(series)) |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same sampling rate issue exists here. The rfftfreq function should receive the sampling rate parameter (d) to produce correct frequency values. Without it, the detected period may be incorrect if the time series has a specific sampling rate.
| # rfft is more efficient for real-valued data | |
| # Compute FFT only on the positive frequencies to save computation | |
| fft = np.fft.rfft(series.to_numpy()) | |
| freqs = np.fft.rfftfreq(len(series)) | |
| # Déterminer l'intervalle d'échantillonnage (d) pour les fréquences | |
| sampling_interval = 1.0 | |
| index = series.index | |
| if isinstance(index, (pd.DatetimeIndex, pd.TimedeltaIndex)): | |
| if index.freq is not None: | |
| # Utiliser la fréquence déclarée si disponible | |
| sampling_interval = index.freq.delta.total_seconds() | |
| elif len(index) > 1: | |
| # Sinon, calculer l'intervalle moyen à partir des deux premiers points | |
| delta = index[1] - index[0] | |
| sampling_interval = delta.total_seconds() | |
| elif getattr(self, "timestamps", None) is not None and len(self.timestamps) > 1: | |
| # Si des timestamps explicites sont fournis, les utiliser | |
| delta = self.timestamps[1] - self.timestamps[0] | |
| if hasattr(delta, "total_seconds"): | |
| sampling_interval = delta.total_seconds() | |
| else: | |
| sampling_interval = float(delta) | |
| # rfft is more efficient for real-valued data | |
| # Compute FFT only on the positive frequencies to save computation | |
| fft = np.fft.rfft(series.to_numpy()) | |
| freqs = np.fft.rfftfreq(len(series), d=sampling_interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in commit ae2fe0d. Applied the same sampling rate detection logic to get_seasonality() method using pd.Timedelta() to avoid deprecation warnings.
| # Calcul des fréquences | ||
| freq = series.value_counts(normalize=normalize) | ||
| cum_freq = freq.cumsum() | ||
| # Création du DataFrame de résultats | ||
| self.result = pd.DataFrame({"Fréquence": freq, "Fréquence Cumulée": cum_freq}) | ||
| if normalize: | ||
| self.result.columns = ["Fréquence Relative", "Fréquence Relative Cumulée"] |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If process() was called with normalize=True, then self.result will have columns "Fréquence Relative" and "Fréquence Relative Cumulée", but not "Fréquence". This will cause the check at line 90 to pass (column exists), but then the code at line 92 will fail because the "Fréquence" column doesn't exist. The logic should handle both cases properly or store the original data to enable conversion.
| # Calcul des fréquences | |
| freq = series.value_counts(normalize=normalize) | |
| cum_freq = freq.cumsum() | |
| # Création du DataFrame de résultats | |
| self.result = pd.DataFrame({"Fréquence": freq, "Fréquence Cumulée": cum_freq}) | |
| if normalize: | |
| self.result.columns = ["Fréquence Relative", "Fréquence Relative Cumulée"] | |
| # Calcul des fréquences absolues (toujours stockées dans self.result) | |
| freq = series.value_counts(normalize=False) | |
| cum_freq = freq.cumsum() | |
| # Création du DataFrame de résultats absolus | |
| self.result = pd.DataFrame({"Fréquence": freq, "Fréquence Cumulée": cum_freq}) | |
| if normalize: | |
| # Calcul des fréquences relatives à partir des fréquences absolues | |
| rel_freq = self.result["Fréquence"] / self.result["Fréquence"].sum() | |
| rel_cum_freq = rel_freq.cumsum() | |
| return pd.DataFrame( | |
| { | |
| "Fréquence Relative": rel_freq, | |
| "Fréquence Relative Cumulée": rel_cum_freq, | |
| }, | |
| index=self.result.index, | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in commit ae2fe0d. Modified process() to always store absolute frequencies in self.result, regardless of the normalize parameter. When normalize=True, it now returns relative frequencies as a separate DataFrame while keeping absolute frequencies internally.
| if len(series) > 1: | ||
| x = np.arange(len(series)) | ||
| slope, intercept = np.polyfit(x, series.to_numpy(), 1) | ||
| stats["Pente"] = slope |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The autocorrelation calculation on line 134 is redundant - the variable 'acf' is computed but never used in the function. This unused computation should be removed to improve performance and code clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in commit ae2fe0d. Removed the unused acf variable calculation from get_seasonality() method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Get unique groups to maintain consistent ordering | ||
| groups = data[group_col].unique() | ||
| # Use groupby with get_group for efficient extraction while preserving order | ||
| group_data = [ | ||
| data.groupby(group_col).get_group(g)[value_col].to_numpy() for g in groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skip NaN groups before ANOVA extraction
When group_col contains missing values, groups = data[group_col].unique() includes NaN but data.groupby(group_col).get_group(g) drops NaN groups, so the list comprehension raises a KeyError before reaching f_oneway. The previous implementation used boolean filtering and simply ignored NaN-labelled rows, so ANOVA now crashes on datasets with any missing group labels. Consider dropping NaNs or filtering them out of groups before calling get_group.
Useful? React with 👍 / 👎.
| groups = data[group_col].unique() | ||
| group_data_dict = { | ||
| name: group[value_col].to_numpy() for name, group in data.groupby(group_col) | ||
| } | ||
| group_data = [group_data_dict[g] for g in groups] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kruskal-Wallis fails on NaN group labels
The Kruskal branch builds groups from data[group_col].unique() (which keeps NaN) but group_data_dict is populated via data.groupby(group_col), which excludes NaN groups by default. The subsequent group_data = [group_data_dict[g] for g in groups] therefore raises a KeyError whenever the grouping column has missing values, whereas the previous boolean filtering ignored those rows. This makes Kruskal unusable on data with any NaN group labels; drop or filter NaNs before indexing.
Useful? React with 👍 / 👎.
|
@copilot apply changes based on the comments in this thread |
… normalize logic Co-authored-by: EthanThePhoenix38 <[email protected]>
|
Co-authored-by: EthanThePhoenix38 <[email protected]>
Co-authored-by: EthanThePhoenix38 <[email protected]>
All code review feedback has been addressed in commits 25715b3, ae2fe0d, and dfe8ecb:
All 12 tests passing. |
|
Please retry analysis of this Pull-Request directly on SonarQube Cloud |
|


Performance optimizations: eliminate O(n²) operations, fix correlation correctness, modernize pandas usage
Recent Changes (addressing code review feedback)
TimeSeriesModule.py:
rfftfreqfor correct frequency values in bothprocess()andget_seasonality()methodsacfvariable calculation inget_seasonality()pd.Timedelta()instead of.deltaFrequenceModule.py:
self.result, even whennormalize=Truenormalize=Trueget_frequence_relative()can always access the absolute "Fréquence" columnDocumentation:
Housekeeping:
Original Performance Improvements
VarianceModule - O(n²) DataFrame filtering
data[data[col] == value]inside nested loops with singlegroupby()callCorrelationModule - Incorrect chunking
Other Optimizations
value_counts()by normalizing cached resultsdict(zip([...], ...))with dict comprehensionsCode Quality
.valueswith.to_numpy()throughoutRuntimeErrorfor production-safe error checkingTesting
✅ All 12 existing tests pass (excluding broken test_refactored_modules.py which has unrelated import issues)
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.