Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file removed .coverage
Binary file not shown.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@

# Python cache
__pycache__/
*.pyc
.coverage
.pytest_cache/
109 changes: 109 additions & 0 deletions CODE_REVIEW_RESPONSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Code Review Response

This document summarizes the changes made in response to the code review feedback.

## Review Comments Addressed

### 1. TimeSeriesModule.py - Line 82: Missing Sampling Rate Parameter

**Issue**: The `rfftfreq` function needs a sampling rate parameter (d) to produce correct frequency values. Without it, frequencies are calculated assuming unit sampling rate, which is incorrect for time series with specific time intervals.

**Fix Applied** (Commit: ae2fe0d):
- Added automatic detection of sampling interval from series index
- For DatetimeIndex/TimedeltaIndex: extracts frequency from index.freq or calculates from first two points
- For explicit timestamps: uses timestamp deltas
- Falls back to default of 1.0 for non-temporal data
- Fixed deprecation warning by using `pd.Timedelta()` instead of `.delta`

**Code Added**:
```python
# Determine sampling interval for correct frequency calculation
sampling_interval = 1.0
if isinstance(series.index, (pd.DatetimeIndex, pd.TimedeltaIndex)):
if hasattr(series.index, "freq") and series.index.freq is not None:
sampling_interval = pd.Timedelta(series.index.freq).total_seconds()
elif len(series.index) > 1:
delta = series.index[1] - series.index[0]
sampling_interval = delta.total_seconds()
# ... then uses: freqs = np.fft.rfftfreq(len(series), d=sampling_interval)
```

### 2. TimeSeriesModule.py - Lines 139-142: Same Sampling Rate Issue in get_seasonality()

**Issue**: The same sampling rate parameter issue exists in the `get_seasonality()` method.

**Fix Applied** (Commit: ae2fe0d):
- Applied identical sampling interval detection logic
- Used `pd.Timedelta()` to avoid deprecation warnings
- Ensures period detection is accurate for time series with explicit time scales

### 3. TimeSeriesModule.py - Line 134: Unused Variable

**Issue**: The `acf` variable (autocorrelation) is calculated but never used in the `get_seasonality()` method.

**Fix Applied** (Commit: ae2fe0d):
- Removed the unused line: `acf = pd.Series(series).autocorr()`
- Improves performance by eliminating unnecessary computation

### 4. FrequenceModule.py - Lines 92-100: Logic Issue with normalize=True

**Issue**: When `process()` is called with `normalize=True`, the result DataFrame has "Fréquence Relative" columns instead of "Fréquence". This causes `get_frequence_relative()` to fail because it expects the "Fréquence" column to exist.

**Fix Applied** (Commit: ae2fe0d):
- Modified `process()` to always compute and store absolute frequencies in `self.result`
- When `normalize=True`, returns relative frequencies as a separate DataFrame
- Internal `self.result` always has "Fréquence" column, ensuring `get_frequence_relative()` works correctly

**Updated Logic**:
```python
# Always store absolute frequencies
freq = series.value_counts(normalize=False)
self.result = pd.DataFrame({"Fréquence": freq, "Fréquence Cumulée": cum_freq})

if normalize:
# Return relative frequencies separately
rel_freq = self.result["Fréquence"] / self.result["Fréquence"].sum()
rel_cum_freq = rel_freq.cumsum()
return pd.DataFrame({
"Fréquence Relative": rel_freq,
"Fréquence Relative Cumulée": rel_cum_freq,
}, index=self.result.index)
```

## Additional Improvements

### Housekeeping
- Removed accidentally committed cache files (`__pycache__`, `.coverage`)
- Updated `.gitignore` to prevent future commits of cache files

## Testing

All 12 existing tests pass:
```
tests/test_basic_imports.py::TestBasicImports::test_matplotlib PASSED
tests/test_basic_imports.py::TestBasicImports::test_numpy PASSED
tests/test_basic_imports.py::TestBasicImports::test_pandas PASSED
tests/test_basic_imports.py::TestBasicImports::test_sklearn PASSED
tests/test_correlation.py::TestCorrelationAnalysis::test_analyze_dataframe PASSED
tests/test_correlation.py::TestCorrelationAnalysis::test_analyze_univariate PASSED
tests/test_descriptives.py::TestDescriptiveStatistics::test_analyze_dataframe PASSED
tests/test_descriptives.py::TestDescriptiveStatistics::test_analyze_list PASSED
tests/test_regression_module.py::TestRegressionModule::test_linear_regression_fit PASSED
tests/test_regression_module.py::TestRegressionModule::test_linear_regression_predict PASSED
tests/test_regression_module.py::TestRegressionModule::test_regression_coefficients PASSED
tests/test_regression_module.py::TestRegressionModule::test_regression_metrics PASSED
```

## Commits Made

1. **25715b3**: Apply code review feedback: fix sampling rate, remove unused var, fix normalize logic
2. **ae2fe0d**: Remove cache files and update gitignore

## Impact

These fixes ensure:
- ✅ Correct frequency and period calculations for time series with real-world time scales
- ✅ Consistent API behavior for FrequenceModule regardless of normalize parameter
- ✅ Better code quality with no unused variables
- ✅ Cleaner repository without cache files
- ✅ Modern pandas API usage (pd.Timedelta instead of deprecated .delta)
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
68 changes: 68 additions & 0 deletions py_stats_toolkit/stats/correlation/CorrelationModule.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,33 @@
=====================================================================
"""

tags : module, stats
"""
from typing import List, Tuple, Union

import pandas as pd
from scipy import stats

from ...utils.parallel import ParallelProcessor
from ..core.AbstractClassBase import StatisticalModule


class CorrelationModule(StatisticalModule):
"""Module pour l'analyse de corrélation."""

def __init__(self, n_jobs: int = -1):
super().__init__()
self.method = None
self.parallel_processor = ParallelProcessor(n_jobs=n_jobs)

def process(self, data, method="pearson", **kwargs):
"""
Calcule la corrélation entre les variables.

Args:
data: Données d'entrée (pandas DataFrame)
method: Méthode de corrélation ('pearson', 'spearman', 'kendall')
**kwargs: Arguments additionnels

from py_stats_toolkit.algorithms import correlation as correlation_algos
from py_stats_toolkit.core.base import StatisticalModule
Expand Down Expand Up @@ -59,6 +83,29 @@ def process(self, data: Union[pd.DataFrame, pd.Series], method: str = "pearson",
Returns:
Correlation matrix
"""
self.validate_data(data)
self.method = method

if not isinstance(data, pd.DataFrame):
raise TypeError("Les données doivent être un pandas DataFrame")

# Compute correlation matrix directly
# pandas/numpy already use optimized algorithms
# Note: Chunking correlation computation produces incorrect results because
# correlation requires all data points to compute proper covariance and variance statistics
self.result = data.corr(method=method)
return self.result

def get_correlation_matrix(self):
"""Retourne la matrice de corrélation."""
return self.result

def get_correlation_pairs(self, threshold=0.5):
"""
Retourne les paires de variables avec une corrélation supérieure au seuil.

Args:
threshold: Seuil de corrélation
# Validation (delegated to validator)
DataValidator.validate_data(data)

Expand Down Expand Up @@ -100,6 +147,27 @@ def get_correlation_pairs(self, threshold: float = 0.5) -> List[Tuple[str, str,
Returns:
List of (var1, var2, correlation) tuples
"""
if self.result is None:
raise ValueError("Exécutez d'abord process()")

# Utilisation de numpy pour le calcul parallèle des paires
corr_matrix = self.result.to_numpy()
n = len(self.result.columns)

# Création des indices pour les paires
i, j = np.triu_indices(n, k=1)
corr_values = corr_matrix[i, j]

# Filtrage des paires selon le seuil
mask = np.abs(corr_values) >= threshold
mask_indices = np.where(mask)[0]

# Vectorized construction of pairs using list comprehension
pairs = [
(self.result.columns[i[idx]], self.result.columns[j[idx]], corr_values[idx])
for idx in mask_indices
]

if not self.has_result():
raise ValueError("No analysis performed. Call process() first.")

Expand Down
Loading