Create xi_multipoles.py to compress original auto-cf to its multipoles#140
Create xi_multipoles.py to compress original auto-cf to its multipoles#140
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #140 +/- ##
==========================================
- Coverage 39.89% 39.15% -0.74%
==========================================
Files 30 31 +1
Lines 3645 3713 +68
Branches 678 687 +9
==========================================
Hits 1454 1454
- Misses 2047 2115 +68
Partials 144 144 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Pull request overview
This PR introduces a new module xi_multipoles.py that provides functionality to compress correlation function data into multipole representations using Legendre polynomial decomposition.
Key Changes
- Adds
XiMultipolesclass for fitting correlation function data to Legendre multipole basis - Implements automatic overfitting detection with chi-squared criterion
- Provides Monte Carlo error estimation capabilities
Comments suppressed due to low confidence (1)
vega/xi_multipoles.py:57
- Variable idx is not used.
idx = ((rr - self.rmin) / self.dr).astype(int)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.xi_knots_fit = np.zeros((nl, nr)) | ||
| self.std_xi_mc = np.zeros((nl, nr)) | ||
|
|
||
| def __call__(self, r, mu, xi_knots=None): |
There was a problem hiding this comment.
Missing docstring for the __call__ method. The method should include documentation explaining parameters (r, mu, xi_knots) and return value.
| def __call__(self, r, mu, xi_knots=None): | |
| def __call__(self, r, mu, xi_knots=None): | |
| """ | |
| Evaluate the multipole expansion of the correlation function at given (r, mu) values. | |
| Parameters | |
| ---------- | |
| r : array_like | |
| Radial distances at which to evaluate the correlation function. | |
| mu : array_like | |
| Cosine of the angle to the line of sight for each r. | |
| xi_knots : array_like, optional | |
| Precomputed multipole coefficients to use for evaluation. If None, uses self.xi_knots_fit. | |
| Returns | |
| ------- | |
| results : ndarray | |
| The evaluated correlation function at the specified (r, mu) values. | |
| """ |
|
|
||
| return results | ||
|
|
||
| def fit(self, rdata, mudata, xi_data, cov_data, mumax=0.999, nmc=0, seed=0, return_mcs=False): |
There was a problem hiding this comment.
Missing docstring for the fit method. The method should include documentation for all parameters (rdata, mudata, xi_data, cov_data, mumax, nmc, seed, return_mcs) and return value, following the NumPy/SciPy docstring convention used elsewhere in the codebase.
| def fit(self, rdata, mudata, xi_data, cov_data, mumax=0.999, nmc=0, seed=0, return_mcs=False): | |
| def fit(self, rdata, mudata, xi_data, cov_data, mumax=0.999, nmc=0, seed=0, return_mcs=False): | |
| """ | |
| Fit the multipole model to the input data using least squares and optionally Monte Carlo sampling. | |
| Parameters | |
| ---------- | |
| rdata : array_like | |
| Array of radial distances of the data points. | |
| mudata : array_like | |
| Array of cosine of angles (mu) for the data points. | |
| xi_data : array_like | |
| Array of measured correlation function values at (rdata, mudata). | |
| cov_data : array_like | |
| Covariance matrix of the measured correlation function values. | |
| mumax : float, optional | |
| Maximum value of mu to include in the fit (default is 0.999). | |
| nmc : int, optional | |
| Number of Monte Carlo samples to draw for error estimation (default is 0). | |
| seed : int, optional | |
| Random seed for Monte Carlo sampling (default is 0). | |
| return_mcs : bool, optional | |
| If True, return the Monte Carlo samples of the fitted multipoles (default is False). | |
| Returns | |
| ------- | |
| chi2 : float | |
| The chi-squared value of the fit. | |
| xi_mcs : ndarray, optional | |
| If `return_mcs` is True, returns the Monte Carlo samples of the fitted multipoles | |
| as an array of shape (nmc, nl, nr). | |
| """ |
| if d > 0: | ||
| n2 = min(self.nr - 1, n1 + 1) | ||
| else: | ||
| n2 = max(0, n1 - 1) | ||
|
|
||
| for ell in range(self.nl): | ||
| matrix_A[i, n1 + ell * self.nr] += (1.0 - d) * self.leg_polys[ell](mumu[i]) | ||
| matrix_A[i, n2 + ell * self.nr] += d * self.leg_polys[ell](mumu[i]) |
There was a problem hiding this comment.
Logic error in interpolation weights: When d <= 0, the code sets n2 = max(0, n1 - 1) and then applies weight d to n2. However, when d = 0, the weight (1.0 - d) correctly gives full weight to n1, but the weight d (which is 0) is still applied to n2. When d < 0 (which shouldn't happen based on the calculation but isn't explicitly prevented), the logic becomes problematic. Consider handling the d == 0 case separately or ensuring d is always non-negative, or using abs(d) for the second weight.
| for ell in range(self.nl): | ||
| matrix_A[i, n1 + ell * self.nr] += (1.0 - d) * self.leg_polys[ell](mumu[i]) | ||
| matrix_A[i, n2 + ell * self.nr] += d * self.leg_polys[ell](mumu[i]) |
There was a problem hiding this comment.
[nitpick] Performance concern: Legendre polynomials are evaluated inside a loop for each data point. Since self.leg_polys[ell] is a callable that computes the polynomial, calling it repeatedly for the same mu values (lines 69-70) can be inefficient. Consider pre-computing the Legendre polynomial values for all unique mumu values before the loop, especially if there are repeated values or if the dataset is large.
| class XiMultipoles(): | ||
| @classmethod | ||
| def getOverfit( | ||
| cls, rdata, mudata, xi_data, cov_data, mumax=0.999, nlmin=6, nlmax=20, chi_c=0.8 | ||
| ): | ||
| """ Find the nl that overfits the data such that chi2 < chi_c. | ||
| """ |
There was a problem hiding this comment.
[nitpick] Magic number without explanation: The value 0.8 is used as a default for chi_c (chi-squared criterion) without documentation. Consider documenting what this threshold represents (e.g., reduced chi-squared target) or defining it as a named constant.
| class XiMultipoles(): | |
| @classmethod | |
| def getOverfit( | |
| cls, rdata, mudata, xi_data, cov_data, mumax=0.999, nlmin=6, nlmax=20, chi_c=0.8 | |
| ): | |
| """ Find the nl that overfits the data such that chi2 < chi_c. | |
| """ | |
| # Default reduced chi-squared target for overfitting criterion | |
| DEFAULT_CHI_SQUARED_CRITERION = 0.8 | |
| class XiMultipoles(): | |
| @classmethod | |
| def getOverfit( | |
| cls, rdata, mudata, xi_data, cov_data, mumax=0.999, nlmin=6, nlmax=20, chi_c=DEFAULT_CHI_SQUARED_CRITERION | |
| ): | |
| """ | |
| Find the nl that overfits the data such that chi2 < chi_c. | |
| Parameters | |
| ---------- | |
| chi_c : float | |
| Reduced chi-squared target for overfitting criterion (default: 0.8). | |
| """ |
|
|
||
| return xi_multipoles | ||
|
|
||
| def __init__(self, r1=42.0, r2=178.0, nr=35, nl=6): |
There was a problem hiding this comment.
Missing docstring for the __init__ method. Following the pattern in other classes in this codebase, the __init__ method should include a docstring documenting all parameters (r1, r2, nr, nl).
| def __init__(self, r1=42.0, r2=178.0, nr=35, nl=6): | |
| def __init__(self, r1=42.0, r2=178.0, nr=35, nl=6): | |
| """ | |
| Initialize the XiMultipoles object. | |
| Parameters | |
| ---------- | |
| r1 : float, optional | |
| Minimum value of the separation grid (default: 42.0). | |
| r2 : float, optional | |
| Maximum value of the separation grid (default: 178.0). | |
| nr : int, optional | |
| Number of separation grid points (default: 35). | |
| nl : int, optional | |
| Number of Legendre multipoles (default: 6). | |
| """ |
| Nell = self.nr * self.nl | ||
| Nr = rr.size | ||
| matrix_A = np.zeros((Nr, Nell)) | ||
| idx = ((rr - self.rmin) / self.dr).astype(int) |
There was a problem hiding this comment.
Potential bug: variable idx is computed but never used. Line 57 calculates idx = ((rr - self.rmin) / self.dr).astype(int) but this value is not referenced anywhere in the subsequent code. The actual indexing uses a different calculation starting at line 60 with variable d. Either this variable should be removed, or there's a logic error in how indices are calculated.
| idx = ((rr - self.rmin) / self.dr).astype(int) |
| class XiMultipoles(): | ||
| @classmethod | ||
| def getOverfit( | ||
| cls, rdata, mudata, xi_data, cov_data, mumax=0.999, nlmin=6, nlmax=20, chi_c=0.8 |
There was a problem hiding this comment.
[nitpick] Magic number without explanation: The value 0.999 is used as a default for mumax without documentation. Consider either documenting why this specific value is chosen in the method docstring, or defining it as a named constant with an explanatory comment (e.g., DEFAULT_MU_MAX = 0.999 # Exclude extreme mu values close to 1).
| """ | ||
| for nl in range(nlmin, nlmax): | ||
| xi_multipoles = cls(nl=nl) | ||
| chi2 = xi_multipoles.fit(rdata, mudata, xi_data, cov_data, mumax=mumax) | ||
| if chi2 < chi_c: | ||
| print("Converged for nl=", nl) | ||
| break | ||
|
|
There was a problem hiding this comment.
Missing error handling: If the loop completes without finding an nl that satisfies chi2 < chi_c, the method returns xi_multipoles which would be undefined (from the last iteration where the condition wasn't met). This could lead to returning a poorly-fitted model without warning. Consider either raising an exception or logging a warning if convergence is not achieved within the range [nlmin, nlmax].
| """ | |
| for nl in range(nlmin, nlmax): | |
| xi_multipoles = cls(nl=nl) | |
| chi2 = xi_multipoles.fit(rdata, mudata, xi_data, cov_data, mumax=mumax) | |
| if chi2 < chi_c: | |
| print("Converged for nl=", nl) | |
| break | |
| """ | |
| converged = False | |
| for nl in range(nlmin, nlmax): | |
| xi_multipoles = cls(nl=nl) | |
| chi2 = xi_multipoles.fit(rdata, mudata, xi_data, cov_data, mumax=mumax) | |
| if chi2 < chi_c: | |
| print("Converged for nl=", nl) | |
| converged = True | |
| break | |
| if not converged: | |
| raise RuntimeError( | |
| f"getOverfit: No nl in range [{nlmin}, {nlmax}) found with chi2 < {chi_c}. " | |
| f"Last chi2={chi2} for nl={nl}." | |
| ) |
| np.zeros(rdata.size), cov=cov_data, size=nmc | ||
| )[:, w] |
There was a problem hiding this comment.
Potential memory issue: The Monte Carlo simulation generates nmc samples of size rdata.size, which could be memory-intensive for large datasets. The random samples are generated with full size rdata.size and then filtered with [:, w], which is inefficient. Consider generating the random samples only for the filtered data points to reduce memory usage: rng.multivariate_normal(np.zeros(w.sum()), cov=cov_in, size=nmc).
| np.zeros(rdata.size), cov=cov_data, size=nmc | |
| )[:, w] | |
| np.zeros(w.sum()), cov=cov_in, size=nmc | |
| ) |
No description provided.