diff --git a/paper/paper.md b/paper/paper.md index f605a5a..f38c6f5 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -39,7 +39,9 @@ bioinformatics [@Ma:2007], medicine [@Kim:2012], econometrics [@Athey:2017], chemistry [@Gu:2018], and materials science [@Leong:2019]. Several generalizations of the Lasso [@Yuan:2006; @Friedman:2010; @Simon:2013; @Wang:2019] and Best Subset Selection [@Bertsimas:2016-a; @Bertsimas:2016-b] have been developed to effectively exploit -additional structure in linear regression. +additional structure in linear regression. The `sparse-lm` Python package provides +a flexible, comprehensive, and user-friendly implementation of sparse linear regression +models. # Statement of need @@ -58,14 +60,14 @@ solving larger problems that would otherwise be unsolvable within reasonable tim A handful of pre-existing Python libraries implement a subset of sparse linear regression models that are also `scikit-learn` compatible. `celer` [@Massias:2018] and `groupyr` [@Richie-Halford:2021] include efficient implementations of the Lasso and -Group Lasso, among other linear models. `group-lasso` [@Moe:2020] is another +Group Lasso. `group-lasso` [@Moe:2020] is another `scikit-learn` compatible implementation of the Group Lasso. `skglm` [@Bertrand:2022] includes several implementations of sparse linear models based on regularization using combinations of $\ell_p$ ($p\in\{1/2,2/3,1,2\}$) norms and pseudo-norms. `abess` [@Zhu:2022] includes an implementation of Best Subset Selection and $\ell_0$ pseudo-norm regularization. -The pre-existing packages mentioned include highly performant implementations of the +The aforementioned packages include highly performant versions of the specific models they implement. However, none of these packages implement the full range of sparse linear models available in `sparse-lm`, nor do they support the flexibility to modify the optimization objective and choose among many open-source and commercially @@ -111,7 +113,7 @@ The second method to obtain structured sparsity is by introducing linear constra into the regression objective. Introducing linear constraints is straight-forward in mixed integer quadratic programming (MIQP) formulations of the Best Subset Selection [@Bertsimas:2016-a; @Bertsimas:2016-b]. The general MIQP formulation of Best Subset -Selection with group and hierarchical structure can be expressed as follows, +Selection with grouped covariates and hierarchical constraints can be expressed as follows, \begin{align} \beta^* = \underset{\beta}{\text{argmin}}\; @@ -132,8 +134,8 @@ corresponding slack variable $z_{\mathbf{g}} = 1$. $M$ is a fixed parameter that estimated from the data [@Bertsimas:2016-a]. The second inequality constraint introduces general sparsity by ensuring that at most $k$ coefficients are nonzero. If $G$ includes only singleton groups of covariates then the MIQP formulation is equivalent -to the Best Subset Selection problem; otherwise it is a generalization that enables -groups-level sparsity structure. The final inequality constraint can be used to +to the Best Subset Selection problem, otherwise it is a generalization that enables +group-level sparsity structure. The last inequality constraint can be used to introduce hierarchical structure into the model. Finally, we have also included an $\ell_2$ regularization term controlled by the hyperparameter $\lambda$, which is useful when dealing with poorly conditioned design matrices. @@ -158,7 +160,7 @@ in similar fashion to any of the available models in the `sklearn.linear_model` ## Implemented regression models The table below shows the regression models that are implemented in `sparse-lm` as well -as available implementations in other Python packages. $\checkmark$ indicates that the +as available implementations in other Python packages. A checkmark ($\checkmark$) indicates that the model selected is implemented in the package located in the corresponding column.