Skip to content

Add landscape analysis #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions 2024-landscape.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
site:
hide_toc: true
hide_footer_links: true
---

# 2024 Landscape Analysis

Python is widely adopted in data science, and its use for statistics is expanding rapidly, particularly in education and applied research.
The statistical ecosystem in Python is currently anchored by four major libraries:

- [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html), which provides a comprehensive suite of probability distributions, summary statistics, and basic statistical tests;
- [statsmodels](https://www.statsmodels.org/), which offers tools for econometrics, classical statistics, and statistical modeling—including linear and generalized linear models, time series analysis, and hypothesis testing;
- [scikit-learn](https://scikit-learn.org/), which is best known for machine learning but also supports some statistical modeling, offering a consistent API for predictive analytics and data preprocessing; and
- [seaborn](https://seaborn.pydata.org/), a library built on top of matplotlib that excels at creating informative and attractive statistical graphics, making it easier to visualize distributions, relationships, and trends in data.

These core libraries are generally well-tested, reliable, and uphold high software engineering standards, making them trusted foundations for research and application.
Libraries like scikit-learn are especially valued for their clean, consistent interfaces and their integration with the broader Python data stack, which streamlines workflows and enhances usability for both new and experienced users.

While there are many smaller, specialized packages available, the ecosystem remains dominated by these large, general-purpose libraries. This concentration of resources ensures stability and quality but can also limit the visibility and adoption of innovative or niche statistical tools.
As Python’s role in statistics continues to grow, fostering a more diverse and accessible ecosystem will be key to meeting the evolving needs of educators, researchers, and practitioners.

# Relationship to Other Languages

R remains the gold standard for statistics, with better branding, a more cohesive ecosystem, and more teaching resources.
R's [tidyverse](https://www.tidyverse.org/) and [RStudio](https://posit.co/products/open-source/rstudio/) provide a smoother user experience for statistics, and CRAN offers a vast repository of statistical packages.

:::{table} Python vs. R for Statistics
:label: table
:align: center

| Aspect | Python (Scientific Python) | R (CRAN, tidyverse) |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------- |
| Core Libraries | [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html), [statsmodels](https://www.statsmodels.org/), [scikit-learn](https://scikit-learn.org/) | [base R](https://www.r-project.org/), [tidyverse](https://www.tidyverse.org/), many CRAN packages |
| User Experience | Fragmented, less cohesive | Cohesive, tidyverse pipelines, RStudio |
| Teaching Resources | Improving, but less abundant | Extensive, beginner-friendly |
| Community | Large, less connected in statistics | Strong, statistics-focused, welcoming |
| Package Development | High barriers, less modularity | Easy, many small packages, dev tools |
| Interoperability | Needs improvement (data structures, APIs) | Strong within tidyverse, RStudio |
| Branding | Data science/machine learning focus | Statistics-focused |

:::

**Interoperability**: While some users switch between Python and R in their workflows, true interoperability is limited.
Most projects use one language at a time, often leveraging R for data manipulation and Python for modeling or vice versa.

**Other Platforms**: Tools like GraphPad Prism remain popular among practicing scientists for basic statistical analyses, indicating that neither Python nor R fully dominates all applied domains.

# Weaknesses and Needs

Despite Python's strengths, several challenges remain.

- **Fragmentation**: The ecosystem is fragmented, with major libraries (e.g., statsmodels vs. scikit-learn) adopting incompatible APIs and workflows, leading to confusion for users and students.
- **User Experience**: There is no central landing place or unified entry point for statistics in Python, unlike R's [tidyverse](https://www.tidyverse.org/) or RStudio, making it harder for newcomers to get started.
- **Interoperability**: Data structures (such as those from [pandas](https://pandas.pydata.org/) and [NumPy](https://numpy.org/)) do not always work seamlessly across libraries, requiring conversions and leading to unpredictable function outputs compared to R's tidyverse pipelines.
- **Teaching Resources**: Python lacks the abundance of user-friendly, statistics-focused tutorials and case studies found in the R community.
- **Contributor Barriers**: Contributing to core libraries can be difficult due to high standards and lack of modularity.
Small, specialized packages exist but are less visible and less widely used than in R.
- **Statistical Methods Coverage**: Some advanced or niche statistical methods are missing or hard to find, especially compared to R's vast [CRAN](https://cran.r-project.org/) repository.
- **Community and Culture**: The Python statistics community is less cohesive and connected than R's, which benefits from a strong identity and established events.

# Conclusion

Python's statistics ecosystem is powerful but fragmented, with significant opportunities for improvement in usability, interoperability, teaching resources, and community cohesion.
While R remains the default for statistics, Python is gaining ground, especially as data science and machine learning continue to grow in influence.
Stronger integration, better documentation, and a more unified vision could help Python become a true peer to R in the statistics domain.
In particular, Python needs:

- A unified, user-friendly interface for statistics, possibly modeled after R's tidyverse.
- Improved interoperability between core data structures and libraries.
- More accessible teaching resources and case studies focused on statistics.
- Lower barriers for contributors and greater visibility for specialized statistical packages.
- Stronger community identity and central organization for statistics in Python.

The Statistical Python project seeks to address these needs by fostering collaboration, sharing best practices, and building a sustainable, inclusive community.
As a domain stack within the [Scientific Python project](https://scientific-python.org/), and with support from the NSF POSE Phase I grant, we are committed to making Python a premier platform for statistical computing, education, and research.
2 changes: 0 additions & 2 deletions about.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,4 @@ The Statistical Python project was launched with support from a [grant from the
We are now completing Phase I, which has centered on scoping activities to inform the transition into a sustainable open-source ecosystem.
During this phase, we conducted interviews with stakeholders across the statistical and scientific Python communities, engaged with related domain-stack OSEs to learn from their experiences, and organized a workshop to gather input on community needs and technical directions.

<!--
Based on our [2024 Landscape Analysis](2024-landscape), we ...
-->