diff --git a/2024-landscape.md b/2024-landscape.md new file mode 100644 index 0000000..f405976 --- /dev/null +++ b/2024-landscape.md @@ -0,0 +1,76 @@ +--- +site: + hide_toc: true + hide_footer_links: true +--- + +# 2024 Landscape Analysis + +Python is widely adopted in data science, and its use for statistics is expanding rapidly, particularly in education and applied research. +The statistical ecosystem in Python is currently anchored by four major libraries: + +- [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html), which provides a comprehensive suite of probability distributions, summary statistics, and basic statistical tests; +- [statsmodels](https://www.statsmodels.org/), which offers tools for econometrics, classical statistics, and statistical modeling—including linear and generalized linear models, time series analysis, and hypothesis testing; +- [scikit-learn](https://scikit-learn.org/), which is best known for machine learning but also supports some statistical modeling, offering a consistent API for predictive analytics and data preprocessing; and +- [seaborn](https://seaborn.pydata.org/), a library built on top of matplotlib that excels at creating informative and attractive statistical graphics, making it easier to visualize distributions, relationships, and trends in data. + +These core libraries are generally well-tested, reliable, and uphold high software engineering standards, making them trusted foundations for research and application. +Libraries like scikit-learn are especially valued for their clean, consistent interfaces and their integration with the broader Python data stack, which streamlines workflows and enhances usability for both new and experienced users. + +While there are many smaller, specialized packages available, the ecosystem remains dominated by these large, general-purpose libraries. This concentration of resources ensures stability and quality but can also limit the visibility and adoption of innovative or niche statistical tools. +As Python’s role in statistics continues to grow, fostering a more diverse and accessible ecosystem will be key to meeting the evolving needs of educators, researchers, and practitioners. + +# Relationship to Other Languages + +R remains the gold standard for statistics, with better branding, a more cohesive ecosystem, and more teaching resources. +R's [tidyverse](https://www.tidyverse.org/) and [RStudio](https://posit.co/products/open-source/rstudio/) provide a smoother user experience for statistics, and CRAN offers a vast repository of statistical packages. + +:::{table} Python vs. R for Statistics +:label: table +:align: center + +| Aspect | Python (Scientific Python) | R (CRAN, tidyverse) | +| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------- | +| Core Libraries | [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html), [statsmodels](https://www.statsmodels.org/), [scikit-learn](https://scikit-learn.org/) | [base R](https://www.r-project.org/), [tidyverse](https://www.tidyverse.org/), many CRAN packages | +| User Experience | Fragmented, less cohesive | Cohesive, tidyverse pipelines, RStudio | +| Teaching Resources | Improving, but less abundant | Extensive, beginner-friendly | +| Community | Large, less connected in statistics | Strong, statistics-focused, welcoming | +| Package Development | High barriers, less modularity | Easy, many small packages, dev tools | +| Interoperability | Needs improvement (data structures, APIs) | Strong within tidyverse, RStudio | +| Branding | Data science/machine learning focus | Statistics-focused | + +::: + +**Interoperability**: While some users switch between Python and R in their workflows, true interoperability is limited. +Most projects use one language at a time, often leveraging R for data manipulation and Python for modeling or vice versa. + +**Other Platforms**: Tools like GraphPad Prism remain popular among practicing scientists for basic statistical analyses, indicating that neither Python nor R fully dominates all applied domains. + +# Weaknesses and Needs + +Despite Python's strengths, several challenges remain. + +- **Fragmentation**: The ecosystem is fragmented, with major libraries (e.g., statsmodels vs. scikit-learn) adopting incompatible APIs and workflows, leading to confusion for users and students. +- **User Experience**: There is no central landing place or unified entry point for statistics in Python, unlike R's [tidyverse](https://www.tidyverse.org/) or RStudio, making it harder for newcomers to get started. +- **Interoperability**: Data structures (such as those from [pandas](https://pandas.pydata.org/) and [NumPy](https://numpy.org/)) do not always work seamlessly across libraries, requiring conversions and leading to unpredictable function outputs compared to R's tidyverse pipelines. +- **Teaching Resources**: Python lacks the abundance of user-friendly, statistics-focused tutorials and case studies found in the R community. +- **Contributor Barriers**: Contributing to core libraries can be difficult due to high standards and lack of modularity. + Small, specialized packages exist but are less visible and less widely used than in R. +- **Statistical Methods Coverage**: Some advanced or niche statistical methods are missing or hard to find, especially compared to R's vast [CRAN](https://cran.r-project.org/) repository. +- **Community and Culture**: The Python statistics community is less cohesive and connected than R's, which benefits from a strong identity and established events. + +# Conclusion + +Python's statistics ecosystem is powerful but fragmented, with significant opportunities for improvement in usability, interoperability, teaching resources, and community cohesion. +While R remains the default for statistics, Python is gaining ground, especially as data science and machine learning continue to grow in influence. +Stronger integration, better documentation, and a more unified vision could help Python become a true peer to R in the statistics domain. +In particular, Python needs: + +- A unified, user-friendly interface for statistics, possibly modeled after R's tidyverse. +- Improved interoperability between core data structures and libraries. +- More accessible teaching resources and case studies focused on statistics. +- Lower barriers for contributors and greater visibility for specialized statistical packages. +- Stronger community identity and central organization for statistics in Python. + +The Statistical Python project seeks to address these needs by fostering collaboration, sharing best practices, and building a sustainable, inclusive community. +As a domain stack within the [Scientific Python project](https://scientific-python.org/), and with support from the NSF POSE Phase I grant, we are committed to making Python a premier platform for statistical computing, education, and research. diff --git a/about.md b/about.md index 24d8ac4..9cf9b14 100644 --- a/about.md +++ b/about.md @@ -10,6 +10,4 @@ The Statistical Python project was launched with support from a [grant from the We are now completing Phase I, which has centered on scoping activities to inform the transition into a sustainable open-source ecosystem. During this phase, we conducted interviews with stakeholders across the statistical and scientific Python communities, engaged with related domain-stack OSEs to learn from their experiences, and organized a workshop to gather input on community needs and technical directions. -