It’s hard, although possible, to build reliable systems out of unreliable components.[1] Dask is a largely community-driven open source project, and its components evolve at different rates. Not all parts of Dask are equally mature; even the components we cover in this book have different levels of support and development. While Dask’s core parts are well maintained and tested, some parts lack the same level of maintenance.
Still, there are already dozens of popular libraries specifically for Dask, and the open source Dask community is growing around them. This gives us some confidence that many of these libraries are here to stay. Libraries frequently used with Dask shows a non-exhaustive list of foundational libraries in use and their relation to the core Dask project. It is meant as a road map for users and is not an endorsement of individual projects. Though we haven’t attempted to cover all the projects shown here, we offer evaluation of some individual projects throughout the book.
Category | Subcategory | Libraries |
---|---|---|
Dask project |
|
|
Data structures: Extend functionality, specific scientific data handling, or deployment hardware options of Dask built-in data structures |
Functionalities and convenience |
|
Hardware |
||
Deployment: Extend deployment options for use with Dask distributed |
Containers |
|
Cloud |
|
|
GPU |
|
|
HPC |
||
ML and analytics: Extend ML libraries and computation with Dask |
|
It’s essential to understand the state of the components that you are considering using. If you need to use a less maintained or developed part of Dask, defensive programming, including thorough code testing, will become even more critical. Working on less-established parts of the Dask ecosystem can also be an exciting opportunity to become more involved and contribute fixes or documentation.
Note
|
This is not to say that closed source software does not suffer from the same challenges (e.g., untested components), but we are in a better place to evaluate and make informed choices with open source software. |
Of course, not all of our projects need to be maintainable, but as the saying goes, "Nothing is more permanent than a temporary fix." If something is truly a one-time-use project, you can likely skip most of the analysis here and try out the libraries to see if they work for you.
Dask is under rapid development, and any static table of which components are production-ready would be out of date by the time it was read. So instead of sharing our views on which components of Dask are currently well developed, this chapter aims to give you the tools to evaluate the libraries you may be considering. In this chapter, we separate metrics that you can measure concretely from the fuzzier qualitative metrics. Perhaps counterintuitively, we believe that the "fuzzier" qualitative metrics are a better framework for evaluating components and projects.
Along the way, we’ll look at some projects and how they are measured, but please keep in mind that these specific observations may be out of date by the time you read this, and you should do your own evaluation with the tools provided here.
Tip
|
While we focus on the Dask ecosystem in this chapter, you can apply most of these techniques throughout software tool selection. |
We start by focusing on qualitative tools since we believe these tools are the best for determining the suitability of a particular library for your project.
Some projects prioritize benchmarks or performance numbers, while other projects can prioritize correctness and clarity, and still others may prioritize completeness. A project’s README or home page is often a good sign of what the project prioritizes. Early in its creation, Apache Spark’s home page focused on performance with benchmarks, whereas now it shows an ecosystem of tools leading more toward completeness. The Dask Kubernetes GitHub README shows a collection of badges indicating the state of the code and not much else, revealing a strong developer focus.
While there are many arguments for and against focusing on benchmarks, correctness should almost never be sacrificed.[2] This does not mean that libraries will never have bugs; rather, projects should take reports of correctness issues seriously and treat them with higher priority than others. An excellent way to see whether a project values correctness is to look for reports of correctness and observe how the core developers respond.
Many Dask ecosystem projects use GitHub’s built-in issue tracker, but if you don’t see any activity, check the README and developer guides to see if the project uses a different issue tracker. For example, many ASF projects use JIRA. Looking into how folks respond to issues gives you a good idea of what issues they consider important. You don’t need to look at all of them, but a small sample of 10 will often give you a good idea (look at open and not fixed issues as well as closed and fixed ones).
As one of the unofficial ASF sayings goes, "Community over code."[3] The Apache Way website describes this as meaning "the most successful long-lived projects value a broad and collaborative community over the details of the code itself." This saying matches our experience, in which we’ve found that technical improvements are easier to copy from other projects, but the community is much harder to move. Measuring community is challenging, and it can be tempting to look at the number of developers or users, but we think it’s essential to go beyond that.
Finding the community associated with a particular project can be tricky. Take your time to look around at issue trackers, source code, forums (like Discourse), and mailing lists. For example, Dask’s Discourse group is highly active. Some projects use IRC, Slack, or Discord, or their "interactive" communication—and in our opinion, some of the best ones put in the effort to make the conversations from these appear in search indexes. Sometimes parts of the community may exist on external social media sites, and these pose a unique set of challenges to community standards.
There are multiple types of communities for open source software projects. The user community is the people who are using the software to build things. The developer community is the group working on improving the library. Some projects have large intersections between these communities, but often the user community is much larger than the developer community. We are biased toward evaluating the developer community, but it’s important to ensure both are healthy. Software projects without enough developers will move slowly, and projects without users are frequently challenging to use by anyone except the developers.
In many situations, a large community with enough jerks (or a lead jerk) can be a much less enjoyable environment than a small community of nice folks. You are less likely to be productive if you are not enjoying your work. Sadly, figuring out if someone is a jerk or if a community has jerks in it is a complex problem. If people are generally rude on the mailing list or in the issue tracker, this can be a sign that the community is not as welcoming to new members.[4]
Note
|
Some projects, including one of Holden’s projects, have attempted to quantify some of these metrics using sentiment analysis combined with random sampling, but this is a time-consuming process you can probably skip in most cases. |
Even with the nicest people, it can matter which institutions the contributors are associated with. If, for example, the top contributors are all grad students in the same research lab or work at the same company, the risk that the software is abandoned increases. This is not to say that single-company or even single-person open source projects are bad,[5] but you should adjust your expectations to match.
Note
|
If you are concerned a project does not meet your current level of maturity and you have a budget, this can be an excellent opportunity to support critical open source projects. Reach out to maintainers and see what they need; sometimes, it’s as simple as writing them a check for new hardware or hiring them to provide training for your company. |
Beyond whether people are nice in a community, it can be a positive sign if folks are using the project similarly to how you are considering using it. If, for example, you are the first person to apply Dask DataFrames to a new domain, even though Dask DataFrames themselves are very mature, you are more likely to find missing components than if other folks in the same area of application are already using Dask.
When it comes to Dask libraries, there are a number of Dask-specific best practices to look for. In general, libraries should not have too much work on the client node, and as much work as possible should be delegated to the workers. Sometimes the documentation will gloss over which parts happen where, and the fastest way to tell in our experience is to simply run the example code and look to see which tasks are getting scheduled on the workers. Relatedly, libraries should bring back only the smallest bits of data when possible. These best practices are slightly different from those for when you are writing your own Dask code, since you can know what your data size is beforehand and determine when local compute is the best path forward.
If a project pins a dependency at a specific version, it is important that the version pinned does not have conflicts with the other packages you want to use and, even more importantly, does not have pinned insecure dependencies. What constitutes "up to date" is a matter of opinion. If you are the kind of developer who likes using the latest version of everything, you’ll probably be happiest with libraries that (mostly) provide minimum but not maximum versions. However, this can be misleading as, especially in the Python ecosystem, many libraries do not use semantic versioning—including Dask, which uses CalVer—and just because a project does not exclude a new version does not mean it will actually work with it.
Note
|
Some folks would call this quantitative, but in a CalVer-focused ecosystem, we believe this is more qualitative. |
A good check, when considering adding a new library to an existing environment, is to try to run the new libraries test suite in the virtual environment that you plan to use it in (or in an equivalently configured one).
While not every tool needs a book (although we do hope you find books useful), very few libraries are truly self-explanatory. On the low end, for simple libraries, a few examples or well-written tests can serve as a stand-in for proper documentation. Complete documentation is a good sign of overall project maturity. Not all documentation is created equal, and as the saying goes, documentation is normally out of date as soon as it is finished (if not before). A good exercise to do, before you dive all the way into a new library, is to open up the documentation and try to run the examples. If the getting-started examples don’t work (and you can’t figure out how to fix them), you will likely be in for a rough ride.
Sometimes good documentation exists but is separate from the project (e.g., in books), and some research may be required. If you find a project has good but not self-evident documentation, consider trying to improve the visibility of the documentation.
If you find the library is promising but not all the way there, it’s important to be able to contribute your improvements back to the library. This is good for the community, and besides, if you can’t upstream your improvements, upgrading to new versions will be more challenging.[6] Many projects nowadays have contribution guides that can give you an idea of how they like to work, but nothing beats a real test contribution. A great place to start with a project is fixing its documentation with the eyes of a newcomer, especially those getting-started examples from the previous section. Documentation often becomes out of sync in fast-moving projects, and if you find it difficult to get your documentation changes accepted, that is a strong indicator of how challenging it will be to contribute more complicated improvements later.
Something to pay attention to is what the issue-reporting experience is like. Since almost no software is completely free of bugs, you may encounter an issue. Whether you have the energy or skills to fix the bug, sharing your experience is vital so it can be fixed. Sharing the bug can help the next person encountering the same challenge feel not alone, even if the issue is unresolved.
Note
|
Pay attention to your experience when trying to report an issue. Most large projects with active communities will have some guidance to help you submit your issue and ensure it’s not duplicating a previous issue. If that’s lacking, reporting an issue could be more challenging (or a smaller community). |
If you don’t have time to make your own test contribution, you can always take a look at a project’s pull requests (or equivalent) and see if the responses seem positive or antagonistic.
Not all changes to libraries necessarily need to be able to go upstream. If a library is appropriately structured, you can add additional functionality without changing the underlying code. Part of what makes Dask so powerful is its extensibility. For example, adding user-defined functions and aggregations allows Dask to be usable by many.
As software developers and data scientists, we often try to use quantitative metrics to make our decisions. Quantitative metrics for software, in both open source and closed source, is an area of active research, so we won’t be able to cover all of the quantitative metrics. A large challenge with all of the quantitative metrics for open source projects is that, especially once money gets involved, the metrics can be influenced. We instead recommend focusing on qualitative factors, which, while more difficult to measure, are also more difficult to game.
Here we cover a few common metrics that folks commonly attempt to use, and there are many other frameworks for evaluating open source projects for use, including the OSSM, OpenSSF Security Metrics, and many more. Some of these frameworks ostensibly produce automated scores (like the OpenSSF), but in our experience, not only are the metrics collected gameable, they are often collected incorrectly.[7]
Frequent releases can be a sign of a healthy library. If a project has not been released for a long time, you are more likely to run into conflicts with other libraries. For libraries built on top of tools like Dask, one detail to check is how many months (or days) it takes for a new version of the library to be released on top of the latest version of Dask. Some libraries do not do traditional releases, but rather suggest installing directly from the source repo. This is often a sign of a project earlier in the development phase, which can be more challenging to take on as a dependency.[8]
Release history is one of the easiest metrics to game, as all it requires is the developers making a release. Some development styles will automatically create releases after every successful check-in, which in our opinion is an anti-pattern,[9] as you often want some additional level of human testing or checking before a full release.
Another popular metric people consider is commit frequency or volume. This metric is far from perfect, as the frequency and volume can vary widely depending on coding styles, which lack correlation with software quality. For example, developers who tend to squash commits can have lower commit volume, whereas developers who use rebases primarily will have a higher volume of commits.
On the flip side, the complete lack of recent commits can be a sign that a project has become abandoned, and if you decide to use it, you will end up having to maintain a fork.
One of the simplest metrics to check is if people are using a package, which you can see by looking at the installs. You can check PyPI package install stats on the PyPI Stats website (see Dask Kubernetes install stats from PyPI Stats) or on Google's BigQuery, and conda installs using the condastats library.
Unfortunately, installation counts are a noisy metric, as PyPI downloads can come from anything from CI pipelines to even someone spinning up a new cluster with the library installed but never used. Not only is this metric unintentionally noisy, but the same techniques can also be used to increase the numbers artificially.
Instead of depending heavily on the number of package installs, we like to see if we can find examples of people using the libraries—such as by searching for imports on GitHub or Sourcegraph. For example, we can try to get an approximate number of people using Streamz or cuDF with Dask by searching (file:requirements.txt OR file:setup.py) cudf AND dask
and (file:requirements.txt OR file:setup.py) streamz AND dask
with Sourcegraph, which yields 72 and 33, respectively. This captures only a few, but when we compare this to the same query for Dask (which yields 500+), it suggests that Streamz has lower usage than cuDF in the Dask ecosystem.
Looking for examples of people using a library has its limitations, especially with data processing. Since data and machine learning pipelines are not as frequently open sourced, finding examples can be harder for libraries used for those purposes.
Another proxy for usage you can look at is the frequency of issues or mailing list posts. If a project is hosted on something like GitHub, stars can also be an interesting way of measuring usage—but since people can now buy GitHub stars just like Instagram likes (as shown in Someone selling GitHub stars), you shouldn’t weigh this metric too heavily.[10]
Even setting aside people purchasing stars, what constitutes a project worth starring varies from person to person. Some projects will, while not purchasing stars, ask many individuals to star their projects, which can quickly inflate this metric.[11]
Software testing is second nature to many software engineers, but sometimes projects are created hastily without tests. If a project does not have tests, and tests that are mostly passing, then it’s much harder to have confidence in how the project will behave. Even in the most professional of projects, corners sometimes get cut when it comes to testing, and adding more tests to a project can be a great way to ensure that it continues to function in the ways you need it to. A good question is if the tests cover the parts that are important to you. If a project does have relevant tests, the next natural question is if they are being used. If it’s too difficult to run tests, human nature often takes over, and the tests may not be run. So a good step is to see if you can run the tests in the project.
Note
|
Test coverage numbers can be especially informative, but unfortunately, for projects built on top of systems like Dask,[12] getting accurate test coverage information is a challenge. Instead, a more qualitative approach is often needed here. In single-machine systems, test coverage can be an excellent automatically computed quantitative metric. |
We believe that most good libraries will have some form of continuous integration (CI) or automated testing, including proposed changes (or when a pull request is created). You can check if a GitHub project has continuous integration by looking at the pull-requests tab. CI can be very helpful for reducing bugs overall, especially regressions.[13] Historically, use of CI was somewhat a matter of project preference, but with the creation of free tools, including GitHub actions, many multi-person software projects now have some form of CI. This is a common software engineering practice, and we consider it essential for libraries that we depend on.
Static typing is frequently considered a programming best practice, though there are some detractors. While the arguments for and against static types inside data pipelines are complex, we believe some typing at the library level is something one should expect.
When building data (or other) applications on Dask, you will likely need many different tools from the ecosystem. The ecosystem evolves at different rates, with some parts requiring more investment by you before you can use them effectively. Choosing the right tools, and transitively the right people, is key to whether your project will succeed and, in our experience, to how enjoyable your work will be. It’s important to remember that these decisions are not set in stone, but changing a library tends to get harder the longer you’ve been using it in your project. In this chapter, you’ve learned how to evaluate the different components of the ecosystem for project maturity. You can use this knowledge to decide when to use a library versus writing the functionality you need yourself.
set_index
in Dask-on-Ray causing rows to disappear; this issue took about a month to fix, which in our opinion is quite reasonable given the challenges in reproducing it. Sometimes correctness fixes, like security fixes, can result in slower processing; for example, MongoDB’s defaults are very fast but can lose data.