Skip to content

Latest commit

 

History

History
64 lines (44 loc) · 6.17 KB

appC.asciidoc

File metadata and controls

64 lines (44 loc) · 6.17 KB

Appendix A: Debugging Dask

Depending on your debugging techniques, moving to distributed systems could require a new set of techniques. While you can use debuggers in remote mode, it often requires more setup work. You can also run Dask locally to use your existing debugging tools in many other situations, although—take it from us—a surprising number of difficult-to-debug errors don’t show up in local mode. Dask has a special hybrid approach. Some errors happen outside Python, making them more difficult to debug, like container out-of-memory (OOM) errors, segmentation faults, and other native errors.

Note

Some of this advice is common across distributed systems, including Ray and Apache Spark. As such, some elements of this chapter are shared with High Performance Spark, second edition, and Scaling Python with Ray.

Using Debuggers

There are a few different options for using debuggers in Dask. PyCharm and PDB both support connecting to remote debugger processes, but figuring out where your task is running and also setting up the remote debugger can be a challenge. For details on PyCharm remote debugging, see the JetBrains article "Remote Debugging with PyCharm". One option is to use epdb and run import epdb; epdb.serve() inside of an actor. The easiest option, which is not perfect, is to have Dask re-run failed tasks locally by running client.recreate_error_locally on the future that failed.

General Debugging Tips with Dask

You likely have your own standard debugging techniques for working with Python code, and these are not meant to replace them. Some general techniques that are helpful with Dask include the following:

  • Break up failing functions into smaller functions; smaller functions make it easier to isolate the problem.

  • Be careful about referencing variables from outside of a function, which can result in unintended scope capture, serializing more data and objects than intended.

  • Sample data and try to reproduce locally (local debugging is often easier).

  • Use mypy for type checking. While we haven’t included types in many of our examples for space, in production code liberal type usage can catch tricky errors.

  • Having difficulty tracking down where a task is getting scheduled? Dask actors can’t move, so use an actor to keep all invocations on one machine for debugging.

  • When the issues do appear, regardless of parallelization, debugging your code in local single-threaded mode can make it easier to understand what’s going on.

With these tips you will (often) be able to find yourself in a familiar enough environment to use your traditional debugging tools, but some types of errors are a little bit more complicated.

Native Errors

Native errors and core dumps can be challenging to debug for the same reasons as container errors. Since these types of errors often result in the container exiting, accessing the debugging information can become challenging. Depending on your deployment, there may be a centralized log aggregator that collects all of the logs from the containers, although sometimes these can miss the final few parts of the log (which you likely care about the most). A quick solution to this is to add a sleep to the launch script (on failure) so that you can connect to the container (e.g., [dasklaunchcommand] || sleep 100000) and use native debugging tools.

However, accessing the internals of a container can be easier said than done. In many production environments, you may not be able to get remote access (e.g., kubectly exec on Kubernetes) for security reasons. If that is the case, you can (sometimes) add a shutdown script to your container specification that copies the core files to a location that persists after the container shuts down (e.g., s3 or HDFS or NFS). Your cluster administrator may also have recommended tools to help debug (or if not, they may be able to help you create a recommended path for your organization).

Some Notes on Official Advice for Handling Bad Records

Dask’s official debugging guide recommends removing failed futures manually. When loading data that can be processed in smaller chunks rather than entire partitions at a time, returning tuples with successful and failed data is better, since removing entire partitions is not conducive to determining the root cause. This technique is demonstrated in Alternative approach for handling bad data.

Example 1. Alternative approach for handling bad data
link:./examples/dask/Dask-Debugging-EXs.py[role=include]
Note

Bad records here does not exclusively mean records that fail to load or parse; they can also be records that are causing your code to fail. By following this pattern, you can extract the problematic records for deeper investigation and use this to improve your code.

Dask Diagnostics

Dask has built-in diagnostic tools for both distributed and local schedulers. The local diagnostics are more featureful with pretty much every part of debugging. These diagnostics can be especially great for debugging situations in which you see a slow degradation of performance over time.

Note

It’s really easy to accidentally use Dask’s distributed local backend by mistake when making a Dask client, so if you don’t see the diagnostics you expect, make sure you are explicit about which backend you are running on.

Conclusion

You will have a bit more work to get started with your debugging tools in Dask, and when possible, Dask’s local mode offers a great alternative to remote debugging. Not all errors are created equal, and some errors, like segmentation faults in native code, are especially challenging to debug. Good luck finding the bug(s); we believe in you.