Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor change in dask docs from dask maintainer #1568

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions docs/hub/datasets-dask.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,13 @@ def dummy_count_words(texts):
return pd.Series([len(text.split(" ")) for text in texts])
```

or a similar function using pandas string methods (faster):

```python
def dummy_count_words(texts):
return texts.str.count(" ")
```

In pandas you can use this function on a text column:

```python
Expand Down Expand Up @@ -116,3 +123,29 @@ This is useful when you want to manipulate a subset of the columns or for analyt
# for the filtering and computation and skip the other columns.
df.token_count.mean().compute()
```

## Client

Most features in `dask` are optimized for a cluster or a local `Client` to launch the parallel computations:

```python
import dask.dataframe as dd
from distributed import Client

if __name__ == "__main__": # needed for creating new processes
client = Client()
df = dd.read_parquet(...)
...
```

For local usage, the `Client` uses a Dask `LocalCluster` with multiprocessing by default. You can manually configure the multiprocessing of `LocalCluster` with

```python
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=8, threads_per_worker=8)
client = Client(cluster)
```

Note that if you use the default threaded scheduler locally without `Client`, a DataFrame can become slower after certain operations (more details [here](https://github.com/dask/dask-expr/issues/1181)).

Find more information on setting up a local or cloud cluster in the [Deploying Dask documentation](https://docs.dask.org/en/latest/deploying.html).