diff --git a/README.md b/README.md index 64da36e65..55b5f8155 100644 --- a/README.md +++ b/README.md @@ -1,30 +1,47 @@ -# hub-docs +# Hub Documentation -This repository regroups documentation and information that is hosted on the Hugging Face website. +Welcome to the documentation repository for Hugging Face Hub. This repository contains the documentation and information hosted on the Hugging Face website. -You can access the Hugging Face Hub documentation in the `docs` folder at [hf.co/docs/hub](https://hf.co/docs/hub). +## Accessing Documentation -For some related components, check out the [Hugging Face Hub JS repository](https://github.com/huggingface/huggingface.js) -- Utilities to interact with the Hub: [huggingface/huggingface.js/packages/hub](https://github.com/huggingface/huggingface.js/tree/main/packages/hub) -- Hub Widgets: [huggingface/huggingface.js/packages/widgets](https://github.com/huggingface/huggingface.js/tree/main/packages/widgets) -- Hub Tasks (as visible on the page [hf.co/tasks](https://hf.co/tasks)): [huggingface/huggingface.js/packages/tasks](https://github.com/huggingface/huggingface.js/tree/main/packages/tasks) +You can find the Hugging Face Hub documentation in the `docs` folder. For direct access, visit: [hf.co/docs/hub](https://huggingface.co/docs/hub/index). -### How to contribute to the docs +## Related Components -Just add/edit the Markdown files, commit them, and create a PR. -Then the CI bot will build the preview page and provide a url for you to look at the result! +Explore these related components for more utilities and features: -For simple edits, you don't need a local build environment. +- **Hugging Face Hub JS Repository:** + - [Utilities to Interact with the Hub](https://github.com/huggingface/huggingface.js/tree/main/packages/hub): Contains utilities to interact with the Hugging Face Hub. + - [Hub Widgets](https://github.com/huggingface/huggingface.js/tree/main/packages/widgets): Includes widgets for the Hub. + - [Hub Tasks](https://github.com/huggingface/huggingface.js/tree/main/packages/tasks): Provides information on tasks visible on [hf.co/tasks](https://huggingface.co/tasks). -### Previewing locally +## Contributing to the Documentation -```bash -# install doc-builder (if not done already) -pip install hf-doc-builder +To contribute: -# you may also need to install some extra dependencies -pip install black watchdog +1. **Edit/Add Markdown Files:** Make changes directly to the Markdown files in this repository. +2. **Commit Changes:** Commit your changes and create a Pull Request (PR). +3. **CI Bot Preview:** After creating a PR, the CI bot will build a preview of your changes. You will receive a URL to review the result. -# run `doc-builder preview` cmd -doc-builder preview hub {YOUR_PATH}/hub-docs/docs/hub/ --not_python_module -``` +For straightforward edits, you do not need a local build environment. + +## Previewing Documentation Locally + +To preview the documentation changes on your local machine, follow these steps: + +1. **Install Doc-Builder:** + ```bash + pip install hf-doc-builder + ``` + +2. **Install Additional Dependencies (if needed):** + ```bash + pip install black watchdog + ``` + +3. **Run the Preview Command:** + ```bash + doc-builder preview hub {YOUR_PATH}/hub-docs/docs/hub/ --not_python_module + ``` + +Replace `{YOUR_PATH}` with the path to the cloned repository on your local machine. diff --git a/docs/hub/datasets-dask.md b/docs/hub/datasets-dask.md index 7c97214a3..00f284c03 100644 --- a/docs/hub/datasets-dask.md +++ b/docs/hub/datasets-dask.md @@ -1,47 +1,62 @@ -# Dask +# Dask Integration with Hugging Face -[Dask](https://github.com/dask/dask) is a parallel and distributed computing library that scales the existing Python and PyData ecosystem. -Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub: +[Dask](https://github.com/dask/dask) is a powerful parallel and distributed computing library that scales the existing Python and PyData ecosystem. By leveraging [fsspec](https://filesystem-spec.readthedocs.io/en/latest/), Dask can seamlessly interact with remote data sources, including the Hugging Face Hub. This allows you to read and write datasets directly from the Hub using Hugging Face paths (`hf://`). -First you need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: +## Prerequisites -``` -huggingface-cli login -``` +Before you can use Hugging Face paths with Dask, you need to: -Then you can [Create a dataset repository](/docs/huggingface_hub/quick-start#create-a-repository), for example using: +1. **Login to your Hugging Face account:** + Authenticate your session by logging in using the Hugging Face CLI: + ```bash + huggingface-cli login + ``` -```python -from huggingface_hub import HfApi +2. **Create a dataset repository:** + You can create a new dataset repository on the Hugging Face Hub using the `HfApi` class: + ```python + from huggingface_hub import HfApi -HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") -``` + HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") + ``` + +## Writing Data to the Hub -Finally, you can use [Hugging Face paths](/docs/huggingface_hub/guides/hf_file_system#integrations) in Dask: +Once your environment is set up, you can easily write Dask DataFrames to the Hugging Face Hub. For instance, to store your dataset in Parquet format: ```python import dask.dataframe as dd +# Writing the entire dataset to a single location df.to_parquet("hf://datasets/username/my_dataset") -# or write in separate directories if the dataset has train/validation/test splits +# Writing data to separate directories for train/validation/test splits df_train.to_parquet("hf://datasets/username/my_dataset/train") df_valid.to_parquet("hf://datasets/username/my_dataset/validation") -df_test .to_parquet("hf://datasets/username/my_dataset/test") +df_test.to_parquet("hf://datasets/username/my_dataset/test") ``` -This creates a dataset repository `username/my_dataset` containing your Dask dataset in Parquet format. -You can reload it later: +This will create a dataset repository `username/my_dataset` containing your data in Parquet format, which can be accessed later. + +## Reading Data from the Hub + +You can reload your dataset from the Hugging Face Hub just as easily: ```python import dask.dataframe as dd +# Reading the entire dataset df = dd.read_parquet("hf://datasets/username/my_dataset") -# or read from separate directories if the dataset has train/validation/test splits +# Reading data from separate directories for train/validation/test splits df_train = dd.read_parquet("hf://datasets/username/my_dataset/train") df_valid = dd.read_parquet("hf://datasets/username/my_dataset/validation") -df_test = dd.read_parquet("hf://datasets/username/my_dataset/test") +df_test = dd.read_parquet("hf://datasets/username/my_dataset/test") ``` -For more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system). +This allows you to seamlessly integrate your Dask workflows with datasets stored on the Hugging Face Hub. + +## Further Information + +For more detailed information on using Hugging Face paths and their implementation, refer to the [Hugging Face File System documentation](https://huggingface.co/docs/huggingface_hub/en/guides/hf_file_system). +