Skip to content

Commit 16b20c9

Browse files
lhoestqdavanstrien
andauthored
More pandas docs (#1563)
* More pandas docs * add audio * minor changes * Apply suggestions from code review Co-authored-by: Daniel van Strien <[email protected]> --------- Co-authored-by: Daniel van Strien <[email protected]>
1 parent 830ddff commit 16b20c9

File tree

1 file changed

+183
-11
lines changed

1 file changed

+183
-11
lines changed

docs/hub/datasets-pandas.md

Lines changed: 183 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,44 @@
11
# Pandas
22

33
[Pandas](https://github.com/pandas-dev/pandas) is a widely used Python data analysis toolkit.
4-
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub:
4+
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub.
55

6-
First you need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using:
6+
## Load a DataFrame
7+
8+
You can load data from local files or from remote storage like Hugging Face Datasets. Pandas supports many formats including CSV, JSON and Parquet:
9+
10+
```python
11+
>>> import pandas as pd
12+
>>> df = pd.read_csv("path/to/data.csv")
13+
```
14+
15+
To load a file from Hugging Face, the path needs to start with `hf://`. For example, the path to the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset repository is `hf://datasets/stanfordnlp/imdb`. The dataset on Hugging Face contains multiple Parquet files. The Parquet file format is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Here is how to load the file `plain_text/train-00000-of-00001.parquet`:
16+
17+
```python
18+
>>> import pandas as pd
19+
>>> df = pd.read_parquet("hf://datasets/stanfordnlp/imdb/plain_text/train-00000-of-00001.parquet")
20+
>>> df
21+
text label
22+
0 I rented I AM CURIOUS-YELLOW from my video sto... 0
23+
1 "I Am Curious: Yellow" is a risible and preten... 0
24+
2 If only to avoid making this type of film in t... 0
25+
3 This film was probably inspired by Godard's Ma... 0
26+
4 Oh, brother...after hearing about this ridicul... 0
27+
... ... ...
28+
24995 A hit at the time but now better categorised a... 1
29+
24996 I love this movie like no other. Another time ... 1
30+
24997 This film and it's sequel Barry Mckenzie holds... 1
31+
24998 'The Adventures Of Barry McKenzie' started lif... 1
32+
24999 The story centers around Barry McKenzie who mu... 1
33+
```
34+
35+
For more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system).
36+
37+
## Save a DataFrame
38+
39+
You can save a pandas DataFrame using `to_csv/to_json/to_parquet` to a local file or to Hugging Face directly.
40+
41+
To save the DataFrame on Hugging Face, you first need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using:
742

843
```
944
huggingface-cli login
@@ -22,26 +57,163 @@ Finally, you can use [Hugging Face paths](/docs/huggingface_hub/guides/hf_file_s
2257
```python
2358
import pandas as pd
2459

25-
df.to_parquet("hf://datasets/username/my_dataset/data.parquet")
60+
df.to_parquet("hf://datasets/username/my_dataset/imdb.parquet")
2661

2762
# or write in separate files if the dataset has train/validation/test splits
2863
df_train.to_parquet("hf://datasets/username/my_dataset/train.parquet")
2964
df_valid.to_parquet("hf://datasets/username/my_dataset/validation.parquet")
3065
df_test .to_parquet("hf://datasets/username/my_dataset/test.parquet")
3166
```
3267

33-
This creates a dataset repository `username/my_dataset` containing your Pandas dataset in Parquet format.
34-
You can reload it later:
68+
## Use Images
69+
70+
You can load a folder with a metadata file containing a field for the names or paths to the images, structured like this:
71+
72+
```
73+
Example 1: Example 2:
74+
folder/ folder/
75+
├── metadata.csv ├── metadata.csv
76+
├── img000.png └── images
77+
├── img001.png ├── img000.png
78+
... ...
79+
└── imgNNN.png └── imgNNN.png
80+
```
81+
82+
You can iterate on the images paths like this:
83+
84+
```python
85+
import pandas as pd
86+
87+
folder_path = "path/to/folder/"
88+
df = pd.read_csv(folder_path + "metadata.csv")
89+
for image_path in (folder_path + df["file_name"]):
90+
...
91+
```
92+
93+
Since the dataset is in a supported structure (a `metadata.csv` file with a `file_name` field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images on Hugging Face.
94+
95+
```python
96+
from huggingface_hub import HfApi
97+
api = HfApi()
98+
99+
api.upload_folder(
100+
folder_path=folder_path,
101+
repo_id="username/my_image_dataset",
102+
repo_type="dataset",
103+
)
104+
```
105+
106+
### Image methods and Parquet
107+
108+
Using [pandas-image-methods](https://github.com/lhoestq/pandas-image-methods) you enable `PIL.Image` methods on an image column. It also enables saving the dataset as one single Parquet file containing both the images and the metadata:
109+
110+
```python
111+
import pandas as pd
112+
from pandas_image_methods import PILMethods
113+
114+
pd.api.extensions.register_series_accessor("pil")(PILMethods)
115+
116+
df["image"] = (folder_path + df["file_name"]).pil.open()
117+
df.to_parquet("data.parquet")
118+
```
119+
120+
All the `PIL.Image` methods are available, e.g.
121+
122+
```python
123+
df["image"] = df["image"].pil.rotate(90)
124+
```
125+
126+
## Use Audios
127+
128+
You can load a folder with a metadata file containing a field for the names or paths to the audios, structured like this:
129+
130+
```
131+
Example 1: Example 2:
132+
folder/ folder/
133+
├── metadata.csv ├── metadata.csv
134+
├── rec000.wav └── audios
135+
├── rec001.wav ├── rec000.wav
136+
... ...
137+
└── recNNN.wav └── recNNN.wav
138+
```
139+
140+
You can iterate on the audios paths like this:
141+
142+
```python
143+
import pandas as pd
144+
145+
folder_path = "path/to/folder/"
146+
df = pd.read_csv(folder_path + "metadata.csv")
147+
for audio_path in (folder_path + df["file_name"]):
148+
...
149+
```
150+
151+
Since the dataset is in a supported structure (a `metadata.csv` file with a `file_name` field), you can save it to Hugging Face, and the Hub Dataset Viewer shows both the metadata and audio.
152+
153+
```python
154+
from huggingface_hub import HfApi
155+
api = HfApi()
156+
157+
api.upload_folder(
158+
folder_path=folder_path,
159+
repo_id="username/my_audio_dataset",
160+
repo_type="dataset",
161+
)
162+
```
163+
164+
### Audio methods and Parquet
165+
166+
Using [pandas-audio-methods](https://github.com/lhoestq/pandas-audio-methods) you enable `soundfile` methods on an audio column. It also enables saving the dataset as one single Parquet file containing both the audios and the metadata:
35167

36168
```python
37169
import pandas as pd
170+
from pandas_image_methods import SFMethods
38171

39-
df = pd.read_parquet("hf://datasets/username/my_dataset/data.parquet")
172+
pd.api.extensions.register_series_accessor("sf")(SFMethods)
40173

41-
# or read from separate files if the dataset has train/validation/test splits
42-
df_train = pd.read_parquet("hf://datasets/username/my_dataset/train.parquet")
43-
df_valid = pd.read_parquet("hf://datasets/username/my_dataset/validation.parquet")
44-
df_test = pd.read_parquet("hf://datasets/username/my_dataset/test.parquet")
174+
df["audio"] = (folder_path + df["file_name"]).sf.open()
175+
df.to_parquet("data.parquet")
45176
```
46177

47-
To have more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system).
178+
This makes it easy to use with `librosa` e.g. for resampling:
179+
180+
```python
181+
df["audio"] = [librosa.load(audio, sr=16_000) for audio in df["audio"]]
182+
df["audio"] = df["audio"].sf.write()
183+
```
184+
185+
## Use Transformers
186+
187+
You can use `transformers` pipelines on pandas DataFrames to classify, generate text, images, etc.
188+
This section shows a few examples with `tqdm` for progress bars.
189+
190+
<Tip>
191+
192+
Pipelines don't accept a `tqdm` object as input but you can use a python generator instead, in the form `x for x in tqdm(...)`
193+
194+
</Tip>
195+
196+
### Text Classification
197+
198+
```python
199+
from transformers import pipeline
200+
from tqdm import tqdm
201+
202+
pipe = pipeline("text-classification", model="clapAI/modernBERT-base-multilingual-sentiment")
203+
204+
# Compute labels
205+
df["label"] = [y["label"] for y in pipe(x for x in tqdm(df["text"]))]
206+
# Compute labels and scores
207+
df[["label", "score"]] = [(y["label"], y["score"]) for y in pipe(x for x in tqdm(df["text"]))]
208+
```
209+
210+
### Text Generation
211+
212+
```python
213+
from transformers import pipeline
214+
from tqdm import tqdm
215+
216+
p = pipeline("text-generation", model="Qwen/Qwen2.5-1.5B-Instruct")
217+
prompt = "What is the main topic of this sentence ? REPLY IN LESS THAN 3 WORDS. Sentence: '{}'"
218+
df["output"] = [y["generated_text"][1]["content"] for y in pipe([{"role": "user", "content": prompt.format(x)}] for x in tqdm(df["text"]))]
219+
```

0 commit comments

Comments
 (0)