Skip to content

Commit

Permalink
Fix canonical dataset names (#3033)
Browse files Browse the repository at this point in the history
* imdb

* ett

* atomic

* some more

* some more

* some more

* more

* more

* more

* more

* more

* mnist

* last ones
  • Loading branch information
severo authored Aug 21, 2024
1 parent e8e0edf commit 83c2b1d
Show file tree
Hide file tree
Showing 18 changed files with 225 additions and 216 deletions.
14 changes: 7 additions & 7 deletions docs/source/clickhouse.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,17 +97,17 @@ Remember to set `enable_url_encoding` to 0 and `max_https_get_redirects` to 1 to
SET max_http_get_redirects = 1, enable_url_encoding = 0
```

Let's create a function to return a list of Parquet files from the [`blog_authorship_corpus`](https://huggingface.co/datasets/blog_authorship_corpus):
Let's create a function to return a list of Parquet files from the [`barilan/blog_authorship_corpus`](https://huggingface.co/datasets/barilan/blog_authorship_corpus):

```bash
CREATE OR REPLACE FUNCTION hugging_paths AS dataset -> (
SELECT arrayMap(x -> (x.1), JSONExtract(json, 'parquet_files', 'Array(Tuple(url String))'))
FROM url('https://datasets-server.huggingface.co/parquet?dataset=' || dataset, 'JSONAsString')
);

SELECT hugging_paths('blog_authorship_corpus') AS paths
SELECT hugging_paths('barilan/blog_authorship_corpus') AS paths

['https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet','https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0001.parquet','https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/validation/0000.parquet']
['https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet','https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0001.parquet','https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/validation/0000.parquet']
```

You can make this even easier by creating another function that calls `hugging_paths` and outputs all the files based on the dataset name:
Expand All @@ -118,16 +118,16 @@ CREATE OR REPLACE FUNCTION hf AS dataset -> (
SELECT multiIf(length(urls) = 0, '', length(urls) = 1, urls[1], 'https://huggingface.co/datasets/{' || arrayStringConcat(arrayMap(x -> replaceRegexpOne(replaceOne(x, 'https://huggingface.co/datasets/', ''), '\\.parquet$', ''), urls), ',') || '}.parquet')
);

SELECT hf('blog_authorship_corpus') AS pattern
SELECT hf('barilan/blog_authorship_corpus') AS pattern

['https://huggingface.co/datasets/{blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-train-00000-of-00002,blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-train-00001-of-00002,blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-validation}.parquet']
['https://huggingface.co/datasets/{blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/barilan/blog_authorship_corpus/blog_authorship_corpus-train-00000-of-00002,barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-train-00001-of-00002,barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-validation}.parquet']
```

Now use the `hf` function to query any dataset by passing the dataset name:

```bash
SELECT horoscope, count(*), AVG(LENGTH(text)) AS avg_blog_length
FROM url(hf('blog_authorship_corpus'))
FROM url(hf('barilan/blog_authorship_corpus'))
GROUP BY horoscope
ORDER BY avg_blog_length
DESC LIMIT(5)
Expand All @@ -140,4 +140,4 @@ DESC LIMIT(5)
│ Sagittarius │ 52753 │ 1055.7120732470191 │
│ Capricorn │ 52207 │ 1055.4147719654452 │
└─────────────┴───────┴────────────────────┘
```
```
6 changes: 3 additions & 3 deletions docs/source/cudf.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ To read from a single Parquet file, use the [`read_parquet`](https://docs.rapids
import cudf

df = (
cudf.read_parquet("https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet")
cudf.read_parquet("https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet")
.groupby('horoscope')['text']
.apply(lambda x: x.str.len().mean())
.sort_values(ascending=False)
Expand All @@ -25,6 +25,6 @@ import dask.dataframe as dd
dask.config.set({"dataframe.backend": "cudf"})

df = (
dd.read_parquet("https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/*.parquet")
dd.read_parquet("https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/*.parquet")
)
```
```
4 changes: 2 additions & 2 deletions docs/source/data_types.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@ Datasets supported by the dataset viewer have a tabular format, meaning a data p

There are several different data `Features` for representing different data formats such as [`Audio`](https://huggingface.co/docs/datasets/v2.5.2/en/package_reference/main_classes#datasets.Audio) and [`Image`](https://huggingface.co/docs/datasets/v2.5.2/en/package_reference/main_classes#datasets.Image) for speech and image data respectively. Knowing a dataset feature gives you a better understanding of the data type you're working with, and how you can preprocess it.

For example, the `/first-rows` endpoint for the [Rotten Tomatoes](https://huggingface.co/datasets/rotten_tomatoes) dataset returns the following:
For example, the `/first-rows` endpoint for the [Rotten Tomatoes](https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes) dataset returns the following:

```json
{"dataset": "rotten_tomatoes",
{"dataset": "cornell-movie-review-data/rotten_tomatoes",
"config": "default",
"split": "train",
"features": [{"feature_idx": 0,
Expand Down
4 changes: 2 additions & 2 deletions docs/source/duckdb.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
```py
import duckdb

url = "https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet"
url = "https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet"

con = duckdb.connect()
con.execute("INSTALL httpfs;")
Expand All @@ -22,7 +22,7 @@ var con = db.connect();
con.exec('INSTALL httpfs');
con.exec('LOAD httpfs');

const url = "https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet"
const url = "https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet"
```
</js>
</inferencesnippet>
Expand Down
2 changes: 1 addition & 1 deletion docs/source/first_rows.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ For some datasets, the response size from `/first-rows` may exceed 1MB, in which

In some cases, if even the first few rows generate a response that exceeds 1MB, some of the columns are truncated and converted to a string. You'll see these listed in the `truncated_cells` field.

For example, the [`ett`](https://datasets-server.huggingface.co/first-rows?dataset=ett&config=m2&split=test) dataset only returns 10 rows, and the `target` and `feat_dynamic_real` columns are truncated:
For example, the [`ETDataset/ett`](https://datasets-server.huggingface.co/first-rows?dataset=ETDataset/ett&config=m2&split=test) dataset only returns 10 rows, and the `target` and `feat_dynamic_real` columns are truncated:

```json
...
Expand Down
4 changes: 2 additions & 2 deletions docs/source/mlcroissant.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@

</Tip>

Let's start by parsing the Croissant metadata for the [`blog_authorship_corpus`](https://huggingface.co/datasets/blog_authorship_corpus) dataset. Be sure to first install `mlcroissant[parquet]` and `GitPython` to be able to load Parquet files over the git+https protocol.
Let's start by parsing the Croissant metadata for the [`barilan/blog_authorship_corpus`](https://huggingface.co/datasets/barilan/blog_authorship_corpus) dataset. Be sure to first install `mlcroissant[parquet]` and `GitPython` to be able to load Parquet files over the git+https protocol.

```py
from mlcroissant import Dataset
ds = Dataset(jsonld="https://huggingface.co/api/datasets/blog_authorship_corpus/croissant")
ds = Dataset(jsonld="https://huggingface.co/api/datasets/barilan/blog_authorship_corpus/croissant")
```

To read from the first subset (called RecordSet in Croissant's vocabulary), use the [`records`](https://github.com/mlcommons/croissant/blob/cd64e12c733cf8bf48f2f85c951c1c67b1c94f5a/python/mlcroissant/mlcroissant/_src/datasets.py#L86) function, which returns an iterator of dicts.
Expand Down
Loading

0 comments on commit 83c2b1d

Please sign in to comment.