Fix canonical dataset names (#3033)

* imdb * ett * atomic * some more * some more * some more * more * more * more * more * more * mnist * last ones
huggingface · Aug 21, 2024 · 83c2b1d · 83c2b1d
1 parent e8e0edf
commit 83c2b1d
Show file tree

Hide file tree

Showing 18 changed files with 225 additions and 216 deletions.
diff --git a/docs/source/clickhouse.md b/docs/source/clickhouse.md
@@ -97,17 +97,17 @@ Remember to set `enable_url_encoding` to 0 and `max_https_get_redirects` to 1 to
 SET max_http_get_redirects = 1, enable_url_encoding = 0
 ```
 
-Let's create a function to return a list of Parquet files from the [`blog_authorship_corpus`](https://huggingface.co/datasets/blog_authorship_corpus):
+Let's create a function to return a list of Parquet files from the [`barilan/blog_authorship_corpus`](https://huggingface.co/datasets/barilan/blog_authorship_corpus):
 
 ```bash
 CREATE OR REPLACE FUNCTION hugging_paths AS dataset -> (
     SELECT arrayMap(x -> (x.1), JSONExtract(json, 'parquet_files', 'Array(Tuple(url String))'))
     FROM url('https://datasets-server.huggingface.co/parquet?dataset=' || dataset, 'JSONAsString')
 );
 
-SELECT hugging_paths('blog_authorship_corpus') AS paths
+SELECT hugging_paths('barilan/blog_authorship_corpus') AS paths
 
-['https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet','https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0001.parquet','https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/validation/0000.parquet']
+['https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet','https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0001.parquet','https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/validation/0000.parquet']
 ```
 
 You can make this even easier by creating another function that calls `hugging_paths` and outputs all the files based on the dataset name:
@@ -118,16 +118,16 @@ CREATE OR REPLACE FUNCTION hf AS dataset -> (
     SELECT multiIf(length(urls) = 0, '', length(urls) = 1, urls[1], 'https://huggingface.co/datasets/{' || arrayStringConcat(arrayMap(x -> replaceRegexpOne(replaceOne(x, 'https://huggingface.co/datasets/', ''), '\\.parquet$', ''), urls), ',') || '}.parquet')
 );
 
-SELECT hf('blog_authorship_corpus') AS pattern
+SELECT hf('barilan/blog_authorship_corpus') AS pattern
 
-['https://huggingface.co/datasets/{blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-train-00000-of-00002,blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-train-00001-of-00002,blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-validation}.parquet']
+['https://huggingface.co/datasets/{blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/barilan/blog_authorship_corpus/blog_authorship_corpus-train-00000-of-00002,barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-train-00001-of-00002,barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-validation}.parquet']
 ```
 
 Now use the `hf` function to query any dataset by passing the dataset name:
 
 ```bash
 SELECT horoscope, count(*), AVG(LENGTH(text)) AS avg_blog_length 
-FROM url(hf('blog_authorship_corpus')) 
+FROM url(hf('barilan/blog_authorship_corpus'))
 GROUP BY horoscope 
 ORDER BY avg_blog_length 
 DESC LIMIT(5) 
@@ -140,4 +140,4 @@ DESC LIMIT(5)
 │ Sagittarius │ 52753 │ 1055.7120732470191 │
 │ Capricorn   │ 52207 │ 1055.4147719654452 │
 └─────────────┴───────┴────────────────────┘
-```
+```
diff --git a/docs/source/cudf.md b/docs/source/cudf.md
@@ -8,7 +8,7 @@ To read from a single Parquet file, use the [`read_parquet`](https://docs.rapids
 import cudf
 
 df = (
-    cudf.read_parquet("https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet")
+    cudf.read_parquet("https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet")
     .groupby('horoscope')['text']
     .apply(lambda x: x.str.len().mean())
     .sort_values(ascending=False)
@@ -25,6 +25,6 @@ import dask.dataframe as dd
 dask.config.set({"dataframe.backend": "cudf"})
 
 df = (
-    dd.read_parquet("https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/*.parquet")
+    dd.read_parquet("https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/*.parquet")
 )
-```
+```
diff --git a/docs/source/data_types.md b/docs/source/data_types.md
@@ -4,10 +4,10 @@ Datasets supported by the dataset viewer have a tabular format, meaning a data p
 
 There are several different data `Features` for representing different data formats such as [`Audio`](https://huggingface.co/docs/datasets/v2.5.2/en/package_reference/main_classes#datasets.Audio) and [`Image`](https://huggingface.co/docs/datasets/v2.5.2/en/package_reference/main_classes#datasets.Image) for speech and image data respectively. Knowing a dataset feature gives you a better understanding of the data type you're working with, and how you can preprocess it.
 
-For example, the `/first-rows` endpoint for the [Rotten Tomatoes](https://huggingface.co/datasets/rotten_tomatoes) dataset returns the following:
+For example, the `/first-rows` endpoint for the [Rotten Tomatoes](https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes) dataset returns the following:
 
 ```json
-{"dataset": "rotten_tomatoes",
+{"dataset": "cornell-movie-review-data/rotten_tomatoes",
  "config": "default",
  "split": "train",
  "features": [{"feature_idx": 0,

diff --git a/docs/source/duckdb.md b/docs/source/duckdb.md
@@ -7,7 +7,7 @@
 ```py
 import duckdb
 
-url = "https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet"
+url = "https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet"
 
 con = duckdb.connect()
 con.execute("INSTALL httpfs;")
@@ -22,7 +22,7 @@ var con = db.connect();
 con.exec('INSTALL httpfs');
 con.exec('LOAD httpfs');
 
-const url = "https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet"
+const url = "https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet"
 ```
 </js>
 </inferencesnippet>

diff --git a/docs/source/first_rows.md b/docs/source/first_rows.md
@@ -145,7 +145,7 @@ For some datasets, the response size from `/first-rows` may exceed 1MB, in which
 
 In some cases, if even the first few rows generate a response that exceeds 1MB, some of the columns are truncated and converted to a string. You'll see these listed in the `truncated_cells` field.
 
-For example, the [`ett`](https://datasets-server.huggingface.co/first-rows?dataset=ett&config=m2&split=test) dataset only returns 10 rows, and the `target` and `feat_dynamic_real` columns are truncated:
+For example, the [`ETDataset/ett`](https://datasets-server.huggingface.co/first-rows?dataset=ETDataset/ett&config=m2&split=test) dataset only returns 10 rows, and the `target` and `feat_dynamic_real` columns are truncated:
 
 ```json
   ...

diff --git a/docs/source/mlcroissant.md b/docs/source/mlcroissant.md
@@ -8,11 +8,11 @@
 
 </Tip>
 
-Let's start by parsing the Croissant metadata for the [`blog_authorship_corpus`](https://huggingface.co/datasets/blog_authorship_corpus) dataset. Be sure to first install `mlcroissant[parquet]` and `GitPython` to be able to load Parquet files over the git+https protocol.
+Let's start by parsing the Croissant metadata for the [`barilan/blog_authorship_corpus`](https://huggingface.co/datasets/barilan/blog_authorship_corpus) dataset. Be sure to first install `mlcroissant[parquet]` and `GitPython` to be able to load Parquet files over the git+https protocol.
 
 ```py
 from mlcroissant import Dataset
-ds = Dataset(jsonld="https://huggingface.co/api/datasets/blog_authorship_corpus/croissant")
+ds = Dataset(jsonld="https://huggingface.co/api/datasets/barilan/blog_authorship_corpus/croissant")
 ```
 
 To read from the first subset (called RecordSet in Croissant's vocabulary), use the [`records`](https://github.com/mlcommons/croissant/blob/cd64e12c733cf8bf48f2f85c951c1c67b1c94f5a/python/mlcroissant/mlcroissant/_src/datasets.py#L86) function, which returns an iterator of dicts.
-Original file line number
+Diff line change
@@ Expand Up @@
     In some cases, if even the first few rows generate a response that exceeds 1MB, some of the columns are truncated and converted to a string. You'll see these listed in the `truncated_cells` field.
-    For example, the [`ett`](https://datasets-server.huggingface.co/first-rows?dataset=ett&config=m2&split=test) dataset only returns 10 rows, and the `target` and `feat_dynamic_real` columns are truncated:
+    For example, the [`ETDataset/ett`](https://datasets-server.huggingface.co/first-rows?dataset=ETDataset/ett&config=m2&split=test) dataset only returns 10 rows, and the `target` and `feat_dynamic_real` columns are truncated:
     ```json
       ...
@@ Expand Down @@