@@ -31,17 +31,14 @@ frequently.
3131Gaggle tries to help simplify this process by hiding the complexity and letting you work with datasets directly inside
3232an analytical database like DuckDB that can handle fast queries.
3333
34- > [ !NOTE]
35- > Gaggle is similar of [ Hugging Face extension] ( ) for DuckDB.
36- > Although, Kaggle datasets are not directly accessible from a remote file system, like Hugging Face datasets.
37-
3834### Features
3935
40- - Has a simple API (just a handful of SQL functions)
41- - Allows you search, download, update, and delete Kaggle datasets directly from DuckDB
42- - Supports datasets made of CSV, JSON, and Parquet files
36+ - Has a simple API to interact with Kaggle datasets from DuckDB
37+ - Allows you to search, download, and read datasets from Kaggle
38+ - Supports datasets that contain CSV, Parquet, JSON, and XLSX files
4339- Configurable and has built-in caching support
44- - Thread-safe and memory-efficient
40+ - Thread-safe, fast, and has a low memory footprint
41+ - Supports dataset versioning and update checks
4542
4643See the [ ROADMAP.md] ( ROADMAP.md ) for planned features and the [ docs] ( docs ) folder for detailed documentation.
4744
@@ -94,35 +91,34 @@ make release
9491#### Trying Gaggle
9592
9693``` sql
97- -- Load the Gaggle extension
98- load ' build/release/extension/gaggle/gaggle.duckdb_extension' ;
94+ -- Load the Gaggle extension (only needed if you built from source)
95+ -- load 'build/release/extension/gaggle/gaggle.duckdb_extension';
9996
100- -- Set your Kaggle credentials (or use `~/.kaggle/kaggle.json`)
97+ -- Manually, set your Kaggle credentials (or use `~/.kaggle/kaggle.json`)
10198select gaggle_set_credentials(' your-username' , ' your-api-key' );
10299
103100-- Get extension version
104101select gaggle_version();
105102
106- -- Prime cache by downloading the dataset locally (optional, but improves first-time performance)
107- select gaggle_download(' habedi/flickr-8k-dataset-clean' );
108-
109103-- List files in the downloaded dataset
104+ -- (Note that if the datasets is not downloaded yet, it will be downloaded and cached first)
110105select *
111106from gaggle_ls(' habedi/flickr-8k-dataset-clean' ) limit 5 ;
112107
113- -- Read a Parquet file from local cache using a prepared statement (no subquery in function args)
108+ -- Read a Parquet file from local cache using a prepared statement
109+ -- (Note that DuckDB doesn't support subquery in function arguments, so we use a prepared statement)
114110prepare rp as select * from read_parquet(?) limit 10 ;
115111execute rp(gaggle_file_paths(' habedi/flickr-8k-dataset-clean' , ' flickr8k.parquet' ));
116112
117- -- Use the replacement scan to read directly via kaggle: URL
113+ -- Alternatively, we can use a replacement scan to read directly via ` kaggle:` prefix
118114select count (* )
119115from ' kaggle:habedi/flickr-8k-dataset-clean/flickr8k.parquet' ;
120116
121117-- Or glob Parquet files in a dataset directory
122118select count (* )
123119from ' kaggle:habedi/flickr-8k-dataset-clean/*.parquet' ;
124120
125- -- Optionally, check cache info
121+ -- Optionally, we check cache info
126122select gaggle_cache_info();
127123
128124-- Enforce cache size limit manually (automatic with soft limit by default)
0 commit comments