Skip to content

Commit cd96d2d

Browse files
committed
WIP
1 parent e590e4e commit cd96d2d

File tree

2 files changed

+15
-19
lines changed

2 files changed

+15
-19
lines changed

README.md

Lines changed: 13 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -31,17 +31,14 @@ frequently.
3131
Gaggle tries to help simplify this process by hiding the complexity and letting you work with datasets directly inside
3232
an analytical database like DuckDB that can handle fast queries.
3333

34-
> [!NOTE]
35-
> Gaggle is similar of [Hugging Face extension]() for DuckDB.
36-
> Although, Kaggle datasets are not directly accessible from a remote file system, like Hugging Face datasets.
37-
3834
### Features
3935

40-
- Has a simple API (just a handful of SQL functions)
41-
- Allows you search, download, update, and delete Kaggle datasets directly from DuckDB
42-
- Supports datasets made of CSV, JSON, and Parquet files
36+
- Has a simple API to interact with Kaggle datasets from DuckDB
37+
- Allows you to search, download, and read datasets from Kaggle
38+
- Supports datasets that contain CSV, Parquet, JSON, and XLSX files
4339
- Configurable and has built-in caching support
44-
- Thread-safe and memory-efficient
40+
- Thread-safe, fast, and has a low memory footprint
41+
- Supports dataset versioning and update checks
4542

4643
See the [ROADMAP.md](ROADMAP.md) for planned features and the [docs](docs) folder for detailed documentation.
4744

@@ -94,35 +91,34 @@ make release
9491
#### Trying Gaggle
9592

9693
```sql
97-
-- Load the Gaggle extension
98-
load 'build/release/extension/gaggle/gaggle.duckdb_extension';
94+
-- Load the Gaggle extension (only needed if you built from source)
95+
--load 'build/release/extension/gaggle/gaggle.duckdb_extension';
9996

100-
-- Set your Kaggle credentials (or use `~/.kaggle/kaggle.json`)
97+
-- Manually, set your Kaggle credentials (or use `~/.kaggle/kaggle.json`)
10198
select gaggle_set_credentials('your-username', 'your-api-key');
10299

103100
-- Get extension version
104101
select gaggle_version();
105102

106-
-- Prime cache by downloading the dataset locally (optional, but improves first-time performance)
107-
select gaggle_download('habedi/flickr-8k-dataset-clean');
108-
109103
-- List files in the downloaded dataset
104+
-- (Note that if the datasets is not downloaded yet, it will be downloaded and cached first)
110105
select *
111106
from gaggle_ls('habedi/flickr-8k-dataset-clean') limit 5;
112107

113-
-- Read a Parquet file from local cache using a prepared statement (no subquery in function args)
108+
-- Read a Parquet file from local cache using a prepared statement
109+
-- (Note that DuckDB doesn't support subquery in function arguments, so we use a prepared statement)
114110
prepare rp as select * from read_parquet(?) limit 10;
115111
execute rp(gaggle_file_paths('habedi/flickr-8k-dataset-clean', 'flickr8k.parquet'));
116112

117-
-- Use the replacement scan to read directly via kaggle: URL
113+
-- Alternatively, we can use a replacement scan to read directly via `kaggle:` prefix
118114
select count(*)
119115
from 'kaggle:habedi/flickr-8k-dataset-clean/flickr8k.parquet';
120116

121117
-- Or glob Parquet files in a dataset directory
122118
select count(*)
123119
from 'kaggle:habedi/flickr-8k-dataset-clean/*.parquet';
124120

125-
-- Optionally, check cache info
121+
-- Optionally, we check cache info
126122
select gaggle_cache_info();
127123

128124
-- Enforce cache size limit manually (automatic with soft limit by default)

ROADMAP.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ It outlines features to be implemented and their current status.
3939
* [x] CSV and TSV file reading.
4040
* [x] Parquet file reading.
4141
* [x] JSON file reading.
42-
* [ ] Excel and XLSX file reading.
42+
* [ ] Excel (XLSX) file reading.
4343
* **Querying Datasets**
4444
* [x] Replacement scan for `kaggle:` URLs.
4545
* [ ] Virtual table support for lazy loading.
@@ -49,7 +49,7 @@ It outlines features to be implemented and their current status.
4949
* **Concurrency Control**
5050
* [x] Thread-safe credential storage.
5151
* [x] Thread-safe cache access.
52-
* [x] Concurrent dataset downloads (with per-dataset serialization to prevent race conditions).
52+
* [x] Concurrent dataset downloads.
5353
* **Network Optimization**
5454
* [x] Configurable HTTP timeouts.
5555
* [x] Retry logic with backoff for failed requests.

0 commit comments

Comments
 (0)