Skip to content

Commit e8987c8

Browse files
committed
Improve the API
1 parent 7ad421c commit e8987c8

File tree

9 files changed

+321
-122
lines changed

9 files changed

+321
-122
lines changed

README.md

Lines changed: 24 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -108,56 +108,38 @@ select gaggle_list_files('habedi/flickr-8k-dataset-clean') as files;
108108

109109
select gaggle_info('habedi/flickr-8k-dataset-clean') as metadata;
110110

111-
-- Read a CSV file directly from local path after download
112-
select *
113-
from read_csv_auto('/path/to/downloaded/dataset/file.csv') limit 10;
111+
-- Read a Parquet file directly from local path after download
112+
-- Read a Parquet file from local cache using a prepared statement (no subquery in function args)
113+
prepare rp as select * from read_parquet(?) limit 10;
114+
execute rp(gaggle_get_file_path('habedi/flickr-8k-dataset-clean','flickr8k.parquet'));
115+
116+
-- Use the replacement scan to read directly via kaggle: URL
117+
select count(*) from 'kaggle:habedi/flickr-8k-dataset-clean/flickr8k.parquet';
118+
-- Or glob Parquet files in a dataset directory
119+
select count(*) from 'kaggle:habedi/flickr-8k-dataset-clean/*.parquet';
114120
```
115121

116-
[![Simple Demo 1](https://asciinema.org/a/745806.svg)](https://asciinema.org/a/745806)
117-
118-
#### API Functions
119-
120-
| Function | Description |
121-
|-----------------------------------------|------------------------------------------|
122-
| `gaggle_set_credentials(username, key)` | Set Kaggle API credentials |
123-
| `gaggle_search(query, page, page_size)` | Search for datasets on Kaggle |
124-
| `gaggle_download(dataset_path)` | Download a dataset and return local path |
125-
| `gaggle_list_files(dataset_path)` | List files in a dataset (JSON array) |
126-
| `gaggle_info(dataset_path)` | Get dataset metadata (JSON object) |
127-
| `gaggle_get_version()` | Get extension version info |
128-
| `gaggle_clear_cache()` | Clear the local dataset cache |
129-
| `gaggle_get_cache_info()` | Get cache statistics |
130-
131-
#### Configuration
122+
```sql
123+
load 'build/release/extension/gaggle/gaggle.duckdb_extension';
124+
select gaggle_set_credentials('your-username','your-api-key');
132125

133-
Gaggle can be configured via environment variables:
126+
-- Prime cache
127+
select gaggle_download('habedi/flickr-8k-dataset-clean');
134128

135-
- `KAGGLE_USERNAME` - Your Kaggle username
136-
- `KAGGLE_KEY` - Your Kaggle API key
137-
- `GAGGLE_CACHE_DIR` - Directory for caching datasets (default: system cache dir)
138-
- `GAGGLE_VERBOSE` - Enable verbose logging (default: false)
139-
- `GAGGLE_HTTP_TIMEOUT` - HTTP timeout in seconds (default: 30)
129+
-- Replacement scan single-file parquet
130+
select count(*) from 'kaggle:habedi/flickr-8k-dataset-clean/flickr8k.parquet';
140131

141-
Alternatively, create `~/.kaggle/kaggle.json`:
132+
-- If the file is nested or the name differs, try a glob:
133+
select count(*) from 'kaggle:habedi/flickr-8k-dataset-clean/*.parquet';
134+
-- Or even broader:
135+
-- select count(*) from 'kaggle:habedi/flickr-8k-dataset-clean/*flickr8k*.parquet';
142136

143-
```json
144-
{
145-
"username": "your-username",
146-
"key": "your-api-key"
147-
}
137+
-- Direct read via file path without subquery in table function
138+
prepare rp as select * from read_parquet(?) limit 10;
139+
execute rp(gaggle_get_file_path('habedi/flickr-8k-dataset-clean','flickr8k.parquet'));
148140
```
149141

150-
##### JSON Parsing
151-
152-
> [!TIP]
153-
> Gaggle returns JSON data for search results, file lists, and metadata.
154-
> For advanced JSON parsing, you can optionally load the JSON DuckDB extension:
155-
> ```sql
156-
> install json;
157-
> load json;
158-
> select * from json_each(gaggle_search('covid-19', 1, 10));
159-
> ```
160-
> If the JSON extension is not available, you can still access the raw JSON strings and work with them directly.
142+
[![Simple Demo 1](https://asciinema.org/a/745806.svg)](https://asciinema.org/a/745806)
161143

162144
---
163145

ROADMAP.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ It outlines features to be implemented and their current status.
5454
* **Network Optimization**
5555
* [x] Configurable HTTP timeouts.
5656
* [ ] Connection pooling for Kaggle API requests.
57-
* [ ] Retry logic with exponential backoff.
57+
* [x] Retry logic with exponential backoff.
5858
* **Caching Strategy**
5959
* [ ] Incremental cache updates.
6060
* [ ] Background cache synchronization.
@@ -67,7 +67,7 @@ It outlines features to be implemented and their current status.
6767
* [x] Clear error messages for `NULL` inputs.
6868
* [ ] Detailed error codes for programmatic error handling.
6969
* **Resilience**
70-
* [ ] Automatic retry on network failures.
70+
* [x] Automatic retry on network failures.
7171
* [ ] Graceful degradation when Kaggle API is unavailable.
7272
* [ ] Local-only mode for cached datasets.
7373

docs/README.md

Lines changed: 34 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,41 @@ The table below includes the information about all SQL functions exposed by Gagg
66
|---|:--------------------------------------------------------|:-----------------|:------------------------------------------------------------------------------------|
77
| 1 | `gaggle_set_credentials(username VARCHAR, key VARCHAR)` | `BOOLEAN` | Sets Kaggle API credentials for the session. Returns `true` on success. |
88
| 2 | `gaggle_search(query VARCHAR, page INTEGER, page_size INTEGER)` | `VARCHAR (JSON)` | Searches Kaggle for datasets matching the query and returns results as JSON. |
9-
| 3 | `gaggle_list_files(dataset_path VARCHAR)` | `VARCHAR (JSON)` | Lists all files in a Kaggle dataset (format: 'owner/dataset-name'). |
10-
| 4 | `gaggle_download(dataset_path VARCHAR)` | `VARCHAR` | Downloads a Kaggle dataset and returns the local cache directory path. |
11-
| 5 | `gaggle_info(dataset_path VARCHAR)` | `VARCHAR (JSON)` | Returns metadata for a Kaggle dataset including size, description, and update info. |
9+
| 3 | `gaggle_list_files(dataset_path VARCHAR)` | `VARCHAR (JSON)` | Lists all files in a Kaggle dataset (format: 'owner/dataset-name'). |
10+
| 4 | `gaggle_download(dataset_path VARCHAR)` | `VARCHAR` | Downloads a Kaggle dataset and returns the local cache directory path. |
11+
| 5 | `gaggle_info(dataset_path VARCHAR)` | `VARCHAR (JSON)` | Returns metadata for a Kaggle dataset including size, description, and update info. |
12+
| 6 | `gaggle_get_version()` | `VARCHAR (JSON)` | Returns version information for the Gaggle extension. |
13+
| 7 | `gaggle_last_error()` | `VARCHAR` | Returns the last error message recorded by the extension (empty string if none). |
14+
| 8 | `gaggle_clear_cache()` | `BOOLEAN` | Clears the local cache directory used by Gaggle. |
15+
| 9 | `gaggle_get_cache_info()` | `VARCHAR (JSON)` | Returns cache statistics including size and location. |
16+
| 10 | `gaggle_json_each(json VARCHAR)` | `VARCHAR` | Returns newline-delimited JSON records from a JSON array string. |
17+
| 11 | `gaggle_get_file_path(dataset_path VARCHAR, filename VARCHAR)` | `VARCHAR` | Resolves and returns the local file path for a file inside a downloaded dataset. |
1218

1319
> [!NOTE]
1420
> Kaggle credentials can be provided via environment variables (`KAGGLE_USERNAME`, `KAGGLE_KEY`),
1521
> a `~/.kaggle/kaggle.json` file, or using the `gaggle_set_credentials()` function.
1622
23+
### Configuration
24+
25+
Gaggle can be configured via environment variables:
26+
27+
- `KAGGLE_USERNAME` - Your Kaggle username
28+
- `KAGGLE_KEY` - Your Kaggle API key
29+
- `GAGGLE_CACHE_DIR` - Directory for caching datasets (default: system cache dir)
30+
- `GAGGLE_VERBOSE` - Enable verbose logging (default: false)
31+
- `GAGGLE_HTTP_TIMEOUT` - HTTP timeout in seconds (default: 30)
32+
- `GAGGLE_HTTP_RETRY_ATTEMPTS` - Number of retry attempts on HTTP errors (default: 0)
33+
- `GAGGLE_HTTP_RETRY_DELAY` - Initial retry delay in milliseconds (default: 1000)
34+
35+
Alternatively, create `~/.kaggle/kaggle.json`:
36+
37+
```json
38+
{
39+
"username": "your-username",
40+
"key": "your-api-key"
41+
}
42+
```
43+
1744
---
1845

1946
### Usage Examples
@@ -62,15 +89,14 @@ from read_csv('~/.gaggle_cache/datasets/username/dataset-name/file.csv');
6289

6390
```sql
6491
-- Load the extension
65-
LOAD
66-
'build/release/extension/gaggle/gaggle.duckdb_extension';
92+
LOAD 'build/release/extension/gaggle/gaggle.duckdb_extension';
6793

6894
-- Search for a dataset
6995
select gaggle_search('iris', 1, 10);
7096

71-
-- Download and read the dataset
72-
select *
73-
from read_csv((select gaggle_download('uciml/iris') || '/iris.csv'));
97+
-- Download and read the dataset (avoid subqueries in table function args)
98+
prepare rp as select * from read_csv(?) limit 10;
99+
execute rp(gaggle_download('uciml/iris') || '/iris.csv');
74100
```
75101

76102
---

docs/examples/e2_advanced_features.sql

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,23 +9,31 @@ load 'build/release/extension/gaggle/gaggle.duckdb_extension';
99
select gaggle_set_credentials('your-username', 'your-api-key') as credentials_set;
1010

1111
-- Get path to specific file
12-
select gaggle_get_file_path('owid/covid-latest-data', 'owid-covid-latest.csv') as file_path;
12+
select gaggle_get_file_path('habedi/flickr-8k-dataset-clean', 'flickr8k.parquet') as file_path;
1313

14-
-- Use the file path with DuckDB's read_csv_auto
15-
select * from read_csv_auto(
16-
(select gaggle_get_file_path('owid/covid-latest-data', 'owid-covid-latest.csv'))
17-
) limit 10;
14+
-- Use the file path with DuckDB's read_parquet
15+
-- DuckDB table functions cannot contain subqueries as arguments.
16+
-- Use PREPARE/EXECUTE to pass the computed path as a parameter instead.
17+
prepare rp as select * from read_parquet(?);
18+
execute rp(gaggle_get_file_path('habedi/flickr-8k-dataset-clean', 'flickr8k.parquet')) limit 10;
1819

1920
-- section 2: List and process multiple files
2021
select '## List and process dataset files';
2122
with files as (
22-
select gaggle_list_files('owid/covid-latest-data') as files_json
23+
select gaggle_list_files('habedi/flickr-8k-dataset-clean') as files_json
2324
)
2425
select files_json from files;
2526

27+
-- section 2b: Use replacement scan for direct reads via kaggle: URLs
28+
select '## Replacement scan - direct reads via kaggle:';
29+
-- Single file read
30+
select count(*) from 'kaggle:habedi/flickr-8k-dataset-clean/flickr8k.parquet';
31+
-- Glob pattern over CSVs
32+
select count(*) from 'kaggle:habedi/flickr-8k-dataset-clean/*.parquet';
33+
2634
-- section 3: Download and verify cache
2735
select '## Verify dataset is cached';
28-
select gaggle_download('owid/covid-latest-data') as cached_path;
36+
select gaggle_download('habedi/flickr-8k-dataset-clean') as cached_path;
2937
select gaggle_get_cache_info() as cache_status;
3038

3139
-- section 4: Clear cache if needed

docs/examples/gaggle_usage.sql

Lines changed: 26 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,26 @@
11
-- Gaggle - Kaggle Dataset Extension for DuckDB
22
-- Example Usage
33

4-
-- load the extension
4+
.echo on
5+
6+
+-- Optional: configure retry/backoff via environment
7+
+-- export GAGGLE_HTTP_RETRY_ATTEMPTS=3
8+
+-- export GAGGLE_HTTP_RETRY_DELAY=250
9+
+
10+
-- Load the extension and set credentials
511
load 'build/release/extension/gaggle/gaggle.duckdb_extension';
12+
select gaggle_set_credentials('your-username', 'your-api-key') as credentials_set;
13+
14+
-- Download a dataset and read a file via local path
15+
select gaggle_download('habedi/flickr-8k-dataset-clean') as flickr_path;
16+
select * from read_parquet((select gaggle_download('habedi/flickr-8k-dataset-clean') || '/flickr8k.parquet')) limit 5;
17+
18+
-- Read directly via kaggle: URL using replacement scan
19+
select count(*) from 'kaggle:habedi/flickr-8k-dataset-clean/flickr8k.parquet';
20+
-- Glob Parquet files in a dataset directory
21+
select count(*) from 'kaggle:habedi/flickr-8k-dataset-clean/*.parquet';
22+
+
23+
.echo off
624

725
-- set kaggle credentials (or use kaggle_username and kaggle_key env vars, or ~/.kaggle/kaggle.json)
826
select gaggle_set_credentials('your-username', 'your-api-key');
@@ -11,21 +29,21 @@ select gaggle_set_credentials('your-username', 'your-api-key');
1129
select gaggle_get_version();
1230

1331
-- search for datasets
14-
select * from json_each(gaggle_search('covid-19', 1, 10));
32+
select * from json_each(gaggle_search('flickr', 1, 10));
1533

1634
-- download a dataset
17-
select gaggle_download('owid/covid-latest-data');
35+
select gaggle_download('habedi/flickr-8k-dataset-clean');
1836

1937
-- list files in a dataset
20-
select * from json_each(gaggle_list_files('owid/covid-latest-data'));
38+
select * from json_each(gaggle_list_files('habedi/flickr-8k-dataset-clean'));
2139

2240
-- get dataset metadata
23-
select * from json_each(gaggle_info('owid/covid-latest-data'));
41+
select * from json_each(gaggle_info('habedi/flickr-8k-dataset-clean'));
2442

2543
-- read a csv file from kaggle dataset directly
26-
-- option 1: using read_csv with the file path
27-
select * from read_csv_auto(
28-
(select gaggle_download('owid/covid-latest-data') || '/owid-covid-latest.csv')
44+
-- option 1: using read_parquet with the file path
45+
select * from read_parquet(
46+
(select gaggle_download('habedi/flickr-8k-dataset-clean') || '/flickr8k.parquet')
2947
) limit 10;
3048

3149
-- clear cache

0 commit comments

Comments
 (0)