Skip to content

Commit 005c04d

Browse files
committed
Add more Sqllogictest-style tests
1 parent be82e81 commit 005c04d

20 files changed

+226
-40
lines changed

README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,34 +12,34 @@
1212
[![Docs](https://img.shields.io/badge/docs-read-blue?style=flat&labelColor=282c34&logo=read-the-docs)](https://github.com/CogitatorTech/gaggle/tree/main/docs)
1313
[![License](https://img.shields.io/badge/license-MIT%2FApache--2.0-007ec6?style=flat&labelColor=282c34&logo=open-source-initiative)](https://github.com/CogitatorTech/gaggle)
1414

15-
Kaggle Datasets for DuckDB
15+
Access and Query Kaggle Datasets from DuckDB
1616

1717
</div>
1818

1919
---
2020

2121
Gaggle is a DuckDB extension that allows you to work with Kaggle datasets directly in SQL queries, as if
2222
they were DuckDB tables.
23-
It is written in Rust and uses the Kaggle API to search, download, and manage datasets.
23+
It is written in Rust and uses the Kaggle API to search, download, and manage Kaggle datasets.
2424

2525
Kaggle hosts a large collection of very useful datasets for data science and machine learning.
2626
Accessing these datasets typically involves manually downloading a dataset (as a ZIP file),
2727
extracting it, loading the files in the dataset into your data science environment, and managing storage and dataset
2828
updates, etc.
29-
This workflow can be become complex, especially when working with multiple datasets or when datasets are updated
29+
This workflow can quickly become complex, especially when working with multiple datasets or when datasets are updated
3030
frequently.
3131
Gaggle tries to help simplify this process by hiding the complexity and letting you work with datasets directly inside
3232
an analytical database like DuckDB that can handle fast queries.
3333
In essence, Gaggle makes DuckDB into a SQL-enabled frontend for Kaggle datasets.
3434

3535
### Features
3636

37-
- Has a simple API to interact with Kaggle datasets from DuckDB
37+
- Provides a simple API to interact with Kaggle datasets from DuckDB
3838
- Allows you to search, download, and read datasets from Kaggle
39-
- Supports datasets that contain CSV, Parquet, JSON, and XLSX files (XLSX requires DuckDB's Excel reader to be available in your DuckDB build)
40-
- Configurable and has built-in caching support
39+
- Supports datasets that contain CSV, Parquet, JSON, and XLSX files
40+
- Configurable and has built-in caching of downloaded datasets
4141
- Thread-safe, fast, and has a low memory footprint
42-
- Supports dataset versioning and update checks
42+
- Supports dataset updates and versioning
4343

4444
See the [ROADMAP.md](ROADMAP.md) for planned features and the [docs](docs) folder for detailed documentation.
4545

ROADMAP.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ It outlines features to be implemented and their current status.
3939
* [x] CSV and TSV file reading.
4040
* [x] Parquet file reading.
4141
* [x] JSON file reading.
42-
* [ ] Excel (XLSX) file reading. (Available when DuckDB is built with the Excel reader; replacement scan routes `.xlsx` to `read_excel`.)
42+
* [x] Excel (XLSX) file reading.
4343
* **Querying Datasets**
4444
* [x] Replacement scan for `kaggle:` URLs.
4545
* [ ] Virtual table support for lazy loading.
@@ -66,8 +66,8 @@ It outlines features to be implemented and their current status.
6666
* [x] Detailed error codes for programmatic error handling.
6767
* **Resilience**
6868
* [x] Automatic retry on network failures.
69-
* [ ] Graceful degradation when Kaggle API is unavailable.
7069
* [x] Local-only mode for cached datasets (via `GAGGLE_OFFLINE`).
70+
* [ ] Graceful degradation when Kaggle API is unavailable.
7171

7272
### 6. Documentation and Distribution
7373

docs/CONFIGURATION.md

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Gaggle supports configuration via environment variables to customize its behavio
1010

1111
- **Description**: Directory path for caching downloaded Kaggle datasets
1212
- **Type**: String (path)
13-
- **Default**: `$XDG_CACHE_HOME/gaggle_cache` (typically `~/.cache/gaggle_cache`)
13+
- **Default**: `$XDG_CACHE_HOME/gaggle` (typically `~/.cache/gaggle`)
1414
- **Example**:
1515
```bash
1616
export GAGGLE_CACHE_DIR="/var/cache/gaggle"
@@ -144,9 +144,10 @@ These settings control the wait behavior when a download is already in progress.
144144
- **Type**: Boolean (`1`, `true`, `yes`, `on` to enable)
145145
- **Default**: `false`
146146
- **Effects**:
147-
- gaggle_download(...) fails if the dataset isn’t cached.
148-
- Version checks use cached `.downloaded` metadata when available; otherwise return "unknown".
149-
- Search and metadata calls will still attempt network; consider avoiding them in offline mode.
147+
- `gaggle_download(...)` fails if the dataset isn’t cached.
148+
- `gaggle_version_info` reports `latest_version` as "unknown" if no cache metadata exists.
149+
- `gaggle_is_current` and other version checks use cached `.downloaded` metadata when available.
150+
- `gaggle_search` and `gaggle_info` also fail fast in offline mode (no network attempts).
150151
- **Example**:
151152
```bash
152153
export GAGGLE_OFFLINE=1
@@ -187,7 +188,7 @@ export GAGGLE_HTTP_TIMEOUT=120 # 2 minutes
187188
export GAGGLE_HTTP_RETRY_ATTEMPTS=5 # Retry up to 5 times
188189
export GAGGLE_HTTP_RETRY_DELAY_MS=2000 # 2 second initial delay
189190
export GAGGLE_HTTP_RETRY_MAX_DELAY_MS=30000 # Cap backoff at 30s
190-
export GAGGLE_LOG_LEVEL=WARN # Production logging (planned)
191+
export GAGGLE_LOG_LEVEL=WARN # Production logging
191192
192193
## Set Kaggle credentials
193194
export KAGGLE_USERNAME="your-username"
@@ -202,7 +203,7 @@ export KAGGLE_KEY="your-api-key"
202203
```bash
203204
## Development setup with verbose logging
204205
export GAGGLE_CACHE_DIR="./dev-cache"
205-
export GAGGLE_LOG_LEVEL=DEBUG ## Detailed debug logs (planned)
206+
export GAGGLE_LOG_LEVEL=DEBUG ## Detailed debug logs
206207
export GAGGLE_HTTP_TIMEOUT=10 ## Shorter timeout for dev
207208
export GAGGLE_HTTP_RETRY_ATTEMPTS=1 ## Fail fast in development
208209
export GAGGLE_HTTP_RETRY_DELAY_MS=250 ## Quick retry
@@ -230,10 +231,11 @@ export GAGGLE_HTTP_RETRY_MAX_DELAY_MS=60000 ## Cap at 60s
230231
export GAGGLE_OFFLINE=1
231232
232233
# Attempt to download a dataset (will fail if not cached)
233-
gaggle download username/dataset-name
234+
SELECT gaggle_download('username/dataset-name');
234235
235-
# Querying metadata or searching will still attempt network access
236-
gaggle info username/dataset-name
236+
# Querying metadata or searching will fail fast in offline mode
237+
SELECT gaggle_info('username/dataset-name');
238+
SELECT gaggle_search('keyword', 1, 10);
237239
```
238240
239241
### Configuration Verification
@@ -253,16 +255,19 @@ SELECT gaggle_search('housing', 1, 10);
253255
254256
-- Get dataset metadata
255257
SELECT gaggle_info('username/dataset-name');
258+
259+
-- Retrieve last error string (or NULL if none)
260+
SELECT gaggle_last_error();
256261
```
257262
258263
### Retry Policy Details
259264
260265
Gaggle implements retries with exponential backoff for HTTP requests. The number of attempts, initial delay, and
261266
maximum delay can be tuned with the environment variables above.
262267
263-
### Logging Levels (planned)
268+
### Logging Levels
264269
265-
Detailed logging control via `GAGGLE_LOG_LEVEL` is planned but not yet implemented.
270+
Detailed logging control via `GAGGLE_LOG_LEVEL` is implemented.
266271
267272
### Notes
268273

docs/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ The table below includes the information about all SQL functions exposed by Gagg
1616
| 10 | `gaggle_update_dataset(dataset_path VARCHAR)` | `VARCHAR` | Forces update to latest version (ignores cache). Returns local path to freshly downloaded dataset. |
1717
| 11 | `gaggle_version_info(dataset_path VARCHAR)` | `VARCHAR (JSON)` | Returns version info: `cached_version`, `latest_version`, `is_current`, `is_cached`. |
1818
| 12 | `gaggle_json_each(json VARCHAR)` | `VARCHAR` | Expands a JSON object/array into newline-delimited JSON rows with fields: `key`, `value`, `type`, `path`. |
19-
| 13 | `gaggle_file_paths(dataset_path VARCHAR, filename VARCHAR)` | `VARCHAR` | Resolves a specific file's local path inside a downloaded dataset. |
19+
| 13 | `gaggle_file_path(dataset_path VARCHAR, filename VARCHAR)` | `VARCHAR` | Resolves a specific file's local path inside a downloaded dataset. |
2020

2121
> [!NOTE]
2222
> Dataset paths must be in the form `owner/dataset` where `owner` is the username and `dataset` is the dataset name on
@@ -220,6 +220,6 @@ Gaggle is made up of two main components:
220220
- C-compatible FFI surface
221221

222222
2. **C++ DuckDB Bindings (`gaggle/bindings/`)** that:
223-
- Defines the custom SQL functions (for example: `gaggle_ls`, `gaggle_file_paths`, `gaggle_search`)
223+
- Defines the custom SQL functions (for example: `gaggle_ls`, `gaggle_file_path`, `gaggle_search`)
224224
- Integrates with DuckDB’s extension system and replacement scans (`'kaggle:...'`)
225225
- Marshals values between DuckDB vectors and the Rust FFI

docs/examples/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,7 @@ Each file is self‑contained and can be executed in the DuckDB shell (or via `d
2323
```bash
2424
make examples
2525
```
26+
27+
> [!NOTE]
28+
> Some operations (like search and download) need network access unless `GAGGLE_OFFLINE=1`.
29+
> When offline, these will fail fast if data is not cached locally (not downloaded already).

docs/examples/e1_core_functionality.sql

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,22 +26,22 @@ limit 5;
2626

2727
-- Section 4: download a dataset
2828
select '## Download a dataset';
29-
select gaggle_download('owid/covid-latest-data') as download_path;
29+
select gaggle_download('uciml/iris') as download_path;
3030

3131
-- Section 5: list files (JSON)
3232
select '## list files (json)';
3333
select to_json(
3434
list(struct_pack(name := name, size := size, path := path))
3535
) as files_json
36-
from gaggle_ls('owid/covid-latest-data');
36+
from gaggle_ls('uciml/iris');
3737

3838
-- Section 5b: list files (table)
3939
select '## list files (table)';
40-
select * from gaggle_ls('owid/covid-latest-data') limit 5;
40+
select * from gaggle_ls('uciml/iris') limit 5;
4141

4242
-- Section 6: get dataset metadata
4343
select '## get dataset metadata';
44-
select gaggle_info('owid/covid-latest-data') as dataset_metadata;
44+
select gaggle_info('uciml/iris') as dataset_metadata;
4545

4646
-- Section 7: get cache information
4747
select '## Get cache information';

docs/examples/e2_advanced_features.sql

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@ load 'build/release/extension/gaggle/gaggle.duckdb_extension';
66
select gaggle_set_credentials('your-username', 'your-api-key') as credentials_set;
77

88
-- Get path to specific file
9-
select gaggle_file_paths('habedi/flickr-8k-dataset-clean', 'flickr8k.parquet') as file_path;
9+
select gaggle_file_path('habedi/flickr-8k-dataset-clean', 'flickr8k.parquet') as file_path;
1010

1111
-- Use the file path with DuckDB's read_parquet via prepared statement (no subqueries in args)
1212
prepare rp as select * from read_parquet(?) limit 10;
13-
execute rp(gaggle_file_paths('habedi/flickr-8k-dataset-clean', 'flickr8k.parquet'));
13+
execute rp(gaggle_file_path('habedi/flickr-8k-dataset-clean', 'flickr8k.parquet'));
1414

1515
-- Section 2: list and process multiple files
1616
select '## list and process dataset files (json and table)';
@@ -35,7 +35,7 @@ select gaggle_cache_info() as cache_status;
3535

3636
-- Section 4: purge cache if needed
3737
select '## Purge cache (optional)';
38-
-- select gaggle_purge_cache() as cache_purged;
38+
-- select gaggle_clear_cache() as cache_cleared;
3939

4040
-- Section 5: Dataset versioning
4141
select '## Check dataset versions';

docs/examples/e4_json_utils.sql

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
.echo on
2+
3+
-- Example 4: JSON helper utilities
4+
-- Shows how to expand JSON using gaggle_json_each
5+
6+
select '## Load extension';
7+
load 'build/release/extension/gaggle/gaggle.duckdb_extension';
8+
9+
select '## Expand JSON values into newline-delimited JSON rows';
10+
select gaggle_json_each('{"a":1,"b":[true,{"c":"x"}],"d":null}') as rows;
11+
12+
-- Combine with DuckDB JSON functions
13+
with x as (
14+
select gaggle_json_each('{"a":1,"b":[true,{"c":"x"}],"d":null}') as row
15+
)
16+
select
17+
json_type(row) as value_type,
18+
json_extract(row, '$') as raw,
19+
json_extract_string(row, '$.key') as key,
20+
json_extract(row, '$.value') as value
21+
from x;
22+
23+
.echo off
24+

docs/examples/e5_cache_ops.sql

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
.echo on
2+
3+
-- Example 5: Cache operations & housekeeping
4+
-- Demonstrates gaggle_version, gaggle_cache_info, gaggle_clear_cache, gaggle_enforce_cache_limit
5+
6+
select '## Load extension';
7+
load 'build/release/extension/gaggle/gaggle.duckdb_extension';
8+
9+
select '## Extension version';
10+
select gaggle_version() as version;
11+
12+
select '## Cache info (path, size, limit)';
13+
select gaggle_cache_info() as cache_info_json;
14+
15+
select '## Clear cache (optional)';
16+
-- Uncomment to clear the local dataset cache
17+
-- select gaggle_clear_cache() as cache_cleared;
18+
19+
select '## Enforce cache size limit (LRU eviction)';
20+
-- This triggers cleanup based on the configured limit; safe to run repeatedly
21+
select gaggle_enforce_cache_limit() as enforced;
22+
23+
.echo off
24+

gaggle/bindings/gaggle_extension.cpp

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -436,6 +436,22 @@ static void GetFilePath(DataChunk &args, ExpressionState &state,
436436
gaggle_free(file_path_c);
437437
}
438438

439+
/**
440+
* @brief Implements the `gaggle_last_error()` SQL function.
441+
* Returns the last error message string or NULL if no error is set.
442+
*/
443+
static void GetLastError(DataChunk &args, ExpressionState &state, Vector &result) {
444+
result.SetVectorType(VectorType::CONSTANT_VECTOR);
445+
const char *err = gaggle_last_error();
446+
if (!err) {
447+
ConstantVector::SetNull(result, true);
448+
return;
449+
}
450+
ConstantVector::GetData<string_t>(result)[0] =
451+
StringVector::AddString(result, err);
452+
ConstantVector::SetNull(result, false);
453+
}
454+
439455
/**
440456
* @brief Table function to read a Kaggle dataset file as a table
441457
*/
@@ -660,6 +676,9 @@ static void GaggleLsFunction(ClientContext &context, TableFunctionInput &data_p,
660676
* @brief Registers all the Gaggle functions with DuckDB.
661677
*/
662678
static void LoadInternal(ExtensionLoader &loader) {
679+
// Initialize Rust logging once per process
680+
gaggle_init_logging();
681+
663682
// Scalar functions (public)
664683
loader.RegisterFunction(ScalarFunction(
665684
"gaggle_set_credentials", {LogicalType::VARCHAR, LogicalType::VARCHAR},
@@ -691,6 +710,8 @@ static void LoadInternal(ExtensionLoader &loader) {
691710
"gaggle_json_each", {LogicalType::VARCHAR}, LogicalType::VARCHAR, JsonEach));
692711
loader.RegisterFunction(ScalarFunction(
693712
"gaggle_file_path", {LogicalType::VARCHAR, LogicalType::VARCHAR}, LogicalType::VARCHAR, GetFilePath));
713+
loader.RegisterFunction(ScalarFunction(
714+
"gaggle_last_error", {}, LogicalType::VARCHAR, GetLastError));
694715

695716
// Table function: gaggle_ls(dataset_path) -> name,size,path
696717
TableFunction ls_fun("gaggle_ls", {LogicalType::VARCHAR}, GaggleLsFunction,
@@ -710,11 +731,8 @@ DUCKDB_CPP_EXTENSION_ENTRY(gaggle, loader) { LoadInternal(loader); }
710731
std::string GaggleExtension::Name() { return "gaggle"; }
711732
std::string GaggleExtension::Version() const {
712733
// Return a static-safe version string to avoid calling into FFI at load time
713-
return std::string("v0.1.0");
734+
return std::string("0.1.0");
714735
}
715736

716737
} // namespace duckdb
717738

718-
extern "C" {
719-
DUCKDB_CPP_EXTENSION_ENTRY(gaggle, loader) { duckdb::LoadInternal(loader); }
720-
}

0 commit comments

Comments
 (0)