Skip to content

Commit 9a14502

Browse files
committed
Improve project's organization
1 parent 056b59f commit 9a14502

File tree

10 files changed

+313
-459
lines changed

10 files changed

+313
-459
lines changed

README.md

Lines changed: 76 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -18,89 +18,26 @@ Kaggle Datasets for DuckDB
1818

1919
---
2020

21-
Gaggle is a DuckDB extension that allows you to read and write Kaggle datasets directly in SQL queries,
22-
treating them as if they were regular DuckDB tables. It is developed in Rust and provides seamless integration
23-
with the Kaggle API for dataset discovery, download, and management.
21+
Gaggle is a DuckDB extension that allows you to read and write Kaggle datasets directly in SQL queries, as if
22+
they were DuckDB tables.
23+
It is written in Rust and uses the official Kaggle API to search, download, and manage datasets.
2424

25-
### Motivation
26-
27-
Data scientists and analysts often need to work with datasets from Kaggle. The traditional workflow involves:
28-
1. Manually downloading datasets from Kaggle's website
29-
2. Extracting ZIP files
30-
3. Loading CSV/Parquet files into your analysis environment
31-
4. Managing storage and updates
32-
33-
Gaggle simplifies this workflow by integrating Kaggle datasets directly into DuckDB. You can:
34-
- Query Kaggle datasets as if they were local tables
35-
- Search and discover datasets without leaving SQL
36-
- Automatically cache datasets locally for fast access
37-
- Manage dataset versions and updates
25+
Kaggle hosts a large collection of very useful datasets for data science and machine learning work.
26+
Accessing these datasets typically involves multiple manual steps: manually downloading a dataset (as a ZIP file),
27+
extracting it, loading it into your data science environment, and managing storage and dataset updates, etc.
28+
This workflow can be simplified and optimized by bringing the datasets directly into DuckDB.
29+
This is where Gaggle comes in.
30+
It provides a few simple SQL functions that let you read and write Kaggle datasets as if they were DuckDB tables.
3831

3932
### Features
4033

41-
- **Direct SQL access** to Kaggle datasets with `SELECT * FROM 'kaggle:owner/dataset/file.csv'`
42-
- **Search datasets** using `gaggle_search('query')`
43-
- **Download and cache** datasets automatically
44-
- **List files** in datasets before loading
45-
- **Get metadata** about datasets including size, description, and update info
46-
- **Credential management** via environment variables, config file, or SQL
47-
- **Automatic caching** for fast repeated access
34+
- Has a simple API (just a few SQL functions)
35+
- Allows you search, download, update, and delete Kaggle datasets directly from DuckDB
36+
- Supports datasets made of CSV, JSON, and Parquet files
37+
- Configurable and has built-in caching support
4838
- Thread-safe and memory-efficient
4939

50-
### Quick Start
51-
52-
```sql
53-
-- Set your Kaggle credentials (or use ~/.kaggle/kaggle.json)
54-
SELECT gaggle_set_credentials('your-username', 'your-api-key');
55-
56-
-- Search for datasets
57-
SELECT * FROM json_each(gaggle_search('covid-19', 1, 10));
58-
59-
-- Read a Kaggle dataset directly
60-
SELECT * FROM 'kaggle:owid/covid-latest-data/owid-covid-latest.csv' LIMIT 10;
61-
62-
-- Download and get local path
63-
SELECT gaggle_download('owid/covid-latest-data');
64-
65-
-- List files in a dataset
66-
SELECT * FROM json_each(gaggle_list_files('owid/covid-latest-data'));
67-
68-
-- Get dataset metadata
69-
SELECT * FROM json_each(gaggle_info('owid/covid-latest-data'));
70-
```
71-
72-
### API Functions
73-
74-
| Function | Description |
75-
|----------|-------------|
76-
| `gaggle_set_credentials(username, key)` | Set Kaggle API credentials |
77-
| `gaggle_search(query, page, page_size)` | Search for datasets on Kaggle |
78-
| `gaggle_download(dataset_path)` | Download a dataset and return local path |
79-
| `gaggle_list_files(dataset_path)` | List files in a dataset (JSON array) |
80-
| `gaggle_info(dataset_path)` | Get dataset metadata (JSON object) |
81-
| `gaggle_get_version()` | Get extension version info |
82-
| `gaggle_clear_cache()` | Clear the local dataset cache |
83-
| `gaggle_get_cache_info()` | Get cache statistics |
84-
85-
### Configuration
86-
87-
Gaggle can be configured via environment variables:
88-
89-
- `KAGGLE_USERNAME` - Your Kaggle username
90-
- `KAGGLE_KEY` - Your Kaggle API key
91-
- `GAGGLE_CACHE_DIR` - Directory for caching datasets (default: system cache dir)
92-
- `GAGGLE_VERBOSE` - Enable verbose logging (default: false)
93-
- `GAGGLE_HTTP_TIMEOUT` - HTTP timeout in seconds (default: 30)
94-
95-
Alternatively, create `~/.kaggle/kaggle.json`:
96-
```json
97-
{
98-
"username": "your-username",
99-
"key": "your-api-key"
100-
}
101-
```
102-
103-
See the [ROADMAP.md](ROADMAP.md) for planned features and the [docs](docs/) folder for detailed documentation.
40+
See the [ROADMAP.md](ROADMAP.md) for planned features and the [docs](docs) folder for detailed documentation.
10441

10542
> [!IMPORTANT]
10643
> Gaggle is in early development, so bugs and breaking changes are expected.
@@ -148,33 +85,79 @@ make release
14885
> You can download the pre-built binaries from the [releases page](https://github.com/CogitatorTech/gaggle/releases) for
14986
> your platform.
15087
151-
15288
#### Trying Gaggle
15389

15490
```sql
155-
-- 0. Install and load Gaggle
156-
-- Skip this step if you built from source and ran `./build/release/duckdb`
157-
install gaggle from community;
158-
load gaggle;
91+
-- Load the Gaggle extension
92+
load 'build/release/extension/gaggle/gaggle.duckdb_extension';
15993

160-
-- 1. Load a simple linear model from a remote URL
161-
select gaggle_load_model('linear_model',
162-
'https://github.com/CogitatorTech/gaggle/raw/refs/heads/main/test/models/linear.onnx');
94+
-- Set your Kaggle credentials (or use `~/.kaggle/kaggle.json`)
95+
select gaggle_set_credentials('your-username', 'your-api-key');
16396

164-
-- 2. Run a prediction using a very simple linear model
165-
-- Model: y = 2*x1 - 1*x2 + 0.5*x3 + 0.25
166-
select gaggle_predict('linear_model', 1.0, 2.0, 3.0);
167-
-- Expected output: 1.75
97+
-- Get extension version
98+
select gaggle_get_version();
16899

169-
-- 3. Unload the model when we're done with it
170-
select gaggle_unload_model('linear_model');
100+
-- Download and get local path
101+
select gaggle_download('habedi/flickr-8k-dataset-clean');
171102

172-
-- 4. Check the Gaggle version
173-
select gaggle_get_version();
103+
-- Get raw JSON results from search
104+
select gaggle_search('flickr', 1, 10) as search_results;
105+
106+
select gaggle_list_files('habedi/flickr-8k-dataset-clean') as files;
107+
108+
select gaggle_info('habedi/flickr-8k-dataset-clean') as metadata;
109+
110+
-- Read a CSV file directly from local path after download
111+
select *
112+
from read_csv_auto('/path/to/downloaded/dataset/file.csv') limit 10;
174113
```
175114

176115
[![Simple Demo 1](https://asciinema.org/a/745806.svg)](https://asciinema.org/a/745806)
177116

117+
#### API Functions
118+
119+
| Function | Description |
120+
|-----------------------------------------|------------------------------------------|
121+
| `gaggle_set_credentials(username, key)` | Set Kaggle API credentials |
122+
| `gaggle_search(query, page, page_size)` | Search for datasets on Kaggle |
123+
| `gaggle_download(dataset_path)` | Download a dataset and return local path |
124+
| `gaggle_list_files(dataset_path)` | List files in a dataset (JSON array) |
125+
| `gaggle_info(dataset_path)` | Get dataset metadata (JSON object) |
126+
| `gaggle_get_version()` | Get extension version info |
127+
| `gaggle_clear_cache()` | Clear the local dataset cache |
128+
| `gaggle_get_cache_info()` | Get cache statistics |
129+
130+
#### Configuration
131+
132+
Gaggle can be configured via environment variables:
133+
134+
- `KAGGLE_USERNAME` - Your Kaggle username
135+
- `KAGGLE_KEY` - Your Kaggle API key
136+
- `GAGGLE_CACHE_DIR` - Directory for caching datasets (default: system cache dir)
137+
- `GAGGLE_VERBOSE` - Enable verbose logging (default: false)
138+
- `GAGGLE_HTTP_TIMEOUT` - HTTP timeout in seconds (default: 30)
139+
140+
Alternatively, create `~/.kaggle/kaggle.json`:
141+
142+
```json
143+
{
144+
"username": "your-username",
145+
"key": "your-api-key"
146+
}
147+
```
148+
149+
##### JSON Parsing
150+
151+
> [!TIP]
152+
> Gaggle returns JSON data for search results, file lists, and metadata.
153+
> For advanced JSON parsing, you can optionally load the JSON DuckDB extension:
154+
> ```sql
155+
> install json;
156+
> load json;
157+
> select * from json_each(gaggle_search('covid-19', 1, 10));
158+
> ```
159+
> If the JSON extension is not available, you can still access the raw JSON strings and work with them directly.
160+
178161
---
179162
180163
### Documentation

ROADMAP.md

Lines changed: 88 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -1,70 +1,101 @@
11
## Feature Roadmap
22

3-
This document includes the roadmap for the Infera DuckDB extension.
3+
This document includes the roadmap for the Gaggle DuckDB extension.
44
It outlines features to be implemented and their current status.
55

66
> [!IMPORTANT]
77
> This roadmap is a work in progress and is subject to change.
88
9-
### 1. Inference Interface
10-
11-
* **Input Data Types**
12-
* [x] `FLOAT` features from table columns.
13-
* [x] Type casting from `INTEGER`, `BIGINT`, and `DOUBLE` columns.
14-
* [x] Type casting from `DECIMAL` columns.
15-
* [x] `BLOB` input for tensor data.
16-
* [ ] `STRUCT` or `MAP` input for named features.
17-
* **Output Data Types**
18-
* [x] Single `FLOAT` scalar output.
19-
* [x] Multiple `FLOAT` outputs as a `VARCHAR` containing JSON.
20-
* [x] Multiple `FLOAT` outputs as a `LIST[FLOAT]`.
21-
* **Batch Processing**
22-
* [x] Inference on batches for models with dynamic dimensions.
23-
* [ ] Automatic batch splitting for models with a fixed batch size.
24-
25-
### 2. Model Management API
26-
27-
* **Model Loading**
28-
* [x] Load models from local file paths.
29-
* [x] Load models from URLs with local caching.
30-
* [x] Load all `.onnx` models from a directory.
31-
* **Model Lifecycle**
32-
* [x] Unload models from memory.
33-
* [x] List loaded models.
34-
* [x] Get model metadata as a JSON object.
35-
* [x] Check if a model is currently loaded.
36-
* [x] Cache eviction policies for remote models.
37-
38-
### 3. Performance and Concurrency
9+
### 1. Kaggle API Integration
10+
11+
* **Authentication**
12+
* [x] Set Kaggle API credentials programmatically.
13+
* [x] Support environment variables (KAGGLE_USERNAME, KAGGLE_KEY).
14+
* [x] Support ~/.kaggle/kaggle.json file.
15+
* [ ] Support OAuth2 authentication.
16+
* **Dataset Operations**
17+
* [x] Search for datasets by query.
18+
* [x] Download datasets from Kaggle.
19+
* [x] List files in a dataset.
20+
* [x] Get dataset metadata.
21+
* [ ] Upload datasets to Kaggle.
22+
* [ ] Delete datasets from Kaggle.
23+
24+
### 2. Caching and Storage
25+
26+
* **Cache Management**
27+
* [x] Automatic caching of downloaded datasets.
28+
* [x] Clear cache functionality.
29+
* [x] Get cache information (size, location).
30+
* [ ] Configurable cache size limits.
31+
* [ ] Cache expiration policies.
32+
* [ ] Partial file downloads and resumable transfers.
33+
* **Storage**
34+
* [x] Store datasets in configurable directory.
35+
* [ ] Support for cloud storage backends (S3, GCS, Azure).
36+
37+
### 3. Data Integration
38+
39+
* **File Format Support**
40+
* [x] CSV file reading integration.
41+
* [ ] JSON file reading.
42+
* [x] Parquet file reading (via DuckDB).
43+
* [ ] Excel/XLSX file reading.
44+
* [ ] TSV/TXT file reading.
45+
* **Direct Query Integration**
46+
* [ ] Direct SQL queries on remote datasets without full download.
47+
* [ ] Streaming data from Kaggle without caching.
48+
* [ ] Virtual table support for lazy loading.
49+
50+
### 4. Performance and Concurrency
3951

4052
* **Concurrency Control**
41-
* [x] Thread-safe model store for concurrent queries.
42-
* **Data Transfer**
43-
* [ ] Process `BLOB` columns in a single FFI call.
44-
* [ ] Zero-copy data transfer between DuckDB and Rust.
45-
* **Hardware Support**
46-
* [ ] GPU support for inference via an alternative backend.
47-
48-
### 4. Backend and Format Support
49-
50-
* **ONNX Standard**
51-
* [x] Support for ONNX operators via the `tract` engine.
52-
* [ ] Support for models with named inputs and outputs.
53-
* **Alternative Backends**
54-
* [ ] An optional build using the ONNX Runtime backend.
55-
* **Other Formats**
56-
* [ ] Support for other model formats like TorchScript or TensorFlow Lite.
57-
58-
### 5. Miscellaneous
59-
60-
* **SQL Functions**
61-
* [x] Consistent function names for the public API.
62-
* **Error Handling**
63-
* [x] Error messages for missing models, incorrect argument counts, and NULL inputs.
64-
* [ ] Error messages with more specific details.
53+
* [x] Thread-safe credential storage.
54+
* [x] Thread-safe cache access.
55+
* [ ] Concurrent dataset downloads.
56+
* **Network Optimization**
57+
* [ ] Connection pooling for Kaggle API requests.
58+
* [ ] Retry logic with exponential backoff.
59+
* [x] Configurable HTTP timeouts.
60+
* **Caching Strategy**
61+
* [ ] Incremental cache updates.
62+
* [ ] Background cache synchronization.
63+
64+
### 5. Error Handling and Resilience
65+
66+
* **Error Messages**
67+
* [x] Clear error messages for invalid credentials.
68+
* [x] Clear error messages for missing datasets.
69+
* [x] Clear error messages for NULL inputs.
70+
* [ ] Detailed error codes for programmatic error handling.
71+
* **Resilience**
72+
* [ ] Automatic retry on network failures.
73+
* [ ] Graceful degradation when Kaggle API is unavailable.
74+
* [ ] Local-only mode for cached datasets.
75+
76+
### 6. Documentation and Distribution
77+
6578
* **Documentation**
66-
* [x] `README.md` file with API reference and examples.
67-
* [ ] An official DuckDB extension documentation page.
79+
* [x] API reference in README.md.
80+
* [x] Usage examples in docs/examples/.
81+
* [ ] Tutorial documentation.
82+
* [ ] FAQ section.
83+
* [ ] Troubleshooting guide.
84+
* **Testing**
85+
* [x] Unit tests for all modules.
86+
* [x] SQL integration tests.
87+
* [ ] End-to-end integration tests.
88+
* [ ] Performance benchmarks.
6889
* **Distribution**
6990
* [ ] Pre-compiled extension binaries for Linux, macOS, and Windows.
7091
* [ ] Submission to the DuckDB Community Extensions repository.
92+
* [ ] Docker image with Gaggle pre-installed.
93+
94+
### 7. Future Enhancements
95+
96+
* **Advanced Features**
97+
* [ ] Dataset versioning and history tracking.
98+
* [ ] Collaborative dataset sharing.
99+
* [ ] Integration with other data sources (HuggingFace, GitHub, etc.).
100+
* [ ] Data transformation and preprocessing functions.
101+
* [ ] Dataset quality metrics and validation.

0 commit comments

Comments
 (0)