CogitatorTech
diff --git a/‎README.md‎
Lines changed: 76 additions & 93 deletions b/‎README.md‎
Lines changed: 76 additions & 93 deletions
diff --git a/‎ROADMAP.md‎
Lines changed: 88 additions & 57 deletions b/‎ROADMAP.md‎
Lines changed: 88 additions & 57 deletions
@@ -18,89 +18,26 @@ Kaggle Datasets for DuckDB
 
 ---
 
-Gaggle is a DuckDB extension that allows you to read and write Kaggle datasets directly in SQL queries,
-treating them as if they were regular DuckDB tables. It is developed in Rust and provides seamless integration
-with the Kaggle API for dataset discovery, download, and management.
+Gaggle is a DuckDB extension that allows you to read and write Kaggle datasets directly in SQL queries, as if
+they were DuckDB tables.
+It is written in Rust and uses the official Kaggle API to search, download, and manage datasets.
 
-### Motivation
-
-Data scientists and analysts often need to work with datasets from Kaggle. The traditional workflow involves:
-1. Manually downloading datasets from Kaggle's website
-2. Extracting ZIP files
-3. Loading CSV/Parquet files into your analysis environment
-4. Managing storage and updates
-
-Gaggle simplifies this workflow by integrating Kaggle datasets directly into DuckDB. You can:
-- Query Kaggle datasets as if they were local tables
-- Search and discover datasets without leaving SQL
-- Automatically cache datasets locally for fast access
-- Manage dataset versions and updates
+Kaggle hosts a large collection of very useful datasets for data science and machine learning work.
+Accessing these datasets typically involves multiple manual steps: manually downloading a dataset (as a ZIP file),
+extracting it, loading it into your data science environment, and managing storage and dataset updates, etc.
+This workflow can be simplified and optimized by bringing the datasets directly into DuckDB.
+This is where Gaggle comes in.
+It provides a few simple SQL functions that let you read and write Kaggle datasets as if they were DuckDB tables.
 
 ### Features
 
-- **Direct SQL access** to Kaggle datasets with `SELECT * FROM 'kaggle:owner/dataset/file.csv'`
-- **Search datasets** using `gaggle_search('query')`
-- **Download and cache** datasets automatically
-- **List files** in datasets before loading
-- **Get metadata** about datasets including size, description, and update info
-- **Credential management** via environment variables, config file, or SQL
-- **Automatic caching** for fast repeated access
+- Has a simple API (just a few SQL functions)
+- Allows you search, download, update, and delete Kaggle datasets directly from DuckDB
+- Supports datasets made of CSV, JSON, and Parquet files
+- Configurable and has built-in caching support
 - Thread-safe and memory-efficient
 
-### Quick Start
-
-```sql
--- Set your Kaggle credentials (or use ~/.kaggle/kaggle.json)
-SELECT gaggle_set_credentials('your-username', 'your-api-key');
-
--- Search for datasets
-SELECT * FROM json_each(gaggle_search('covid-19', 1, 10));
-
--- Read a Kaggle dataset directly
-SELECT * FROM 'kaggle:owid/covid-latest-data/owid-covid-latest.csv' LIMIT 10;
-
--- Download and get local path
-SELECT gaggle_download('owid/covid-latest-data');
-
--- List files in a dataset
-SELECT * FROM json_each(gaggle_list_files('owid/covid-latest-data'));
-
--- Get dataset metadata
-SELECT * FROM json_each(gaggle_info('owid/covid-latest-data'));
-```
-
-### API Functions
-
-| Function | Description |
-|----------|-------------|
-| `gaggle_set_credentials(username, key)` | Set Kaggle API credentials |
-| `gaggle_search(query, page, page_size)` | Search for datasets on Kaggle |
-| `gaggle_download(dataset_path)` | Download a dataset and return local path |
-| `gaggle_list_files(dataset_path)` | List files in a dataset (JSON array) |
-| `gaggle_info(dataset_path)` | Get dataset metadata (JSON object) |
-| `gaggle_get_version()` | Get extension version info |
-| `gaggle_clear_cache()` | Clear the local dataset cache |
-| `gaggle_get_cache_info()` | Get cache statistics |
-
-### Configuration
-
-Gaggle can be configured via environment variables:
-
-- `KAGGLE_USERNAME` - Your Kaggle username
-- `KAGGLE_KEY` - Your Kaggle API key
-- `GAGGLE_CACHE_DIR` - Directory for caching datasets (default: system cache dir)
-- `GAGGLE_VERBOSE` - Enable verbose logging (default: false)
-- `GAGGLE_HTTP_TIMEOUT` - HTTP timeout in seconds (default: 30)
-
-Alternatively, create `~/.kaggle/kaggle.json`:
-```json
-{
-  "username": "your-username",
-  "key": "your-api-key"
-}
-```
-
-See the [ROADMAP.md](ROADMAP.md) for planned features and the [docs](docs/) folder for detailed documentation.
+See the [ROADMAP.md](ROADMAP.md) for planned features and the [docs](docs) folder for detailed documentation.
 
 > [!IMPORTANT]
 > Gaggle is in early development, so bugs and breaking changes are expected.
@@ -148,33 +85,79 @@ make release
 > You can download the pre-built binaries from the [releases page](https://github.com/CogitatorTech/gaggle/releases) for
 > your platform.
 
-
 #### Trying Gaggle
 
 ```sql
--- 0. Install and load Gaggle
--- Skip this step if you built from source and ran `./build/release/duckdb`
-install gaggle from community;
-load gaggle;
+-- Load the Gaggle extension
+load 'build/release/extension/gaggle/gaggle.duckdb_extension';
 
--- 1. Load a simple linear model from a remote URL
-select gaggle_load_model('linear_model',
-                         'https://github.com/CogitatorTech/gaggle/raw/refs/heads/main/test/models/linear.onnx');
+-- Set your Kaggle credentials (or use `~/.kaggle/kaggle.json`)
+select gaggle_set_credentials('your-username', 'your-api-key');
 
--- 2. Run a prediction using a very simple linear model
--- Model: y = 2*x1 - 1*x2 + 0.5*x3 + 0.25
-select gaggle_predict('linear_model', 1.0, 2.0, 3.0);
--- Expected output: 1.75
+-- Get extension version
+select gaggle_get_version();
 
--- 3. Unload the model when we're done with it
-select gaggle_unload_model('linear_model');
+-- Download and get local path
+select gaggle_download('habedi/flickr-8k-dataset-clean');
 
--- 4. Check the Gaggle version
-select gaggle_get_version();
+-- Get raw JSON results from search
+select gaggle_search('flickr', 1, 10) as search_results;
+
+select gaggle_list_files('habedi/flickr-8k-dataset-clean') as files;
+
+select gaggle_info('habedi/flickr-8k-dataset-clean') as metadata;
+
+-- Read a CSV file directly from local path after download
+select *
+from read_csv_auto('/path/to/downloaded/dataset/file.csv') limit 10;
 ```
 
 [![Simple Demo 1](https://asciinema.org/a/745806.svg)](https://asciinema.org/a/745806)
 
+#### API Functions
+
+| Function                                | Description                              |
+|-----------------------------------------|------------------------------------------|
+| `gaggle_set_credentials(username, key)` | Set Kaggle API credentials               |
+| `gaggle_search(query, page, page_size)` | Search for datasets on Kaggle            |
+| `gaggle_download(dataset_path)`         | Download a dataset and return local path |
+| `gaggle_list_files(dataset_path)`       | List files in a dataset (JSON array)     |
+| `gaggle_info(dataset_path)`             | Get dataset metadata (JSON object)       |
+| `gaggle_get_version()`                  | Get extension version info               |
+| `gaggle_clear_cache()`                  | Clear the local dataset cache            |
+| `gaggle_get_cache_info()`               | Get cache statistics                     |
+
+#### Configuration
+
+Gaggle can be configured via environment variables:
+
+- `KAGGLE_USERNAME` - Your Kaggle username
+- `KAGGLE_KEY` - Your Kaggle API key
+- `GAGGLE_CACHE_DIR` - Directory for caching datasets (default: system cache dir)
+- `GAGGLE_VERBOSE` - Enable verbose logging (default: false)
+- `GAGGLE_HTTP_TIMEOUT` - HTTP timeout in seconds (default: 30)
+
+Alternatively, create `~/.kaggle/kaggle.json`:
+
+```json
+{
+    "username": "your-username",
+    "key": "your-api-key"
+}
+```
+
+##### JSON Parsing
+
+> [!TIP]
+> Gaggle returns JSON data for search results, file lists, and metadata.
+> For advanced JSON parsing, you can optionally load the JSON DuckDB extension:
+> ```sql
+> install json;
+> load json;
+> select * from json_each(gaggle_search('covid-19', 1, 10));
+> ```
+> If the JSON extension is not available, you can still access the raw JSON strings and work with them directly.
+
 ---
 
 ### Documentation
 
@@ -1,70 +1,101 @@
 ## Feature Roadmap
 
-This document includes the roadmap for the Infera DuckDB extension.
+This document includes the roadmap for the Gaggle DuckDB extension.
 It outlines features to be implemented and their current status.
 
 > [!IMPORTANT]
 > This roadmap is a work in progress and is subject to change.
 
-### 1. Inference Interface
-
-* **Input Data Types**
-    * [x] `FLOAT` features from table columns.
-    * [x] Type casting from `INTEGER`, `BIGINT`, and `DOUBLE` columns.
-    * [x] Type casting from `DECIMAL` columns.
-    * [x] `BLOB` input for tensor data.
-    * [ ] `STRUCT` or `MAP` input for named features.
-* **Output Data Types**
-    * [x] Single `FLOAT` scalar output.
-    * [x] Multiple `FLOAT` outputs as a `VARCHAR` containing JSON.
-    * [x] Multiple `FLOAT` outputs as a `LIST[FLOAT]`.
-* **Batch Processing**
-    * [x] Inference on batches for models with dynamic dimensions.
-    * [ ] Automatic batch splitting for models with a fixed batch size.
-
-### 2. Model Management API
-
-* **Model Loading**
-    * [x] Load models from local file paths.
-    * [x] Load models from URLs with local caching.
-    * [x] Load all `.onnx` models from a directory.
-* **Model Lifecycle**
-    * [x] Unload models from memory.
-    * [x] List loaded models.
-    * [x] Get model metadata as a JSON object.
-    * [x] Check if a model is currently loaded.
-    * [x] Cache eviction policies for remote models.
-
-### 3. Performance and Concurrency
+### 1. Kaggle API Integration
+
+* **Authentication**
+    * [x] Set Kaggle API credentials programmatically.
+    * [x] Support environment variables (KAGGLE_USERNAME, KAGGLE_KEY).
+    * [x] Support ~/.kaggle/kaggle.json file.
+    * [ ] Support OAuth2 authentication.
+* **Dataset Operations**
+    * [x] Search for datasets by query.
+    * [x] Download datasets from Kaggle.
+    * [x] List files in a dataset.
+    * [x] Get dataset metadata.
+    * [ ] Upload datasets to Kaggle.
+    * [ ] Delete datasets from Kaggle.
+
+### 2. Caching and Storage
+
+* **Cache Management**
+    * [x] Automatic caching of downloaded datasets.
+    * [x] Clear cache functionality.
+    * [x] Get cache information (size, location).
+    * [ ] Configurable cache size limits.
+    * [ ] Cache expiration policies.
+    * [ ] Partial file downloads and resumable transfers.
+* **Storage**
+    * [x] Store datasets in configurable directory.
+    * [ ] Support for cloud storage backends (S3, GCS, Azure).
+
+### 3. Data Integration
+
+* **File Format Support**
+    * [x] CSV file reading integration.
+    * [ ] JSON file reading.
+    * [x] Parquet file reading (via DuckDB).
+    * [ ] Excel/XLSX file reading.
+    * [ ] TSV/TXT file reading.
+* **Direct Query Integration**
+    * [ ] Direct SQL queries on remote datasets without full download.
+    * [ ] Streaming data from Kaggle without caching.
+    * [ ] Virtual table support for lazy loading.
+
+### 4. Performance and Concurrency
 
 * **Concurrency Control**
-    * [x] Thread-safe model store for concurrent queries.
-* **Data Transfer**
-    * [ ] Process `BLOB` columns in a single FFI call.
-    * [ ] Zero-copy data transfer between DuckDB and Rust.
-* **Hardware Support**
-    * [ ] GPU support for inference via an alternative backend.
-
-### 4. Backend and Format Support
-
-* **ONNX Standard**
-    * [x] Support for ONNX operators via the `tract` engine.
-    * [ ] Support for models with named inputs and outputs.
-* **Alternative Backends**
-    * [ ] An optional build using the ONNX Runtime backend.
-* **Other Formats**
-    * [ ] Support for other model formats like TorchScript or TensorFlow Lite.
-
-### 5. Miscellaneous
-
-* **SQL Functions**
-    * [x] Consistent function names for the public API.
-* **Error Handling**
-    * [x] Error messages for missing models, incorrect argument counts, and NULL inputs.
-    * [ ] Error messages with more specific details.
+    * [x] Thread-safe credential storage.
+    * [x] Thread-safe cache access.
+    * [ ] Concurrent dataset downloads.
+* **Network Optimization**
+    * [ ] Connection pooling for Kaggle API requests.
+    * [ ] Retry logic with exponential backoff.
+    * [x] Configurable HTTP timeouts.
+* **Caching Strategy**
+    * [ ] Incremental cache updates.
+    * [ ] Background cache synchronization.
+
+### 5. Error Handling and Resilience
+
+* **Error Messages**
+    * [x] Clear error messages for invalid credentials.
+    * [x] Clear error messages for missing datasets.
+    * [x] Clear error messages for NULL inputs.
+    * [ ] Detailed error codes for programmatic error handling.
+* **Resilience**
+    * [ ] Automatic retry on network failures.
+    * [ ] Graceful degradation when Kaggle API is unavailable.
+    * [ ] Local-only mode for cached datasets.
+
+### 6. Documentation and Distribution
+
 * **Documentation**
-    * [x] `README.md` file with API reference and examples.
-    * [ ] An official DuckDB extension documentation page.
+    * [x] API reference in README.md.
+    * [x] Usage examples in docs/examples/.
+    * [ ] Tutorial documentation.
+    * [ ] FAQ section.
+    * [ ] Troubleshooting guide.
+* **Testing**
+    * [x] Unit tests for all modules.
+    * [x] SQL integration tests.
+    * [ ] End-to-end integration tests.
+    * [ ] Performance benchmarks.
 * **Distribution**
     * [ ] Pre-compiled extension binaries for Linux, macOS, and Windows.
     * [ ] Submission to the DuckDB Community Extensions repository.
+    * [ ] Docker image with Gaggle pre-installed.
+
+### 7. Future Enhancements
+
+* **Advanced Features**
+    * [ ] Dataset versioning and history tracking.
+    * [ ] Collaborative dataset sharing.
+    * [ ] Integration with other data sources (HuggingFace, GitHub, etc.).
+    * [ ] Data transformation and preprocessing functions.
+    * [ ] Dataset quality metrics and validation.