@@ -18,89 +18,26 @@ Kaggle Datasets for DuckDB
1818
1919---
2020
21- Gaggle is a DuckDB extension that allows you to read and write Kaggle datasets directly in SQL queries,
22- treating them as if they were regular DuckDB tables. It is developed in Rust and provides seamless integration
23- with the Kaggle API for dataset discovery , download, and management .
21+ Gaggle is a DuckDB extension that allows you to read and write Kaggle datasets directly in SQL queries, as if
22+ they were DuckDB tables.
23+ It is written in Rust and uses the official Kaggle API to search , download, and manage datasets .
2424
25- ### Motivation
26-
27- Data scientists and analysts often need to work with datasets from Kaggle. The traditional workflow involves:
28- 1 . Manually downloading datasets from Kaggle's website
29- 2 . Extracting ZIP files
30- 3 . Loading CSV/Parquet files into your analysis environment
31- 4 . Managing storage and updates
32-
33- Gaggle simplifies this workflow by integrating Kaggle datasets directly into DuckDB. You can:
34- - Query Kaggle datasets as if they were local tables
35- - Search and discover datasets without leaving SQL
36- - Automatically cache datasets locally for fast access
37- - Manage dataset versions and updates
25+ Kaggle hosts a large collection of very useful datasets for data science and machine learning work.
26+ Accessing these datasets typically involves multiple manual steps: manually downloading a dataset (as a ZIP file),
27+ extracting it, loading it into your data science environment, and managing storage and dataset updates, etc.
28+ This workflow can be simplified and optimized by bringing the datasets directly into DuckDB.
29+ This is where Gaggle comes in.
30+ It provides a few simple SQL functions that let you read and write Kaggle datasets as if they were DuckDB tables.
3831
3932### Features
4033
41- - ** Direct SQL access** to Kaggle datasets with ` SELECT * FROM 'kaggle:owner/dataset/file.csv' `
42- - ** Search datasets** using ` gaggle_search('query') `
43- - ** Download and cache** datasets automatically
44- - ** List files** in datasets before loading
45- - ** Get metadata** about datasets including size, description, and update info
46- - ** Credential management** via environment variables, config file, or SQL
47- - ** Automatic caching** for fast repeated access
34+ - Has a simple API (just a few SQL functions)
35+ - Allows you search, download, update, and delete Kaggle datasets directly from DuckDB
36+ - Supports datasets made of CSV, JSON, and Parquet files
37+ - Configurable and has built-in caching support
4838- Thread-safe and memory-efficient
4939
50- ### Quick Start
51-
52- ``` sql
53- -- Set your Kaggle credentials (or use ~/.kaggle/kaggle.json)
54- SELECT gaggle_set_credentials(' your-username' , ' your-api-key' );
55-
56- -- Search for datasets
57- SELECT * FROM json_each(gaggle_search(' covid-19' , 1 , 10 ));
58-
59- -- Read a Kaggle dataset directly
60- SELECT * FROM ' kaggle:owid/covid-latest-data/owid-covid-latest.csv' LIMIT 10 ;
61-
62- -- Download and get local path
63- SELECT gaggle_download(' owid/covid-latest-data' );
64-
65- -- List files in a dataset
66- SELECT * FROM json_each(gaggle_list_files(' owid/covid-latest-data' ));
67-
68- -- Get dataset metadata
69- SELECT * FROM json_each(gaggle_info(' owid/covid-latest-data' ));
70- ```
71-
72- ### API Functions
73-
74- | Function | Description |
75- | ----------| -------------|
76- | ` gaggle_set_credentials(username, key) ` | Set Kaggle API credentials |
77- | ` gaggle_search(query, page, page_size) ` | Search for datasets on Kaggle |
78- | ` gaggle_download(dataset_path) ` | Download a dataset and return local path |
79- | ` gaggle_list_files(dataset_path) ` | List files in a dataset (JSON array) |
80- | ` gaggle_info(dataset_path) ` | Get dataset metadata (JSON object) |
81- | ` gaggle_get_version() ` | Get extension version info |
82- | ` gaggle_clear_cache() ` | Clear the local dataset cache |
83- | ` gaggle_get_cache_info() ` | Get cache statistics |
84-
85- ### Configuration
86-
87- Gaggle can be configured via environment variables:
88-
89- - ` KAGGLE_USERNAME ` - Your Kaggle username
90- - ` KAGGLE_KEY ` - Your Kaggle API key
91- - ` GAGGLE_CACHE_DIR ` - Directory for caching datasets (default: system cache dir)
92- - ` GAGGLE_VERBOSE ` - Enable verbose logging (default: false)
93- - ` GAGGLE_HTTP_TIMEOUT ` - HTTP timeout in seconds (default: 30)
94-
95- Alternatively, create ` ~/.kaggle/kaggle.json ` :
96- ``` json
97- {
98- "username" : " your-username" ,
99- "key" : " your-api-key"
100- }
101- ```
102-
103- See the [ ROADMAP.md] ( ROADMAP.md ) for planned features and the [ docs] ( docs/ ) folder for detailed documentation.
40+ See the [ ROADMAP.md] ( ROADMAP.md ) for planned features and the [ docs] ( docs ) folder for detailed documentation.
10441
10542> [ !IMPORTANT]
10643> Gaggle is in early development, so bugs and breaking changes are expected.
@@ -148,33 +85,79 @@ make release
14885> You can download the pre-built binaries from the [ releases page] ( https://github.com/CogitatorTech/gaggle/releases ) for
14986> your platform.
15087
151-
15288#### Trying Gaggle
15389
15490``` sql
155- -- 0. Install and load Gaggle
156- -- Skip this step if you built from source and ran `./build/release/duckdb`
157- install gaggle from community;
158- load gaggle;
91+ -- Load the Gaggle extension
92+ load ' build/release/extension/gaggle/gaggle.duckdb_extension' ;
15993
160- -- 1. Load a simple linear model from a remote URL
161- select gaggle_load_model(' linear_model' ,
162- ' https://github.com/CogitatorTech/gaggle/raw/refs/heads/main/test/models/linear.onnx' );
94+ -- Set your Kaggle credentials (or use `~/.kaggle/kaggle.json`)
95+ select gaggle_set_credentials(' your-username' , ' your-api-key' );
16396
164- -- 2. Run a prediction using a very simple linear model
165- -- Model: y = 2*x1 - 1*x2 + 0.5*x3 + 0.25
166- select gaggle_predict(' linear_model' , 1 .0 , 2 .0 , 3 .0 );
167- -- Expected output: 1.75
97+ -- Get extension version
98+ select gaggle_get_version();
16899
169- -- 3. Unload the model when we're done with it
170- select gaggle_unload_model( ' linear_model ' );
100+ -- Download and get local path
101+ select gaggle_download( ' habedi/flickr-8k-dataset-clean ' );
171102
172- -- 4. Check the Gaggle version
173- select gaggle_get_version();
103+ -- Get raw JSON results from search
104+ select gaggle_search(' flickr' , 1 , 10 ) as search_results;
105+
106+ select gaggle_list_files(' habedi/flickr-8k-dataset-clean' ) as files;
107+
108+ select gaggle_info(' habedi/flickr-8k-dataset-clean' ) as metadata;
109+
110+ -- Read a CSV file directly from local path after download
111+ select *
112+ from read_csv_auto(' /path/to/downloaded/dataset/file.csv' ) limit 10 ;
174113```
175114
176115[ ![ Simple Demo 1] ( https://asciinema.org/a/745806.svg )] ( https://asciinema.org/a/745806 )
177116
117+ #### API Functions
118+
119+ | Function | Description |
120+ | -----------------------------------------| ------------------------------------------|
121+ | ` gaggle_set_credentials(username, key) ` | Set Kaggle API credentials |
122+ | ` gaggle_search(query, page, page_size) ` | Search for datasets on Kaggle |
123+ | ` gaggle_download(dataset_path) ` | Download a dataset and return local path |
124+ | ` gaggle_list_files(dataset_path) ` | List files in a dataset (JSON array) |
125+ | ` gaggle_info(dataset_path) ` | Get dataset metadata (JSON object) |
126+ | ` gaggle_get_version() ` | Get extension version info |
127+ | ` gaggle_clear_cache() ` | Clear the local dataset cache |
128+ | ` gaggle_get_cache_info() ` | Get cache statistics |
129+
130+ #### Configuration
131+
132+ Gaggle can be configured via environment variables:
133+
134+ - ` KAGGLE_USERNAME ` - Your Kaggle username
135+ - ` KAGGLE_KEY ` - Your Kaggle API key
136+ - ` GAGGLE_CACHE_DIR ` - Directory for caching datasets (default: system cache dir)
137+ - ` GAGGLE_VERBOSE ` - Enable verbose logging (default: false)
138+ - ` GAGGLE_HTTP_TIMEOUT ` - HTTP timeout in seconds (default: 30)
139+
140+ Alternatively, create ` ~/.kaggle/kaggle.json ` :
141+
142+ ``` json
143+ {
144+ "username" : " your-username" ,
145+ "key" : " your-api-key"
146+ }
147+ ```
148+
149+ ##### JSON Parsing
150+
151+ > [ !TIP]
152+ > Gaggle returns JSON data for search results, file lists, and metadata.
153+ > For advanced JSON parsing, you can optionally load the JSON DuckDB extension:
154+ > ``` sql
155+ > install json;
156+ > load json;
157+ > select * from json_each(gaggle_search(' covid-19' , 1 , 10 ));
158+ > ` ` `
159+ > If the JSON extension is not available, you can still access the raw JSON strings and work with them directly.
160+
178161---
179162
180163### Documentation
0 commit comments