Gaggle is a DuckDB extension that allows you to work with Kaggle datasets directly in SQL queries, as if they were DuckDB tables. It is written in Rust and uses the Kaggle API to search, download, and manage the datasets.
Kaggle hosts a large collection of very useful datasets for data science and machine learning. Accessing these datasets typically involves manually downloading a dataset (as a ZIP file), extracting it, loading the files in the dataset into your data science environment, and managing storage and dataset updates, etc. This workflow can quickly become complex, especially when working with multiple datasets or when datasets are updated frequently. Gaggle tries to help simplify this process by hiding the complexity and letting you work with datasets directly inside DuckDB that allow you to run fast analytical queries on the data.
In essence, Gaggle makes DuckDB into a SQL-enabled frontend for Kaggle datasets.
- Provides a simple API to interact with Kaggle datasets from DuckDB
- Allows you to search, download, and read datasets from Kaggle
- Supports datasets that contain CSV, Parquet, JSON, and XLSX files
- Supports dataset updates and versioning
- Configurable and has built-in caching support to avoid re-downloading
- Thread-safe, fast, and has a low memory footprint
See the ROADMAP.md for the list of implemented and planned features.
Important
Gaggle is in early development, so bugs and breaking changes are expected. Please use the issues page to report bugs or request features.
You can install and load Gaggle from the DuckDB community extensions repository by running the following SQL commands in the DuckDB shell:
install gaggle from community;
load gaggle;Alternatively, you can build Gaggle from source and use it by following these steps:
- Clone the repository and build the Gaggle extension from source:
git clone --recursive https://github.com/CogitatorTech/gaggle.git
cd gaggle
# This might take a while to run
make release- Start DuckDB shell (with Gaggle statically linked to it):
./build/release/duckdbNote
After building from source, the Gaggle binary will be build/release/extension/gaggle/gaggle.duckdb_extension.
You can load it using the load 'build/release/extension/gaggle/gaggle.duckdb_extension'; in the DuckDB shell.
Note that the extension binary will only work with the DuckDB version that it was built against.
You can download the pre-built binaries from the releases page for
your platform.
-- Get extension version
select gaggle_version();
-- List files in the dataset
-- (Note that if the datasets is not downloaded, it will be downloaded and cached)
select *
from gaggle_ls('habedi/flickr-8k-dataset-clean') limit 5;
-- Read a Parquet file from local cache using a prepared statement
-- (DuckDB doesn't allow the use of subqueries in function arguments, so we use a prepared statement)
prepare rp as select * from read_parquet(?) limit 10;
execute rp(gaggle_file_path('habedi/flickr-8k-dataset-clean', 'flickr8k.parquet'));
-- Alternatively, we can use a replacement scan to read directly via `kaggle:` prefix
select count(*)
from 'kaggle:habedi/flickr-8k-dataset-clean/flickr8k.parquet';
-- Optionally, we check cache info
select gaggle_cache_info();
-- Check if cached dataset is current (is newest version?)
select gaggle_is_current('habedi/flickr-8k-dataset-clean');Check out the docs directory for the API documentation, how to build Gaggle from source, and more.
Check out the examples directory for SQL scripts that show how to use Gaggle.
See CONTRIBUTING.md for details on how to make a contribution.
Gaggle is available under either of the following licenses:
- MIT License (LICENSE-MIT)
- Apache License, Version 2.0 (LICENSE-APACHE)
- The logo is from here with some modifications.