Description
I did some initial exploration regarding DuckDB, Arrow (and ADBC), and DataFrame: https://github.com/Kotlin/dataframe/tree/duckdb-arrow-exploration
This branch contains the following proof-of-concepts requiring little to no changes to the DataFrame API. Feel free to use them as inspiration for your own projects :)
- DuckDB -> dataframe-jdbc (to be turned into an example)
- PostgreSQL -> DuckDB -> dataframe-jdbc
- MySQL (on h2) -> Arrow ADBC-JDBC -> dataframe-arrow
- PostgreSQL (on h2) -> Arrow ADBC-JDBC -> dataframe-arrow (issues defining arrays and calendar mappings)
- DuckDB -> Arrow -> dataframe-arrow (already present as test at the moment and as article by @fb64)
- PostgreSQL -> DuckDB -> Arrow -> dataframe-arrow
- SQLite -> DuckDB -> Arrow -> dataframe-arrow
- And finally some Parquet exploration related to Add support for reading parquet file thanks to arrow-dataset #576 #577, but irrelevant for DuckDB
It was interesting to see how powerful it can be to support some integration that can be an "intermediate" structure, like DuckDB and/or Arrow. Fully supporting one of those could allow DataFrame to read from almost any data source while only providing a single type mapping and reading mechanism.
Either way, while we can add DuckDB support fully later, let's provide DuckDB -> dataframe-jdbc as an example project or notebook for now. People likely won't need to update their DataFrame version at all to use it and its extensions could allow connecting even more unsupported databases while not having to worry about specific SQL quirks (beside DuckDB's). I'll add some tests to check whether all DuckDB's types are supported.