Skip to content

Conversation

@Alex-Monahan
Copy link

@Alex-Monahan Alex-Monahan commented Dec 22, 2025

Hi folks!

This is a new PR meant to solve the same use case as #593, but addressing the PR feedback! Thank you for the guidance - it was super helpful. I am still open to any changes you recommend!

This adds 2 new tables to the DuckLake spec: ducklake_sort_info and ducklake_sort_expression.

ducklake_sort_info keeps a version history of the sort settings for tables over time. It has 1 row per time that a table has a new sort setting applied. (The prior PR used an option for this, but that has been removed based on the feedback).

ducklake_sort_expression tracks the details of that sort. For each time a new sort setting is applied, this table includes one row for each expression in the order by. (If I order by column3 asc, column42 desc, column64 asc, then there will be 3 rows.)

There are still a few limitations with this PR:

  1. This does not order during insert (only during compaction and inline flush).
    • I would love to try to do a follow up PR to add this!
  2. Only explicit column names can be used in the sorting, not expressions.
    • I tried to implement this but could not, so I could use some help with this piece!
    • There is a friendly error message (and tests) to document this limitation.
    • The spec has a column expression, so the intention was to make the spec itself forwards-compatible with expression-oriented sorting.
  3. Files are still selected for compaction based on insertion order. It could be better to sort the list of files by min/max metadata before selecting files for compaction.
    • Let me know if this is desirable and I can work on it after the "order during insert" in number 1!

I believe that I made this fully compatible with the batching code, but I was testing locally on a DuckDB catalog and not on Postgres. Any extra eyes on that side would be great!

If this looks good, I can also do any docs PRs that you recommend - happy to help there.

Thanks folks! CC @philippmd as well as an FYI

@Alex-Monahan
Copy link
Author

When investigating 3, I found that you are correct that I am currently incrementing the schema_version when altering the sort order. Dang! I think there are few other cases where we probably want to have the same "alter table, but don't change the schema_version". I'm thinking:

  • SET_SORT_KEY
  • SET_COMMENT
  • SET_COLUMN_COMMENT

I'll start trying to implement that behavior, but let me know if I need to course correct! From my first prototype, it requires some additional complexity in a few spots (I think we will need a different construct than new_tables for storing those kinds of alters), so I'm open to feedback and ideas!

@pdet
Copy link
Collaborator

pdet commented Jan 15, 2026

When investigating 3, I found that you are correct that I am currently incrementing the schema_version when altering the sort order. Dang! I think there are few other cases where we probably want to have the same "alter table, but don't change the schema_version". I'm thinking:

  • SET_SORT_KEY
  • SET_COMMENT
  • SET_COLUMN_COMMENT

I'll start trying to implement that behavior, but let me know if I need to course correct! From my first prototype, it requires some additional complexity in a few spots (I think we will need a different construct than new_tables for storing those kinds of alters), so I'm open to feedback and ideas!

Yes, I think none of those should impact the schema ranges for compaction. I think the main thing you have to do is to ensure that these statements won't write a new entry into ducklake_schema_versions, as this table keeps track of schema changes per ducklake table.

@Alex-Monahan
Copy link
Author

Yup, I will prevent that! Unfortunately, I'm also finding that I need to update the local schema cache (since usually a new schema version would be handled differently). I'm making progress!

@Alex-Monahan Alex-Monahan marked this pull request as ready for review January 16, 2026 03:44
@Alex-Monahan
Copy link
Author

Alex-Monahan commented Jan 16, 2026

I now have the schema_version not incrementing on SET_SORT_KEY, SET_COMMENT, and SET_COLUMN_COMMENT!

I split the tests apart and also added tests for 1, 2, 3, and 5! I added quite a lot of tests to ensure transactionality for 5. Since alter table can't happen in the same transaction as a compaction (Getting the deliberate Transactions can either make changes OR perform compaction - not both error), that meant fewer cases in the compaction case than the inlining.

I also added tests for the schema_version non-incrementing for the comment cases.

Let me know what you think!

@Alex-Monahan Alex-Monahan marked this pull request as draft January 16, 2026 03:57
@Alex-Monahan
Copy link
Author

Alex-Monahan commented Jan 16, 2026

So, to fix the assertion issues on my fork's CI, I had to relax an assertion. Please let me know if I am off base, but I think that the assertion was too tight.

In src/storage/ducklake_catalog.cpp lines 439-461, the table_entry_map can have views added to it. However, there was an assertion in DuckLakeCatalogSet::GetEntryById(TableIndex index) that required a table (and not a view). Allowing a view there appears to solve things.

@Alex-Monahan Alex-Monahan marked this pull request as ready for review January 16, 2026 04:19
@Alex-Monahan
Copy link
Author

As I am thinking more about this, I'm wondering if the updates I made to the catalog cache would be safe across processes. Is it safe to not update the schema_version? Could other processes use a stale sort order, or reset it back to being unsorted?

@pdet
Copy link
Collaborator

pdet commented Jan 19, 2026

So, to fix the assertion issues on my fork's CI, I had to relax an assertion. Please let me know if I am off base, but I think that the assertion was too tight.

In src/storage/ducklake_catalog.cpp lines 439-461, the table_entry_map can have views added to it. However, there was an assertion in DuckLakeCatalogSet::GetEntryById(TableIndex index) that required a table (and not a view). Allowing a view there appears to solve things.

Could you point me out to which test broke this requirement?

From what I can tell, there are other parts of the code that require this to return a table, as we deference it to a DuckLakeTableEntry

e.g.,

unique_ptr<DuckLakeStats> DuckLakeCatalog::ConstructStatsMap(vector<DuckLakeGlobalStatsInfo> &global_stats,
                                                             DuckLakeCatalogSet &schema) {
	auto lake_stats = make_uniq<DuckLakeStats>();
	for (auto &stats : global_stats) {
		// find the referenced table entry
		auto table_entry = schema.GetEntryById(stats.table_id);

@pdet
Copy link
Collaborator

pdet commented Jan 19, 2026

As I am thinking more about this, I'm wondering if the updates I made to the catalog cache would be safe across processes. Is it safe to not update the schema_version? Could other processes use a stale sort order, or reset it back to being unsorted?

Maybe this is something we can test with concurrentloops? (see test/sql/snapshot_info/ducklake_last_commit.test)

@Alex-Monahan
Copy link
Author

So, to fix the assertion issues on my fork's CI, I had to relax an assertion. Please let me know if I am off base, but I think that the assertion was too tight.
In src/storage/ducklake_catalog.cpp lines 439-461, the table_entry_map can have views added to it. However, there was an assertion in DuckLakeCatalogSet::GetEntryById(TableIndex index) that required a table (and not a view). Allowing a view there appears to solve things.

Could you point me out to which test broke this requirement?

From what I can tell, there are other parts of the code that require this to return a table, as we deference it to a DuckLakeTableEntry

e.g.,

unique_ptr<DuckLakeStats> DuckLakeCatalog::ConstructStatsMap(vector<DuckLakeGlobalStatsInfo> &global_stats,
                                                             DuckLakeCatalogSet &schema) {
	auto lake_stats = make_uniq<DuckLakeStats>();
	for (auto &stats : global_stats) {
		// find the referenced table entry
		auto table_entry = schema.GetEntryById(stats.table_id);

Sure! The tests that broke are here in this CI run on my fork. They were all running a query like

COMMENT ON VIEW ducklake.comment_view IS 'con1';

I can create a view-specific GetEntryById function if that would be better!

@Alex-Monahan
Copy link
Author

Alex-Monahan commented Jan 19, 2026

As I am thinking more about this, I'm wondering if the updates I made to the catalog cache would be safe across processes. Is it safe to not update the schema_version? Could other processes use a stale sort order, or reset it back to being unsorted?

Maybe this is something we can test with concurrentloops? (see test/sql/snapshot_info/ducklake_last_commit.test)

Unfortunately, I believe that concurrentloop uses the same DuckDB instance with multiple threads, and the DuckLake catalog is only created once per ATTACH, so it is shared across all threads. I think the issue that might exist from no longer incrementing schema_version would be when two totally separate DuckLakeCatalog instances have different sort information in their cache (with the same schema_version). Is there a multi-process version of concurrentloop? Or maybe a C++ or Python test? Do you want me to remove the schema_version modifications and save them for a later PR?

@pdet
Copy link
Collaborator

pdet commented Jan 19, 2026

As I am thinking more about this, I'm wondering if the updates I made to the catalog cache would be safe across processes. Is it safe to not update the schema_version? Could other processes use a stale sort order, or reset it back to being unsorted?

Maybe this is something we can test with concurrentloops? (see test/sql/snapshot_info/ducklake_last_commit.test)

Unfortunately, I believe that concurrentloop uses the same DuckDB instance with multiple threads, and the DuckLake catalog is only created once per ATTACH, so it is shared across all threads. I think the issue that might exist from no longer incrementing schema_version would be when two totally separate DuckLakeCatalog instances have different sort information in their cache (with the same schema_version). Is there a multi-process version of concurrentloop? Or maybe a C++ or Python test? Do you want me to remove the schema_version modifications and save them for a later PR?

Could we achieve this with multiple connections then? Because that's also possible within sqltests

@Alex-Monahan
Copy link
Author

As I am thinking more about this, I'm wondering if the updates I made to the catalog cache would be safe across processes. Is it safe to not update the schema_version? Could other processes use a stale sort order, or reset it back to being unsorted?

Maybe this is something we can test with concurrentloops? (see test/sql/snapshot_info/ducklake_last_commit.test)

Unfortunately, I believe that concurrentloop uses the same DuckDB instance with multiple threads, and the DuckLake catalog is only created once per ATTACH, so it is shared across all threads. I think the issue that might exist from no longer incrementing schema_version would be when two totally separate DuckLakeCatalog instances have different sort information in their cache (with the same schema_version). Is there a multi-process version of concurrentloop? Or maybe a C++ or Python test? Do you want me to remove the schema_version modifications and save them for a later PR?

Could we achieve this with multiple connections then? Because that's also possible within sqltests

I am not sure! If you have a spot where I can find an example, I can give it a shot.

To understand the behavior, I made a Python script that kicks off 2 separate CLI processes. I found that the sort is ignored by the other process if the schema was already cached ahead of time. The good news is that the catalog DB itself continues to have the right values, but the cache does not get invalidated correctly.

The flow is:

  • Process 1: connects, creates the table and inserts into it
  • Process 2: connects and runs an ALTER TABLE ADD COLUMN, which caches the catalog
  • Process 1: ALTER TABLE SET SORTED BY
  • Process 1: Completes / exits
  • Process 2: Compacts (using the cached catalog)
  • Process 2: Pulls updated table (which does not show the right order, since the cached catalog was used)
  • Process 2: Completes / exits

If I omit the ALTER TABLE ADD COLUMN step in process 2, then there is no issue and the sort occurs correctly.

What do you recommend I do? I've thought about 3 options, but 3 would need some help!

  1. Keep the schema_version from incrementing, but accept this concurrency behavior.
  2. Allow the schema_version to increment but accept that a compaction barrier gets put in when sort is changed
  3. Keep the schema_version from incrementing but find some other way to correctly invalidate the cache or use a different key for the cache

ducklake_set_sorted_multiprocess_add_column.py

The logs get printed out for process 1 before printing out process 2, but I logged out some timestamps to show the true order:

uv run ./test/ducklake_set_sorted_multiprocess_add_column.py
┌─────────┬──────────────────────┐
│ process │     sum("range")     │
│ varchar │        int128        │
├─────────┼──────────────────────┤
│ sql_1   │ 12499999997500000000 │
└─────────┴──────────────────────┘
┌─────────┬─────────────────────────────────────────────────────────────┐
│ process │ (CAST(now() AS VARCHAR) || ' sql_1 finished SET SORTED BY') │
│ varcharvarchar                           │
├─────────┼─────────────────────────────────────────────────────────────┤
│ sql_1   │ 2026-01-19 19:16:39.22008-07 sql_1 finished SET SORTED BY   │
└─────────┴─────────────────────────────────────────────────────────────┘


┌─────────┬─────────────────────┐
│ process │    sum("range")     │
│ varchar │       int128        │
├─────────┼─────────────────────┤
│ sql_2   │ 1999999999000000000 │
└─────────┴─────────────────────┘
┌─────────┬────────────────────────────────────────────────────────────────────┐
│ process │ (CAST(now() AS VARCHAR) || ' sql_2 finished adding column') │
│ varcharvarchar                               │
├─────────┼────────────────────────────────────────────────────────────────────┤
│ sql_2   │ 2026-01-19 19:16:35.416129-07 sql_2 finished adding column  │
└─────────┴────────────────────────────────────────────────────────────────────┘
┌─────────┬──────────────────────┐
│ process │     sum("range")     │
│ varchar │        int128        │
├─────────┼──────────────────────┤
│ sql_2   │ 40499999995500000000 │
└─────────┴──────────────────────┘
┌─────────┬───────────────────────────────────────────────────────┐
│ process │ (CAST(now() AS VARCHAR) || ' sql_2 about to compact') │
│ varcharvarchar                        │
├─────────┼───────────────────────────────────────────────────────┤
│ sql_2   │ 2026-01-19 19:16:46.373997-07 sql_2 about to compact  │
└─────────┴───────────────────────────────────────────────────────┘
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ 0 rows  │
└─────────┘
┌─────────┬───────────┬────────────┬────────────┬─────────┐
│ process │ unique_id │ sort_key_1 │ sort_key_2 │  bonus  │
│ varchar │   int64   │   int64    │  varcharvarchar │
├─────────┼───────────┼────────────┼────────────┼─────────┤
│ sql_2   │         31 │ woot3      │ NULL    │
│ sql_2   │         20 │ woot2      │ NULL    │
│ sql_2   │         11 │ woot1      │ NULL    │
│ sql_2   │         00 │ woot0      │ NULL    │
│ sql_2   │         71 │ woot7      │ NULL    │
│ sql_2   │         60 │ woot6      │ NULL    │
│ sql_2   │         51 │ woot5      │ NULL    │
│ sql_2   │         40 │ woot4      │ NULL    │
└─────────┴───────────┴────────────┴────────────┴─────────┘
┌─────────┬─────────────┬────────────────┬────────────────────────────────────────────┐
│ process │ snapshot_id │ schema_version │                  changes                   │
│ varchar │    int64    │     int64      │          map(varchar, varchar[])           │
├─────────┼─────────────┼────────────────┼────────────────────────────────────────────┤
│ sql_2   │           00 │ {schemas_created=[main]}                   │
│ sql_2   │           11 │ {tables_created=[main.sort_on_compaction]} │
│ sql_2   │           21 │ {tables_inserted_into=[1]}                 │
│ sql_2   │           31 │ {tables_inserted_into=[1]}                 │
│ sql_2   │           42 │ {tables_altered=[1]}                       │
│ sql_2   │           52 │ {tables_altered=[1]}                       │
│ sql_2   │           62 │ {}                                         │
└─────────┴─────────────┴────────────────┴────────────────────────────────────────────┘
┌─────────┬──────────┬────────────────┬──────────────┬────────────────┬────────────┬────────────────┬────────────┐
│ process │ table_id │ begin_snapshot │ end_snapshot │ sort_key_index │ expression │ sort_direction │ null_order │
│ varchar │  int64   │     int64      │    int64     │     int64      │  varcharvarcharvarchar   │
├─────────┼──────────┼────────────────┼──────────────┼────────────────┼────────────┼────────────────┼────────────┤
│ sql_2   │        15NULL0 │ sort_key_1 │ DESC           │ NULLS_LAST │
│ sql_2   │        15NULL1 │ sort_key_2 │ DESC           │ NULLS_LAST │
└─────────┴──────────┴────────────────┴──────────────┴────────────────┴────────────┴────────────────┴────────────┘

@pdet
Copy link
Collaborator

pdet commented Jan 21, 2026

Hi @Alex-Monahan I had another pass on your PR, and it is looking great! I think there was a slight miscommunication issue wrt the snapshot changes. What I meant is that Sort/Comment, etc should not impact the ducklake_schema_versions table because that is used to ensure compaction of tables that have the same data-schema as in columns being the same in amount and type.

@Alex-Monahan
Copy link
Author

Alex-Monahan commented Jan 21, 2026

Hi @Alex-Monahan I had another pass on your PR, and it is looking great! I think there was a slight miscommunication issue wrt the snapshot changes. What I meant is that Sort/Comment, etc should not impact the ducklake_schema_versions table because that is used to ensure compaction of tables that have the same data-schema as in columns being the same in amount and type.

Hmm, so it is ok if they increment the schema_version inside of ducklake_snapshot, just not ducklake_schema_versions? I'm having trouble detangling how to only prevent updating the schema_version selectively.

@pdet
Copy link
Collaborator

pdet commented Jan 21, 2026

Hi @Alex-Monahan I had another pass on your PR, and it is looking great! I think there was a slight miscommunication issue wrt the snapshot changes. What I meant is that Sort/Comment, etc should not impact the ducklake_schema_versions table because that is used to ensure compaction of tables that have the same data-schema as in columns being the same in amount and type.

Hmm, so it is ok if they increment the schema_version inside of ducklake_snapshot, just not ducklake_schema_versions? I'm having trouble detangling how to only prevent updating the schema_version selectively.

I think it's fine that they increase the global counter of the schema_version, I'm really sorry for the confusion and added work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants