Skip to content

Conversation

Angith
Copy link

@Angith Angith commented Sep 1, 2025

What this PR does:

Which issue(s) this PR fixes:
Fixes #6941

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@Angith Angith changed the title Parquetconverter/add sort columns feat(parquetconverter): add support for additional sort columns during Parquet file generation Sep 1, 2025
@Angith
Copy link
Author

Angith commented Sep 1, 2025

Hi @yeya24

I’ve raised this PR to add support for additional sort columns during Parquet file generation. A few points I wanted to clarify:

  • As per the contributing guide, I ran make doc, but it ended up deleting some other important documents. I also ran goimports, but it reformatted ~700 files, including files in the vendor folder. I assume these shouldn’t be committed, so I left those changes out. Could you please confirm the expected workflow here?
  • I’ve written unit tests for the new functionality and verified, but I haven’t yet tested the change end-to-end. Could you guide me on how to run the full end-to-end tests for Cortex to validate this change?

Thanks in advance for your guidance 🙏

@Angith
Copy link
Author

Angith commented Sep 7, 2025

Hi @yeya24, I’ve made updates to address the CI failure. When you get a chance, could you approve the workflow?

Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. Thanks for the contribution and I think this change looks great!

Just some comments about the configuration. If we already have the sorting columns configurable as limits, we don't need the same config in parquet converter anymore

@@ -109,6 +111,7 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet) {
f.IntVar(&cfg.MaxRowsPerRowGroup, "parquet-converter.max-rows-per-row-group", 1e6, "Maximum number of time series per parquet row group. Larger values improve compression but may reduce performance during reads.")
f.DurationVar(&cfg.ConversionInterval, "parquet-converter.conversion-interval", time.Minute, "How often to check for new TSDB blocks to convert to parquet format.")
f.BoolVar(&cfg.FileBufferEnabled, "parquet-converter.file-buffer-enabled", true, "Enable disk-based write buffering to reduce memory consumption during parquet file generation.")
f.Var((*flagext.StringSlice)(&cfg.AdditionalSortColumns), "parquet-converter.additional-sort-columns", "Configure the additional sort columns, in order of precedence, to improve query performance. These will be applied during parquet file generation.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this config as we can just use the config specified in the limits. That limit can specify default values as well

if len(cfg.AdditionalSortColumns) > 0 {
sortColumns = append(sortColumns, cfg.AdditionalSortColumns...)
}
cfg.AdditionalSortColumns = sortColumns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can maybe just remove sorting column from base converter option if we have the limits

@@ -430,6 +440,13 @@ func (c *Converter) convertUser(ctx context.Context, logger log.Logger, ring rin

converterOpts := append(c.baseConverterOptions, convert.WithName(b.ULID.String()))

userConfiguredSortColumns := c.limits.ParquetConverterSortColumns(userID)
if len(userConfiguredSortColumns) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the if and always use append(sortColumns, userConfiguredSortColumns...) as sorting columns.

@@ -280,6 +287,7 @@ cortex_parquet_queryable_cache_misses_total
1. **Row Group Size**: Adjust `max_rows_per_row_group` based on your query patterns
2. **Cache Size**: Tune `parquet_queryable_shard_cache_size` based on available memory
3. **Concurrency**: Adjust `meta_sync_concurrency` based on object storage performance
4. **Sort Columns**: Configure `sort_columns` based on your most common query filters to improve query performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the full config name parquet_converter_sort_columns

@Angith
Copy link
Author

Angith commented Sep 15, 2025

Hi @yeya24, let me work on the comments and revert to you shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow configuring sorting labels for Parquet Converter
2 participants