-
Notifications
You must be signed in to change notification settings - Fork 831
feat(parquetconverter): add support for additional sort columns during Parquet file generation #7003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat(parquetconverter): add support for additional sort columns during Parquet file generation #7003
Conversation
…g Parquet file generation Signed-off-by: Angith <[email protected]>
…Parquet file generation Signed-off-by: Angith <[email protected]>
Hi @yeya24 I’ve raised this PR to add support for additional sort columns during Parquet file generation. A few points I wanted to clarify:
Thanks in advance for your guidance 🙏 |
…er/add-sort-columns
…lumns Signed-off-by: Angith <[email protected]>
Hi @yeya24, I’ve made updates to address the CI failure. When you get a chance, could you approve the workflow? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review. Thanks for the contribution and I think this change looks great!
Just some comments about the configuration. If we already have the sorting columns configurable as limits, we don't need the same config in parquet converter anymore
@@ -109,6 +111,7 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet) { | |||
f.IntVar(&cfg.MaxRowsPerRowGroup, "parquet-converter.max-rows-per-row-group", 1e6, "Maximum number of time series per parquet row group. Larger values improve compression but may reduce performance during reads.") | |||
f.DurationVar(&cfg.ConversionInterval, "parquet-converter.conversion-interval", time.Minute, "How often to check for new TSDB blocks to convert to parquet format.") | |||
f.BoolVar(&cfg.FileBufferEnabled, "parquet-converter.file-buffer-enabled", true, "Enable disk-based write buffering to reduce memory consumption during parquet file generation.") | |||
f.Var((*flagext.StringSlice)(&cfg.AdditionalSortColumns), "parquet-converter.additional-sort-columns", "Configure the additional sort columns, in order of precedence, to improve query performance. These will be applied during parquet file generation.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this config as we can just use the config specified in the limits. That limit can specify default values as well
if len(cfg.AdditionalSortColumns) > 0 { | ||
sortColumns = append(sortColumns, cfg.AdditionalSortColumns...) | ||
} | ||
cfg.AdditionalSortColumns = sortColumns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can maybe just remove sorting column from base converter option if we have the limits
@@ -430,6 +440,13 @@ func (c *Converter) convertUser(ctx context.Context, logger log.Logger, ring rin | |||
|
|||
converterOpts := append(c.baseConverterOptions, convert.WithName(b.ULID.String())) | |||
|
|||
userConfiguredSortColumns := c.limits.ParquetConverterSortColumns(userID) | |||
if len(userConfiguredSortColumns) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the if and always use append(sortColumns, userConfiguredSortColumns...)
as sorting columns.
@@ -280,6 +287,7 @@ cortex_parquet_queryable_cache_misses_total | |||
1. **Row Group Size**: Adjust `max_rows_per_row_group` based on your query patterns | |||
2. **Cache Size**: Tune `parquet_queryable_shard_cache_size` based on available memory | |||
3. **Concurrency**: Adjust `meta_sync_concurrency` based on object storage performance | |||
4. **Sort Columns**: Configure `sort_columns` based on your most common query filters to improve query performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the full config name parquet_converter_sort_columns
Hi @yeya24, let me work on the comments and revert to you shortly. |
What this PR does:
Which issue(s) this PR fixes:
Fixes #6941
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]