Skip to content

Conversation

@danieloliveira-shopify
Copy link
Contributor

@danieloliveira-shopify danieloliveira-shopify commented Aug 8, 2025

Important

This is a work in progress where I am investigating an issue that is only noticeable from versions after v0.62.0.
TLDR: Column values are mixed up and end up in the wrong database column. I.e: Value from input column A lands in column B. This is not happening to every single table and every single write but usually when the input block is bigger like over 900k rows.

Summary

Problem: When streaming large datasets using OnInput callbacks, columns could get mixed up (column A receiving values intended for column B). This affected production workloads with complex column types like Map(String, String).
Root Cause: Column type inference was applied only once at the start, but OnInput callbacks reset column data between blocks, causing subsequent blocks to use stale type information.
Solution: Extract inference logic into applyInference() function and re-apply it for each block in the streaming loop, ensuring fresh type information while preserving column order.
Testing: Added comprehensive test suite including production-scale validation (900k rows) and Map column integration tests.
Impact: Fixes column mixing in high-volume streaming scenarios while maintaining backward compatibility and performance.

CHANGELOG Description
Fixed column mixing bug in high-volume streaming scenarios

When streaming large datasets to ClickHouse using the OnInput callback, columns could get mixed up where column A would receive values intended for column B. This was particularly noticeable with complex column types like Map(String, String) and affected production workloads with 900k+ rows.
The issue was caused by column type inference being applied only once at the beginning of the streaming process. When using OnInput callbacks, subsequent blocks would use stale type information, leading to column order corruption.

Changes:
Extract column inference logic into applyInference() function
Re-apply inference for each block in the streaming loop
Preserve column order by processing input columns in their original order
Add comprehensive test suite including production-scale validation

Impact:
Fixes column mixing in high-volume streaming scenarios
Maintains backward compatibility - no API changes
Preserves performance with minimal overhead
Validated with production-scale tests (900k rows, Map columns)

Files Changed:
query.go: Fixed sendInput function to re-apply inference per block
query_test.go: Added comprehensive test suite for validation
This fix ensures that column type inference is applied fresh for each block, preventing data corruption in high-volume streaming scenarios while maintaining the existing API and performance characteristics.

Checklist

Delete items not relevant to your PR:

  • Unit and integration tests covering the common scenarios were added
  • A human-readable description of the changes was provided to include in CHANGELOG
  • For significant changes, documentation in https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant