[WIP] Fix query streaming block #1079
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Important
This is a work in progress where I am investigating an issue that is only noticeable from versions after v0.62.0.
TLDR: Column values are mixed up and end up in the wrong database column. I.e: Value from input column A lands in column B. This is not happening to every single table and every single write but usually when the input block is bigger like over 900k rows.
Summary
Problem: When streaming large datasets using OnInput callbacks, columns could get mixed up (column A receiving values intended for column B). This affected production workloads with complex column types like Map(String, String).
Root Cause: Column type inference was applied only once at the start, but OnInput callbacks reset column data between blocks, causing subsequent blocks to use stale type information.
Solution: Extract inference logic into applyInference() function and re-apply it for each block in the streaming loop, ensuring fresh type information while preserving column order.
Testing: Added comprehensive test suite including production-scale validation (900k rows) and Map column integration tests.
Impact: Fixes column mixing in high-volume streaming scenarios while maintaining backward compatibility and performance.
CHANGELOG Description
Fixed column mixing bug in high-volume streaming scenarios
When streaming large datasets to ClickHouse using the OnInput callback, columns could get mixed up where column A would receive values intended for column B. This was particularly noticeable with complex column types like Map(String, String) and affected production workloads with 900k+ rows.
The issue was caused by column type inference being applied only once at the beginning of the streaming process. When using OnInput callbacks, subsequent blocks would use stale type information, leading to column order corruption.
Changes:
Extract column inference logic into applyInference() function
Re-apply inference for each block in the streaming loop
Preserve column order by processing input columns in their original order
Add comprehensive test suite including production-scale validation
Impact:
Fixes column mixing in high-volume streaming scenarios
Maintains backward compatibility - no API changes
Preserves performance with minimal overhead
Validated with production-scale tests (900k rows, Map columns)
Files Changed:
query.go: Fixed sendInput function to re-apply inference per block
query_test.go: Added comprehensive test suite for validation
This fix ensures that column type inference is applied fresh for each block, preventing data corruption in high-volume streaming scenarios while maintaining the existing API and performance characteristics.
Checklist
Delete items not relevant to your PR: