kafka: various test fixes#4301
Conversation
Partition consumer goroutines were spawned before k.msgChan was assigned, causing a data race between the goroutine reads (line 104) and the subsequent write (line 304). Move the assignment before the spawn loop. Fixes CON-407
TestRedpandaRecordOrderSoakTest had a hardcoded 3-minute soak duration but didn't account for time already consumed by other tests in the package. When run with a 5-minute timeout alongside other kafka integration tests, the test was killed before it could finish. The soak duration now adapts to the remaining time budget from t.Deadline(), reserving 2 minutes for container setup/teardown. If insufficient time remains (<30s for soaking), the test skips with a clear message. Fixes CON-413
KafkaSeedBroker() returns localhost:<port> which fails when the system DNS resolver cannot resolve "localhost" (e.g. under Docker networking issues). Use 127.0.0.1 to bypass DNS resolution entirely. Fixes CON-425
range_of_partitions was nested inside explicit_partitions as a child t.Run. Go's test runner releases PAUSEd subtests only when the parent function returns, so the direct explicit_partitions subtests were blocked until range_of_partitions completed (~19s), leaving them with a tight remaining deadline that caused flaky 10s timeouts. Moving range_of_partitions to a sibling t.Run ensures both groups get their own share of the parent deadline. Fixes CON-426
TestRedpandaIntegration runs 5 sequential suite groups that each include 1000-message sequential and parallel tests. Later groups (especially manual_partitioner) start with a nearly exhausted parent test deadline because preceding groups consume most of the available time. Partition-specific groups only need to validate routing correctness, not throughput. Reduce their message count from 1000 to 100, keeping the top-level suite at 1000 for thorough coverage. Fixes CON-427
Lower batches from 1M to 100K (still 1M messages) so the test completes in ~2s instead of ~20s, avoiding timeout when CPU-starved by long-running integration tests in the same package. Fixes CON-431
TestIntegrationUnordered runs 4 sequential suite groups that each include 1000-message sequential and parallel tests. Later groups (especially manual_partitioner) start with a nearly exhausted parent test deadline because preceding groups consume most of the available time, causing the redpanda container to become unresponsive under sustained load. Partition-specific groups only need to validate routing correctness, not throughput. Reduce their message count from 1000 to 100, keeping the top-level suite at 1000 for thorough coverage. Mirrors the same fix applied to TestRedpandaIntegration in edb48668f. Fixes CON-439
TestIntegrationUnordered's manual_partitioner group did not call t.Parallel() like its sibling subtests, so it ran sequentially after the top-level suite, adding to the total elapsed time of the kafka integration package. The go test binary-level 5-minute timeout would fire before TestIntegrationUnordered could even complete its pre-test topic creation. Call t.Parallel() in manual_partitioner to match the other groups, and bump the kafka package timeout to 10m in packages.json, matching other heavy integration packages (redpanda, mssqlserver, etc.). Fixes CON-443
TestIntegrationUnorderedSasl's stream/parallel groups each ran 1000 messages, the same as the non-SASL TestIntegrationUnordered. The SASL variant only validates authentication wiring, not throughput, so the heavy message counts just consumed the parent package budget with no additional coverage. When other tests in the kafka package run first under the package timeout, the SASL subtests were left with no time to even connect to the broker. Drop the message count from 1000 to 100 to match the pattern used by the partition-specific groups in TestRedpandaIntegration (commit edb48668f). Standalone runtime drops from ~18s to ~6s. Fixes CON-440
TestRedpandaRecordOrderSoakTest used to consume ~3 minutes of soak (plus 2 minutes of setup/teardown budget) per run, dominating the kafka package runtime. When it ran before TestRedpandaRecordOrderIntegration under the package timeout, the latter was starved and killed by the package-level test timeout before it could complete. The chaos scenarios (leadership transfer + broker restarts) produce ample coverage within 1 minute, and the comment at the top of the test already notes that long soaks are meant to be run explicitly via `-count 1000` overnight. Standalone runtime drops from ~184s to ~72s. Fixes CON-448
|
Commits Review LGTM |
The brokerAddrPort helper introduced in fdebca9 normalises the broker address returned by testcontainers from localhost to 127.0.0.1 to bypass system DNS resolution, but the YAML templates in integration_unordered_test.go (8x), integration_sarama_test.go (8x) and integration_cache_test.go (1x) still hardcoded localhost:$PORT, which kept the tests dependent on the very lookup the helper was meant to avoid. Replace those occurrences with 127.0.0.1:$PORT to match the ordered integration_test.go templates and finish the work for CON-425.
|
|
||
| // Assign msgChan before spawning partition consumer goroutines to avoid a | ||
| // data race: goroutines read k.msgChan immediately upon starting. | ||
| k.msgChan = msgChan |
There was a problem hiding this comment.
Moving k.msgChan = msgChan here fixes the goroutine-spawn race, but leaves k.msgChan set on the error path. If consumer.ConsumePartition fails inside the loop, this function returns an error while k.msgChan is left pointing at an orphaned channel: the deferred cleanup only closes consumer/coordinator/client, and the goroutine that resets k.msgChan = nil is only spawned after the for loop completes.
That breaks the invariant relied on by Connect: the next retry sees k.msgChan != nil, short-circuits with return nil, and ReadBatch then blocks forever on a dead channel. Suggest resetting k.msgChan = nil in the deferred cleanup when err != nil.
There was a problem hiding this comment.
Addressed in 7bf43b9: reset k.msgChan = nil in the deferred cleanup when err != nil so a subsequent Connect retry observes the disconnected sentinel and re-runs the connection setup rather than short-circuiting on a stale channel reference. Verified the kafka integration suite still passes (TestRedpandaIntegration, TestIntegrationUnordered, TestRedpandaRecordOrder{Integration,SoakTest}, TestIntegrationCache{,Standardized}, TestIntegrationSarama{Redpanda,OutputFixedTimestamp,CheckpointOneLockUp}, TestIntegrationUnorderedSasl, TestRedpandaConnectionTest{,Sasl,PrematureConnect}Integration, TestRedpandaSaslIntegration).
|
Commits Review
|
aba69dd set k.msgChan before the partition-consumer spawn loop to fix a data race, but did not unset it on the error path. If consumer.ConsumePartition fails twice for any partition the function returns an error while k.msgChan is left pointing at an orphaned channel. A subsequent Connect retry sees k.msgChan != nil, short-circuits with return nil (the connected sentinel), and ReadBatch then blocks forever on a dead channel. Reset k.msgChan = nil in the deferred cleanup when err != nil.
|
Commits Review LGTM |
Jeffail
left a comment
There was a problem hiding this comment.
Approving. Original PR's headline data-race fix in connectExplicitTopics is correct and the test reshaping (deadline-aware soak, lighter partition suites, package-timeout bump, manual_partitioner parallelism, unit-test scale-down, range-of-partitions un-nest) is sound.
Two follow-up fixes pushed during review:
d8ba78162extends the127.0.0.1substitution tointegration_unordered_test.go,integration_sarama_test.goandintegration_cache_test.go, finishing the work started infdebca9d1so allstartRedpanda(t)-based templates bypass system DNS uniformly.7bf43b995resetsk.msgChan = nilin the deferred cleanup ofconnectExplicitTopicson the error path. Without this, aconsumer.ConsumePartitionfailure inside the spawn loop would leavek.msgChanpointing at an orphaned channel, breaking thek.msgChan != nil-means-connected invariant inConnectand deadlocking the nextReadBatchretry. Addresses the inline reviewer comment oninput_sarama_kafka_parts.go:225.
Full kafka integration suite verified locally: 11 integration tests pass plus the 60s TestIntegrationSaramaCheckpointOneLockUp and the 77s TestRedpandaRecordOrderSoakTest. Lint clean for internal/impl/kafka/....
Commits
Jira