Skip to content

kafka: various test fixes#4301

Merged
Jeffail merged 12 commits into
mainfrom
mmt/kafka-fixes-v2
May 5, 2026
Merged

kafka: various test fixes#4301
Jeffail merged 12 commits into
mainfrom
mmt/kafka-fixes-v2

Conversation

@mmatczuk
Copy link
Copy Markdown
Contributor

Commits

  • fix data race on msgChan in connectExplicitTopics
  • adapt soak test duration to test deadline
  • use 127.0.0.1 instead of localhost in test broker address
  • unnest range_of_partitions from explicit_partitions test group
  • use lighter suite for partition-specific test groups
  • reduce partition cache ordering test batch count
  • use lighter suite for unordered partition-specific test groups
  • parallelize manual_partitioner subtest and raise package timeout
  • use lighter suite for unordered SASL test
  • shorten soak test default duration to 1 minute

Jira

  • CON-407
  • CON-413
  • CON-425
  • CON-426
  • CON-427
  • CON-431
  • CON-439
  • CON-440
  • CON-443
  • CON-448

mmatczuk added 10 commits April 20, 2026 15:26
Partition consumer goroutines were spawned before k.msgChan was assigned,
causing a data race between the goroutine reads (line 104) and the
subsequent write (line 304). Move the assignment before the spawn loop.

Fixes CON-407
TestRedpandaRecordOrderSoakTest had a hardcoded 3-minute soak duration
but didn't account for time already consumed by other tests in the
package. When run with a 5-minute timeout alongside other kafka
integration tests, the test was killed before it could finish.

The soak duration now adapts to the remaining time budget from
t.Deadline(), reserving 2 minutes for container setup/teardown. If
insufficient time remains (<30s for soaking), the test skips with a
clear message.

Fixes CON-413
KafkaSeedBroker() returns localhost:<port> which fails when the system
DNS resolver cannot resolve "localhost" (e.g. under Docker networking
issues). Use 127.0.0.1 to bypass DNS resolution entirely.

Fixes CON-425
range_of_partitions was nested inside explicit_partitions as a child
t.Run. Go's test runner releases PAUSEd subtests only when the parent
function returns, so the direct explicit_partitions subtests were blocked
until range_of_partitions completed (~19s), leaving them with a tight
remaining deadline that caused flaky 10s timeouts.

Moving range_of_partitions to a sibling t.Run ensures both groups get
their own share of the parent deadline.

Fixes CON-426
TestRedpandaIntegration runs 5 sequential suite groups that each include
1000-message sequential and parallel tests. Later groups (especially
manual_partitioner) start with a nearly exhausted parent test deadline
because preceding groups consume most of the available time.

Partition-specific groups only need to validate routing correctness, not
throughput. Reduce their message count from 1000 to 100, keeping the
top-level suite at 1000 for thorough coverage.

Fixes CON-427
Lower batches from 1M to 100K (still 1M messages) so the test
completes in ~2s instead of ~20s, avoiding timeout when CPU-starved
by long-running integration tests in the same package.

Fixes CON-431
TestIntegrationUnordered runs 4 sequential suite groups that each include
1000-message sequential and parallel tests. Later groups (especially
manual_partitioner) start with a nearly exhausted parent test deadline
because preceding groups consume most of the available time, causing
the redpanda container to become unresponsive under sustained load.

Partition-specific groups only need to validate routing correctness,
not throughput. Reduce their message count from 1000 to 100, keeping
the top-level suite at 1000 for thorough coverage. Mirrors the same
fix applied to TestRedpandaIntegration in edb48668f.

Fixes CON-439
TestIntegrationUnordered's manual_partitioner group did not call
t.Parallel() like its sibling subtests, so it ran sequentially after
the top-level suite, adding to the total elapsed time of the kafka
integration package. The go test binary-level 5-minute timeout would
fire before TestIntegrationUnordered could even complete its pre-test
topic creation.

Call t.Parallel() in manual_partitioner to match the other groups,
and bump the kafka package timeout to 10m in packages.json, matching
other heavy integration packages (redpanda, mssqlserver, etc.).

Fixes CON-443
TestIntegrationUnorderedSasl's stream/parallel groups each ran 1000
messages, the same as the non-SASL TestIntegrationUnordered. The SASL
variant only validates authentication wiring, not throughput, so the
heavy message counts just consumed the parent package budget with no
additional coverage. When other tests in the kafka package run first
under the package timeout, the SASL subtests were left with no time
to even connect to the broker.

Drop the message count from 1000 to 100 to match the pattern used by
the partition-specific groups in TestRedpandaIntegration (commit
edb48668f). Standalone runtime drops from ~18s to ~6s.

Fixes CON-440
TestRedpandaRecordOrderSoakTest used to consume ~3 minutes of soak
(plus 2 minutes of setup/teardown budget) per run, dominating the
kafka package runtime. When it ran before TestRedpandaRecordOrderIntegration
under the package timeout, the latter was starved and killed by the
package-level test timeout before it could complete.

The chaos scenarios (leadership transfer + broker restarts) produce
ample coverage within 1 minute, and the comment at the top of the
test already notes that long soaks are meant to be run explicitly
via `-count 1000` overnight. Standalone runtime drops from ~184s to
~72s.

Fixes CON-448
@claude
Copy link
Copy Markdown

claude Bot commented Apr 20, 2026

Commits
LGTM

Review
Bundle of test-stability fixes for the kafka integration package: real data race fix in connectExplicitTopics, adaptive soak-test duration, lighter suites for partition/SASL subtests, test-group restructuring, container-addr helper, and a package timeout bump. All changes are well-motivated by the commit messages and look sound.

LGTM

This was referenced Apr 20, 2026
The brokerAddrPort helper introduced in fdebca9 normalises the broker
address returned by testcontainers from localhost to 127.0.0.1 to bypass
system DNS resolution, but the YAML templates in
integration_unordered_test.go (8x), integration_sarama_test.go (8x) and
integration_cache_test.go (1x) still hardcoded localhost:$PORT, which kept
the tests dependent on the very lookup the helper was meant to avoid.

Replace those occurrences with 127.0.0.1:$PORT to match the ordered
integration_test.go templates and finish the work for CON-425.

// Assign msgChan before spawning partition consumer goroutines to avoid a
// data race: goroutines read k.msgChan immediately upon starting.
k.msgChan = msgChan
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving k.msgChan = msgChan here fixes the goroutine-spawn race, but leaves k.msgChan set on the error path. If consumer.ConsumePartition fails inside the loop, this function returns an error while k.msgChan is left pointing at an orphaned channel: the deferred cleanup only closes consumer/coordinator/client, and the goroutine that resets k.msgChan = nil is only spawned after the for loop completes.

That breaks the invariant relied on by Connect: the next retry sees k.msgChan != nil, short-circuits with return nil, and ReadBatch then blocks forever on a dead channel. Suggest resetting k.msgChan = nil in the deferred cleanup when err != nil.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 7bf43b9: reset k.msgChan = nil in the deferred cleanup when err != nil so a subsequent Connect retry observes the disconnected sentinel and re-runs the connection setup rather than short-circuiting on a stale channel reference. Verified the kafka integration suite still passes (TestRedpandaIntegration, TestIntegrationUnordered, TestRedpandaRecordOrder{Integration,SoakTest}, TestIntegrationCache{,Standardized}, TestIntegrationSarama{Redpanda,OutputFixedTimestamp,CheckpointOneLockUp}, TestIntegrationUnorderedSasl, TestRedpandaConnectionTest{,Sasl,PrematureConnect}Integration, TestRedpandaSaslIntegration).

@claude
Copy link
Copy Markdown

claude Bot commented May 5, 2026

Commits
LGTM

Review
A bundle of well-scoped kafka integration test fixes plus one production fix for a msgChan data race. The commit hygiene is excellent — each commit is a single self-contained change with a clear rationale tied to a Jira ticket.

  1. The k.msgChan = msgChan reordering in connectExplicitTopics correctly fixes the goroutine-spawn race, but leaves k.msgChan set to an orphaned channel on the consumer.ConsumePartition error path, breaking the k.msgChan \!= nil-means-connected invariant in Connect. Inline comment posted.

aba69dd set k.msgChan before the partition-consumer spawn loop to
fix a data race, but did not unset it on the error path. If
consumer.ConsumePartition fails twice for any partition the function
returns an error while k.msgChan is left pointing at an orphaned
channel. A subsequent Connect retry sees k.msgChan != nil,
short-circuits with return nil (the connected sentinel), and
ReadBatch then blocks forever on a dead channel.

Reset k.msgChan = nil in the deferred cleanup when err != nil.
@claude
Copy link
Copy Markdown

claude Bot commented May 5, 2026

Commits
LGTM

Review
Test-reliability cleanup for the kafka integration package plus a focused fix to connectExplicitTopics. The msgChan race fix is sound (assign before goroutines spawn, reset to nil on the error path via the named-return err), and the lighter partition-specific suites and deadline-aware soak duration are well-motivated.

LGTM

Copy link
Copy Markdown
Contributor

@Jeffail Jeffail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. Original PR's headline data-race fix in connectExplicitTopics is correct and the test reshaping (deadline-aware soak, lighter partition suites, package-timeout bump, manual_partitioner parallelism, unit-test scale-down, range-of-partitions un-nest) is sound.

Two follow-up fixes pushed during review:

  • d8ba78162 extends the 127.0.0.1 substitution to integration_unordered_test.go, integration_sarama_test.go and integration_cache_test.go, finishing the work started in fdebca9d1 so all startRedpanda(t)-based templates bypass system DNS uniformly.
  • 7bf43b995 resets k.msgChan = nil in the deferred cleanup of connectExplicitTopics on the error path. Without this, a consumer.ConsumePartition failure inside the spawn loop would leave k.msgChan pointing at an orphaned channel, breaking the k.msgChan != nil-means-connected invariant in Connect and deadlocking the next ReadBatch retry. Addresses the inline reviewer comment on input_sarama_kafka_parts.go:225.

Full kafka integration suite verified locally: 11 integration tests pass plus the 60s TestIntegrationSaramaCheckpointOneLockUp and the 77s TestRedpandaRecordOrderSoakTest. Lint clean for internal/impl/kafka/....

@Jeffail Jeffail merged commit a9f95f9 into main May 5, 2026
7 checks passed
@Jeffail Jeffail deleted the mmt/kafka-fixes-v2 branch May 5, 2026 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants