Skip to content

flaky test: ddl_for_split_tables_with_merge_and_split fails after TiCDC capture suicide #5176

@3AceShowHand

Description

@3AceShowHand

Failure

PR: #5098
CI: https://prow.tidb.net/jenkins/job/pingcap/job/ticdc/job/pull_cdc_mysql_integration_heavy/2058/display/redirect
Job: pull_cdc_mysql_integration_heavy #2058
Failed group: G05
Case: ddl_for_split_tables_with_merge_and_split with mysql sink

Evidence

Jenkins failed in the Test stage for TEST_GROUP = 'G05' with script returned exit code 1.

The failing command was the split-table helper path. merge_table_with_retry repeatedly failed for table 119:

{ "success": false, "error": "Can't not find maintainer for changefeed: test" }
{ "success": false, "error": "[CDC:ErrTableIsNotFounded]table is not found%!(EXTRA string=tableID, int64=119)" }
merge table 119 failed after 10 retries

The case logs show both TiCDC captures exited before the helper finished:

Error: [CDC:ErrCaptureSuicide]capture suicide

cdc0.log shows the direct cause as etcd session loss:

[WARN] [etcd_watcher.go:70] ["session is disconnected"] [error="[CDC:ErrEtcdSessionDone]the etcd session is done"]
[ERROR] [server.go:152] ["cdc server exits with error"] [error="[CDC:ErrCaptureSuicide]capture suicide"]

cdc1.log exits with the same ErrCaptureSuicide. Around the same time, down_pd.log shows etcd/PD instability: slow linearizable reads, ReadIndex retry, TSO save timestamp failure, not leader, and slow fsync.

I did not find a table-route conflict / RouteAdmin error in the captured CDC logs. The observed failure is that the CDC cluster lost its maintainer after both captures committed suicide, then the test helper kept calling merge-table.

Expected

The test should not fail as a table scheduling failure when the underlying CDC captures have already exited due to PD/etcd session loss. It should either tolerate transient maintainer unavailability with a meaningful wait/retry path, or report the capture-suicide root cause directly.

Notes

This looks like a flaky integration-test/environment failure rather than a regression in the table-route conflict detector branch. The failure URL above should be kept as the reproduction evidence.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions