Skip to content

Bug: [202511][dualtor] tunnel route leftovers are seen after test_stress_arp.py #25699

@nazariig

Description

@nazariig

Is it platform specific

generic

Importance or Severity

High

Description of the bug

Why it happens:

  • arp_update runs in a loop: it periodically compares APPL_DB neighbors with the kernel and, for any (IP, interface) that is in APPL_DB but not in the kernel, it pings to “repair” the mismatch.
  • When the test runs ip neigh flush, the kernel table is cleared while arp_update is in the middle of that loop (or right before its next pass).
  • Right after the flush, the kernel has no neighbors, but APPL_DB still has the entries that existed before the flush (e.g. 172.16.x.x from the test).
  • arp_update then sees a large “mismatch”: many APPL_DB entries are missing from the kernel. It treats that as “kernel is missing these neighbors” and starts pinging those IPs to repopulate the kernel.
  • Those pings recreate neighbor entries (which go FAILED/INCOMPLETE again for 172.16.x.x). Neighsyncd and/or orchagent react to that and recreate the tunnel routes.
  • So the flush clears the kernel once, but arp_update’s “mismatch” logic immediately refills the kernel and brings tunnel routes back.

The race between neighsyncd and arp_update has two implications:

  1. Test case failures due to leftovers
  2. Real use case scenarios when neighbor(s) and tunnel route(s) synchronization may happen endlessly due to timings

Steps to Reproduce

  1. Run test_stress_arp.py
python3 -m pytest arp/test_stress_arp.py --inventory="../ansible/inventory,../ansible/veos" --host-pattern <dut-1>,<dut-2> --module-path ../ansible/library/ --testbed <testbed_name> --setup_name=<setup_name> --testbed_file ../ansible/testbed.yaml --allow_recover --assert plain --log-cli-level info --show-capture=no -ra --showlocals --skip_sanity --store_la_logs --ignore_la_failure -k "ipv4"

Actual Behavior and Expected Behavior

SONiC:

root@sonic:/home/admin# ip -4 ne
172.16.27.98 dev Vlan1000 FAILED
172.16.31.31 dev Vlan1000 FAILED
172.16.12.153 dev Vlan1000 FAILED
172.16.26.112 dev Vlan1000 FAILED
172.16.35.201 dev Vlan1000 FAILED
172.16.15.234 dev Vlan1000 FAILED
172.16.30.109 dev Vlan1000 FAILED
172.16.21.65 dev Vlan1000 FAILED
172.16.24.35 dev Vlan1000 INCOMPLETE
172.16.12.2 dev Vlan1000 FAILED
172.16.13.93 dev Vlan1000 FAILED
172.16.34.215 dev Vlan1000 FAILED
...

root@sonic:/home/admin# redis-cli -n 1 KEYS "*" | grep ":.172.16\|:.fc02:1000" | wc -l
921

The expectation is to have neighbors/routes removed after ip neigh flush all

Relevant log output

SYSLOG:

syslog:2026 Feb 25 16:40:55.641700 sonic INFO python3.13[679724]: ansible-ansible.legacy.command Invoked with _raw_params=ip  -stats neigh flush all _uses_shel
l=True expand_argument_vars=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
syslog:2026 Feb 25 16:40:55.649495 sonic NOTICE swss#arp_update[17851]: 114 mismatch arp entry, pinging 172.16.12.50 on Vlan1000
syslog:2026 Feb 25 16:40:55.792361 sonic NOTICE swss#orchagent: :- create_route: Created tunnel route to 172.16.24.124/32
syslog:2026 Feb 25 16:40:55.860267 sonic NOTICE swss#arp_update[17858]: 114 mismatch arp entry, pinging 172.16.4.147 on Vlan1000
syslog:2026 Feb 25 16:40:55.867902 sonic NOTICE swss#orchagent: :- create_route: Created tunnel route to 172.16.28.159/32
syslog:2026 Feb 25 16:40:55.918404 sonic NOTICE swss#orchagent: :- create_route: Created tunnel route to 172.16.31.216/32
syslog:2026 Feb 25 16:40:56.071206 sonic NOTICE swss#arp_update[17865]: 114 mismatch arp entry, pinging 172.16.25.81 on Vlan1000
syslog:2026 Feb 25 16:40:56.120168 sonic NOTICE swss#orchagent: :- create_route: Created tunnel route to 172.16.4.199/32
syslog:2026 Feb 25 16:40:56.281553 sonic NOTICE swss#arp_update[17872]: 114 mismatch arp entry, pinging 172.16.31.128 on Vlan1000
syslog:2026 Feb 25 16:40:56.304025 sonic NOTICE swss#orchagent: :- create_route: Created tunnel route to 172.16.5.236/32
syslog:2026 Feb 25 16:40:56.317230 sonic NOTICE swss#orchagent: :- remove_route: Removed tunnel route to 172.16.7.101/32

Output of show version, show techsupport

  • N/A

Attach files (if any)

  • N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions