Skip to content

Conversation

@cliping
Copy link
Contributor

@cliping cliping commented Oct 19, 2025

Abort migration job before qemu layer migration starts, then migrate vm again.

VIRT-298398 - [VM migration][precopy] abort migration job before qemu layer migration starts

Test result:
(1/2) type_specific.io-github-autotest-libvirt.migration.async_job.abort_before_qemu_layer_mig_start.domjobabort.tcp.with_precopy.p2p: STARTED
(1/2) type_specific.io-github-autotest-libvirt.migration.async_job.abort_before_qemu_layer_mig_start.domjobabort.tcp.with_precopy.p2p: PASS (228.03 s)
(2/2) type_specific.io-github-autotest-libvirt.migration.async_job.abort_before_qemu_layer_mig_start.domjobabort.unix.with_precopy.p2p: STARTED
(2/2) type_specific.io-github-autotest-libvirt.migration.async_job.abort_before_qemu_layer_mig_start.domjobabort.unix.with_precopy.p2p: PASS (188.65 s)

Summary by CodeRabbit

  • Tests
    • Added migration tests that abort before the hypervisor layer starts and then retry to verify recovery.
    • Exercises multiple transport methods (p2p, tcp, unix proxy) and abort flows with expected error handling and state verification.
    • Includes console/screenshot checks, remote service interruption scenarios, and end-to-end result validation.

@coderabbitai
Copy link

coderabbitai bot commented Oct 19, 2025

Walkthrough

Adds a new Libvirt migration test config and test module that runs a background migration, aborts it before the QEMU layer starts (via domjobabort or killing virtqemud), collects and validates the result, then reruns the migration and performs cleanup.

Changes

Cohort / File(s) Summary
Migration async_job test configuration
libvirt/tests/cfg/migration/async_job/abort_before_qemu_layer_mig_start.cfg
New config declaring abort_before_qemu_layer_mig_start scenario with NFS/storage, console/screendump, SSH/virsh options, remote credentials, expected statuses/errors, and data-driven variants for transports (p2p, with_precopy, tcp, unix) and an abort_mig branch.
Migration async_job test implementation
libvirt/tests/src/migration/async_job/abort_before_qemu_layer_mig_start.py
New test module adding run_migration(vm_name, dest_uri, option, extra) and run(test, params, env); starts migration in a background thread, aborts migration (domjobabort / kill virtqemud), uses a queue for thread result communication, validates via libvirt.check_result, reruns migration, and ensures connection cleanup.

Sequence Diagram(s)

sequenceDiagram
    participant Runner as Test Runner
    participant Conn as MigrationBase / Connections
    participant Thread as Migration Thread
    participant VM as Source VM (virsh)
    participant Queue as Result Queue

    Note over Runner,Conn: Setup
    Runner->>Conn: setup_connection()
    Runner->>Conn: terminate remote virtqemud

    Note over Runner,Thread: Start migration
    Runner->>Thread: start run_migration(vm_name, dest_uri, option, extra)
    Thread->>VM: virsh migrate (background)
    Thread->>Queue: push interim status/results

    alt Abort via domjobabort
        Runner->>VM: virsh domjobabort
    else Abort via kill virtqemud
        Runner->>Conn: pkill/terminate virtqemud on dest
    end

    Thread->>Queue: push final migration result
    Runner->>Thread: join
    Runner->>Queue: pop result
    Runner->>Runner: libvirt.check_result(result, err_msg)

    Note over Runner: Rerun and verify
    Runner->>Conn: run_migration_again()
    Runner->>Conn: verify_default()
    Runner->>Conn: cleanup_connection()
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • chloerh
  • nanli1

Poem

🐰 I launch a thread and watch it roam,
A sudden stop — I guide it safely home.
Queues and signals, retry with care,
I nibble logs and tidy every snare. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "migration: Add case testing the abort operation" is clearly related to the pull request content. It accurately describes the primary change: adding new test configuration and test module files that exercise aborting a migration job. The title appropriately uses the "migration:" prefix to indicate the subsystem and conveys the key action (adding a test case for abort operation). While the title could be more specific about the distinguishing detail that the abort occurs before the QEMU layer migration starts (as highlighted in the PR objectives), the current wording remains clear and accurately represents the main change without being vague or misleading.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (4)
libvirt/tests/src/migration/async_job/abort_before_qemu_layer_mig_start.py (3)

69-70: Replace fixed sleep with a deterministic wait for “job started”.

Polling virsh domjobinfo (or pgrep existence) reduces flakiness across hosts.

Example:

utils_misc.wait_for(
    lambda: process.run('pgrep -f "virsh migrate"', shell=False,
                        ignore_status=True).exit_status == 0,
    timeout=int(params.get("ssh_timeout", 60)),
    step=1
)

25-25: Scope the queue to the test to avoid cross‑test leakage.

Make msg_queue local and pass it to the worker to reduce global state.

Example:

-msg_queue = queue.Queue()
+ # inside run():
+ msg_queue = queue.Queue()
@@
-def run_migration(vm_name, dest_uri, option, extra):
+def run_migration(vm_name, dest_uri, option, extra, outq):
@@
-    msg_queue.put(ret)
+    outq.put(ret)
@@
-        mig_t = threading.Thread(target=run_migration, args=(vm_name, dest_uri, option, extra))
+        mig_t = threading.Thread(target=run_migration, args=(vm_name, dest_uri, option, extra, msg_queue))

52-55: Reduce blast radius: target virtqemud main PID via systemd.

pkill affects all virtqemud processes. Prefer systemctl kill to only pause/resume the managed service.

Example:

migrate_vm.run_remote_cmd("systemctl kill -s STOP --kill-who=main virtqemud", server_ip, server_user, server_pwd)
...
migrate_vm.run_remote_cmd("systemctl kill -s CONT --kill-who=main virtqemud", server_ip, server_user, server_pwd)

Also applies to: 80-81

libvirt/tests/cfg/migration/async_job/abort_before_qemu_layer_mig_start.cfg (1)

27-27: Relax error matching to reduce flakiness across libvirt versions/locales.

Match the stable substring only.

Apply:

-    err_msg = "error: operation aborted: migration out: canceled by client"
+    err_msg = "canceled by client"

Please confirm this substring appears on your target libvirt builds; if not, adjust to the most stable fragment observed in stderr.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 043000c and 8722813.

📒 Files selected for processing (2)
  • libvirt/tests/cfg/migration/async_job/abort_before_qemu_layer_mig_start.cfg (1 hunks)
  • libvirt/tests/src/migration/async_job/abort_before_qemu_layer_mig_start.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
libvirt/tests/src/migration/async_job/abort_before_qemu_layer_mig_start.py (2)
provider/migration/base_steps.py (4)
  • setup_connection (343-364)
  • MigrationBase (27-477)
  • verify_default (309-324)
  • cleanup_connection (366-378)
libvirt/tests/src/migration/migrate_vm.py (1)
  • run_remote_cmd (759-776)
🪛 Ruff (0.14.0)
libvirt/tests/src/migration/async_job/abort_before_qemu_layer_mig_start.py

74-74: Function call with shell=True parameter identified, security issue

(S604)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Python 3.8
  • GitHub Check: Python 3.9
  • GitHub Check: Python 3.11
  • GitHub Check: Python 3.12

:param vm_name: VM name
:param dest_uri: virsh uri
:param option: virsh migrate option parameters
:param extra: virsh migrate extra paramsters
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo in docstring (“paramsters”).

Fix spelling for clarity.

Apply:

-    :param extra: virsh migrate extra paramsters
+    :param extra: virsh migrate extra parameters
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
:param extra: virsh migrate extra paramsters
:param extra: virsh migrate extra parameters
🤖 Prompt for AI Agents
In libvirt/tests/src/migration/async_job/abort_before_qemu_layer_mig_start.py
around line 35, the docstring has a typo: "paramsters" should be "parameters";
update the docstring line to correct the spelling for the :param extra:
description so it reads ":param extra: virsh migrate extra parameters".

Comment on lines +52 to +51
test.log.info("Setup steps.")
migration_obj.setup_connection()
remote.run_remote_cmd("pkill -19 virtqemud", params)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Wrong remote command helper/signature. Use migrate_vm.run_remote_cmd with server creds.

remote.run_remote_cmd(params) is not the exported helper; call migrate_vm.run_remote_cmd(command, server_ip, server_user, server_pwd).

Apply:

@@
-from virttest import remote
+from virttest import utils_misc
+from virttest import virsh
+from virttest.utils_test import libvirt
+from provider.migration import base_steps
+from libvirt.tests.src.migration import migrate_vm
@@
-        migration_obj.setup_connection()
-        remote.run_remote_cmd("pkill -19 virtqemud", params)
+        migration_obj.setup_connection()
+        migrate_vm.run_remote_cmd(
+            "pkill -19 virtqemud",
+            params.get("server_ip"),
+            params.get("server_user"),
+            params.get("server_pwd"),
+        )

Note: Keep the new import; repeat the same change for the resume step below.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In libvirt/tests/src/migration/async_job/abort_before_qemu_layer_mig_start.py
around lines 52 to 55, the test is calling the wrong remote helper
(remote.run_remote_cmd with params); replace that call with
migrate_vm.run_remote_cmd(command, server_ip, server_user, server_pwd) — e.g.
migrate_vm.run_remote_cmd("pkill -19 virtqemud", server_ip, server_user,
server_pwd) — and ensure the same change is applied to the resume step later in
the file; keep the new import for migrate_vm as requested.

Comment on lines 74 to 76
ret = process.run("pgrep -f 'virsh migrate'", shell=True, verbose=True).stdout_text.strip()
utils_misc.safe_kill(ret, signal.SIGINT)
time.sleep(60)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Harden Ctrl‑C abort: avoid shell=True, handle multiple PIDs, and remove fixed sleep.

Addresses Ruff S604, multi‑PID cases, and unnecessary 60s delay.

Apply:

-            ret = process.run("pgrep -f 'virsh migrate'", shell=True, verbose=True).stdout_text.strip()
-            utils_misc.safe_kill(ret, signal.SIGINT)
-            time.sleep(60)
+            res = process.run('pgrep -f "virsh migrate"', shell=False,
+                              verbose=True, ignore_status=True)
+            pids = [p for p in res.stdout_text.split()]
+            if not pids:
+                test.error("No 'virsh migrate' PID found to SIGINT")
+            for pid in pids:
+                utils_misc.safe_kill(int(pid), signal.SIGINT)

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.0)

74-74: Function call with shell=True parameter identified, security issue

(S604)

🤖 Prompt for AI Agents
In libvirt/tests/src/migration/async_job/abort_before_qemu_layer_mig_start.py
around lines 74 to 76, avoid using shell=True with pgrep, handle multiple PIDs
returned, and remove the fixed 60s sleep: run pgrep as a list command (no
shell), check its return code and split stdout into lines to get all PIDs, call
utils_misc.safe_kill for each PID with signal.SIGINT, then replace
time.sleep(60) with a short polling loop that checks for remaining migration
processes (e.g., re-run pgrep or check process existence) with a reasonable
timeout and small sleep interval so the test proceeds as soon as processes exit.

Comment on lines +80 to +72
remote.run_remote_cmd("pkill -18 virtqemud", params)
mig_t.join()

output = msg_queue.get()
libvirt.check_result(output, err_msg)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid hangs: add timeouts to join() and queue.get().

Ensure the test fails fast instead of blocking indefinitely.

Apply:

-        remote.run_remote_cmd("pkill -18 virtqemud", params)
-        mig_t.join()
-
-        output = msg_queue.get()
+        migrate_vm.run_remote_cmd(
+            "pkill -18 virtqemud",
+            params.get("server_ip"),
+            params.get("server_user"),
+            params.get("server_pwd"),
+        )
+        mig_t.join(timeout=int(params.get("ssh_timeout", 60)))
+        if mig_t.is_alive():
+            test.error("Migration thread did not exit in time after abort")
+        try:
+            output = msg_queue.get(timeout=int(params.get("ssh_timeout", 60)))
+        except queue.Empty:
+            test.error("No migrate result retrieved from worker thread")
         libvirt.check_result(output, err_msg)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In libvirt/tests/src/migration/async_job/abort_before_qemu_layer_mig_start.py
around lines 80 to 85, the test calls mig_t.join() and msg_queue.get() without
timeouts which can hang indefinitely; update to pass timeouts (e.g.
join(timeout=... ) and get(timeout=... )) and check their results so the test
fails fast: after join(timeout) assert the thread is not alive (or raise/fail if
it is), and wrap msg_queue.get(timeout) in a try/except for queue.Empty (or
check for None) to raise a clear test failure if no message arrives within the
timeout. Ensure the chosen timeout values are reasonable for the test
environment and that failure paths log/raise an informative error.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
libvirt/tests/cfg/migration/async_job/abort_before_qemu_layer_mig_start.cfg (5)

8-11: Normalize cfg types: quote vars and use yes/no booleans.

Avoid mixed boolean styles and unquoted vars to prevent parser surprises.

-    mnt_path_name = ${nfs_mount_dir}
-    only_pty = True
-    take_regular_screendumps = no
+    mnt_path_name = "${nfs_mount_dir}"
+    only_pty = "yes"
+    take_regular_screendumps = "no"

27-27: Make error check robust across libvirt versions.

Exact strings can differ slightly; match a stable substring.

-    err_msg = "error: operation aborted: migration out: canceled by client"
+    err_msg = "operation aborted"

If the harness supports regex, prefer a pattern like: operation aborted.*canceled by client.


37-39: Add a tcp health check port.

You define migrate_desturi_port but don't check it; add port_to_check for tcp.

         - tcp:
             migrate_desturi_port = "16509"
             migrate_desturi_type = "tcp"
+            port_to_check = "${migrate_desturi_port}"
             virsh_migrate_desturi = "qemu+tcp://${migrate_dest_host}/system"

30-36: Confirm intended cartesian product of variants.

Current layout yields p2p × with_precopy × {tcp, unix} × abort_mig. Ensure this expansion is desired (test time explosion, naming). If not, consider scoping or naming variant dimensions to control combinations.


51-52: Add SIGINT abort path to match PR scope.

PR summary mentions domjobabort and SIGINT; config only has domjobabort.

     variants abort_mig:
         - domjobabort:
+        - sigint:
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8722813 and c22e51f.

📒 Files selected for processing (2)
  • libvirt/tests/cfg/migration/async_job/abort_before_qemu_layer_mig_start.cfg (1 hunks)
  • libvirt/tests/src/migration/async_job/abort_before_qemu_layer_mig_start.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • libvirt/tests/src/migration/async_job/abort_before_qemu_layer_mig_start.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Python 3.12
  • GitHub Check: Python 3.11
  • GitHub Check: Python 3.8
  • GitHub Check: Python 3.9
🔇 Additional comments (2)
libvirt/tests/cfg/migration/async_job/abort_before_qemu_layer_mig_start.cfg (2)

43-50: Verify unix_proxy socket paths across distros.

Paths under /var/lib/libvirt/qemu may differ (e.g., /run/libvirt/qemu on newer systems). Confirm these are correct for your target platforms.


18-19: Validate final domain states correspond to the second (successful) run.

With status_error="yes" then migrate_again_status_error="no", confirm dest=running and src="shut off" are evaluated after the retry, not the initial abort.

Abort migration job before qemu layer migration starts, then
migrate vm again.

VIRT-298398 - [VM migration][precopy] abort migration job before
qemu layer migration starts

Signed-off-by: lcheng <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
libvirt/tests/src/migration/async_job/abort_before_qemu_layer_mig_start.py (5)

15-19: Missing required import for remote command execution.

The code at lines 50 and 67 needs migrate_vm.run_remote_cmd, but the migrate_vm module is not imported.


50-50: Wrong remote command API - will fail at runtime.

remote.run_remote_cmd(command, params) is not the correct signature. Use migrate_vm.run_remote_cmd(command, server_ip, server_user, server_pwd) instead.


67-67: Wrong remote command API - will fail at runtime.

Same issue as line 50: use migrate_vm.run_remote_cmd(command, server_ip, server_user, server_pwd) instead of remote.run_remote_cmd(command, params).


68-68: Add timeout to thread join to prevent indefinite hang.

Without a timeout, the test can block forever if the migration thread doesn't exit. Add a timeout and check if the thread is still alive.


70-70: Add timeout to queue.get() to prevent indefinite hang.

Without a timeout, this call can block forever if the migration thread fails to put a result. Add a timeout and handle the queue.Empty exception.

🧹 Nitpick comments (1)
libvirt/tests/src/migration/async_job/abort_before_qemu_layer_mig_start.py (1)

21-21: Consider moving queue to local scope to avoid shared state.

The module-level msg_queue could retain state across test runs if the module is reused. Moving it inside run() ensures clean state per test execution.

Apply this diff to move the queue to local scope:

-msg_queue = queue.Queue()
-
-
 def run_migration(vm_name, dest_uri, option, extra):
     """
     Run migration
 
     :param vm_name: VM name
     :param dest_uri: virsh uri
     :param option: virsh migrate option parameters
     :param extra: virsh migrate extra parameters
     """
+    msg_queue = threading.current_thread().msg_queue
     ret = virsh.migrate(vm_name, dest_uri, option, extra, ignore_status=True, debug=True)
     msg_queue.put(ret)

Then in run_test():

+        msg_queue = queue.Queue()
         mig_t = threading.Thread(target=run_migration, args=(vm_name, dest_uri, option, extra))
+        mig_t.msg_queue = msg_queue
         mig_t.start()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c22e51f and 6534c53.

📒 Files selected for processing (2)
  • libvirt/tests/cfg/migration/async_job/abort_before_qemu_layer_mig_start.cfg (1 hunks)
  • libvirt/tests/src/migration/async_job/abort_before_qemu_layer_mig_start.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
libvirt/tests/src/migration/async_job/abort_before_qemu_layer_mig_start.py (2)
provider/migration/base_steps.py (4)
  • setup_connection (343-364)
  • MigrationBase (27-477)
  • verify_default (309-324)
  • cleanup_connection (366-378)
libvirt/tests/src/migration/migrate_vm.py (1)
  • run_remote_cmd (759-776)
🔇 Additional comments (5)
libvirt/tests/src/migration/async_job/abort_before_qemu_layer_mig_start.py (2)

24-34: LGTM!

The function correctly uses ignore_status=True to ensure a result is always captured, preventing queue deadlock when migration is aborted.


73-84: LGTM!

The main execution flow correctly initializes the migration object, runs setup and test, executes migration again, verifies results, and ensures cleanup in the finally block.

libvirt/tests/cfg/migration/async_job/abort_before_qemu_layer_mig_start.cfg (3)

1-29: LGTM!

Base configuration parameters are comprehensive and appropriate for testing migration abort before QEMU layer starts, including NFS storage setup, SSH timeouts, expected states, and error messages.


30-50: LGTM!

The variant configurations correctly cover multiple transport methods (p2p, tcp, unix) with appropriate parameters. The service_to_check = "" at line 42 correctly uses an empty string (past issue resolved).


51-52: LGTM!

The abort_mig variant correctly defines the domjobabort test scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant