[PATCH v10 18/19] multifd: Fix hang if send thread errors during sync

Lukas Straub posted 19 patches 1 week ago
Maintainers: Pierrick Bouvier <pierrick.bouvier@linaro.org>, Lukas Straub <lukasstraub2@web.de>, Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>, Laurent Vivier <lvivier@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>
[PATCH v10 18/19] multifd: Fix hang if send thread errors during sync
Posted by Lukas Straub 1 week ago
When a send thread encounters an error (as is the case with yank),
it sets multifd_send_state->exiting and the other threads exit too.
This races with multifd_send_sync_main() which now hangs at
qemu_sem_wait(&p->sem_sync) in multifd_send_sync_main() line 647
as it waits for threads that have exited.

Fix this by kicking the semaphores when exiting the send threads.

I encountered this hang when stress testing the colo unit test,
though I was unable to write a migration test to reliably hit this.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
---
 migration/multifd.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 220ed8564960fdabc58e4baa069dd252c8ad293c..7762aab8e0702672d3730f27e9c9ee3b86500f0c 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -772,9 +772,14 @@ out:
         assert(local_err);
         trace_multifd_send_error(p->id);
         multifd_send_error_propagate(local_err);
-        multifd_send_kick_main(p);
     }
 
+    /*
+     * Always kick the main thread: The main thread might wait on this thread
+     * while another thread encounters an error and signals this thread to exit.
+     */
+    multifd_send_kick_main(p);
+
     rcu_unregister_thread();
     trace_multifd_send_thread_end(p->id, p->packets_sent);
 

-- 
2.39.5
Re: [PATCH v10 18/19] multifd: Fix hang if send thread errors during sync
Posted by Peter Xu 1 day, 9 hours ago
On Fri, Feb 20, 2026 at 08:51:40PM +0100, Lukas Straub wrote:
> When a send thread encounters an error (as is the case with yank),
> it sets multifd_send_state->exiting and the other threads exit too.
> This races with multifd_send_sync_main() which now hangs at
> qemu_sem_wait(&p->sem_sync) in multifd_send_sync_main() line 647
> as it waits for threads that have exited.
> 
> Fix this by kicking the semaphores when exiting the send threads.
> 
> I encountered this hang when stress testing the colo unit test,
> though I was unable to write a migration test to reliably hit this.
> 
> Signed-off-by: Lukas Straub <lukasstraub2@web.de>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu