[PATCH v9 18/19] multifd: Fix hang if send thread errors during sync

Lukas Straub posted 19 patches 1 month, 3 weeks ago
Maintainers: Pierrick Bouvier <pierrick.bouvier@linaro.org>, Lukas Straub <lukasstraub2@web.de>, Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>, Laurent Vivier <lvivier@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>
There is a newer version of this series
[PATCH v9 18/19] multifd: Fix hang if send thread errors during sync
Posted by Lukas Straub 1 month, 3 weeks ago
When a send thread encounters an error (as is the case with yank),
it sets multifd_send_state->exiting and the other threads exit too.
This races with multifd_send_sync_main() which now hangs at
qemu_sem_wait(&p->sem_sync) in multifd_send_sync_main() line 647
as it waits for threads that have exited.

Fix this by kicking the semaphores when exiting the send threads.

I encountered this hang when stress testing the colo unit test,
though I was unable to write a migration test to reliably hit this.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
---
 migration/multifd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/migration/multifd.c b/migration/multifd.c
index 220ed8564960fdabc58e4baa069dd252c8ad293c..e8c85cb6c48deaee2c9bda7b821a976166d78c9c 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -677,6 +677,7 @@ static void *multifd_send_thread(void *opaque)
         qemu_sem_wait(&p->sem);
 
         if (multifd_send_should_exit()) {
+            multifd_send_kick_main(p);
             break;
         }
 

-- 
2.39.5
Re: [PATCH v9 18/19] multifd: Fix hang if send thread errors during sync
Posted by Peter Xu 1 month, 3 weeks ago
On Wed, Feb 18, 2026 at 10:29:38PM +0100, Lukas Straub wrote:
> When a send thread encounters an error (as is the case with yank),
> it sets multifd_send_state->exiting and the other threads exit too.
> This races with multifd_send_sync_main() which now hangs at
> qemu_sem_wait(&p->sem_sync) in multifd_send_sync_main() line 647
> as it waits for threads that have exited.
> 
> Fix this by kicking the semaphores when exiting the send threads.
> 
> I encountered this hang when stress testing the colo unit test,
> though I was unable to write a migration test to reliably hit this.
> 
> Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> ---
>  migration/multifd.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 220ed8564960fdabc58e4baa069dd252c8ad293c..e8c85cb6c48deaee2c9bda7b821a976166d78c9c 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -677,6 +677,7 @@ static void *multifd_send_thread(void *opaque)
>          qemu_sem_wait(&p->sem);
>  
>          if (multifd_send_should_exit()) {
> +            multifd_send_kick_main(p);
>              break;
>          }

Looks like normal migration cancellation will only error out the main
channel not multifd ones, hence the main sync will always properly done via
the sem_sync.  So maybe yank behaves differently indeed and less people use
yank in multifd migrations.  Looks fine to do extra kick for this path, as
long as we'll destroy the two semaphores later for each migration attempt.

Said that, special casing this path looks weird.

We could move the kick main at the end to be out of "err" case, so we
always kick it?  We can add a comment explaining that.

-- 
Peter Xu