[PULL 26/34] migration/multifd: Join the TLS thread

peterx@redhat.com posted 34 patches 9 months, 3 weeks ago
Maintainers: "Alex Bennée" <alex.bennee@linaro.org>, "Philippe Mathieu-Daudé" <philmd@linaro.org>, Thomas Huth <thuth@redhat.com>, Wainer dos Santos Moschetta <wainersm@redhat.com>, Beraldo Leal <bleal@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>, Laurent Vivier <lvivier@redhat.com>
[PULL 26/34] migration/multifd: Join the TLS thread
Posted by peterx@redhat.com 9 months, 3 weeks ago
From: Fabiano Rosas <farosas@suse.de>

We're currently leaking the resources of the TLS thread by not joining
it and also overwriting the p->thread pointer altogether.

Fixes: a1af605bd5 ("migration/multifd: fix hangup with TLS-Multifd due to blocking handshake")
Cc: qemu-stable <qemu-stable@nongnu.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20240206215118.6171-2-farosas@suse.de
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.h | 2 ++
 migration/multifd.c | 8 +++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 78a2317263..720c9d50db 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -73,6 +73,8 @@ typedef struct {
     char *name;
     /* channel thread id */
     QemuThread thread;
+    QemuThread tls_thread;
+    bool tls_thread_created;
     /* communication channel */
     QIOChannel *c;
     /* is the yank function registered */
diff --git a/migration/multifd.c b/migration/multifd.c
index fbdb129088..5551711a2a 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -630,6 +630,10 @@ static void multifd_send_terminate_threads(void)
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
 
+        if (p->tls_thread_created) {
+            qemu_thread_join(&p->tls_thread);
+        }
+
         if (p->running) {
             qemu_thread_join(&p->thread);
         }
@@ -921,7 +925,9 @@ static bool multifd_tls_channel_connect(MultiFDSendParams *p,
     trace_multifd_tls_outgoing_handshake_start(ioc, tioc, hostname);
     qio_channel_set_name(QIO_CHANNEL(tioc), "multifd-tls-outgoing");
     p->c = QIO_CHANNEL(tioc);
-    qemu_thread_create(&p->thread, "multifd-tls-handshake-worker",
+
+    p->tls_thread_created = true;
+    qemu_thread_create(&p->tls_thread, "multifd-tls-handshake-worker",
                        multifd_tls_handshake_thread, p,
                        QEMU_THREAD_JOINABLE);
     return true;
-- 
2.43.0
Re: [PULL 26/34] migration/multifd: Join the TLS thread
Posted by Michael Tokarev 9 months, 2 weeks ago
08.02.2024 06:05, peterx@redhat.com :
> From: Fabiano Rosas <farosas@suse.de>
> 
> We're currently leaking the resources of the TLS thread by not joining
> it and also overwriting the p->thread pointer altogether.
> 
> Fixes: a1af605bd5 ("migration/multifd: fix hangup with TLS-Multifd due to blocking handshake")
> Cc: qemu-stable <qemu-stable@nongnu.org>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> Link: https://lore.kernel.org/r/20240206215118.6171-2-farosas@suse.de
> Signed-off-by: Peter Xu <peterx@redhat.com>

This change, which is suggested for -stable, while simple by its own, seems
to depend on the previous changes in this series, which are not for -stable.
In particular, whole "Finally recycle all the threads" loop in multifd_send_terminate_threads()
(to which the join is being added by this change) is moved from elsewhere by
12808db3b8 "migration/multifd: Cleanup multifd_save_cleanup()" (patch 24 in
this same series).

We can probably add the missing join right into the previous location of this
loop (before 12808db3b8).  I did this in the attached variant for 8.2, is
this correct?

Should we pick this up for 7.2 too?

Thanks,

/mjt

>   migration/multifd.h | 2 ++
>   migration/multifd.c | 8 +++++++-
>   2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 78a2317263..720c9d50db 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -73,6 +73,8 @@ typedef struct {
>       char *name;
>       /* channel thread id */
>       QemuThread thread;
> +    QemuThread tls_thread;
> +    bool tls_thread_created;
>       /* communication channel */
>       QIOChannel *c;
>       /* is the yank function registered */
> diff --git a/migration/multifd.c b/migration/multifd.c
> index fbdb129088..5551711a2a 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -630,6 +630,10 @@ static void multifd_send_terminate_threads(void)
>       for (i = 0; i < migrate_multifd_channels(); i++) {
>           MultiFDSendParams *p = &multifd_send_state->params[i];
>   
> +        if (p->tls_thread_created) {
> +            qemu_thread_join(&p->tls_thread);
> +        }
> +
>           if (p->running) {
>               qemu_thread_join(&p->thread);
>           }
> @@ -921,7 +925,9 @@ static bool multifd_tls_channel_connect(MultiFDSendParams *p,
>       trace_multifd_tls_outgoing_handshake_start(ioc, tioc, hostname);
>       qio_channel_set_name(QIO_CHANNEL(tioc), "multifd-tls-outgoing");
>       p->c = QIO_CHANNEL(tioc);
> -    qemu_thread_create(&p->thread, "multifd-tls-handshake-worker",
> +
> +    p->tls_thread_created = true;
> +    qemu_thread_create(&p->tls_thread, "multifd-tls-handshake-worker",
>                          multifd_tls_handshake_thread, p,
>                          QEMU_THREAD_JOINABLE);
>       return true;
Re: [PULL 26/34] migration/multifd: Join the TLS thread
Posted by Michael Tokarev 9 months, 2 weeks ago
10.02.2024 12:18, Michael Tokarev:
> 08.02.2024 06:05, peterx@redhat.com :
>> From: Fabiano Rosas <farosas@suse.de>
>>
>> We're currently leaking the resources of the TLS thread by not joining
>> it and also overwriting the p->thread pointer altogether.
>>
>> Fixes: a1af605bd5 ("migration/multifd: fix hangup with TLS-Multifd due to blocking handshake")
>> Cc: qemu-stable <qemu-stable@nongnu.org>
>> Reviewed-by: Peter Xu <peterx@redhat.com>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> Link: https://lore.kernel.org/r/20240206215118.6171-2-farosas@suse.de
>> Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> This change, which is suggested for -stable, while simple by its own, seems
> to depend on the previous changes in this series, which are not for -stable.
> In particular, whole "Finally recycle all the threads" loop in multifd_send_terminate_threads()
> (to which the join is being added by this change) is moved from elsewhere by
> 12808db3b8 "migration/multifd: Cleanup multifd_save_cleanup()" (patch 24 in
> this same series).
> 
> We can probably add the missing join right into the previous location of this
> loop (before 12808db3b8).  I did this in the attached variant for 8.2, is
> this correct?

And this does not pass even the basic tests, so it's not that simple :)

The following patch (27/34) is more questionable than this one.

Thanks!

/mjt

Re: [PULL 26/34] migration/multifd: Join the TLS thread
Posted by Fabiano Rosas 9 months, 2 weeks ago
Michael Tokarev <mjt@tls.msk.ru> writes:

> 10.02.2024 12:18, Michael Tokarev:
>> 08.02.2024 06:05, peterx@redhat.com :
>>> From: Fabiano Rosas <farosas@suse.de>
>>>
>>> We're currently leaking the resources of the TLS thread by not joining
>>> it and also overwriting the p->thread pointer altogether.
>>>
>>> Fixes: a1af605bd5 ("migration/multifd: fix hangup with TLS-Multifd due to blocking handshake")
>>> Cc: qemu-stable <qemu-stable@nongnu.org>
>>> Reviewed-by: Peter Xu <peterx@redhat.com>
>>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>>> Link: https://lore.kernel.org/r/20240206215118.6171-2-farosas@suse.de
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> 
>> This change, which is suggested for -stable, while simple by its own, seems
>> to depend on the previous changes in this series, which are not for -stable.
>> In particular, whole "Finally recycle all the threads" loop in multifd_send_terminate_threads()
>> (to which the join is being added by this change) is moved from elsewhere by
>> 12808db3b8 "migration/multifd: Cleanup multifd_save_cleanup()" (patch 24 in
>> this same series).
>> 
> We can probably add the missing join right into the previous location of this
> loop (before 12808db3b8).  I did this in the attached variant for 8.2, is
> this correct?

It should work. This was originally developed without the rest of the
changes on this PR.

>
> And this does not pass even the basic tests, so it's not that simple :)

Do you have a log of what failed?

Anyway, I could prepare a backport on top of 8.2 for you.

>
> The following patch (27/34) is more questionable than this one.
>
> Thanks!
>
> /mjt
Re: [PULL 26/34] migration/multifd: Join the TLS thread
Posted by Michael Tokarev 9 months, 2 weeks ago
14.02.2024 16:27, Fabiano Rosas :
> Michael Tokarev <mjt@tls.msk.ru> writes:
..>>> This change, which is suggested for -stable, while simple by its own, seems
>>> to depend on the previous changes in this series, which are not for -stable.
>>> In particular, whole "Finally recycle all the threads" loop in multifd_send_terminate_threads()
>>> (to which the join is being added by this change) is moved from elsewhere by
>>> 12808db3b8 "migration/multifd: Cleanup multifd_save_cleanup()" (patch 24 in
>>> this same series).
>>>
>> We can probably add the missing join right into the previous location of this
>> loop (before 12808db3b8).  I did this in the attached variant for 8.2, is
>> this correct?

I forgot to attach the patch.  It just moves the join from multifd_send_terminate_threads()
back to multifd_save_cleanup.  Attached now.

> It should work. This was originally developed without the rest of the
> changes on this PR.
> 
>> And this does not pass even the basic tests, so it's not that simple :)
> 
> Do you have a log of what failed?

Re-running it again...  I haven't even tried to push it somewhere for CI to run,
I run local `ninja test', which painted some migration tests in red.  Here:

202/844 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test   ERROR   70.26s   killed by signal 6 SIGABRT
330/844 qemu:qtest+qtest-i386 / qtest-i386/migration-test         ERROR   85.33s   killed by signal 6 SIGABRT
454/844 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test     ERROR  101.02s   killed by signal 6 SIGABRT

Unfortunately I don't see anything interesting in the log:

# starting QEMU: exec ./qemu-system-x86_64 -qtest unix:/tmp/qtest-463614.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-463614.qmp,id=char0 
-mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-q35-8.2, -name target,debug-threads=on -m 150M -serial 
file:/tmp/migration-test-SPJTI2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-SPJTI2/bootsect,format=raw -device 
ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
----------------------------------- stderr -----------------------------------
../../build/qemu/8.2/tests/qtest/libqtest.c:204: kill_qemu() detected QEMU death from signal 6 (Aborted)
(test program exited with status code -6)

Without the attached patch it works.

> Anyway, I could prepare a backport on top of 8.2 for you.

Well, that would definitely be helpful, if you think it's worth to
provide backports for 8.2 for these.   As my attempt apparently isn't
very successful :)

>> The following patch (27/34) is more questionable than this one.

Thank you!

/mjt
Re: [PULL 26/34] migration/multifd: Join the TLS thread
Posted by Fabiano Rosas 9 months, 2 weeks ago
Michael Tokarev <mjt@tls.msk.ru> writes:

> 14.02.2024 16:27, Fabiano Rosas :
>> Michael Tokarev <mjt@tls.msk.ru> writes:
> ..>>> This change, which is suggested for -stable, while simple by its own, seems
>>>> to depend on the previous changes in this series, which are not for -stable.
>>>> In particular, whole "Finally recycle all the threads" loop in multifd_send_terminate_threads()
>>>> (to which the join is being added by this change) is moved from elsewhere by
>>>> 12808db3b8 "migration/multifd: Cleanup multifd_save_cleanup()" (patch 24 in
>>>> this same series).
>>>>
>>> We can probably add the missing join right into the previous location of this
>>> loop (before 12808db3b8).  I did this in the attached variant for 8.2, is
>>> this correct?
>
> I forgot to attach the patch.  It just moves the join from multifd_send_terminate_threads()
> back to multifd_save_cleanup.  Attached now.
>
>> It should work. This was originally developed without the rest of the
>> changes on this PR.
>> 
>>> And this does not pass even the basic tests, so it's not that simple :)
>> 
>> Do you have a log of what failed?
>
> Re-running it again...  I haven't even tried to push it somewhere for CI to run,
> I run local `ninja test', which painted some migration tests in red.  Here:
>
> 202/844 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test   ERROR   70.26s   killed by signal 6 SIGABRT
> 330/844 qemu:qtest+qtest-i386 / qtest-i386/migration-test         ERROR   85.33s   killed by signal 6 SIGABRT
> 454/844 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test     ERROR  101.02s   killed by signal 6 SIGABRT
>
> Unfortunately I don't see anything interesting in the log:
>
> # starting QEMU: exec ./qemu-system-x86_64 -qtest unix:/tmp/qtest-463614.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-463614.qmp,id=char0 
> -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-q35-8.2, -name target,debug-threads=on -m 150M -serial 
> file:/tmp/migration-test-SPJTI2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-SPJTI2/bootsect,format=raw -device 
> ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
> ----------------------------------- stderr -----------------------------------
> ../../build/qemu/8.2/tests/qtest/libqtest.c:204: kill_qemu() detected QEMU death from signal 6 (Aborted)
> (test program exited with status code -6)
>
> Without the attached patch it works.

Ah ok, this is hitting the bug fixed by patch 31. Let's leave patches
26, 27 and 31 out of stable, it would be too risky to backport.