When migrate_cancel a multifd migration, if run sequence like this:
[source] [destination]
multifd_send_sync_main[finish]
multifd_recv_thread wait &p->sem_sync
shutdown to_dst_file
detect error from_src_file
send RAM_SAVE_FLAG_EOS[fail] [no chance to run multifd_recv_sync_main]
multifd_load_cleanup
join multifd receive thread forever
will lead destination qemu hung at following stack:
pthread_join
qemu_thread_join
multifd_load_cleanup
process_incoming_migration_co
coroutine_trampoline
Signed-off-by: Ivan Ren <ivanren@tencent.com>
---
migration/ram.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/migration/ram.c b/migration/ram.c
index e4eb9c441f..504c8ccb03 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1291,6 +1291,11 @@ int multifd_load_cleanup(Error **errp)
MultiFDRecvParams *p = &multifd_recv_state->params[i];
if (p->running) {
+ /*
+ * multifd_recv_thread may hung at MULTIFD_FLAG_SYNC handle code,
+ * however try to wakeup it without harm in cleanup phase.
+ */
+ qemu_sem_post(&p->sem_sync);
qemu_thread_join(&p->thread);
}
object_unref(OBJECT(p->c));
--
2.17.2 (Apple Git-113)
ping for review
problem still exist in qemu-4.1.0-rc2
Threads: 24 total, 0 running, 24 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si,
0.0 st
KiB Mem : 39434172+total, 36798950+free, 2948836 used, 23403388 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 38926476+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
286108 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:01.19
qemu-system-x86
286109 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00
qemu-system-x86
286113 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00 IO
mon_iothread
286114 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00 CPU
0/KVM
286115 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00 CPU
1/KVM
286116 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00 CPU
2/KVM
286117 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00 CPU
3/KVM
286118 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00 CPU
4/KVM
286119 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00 CPU
5/KVM
286120 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00 CPU
6/KVM
286121 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00 CPU
7/KVM
286122 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00 CPU
8/KVM
286123 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00 CPU
9/KVM
286124 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00 CPU
10/KVM
286125 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00 CPU
11/KVM
286126 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00 CPU
12/KVM
286127 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00 CPU
13/KVM
286128 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00 CPU
14/KVM
286129 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00 CPU
15/KVM
286131 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.00
vnc_worker
286132 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.01
multifdrecv_0
286133 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.01
multifdrecv_1
286134 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.01
multifdrecv_2
286136 root 20 0 0.127t 110596 12684 S 0.0 0.0 0:00.01
multifdrecv_3
Thread 2 (Thread 0x7f9d075fe700 (LWP 286136)):
#0 0x00007fbd67123a0b in do_futex_wait.constprop.1 () from
/lib64/libpthread.so.0
#1 0x00007fbd67123a9f in __new_sem_wait_slow.constprop.0 () from
/lib64/libpthread.so.0
#2 0x00007fbd67123b3b in sem_wait@@GLIBC_2.2.5 () from
/lib64/libpthread.so.0
#3 0x00005582236e2514 in qemu_sem_wait (sem=sem@entry=0x558226364dc8) at
util/qemu-thread-posix.c:319
#4 0x00005582232efb67 in multifd_recv_thread
(opaque=opaque@entry=0x558226364d30)
at /qemu-4.1.0-rc2/migration/ram.c:1356
#5 0x00005582236e1b26 in qemu_thread_start (args=<optimized out>) at
util/qemu-thread-posix.c:502
#6 0x00007fbd6711de25 in start_thread () from /lib64/libpthread.so.0
#7 0x00007fbd66e4a35d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fbd6c7a4cc0 (LWP 286108)):
#0 0x00007fbd6711ef57 in pthread_join () from /lib64/libpthread.so.0
#1 0x00005582236e28ff in qemu_thread_join (thread=thread@entry=0x558226364b00)
at util/qemu-thread-posix.c:570
#2 0x00005582232f36b4 in multifd_load_cleanup (errp=errp@entry=0x7fbd341fff58)
at /qemu-4.1.0-rc2/migration/ram.c:1259
#3 0x00005582235765a9 in process_incoming_migration_co (opaque=<optimized
out>) at migration/migration.c:510
#4 0x00005582236f4c0a in coroutine_trampoline (i0=<optimized out>,
i1=<optimized out>) at util/coroutine-ucontext.c:115
#5 0x00007fbd66d98d40 in ?? () from /lib64/libc.so.6
#6 0x00007ffec0bf24d0 in ?? ()
#7 0x0000000000000000 in ?? ()
On Tue, Jun 25, 2019 at 9:18 PM Ivan Ren <renyime@gmail.com> wrote:
> When migrate_cancel a multifd migration, if run sequence like this:
>
> [source] [destination]
>
> multifd_send_sync_main[finish]
> multifd_recv_thread wait &p->sem_sync
> shutdown to_dst_file
> detect error from_src_file
> send RAM_SAVE_FLAG_EOS[fail] [no chance to run
> multifd_recv_sync_main]
> multifd_load_cleanup
> join multifd receive thread forever
>
> will lead destination qemu hung at following stack:
>
> pthread_join
> qemu_thread_join
> multifd_load_cleanup
> process_incoming_migration_co
> coroutine_trampoline
>
> Signed-off-by: Ivan Ren <ivanren@tencent.com>
> ---
> migration/ram.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index e4eb9c441f..504c8ccb03 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1291,6 +1291,11 @@ int multifd_load_cleanup(Error **errp)
> MultiFDRecvParams *p = &multifd_recv_state->params[i];
>
> if (p->running) {
> + /*
> + * multifd_recv_thread may hung at MULTIFD_FLAG_SYNC handle
> code,
> + * however try to wakeup it without harm in cleanup phase.
> + */
> + qemu_sem_post(&p->sem_sync);
> qemu_thread_join(&p->thread);
> }
> object_unref(OBJECT(p->c));
> --
> 2.17.2 (Apple Git-113)
>
>
Ivan Ren <renyime@gmail.com> wrote:
> When migrate_cancel a multifd migration, if run sequence like this:
>
> [source] [destination]
>
> multifd_send_sync_main[finish]
> multifd_recv_thread wait &p->sem_sync
> shutdown to_dst_file
> detect error from_src_file
> send RAM_SAVE_FLAG_EOS[fail] [no chance to run multifd_recv_sync_main]
> multifd_load_cleanup
> join multifd receive thread forever
>
> will lead destination qemu hung at following stack:
>
> pthread_join
> qemu_thread_join
> multifd_load_cleanup
> process_incoming_migration_co
> coroutine_trampoline
>
> Signed-off-by: Ivan Ren <ivanren@tencent.com>
I think this one is not enough. We need to set some error code, or
disable the running bit at that point.
> ---
> migration/ram.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index e4eb9c441f..504c8ccb03 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1291,6 +1291,11 @@ int multifd_load_cleanup(Error **errp)
> MultiFDRecvParams *p = &multifd_recv_state->params[i];
>
> if (p->running) {
> + /*
> + * multifd_recv_thread may hung at MULTIFD_FLAG_SYNC handle code,
> + * however try to wakeup it without harm in cleanup phase.
> + */
> + qemu_sem_post(&p->sem_sync);
> qemu_thread_join(&p->thread);
> }
> object_unref(OBJECT(p->c));
Let's see where we wait for p->sem_sync:
static void *multifd_recv_thread(void *opaque)
{
....
while (true) {
uint32_t used;
uint32_t flags;
ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
p->packet_len, &local_err);
.....
if (flags & MULTIFD_FLAG_SYNC) {
qemu_sem_post(&multifd_recv_state->sem_sync);
qemu_sem_wait(&p->sem_sync);
}
}
if (local_err) {
multifd_recv_terminate_threads(local_err);
}
qemu_mutex_lock(&p->mutex);
p->running = false;
qemu_mutex_unlock(&p->mutex);
rcu_unregister_thread();
trace_multifd_recv_thread_end(p->id, p->num_packets, p->num_pages);
return NULL;
}
If we just post it there, we get out of the wait (that bit is ok), but
then we go back to the beggining of the bucle, we (probably) got one
error on the qui_channel_read_all_eof(), and we go back to
multifd_recv_terminate_threads(), or wait there.
I think that it is better to *also* set an p->quit variable there, and
not even try to receive anything for that channel?
I will send a patch later.
Good catch.
Later, Juan.
> If we just post it there, we get out of the wait (that bit is ok), but
> then we go back to the beggining of the bucle, we (probably) got one
> error on the qui_channel_read_all_eof(), and we go back to
> multifd_recv_terminate_threads(), or wait there.
>
> I think that it is better to *also* set an p->quit variable there, and
> not even try to receive anything for that channel?
>
> I will send a patch later.
Yes, agree.
Thanks for review.
On Wed, Jul 24, 2019 at 5:01 PM Juan Quintela <quintela@redhat.com> wrote:
> Ivan Ren <renyime@gmail.com> wrote:
> > When migrate_cancel a multifd migration, if run sequence like this:
> >
> > [source] [destination]
> >
> > multifd_send_sync_main[finish]
> > multifd_recv_thread wait &p->sem_sync
> > shutdown to_dst_file
> > detect error from_src_file
> > send RAM_SAVE_FLAG_EOS[fail] [no chance to run
> multifd_recv_sync_main]
> > multifd_load_cleanup
> > join multifd receive thread forever
> >
> > will lead destination qemu hung at following stack:
> >
> > pthread_join
> > qemu_thread_join
> > multifd_load_cleanup
> > process_incoming_migration_co
> > coroutine_trampoline
> >
> > Signed-off-by: Ivan Ren <ivanren@tencent.com>
>
> I think this one is not enough. We need to set some error code, or
> disable the running bit at that point.
>
> > ---
> > migration/ram.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/migration/ram.c b/migration/ram.c
> > index e4eb9c441f..504c8ccb03 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1291,6 +1291,11 @@ int multifd_load_cleanup(Error **errp)
> > MultiFDRecvParams *p = &multifd_recv_state->params[i];
> >
> > if (p->running) {
> > + /*
> > + * multifd_recv_thread may hung at MULTIFD_FLAG_SYNC handle
> code,
> > + * however try to wakeup it without harm in cleanup phase.
> > + */
> > + qemu_sem_post(&p->sem_sync);
> > qemu_thread_join(&p->thread);
> > }
> > object_unref(OBJECT(p->c));
>
> Let's see where we wait for p->sem_sync:
>
>
> static void *multifd_recv_thread(void *opaque)
> {
> ....
> while (true) {
> uint32_t used;
> uint32_t flags;
>
> ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
> p->packet_len, &local_err);
> .....
>
> if (flags & MULTIFD_FLAG_SYNC) {
> qemu_sem_post(&multifd_recv_state->sem_sync);
> qemu_sem_wait(&p->sem_sync);
> }
> }
> if (local_err) {
> multifd_recv_terminate_threads(local_err);
> }
> qemu_mutex_lock(&p->mutex);
> p->running = false;
> qemu_mutex_unlock(&p->mutex);
>
> rcu_unregister_thread();
> trace_multifd_recv_thread_end(p->id, p->num_packets, p->num_pages);
>
> return NULL;
> }
>
> If we just post it there, we get out of the wait (that bit is ok), but
> then we go back to the beggining of the bucle, we (probably) got one
> error on the qui_channel_read_all_eof(), and we go back to
> multifd_recv_terminate_threads(), or wait there.
>
> I think that it is better to *also* set an p->quit variable there, and
> not even try to receive anything for that channel?
>
> I will send a patch later.
>
> Good catch.
>
> Later, Juan.
>
© 2016 - 2026 Red Hat, Inc.