The ISA serial port device's output can hang when the pipe on stdout
becomes full. This is a race condition where the vcpu thread executing
serial emulation code adds a watch on stdout while the main loop thread
is blocked in ppoll(2). If no timer or other event wakes up the main
loop, there will be no further output from the serial device even when
the pipe becomes writable.
Richard W. M. Jones was able to reproduce the hang on recent versions of
guestfs-tools-c and libglib2 on Fedora 26 hosts.
This patch kicks the main loop so the next iteration invokes ppoll(2)
with the watch fd.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1435432
Reported-by: Richard W. M. Jones <rjones@redhat.com>
Tested-by: Richard W. M. Jones <rjones@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
chardev/char.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/chardev/char.c b/chardev/char.c
index 3df1163..6c99c34 100644
--- a/chardev/char.c
+++ b/chardev/char.c
@@ -1059,6 +1059,11 @@ guint qemu_chr_fe_add_watch(CharBackend *be, GIOCondition cond,
tag = g_source_attach(src, NULL);
g_source_unref(src);
+ /* The main loop may be in blocked waiting on events in another thread.
+ * Kick it so the new watch will be added.
+ */
+ qemu_notify_event();
+
return tag;
}
--
2.9.3
On 31/03/2017 18:43, Stefan Hajnoczi wrote: > The ISA serial port device's output can hang when the pipe on stdout > becomes full. This is a race condition where the vcpu thread executing > serial emulation code adds a watch on stdout while the main loop thread > is blocked in ppoll(2). If no timer or other event wakes up the main > loop, there will be no further output from the serial device even when > the pipe becomes writable. > > Richard W. M. Jones was able to reproduce the hang on recent versions of > guestfs-tools-c and libglib2 on Fedora 26 hosts. > > This patch kicks the main loop so the next iteration invokes ppoll(2) > with the watch fd. > > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1435432 > Reported-by: Richard W. M. Jones <rjones@redhat.com> > Tested-by: Richard W. M. Jones <rjones@redhat.com> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > --- > chardev/char.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/chardev/char.c b/chardev/char.c > index 3df1163..6c99c34 100644 > --- a/chardev/char.c > +++ b/chardev/char.c > @@ -1059,6 +1059,11 @@ guint qemu_chr_fe_add_watch(CharBackend *be, GIOCondition cond, > tag = g_source_attach(src, NULL); > g_source_unref(src); > > + /* The main loop may be in blocked waiting on events in another thread. > + * Kick it so the new watch will be added. > + */ > + qemu_notify_event(); > + > return tag; > } > > Thanks for looking at this, I was quite stuck and now I understand what's going on. However, I don't believe your patch is the right solution. According to Richard's bisection, the bug was introduced by the patch at https://bug761102.bugzilla-attachments.gnome.org/attachment.cgi?id=319699. The g_wakeup_signal that is removed (actually made conditional) in that patch is doing exactly the same thing as qemu_notify_event, which is fishy... It would still be a QEMU bug according to the theory below but, depending on how they handle backwards-compatibility, they might consider undoing this change. glib is expecting QEMU to use g_main_context_acquire around accesses to GMainContext. However QEMU is not doing that, instead it is taking its own mutex. So we should add g_main_context_acquire and g_main_context_release in the two implementations of os_host_main_loop_wait; these should undo the effect of Frediano's glib patch. In all fairness, the docs do say "You must be the owner of a context before you can call g_main_context_prepare(), g_main_context_query(), g_main_context_check(), g_main_context_dispatch()". However, it has worked until now and the documentation does not say exactly why that is necessary. Paolo
On Fri, Mar 31, 2017 at 06:53:56PM +0200, Paolo Bonzini wrote: > glib is expecting QEMU to use g_main_context_acquire around accesses to > GMainContext. However QEMU is not doing that, instead it is taking its > own mutex. So we should add g_main_context_acquire and > g_main_context_release in the two implementations of > os_host_main_loop_wait; these should undo the effect of Frediano's > glib patch. Based on this paragraph, I'm testing the attached patch, and it does also appear to solve the hanging serial port problem. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org From d46afa25dae0f1d057cfac3a0b6ae94771be51f2 Mon Sep 17 00:00:00 2001 From: "Richard W.M. Jones" <rjones@redhat.com> Date: Fri, 31 Mar 2017 18:29:01 +0100 Subject: [PATCH] main-loop: Acquire main_context lock around os_host_main_loop_wait. --- util/main-loop.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/util/main-loop.c b/util/main-loop.c index 4534c89..19cad6b 100644 --- a/util/main-loop.c +++ b/util/main-loop.c @@ -218,9 +218,12 @@ static void glib_pollfds_poll(void) static int os_host_main_loop_wait(int64_t timeout) { + GMainContext *context = g_main_context_default(); int ret; static int spin_counter; + g_main_context_acquire(context); + glib_pollfds_fill(&timeout); /* If the I/O thread is very busy or we are incorrectly busy waiting in @@ -256,6 +259,9 @@ static int os_host_main_loop_wait(int64_t timeout) } glib_pollfds_poll(); + + g_main_context_release(context); + return ret; } #else @@ -412,12 +418,15 @@ static int os_host_main_loop_wait(int64_t timeout) fd_set rfds, wfds, xfds; int nfds; + g_main_context_acquire(context); + /* XXX: need to suppress polling by better using win32 events */ ret = 0; for (pe = first_polling_entry; pe != NULL; pe = pe->next) { ret |= pe->func(pe->opaque); } if (ret != 0) { + g_main_context_release(context); return ret; } @@ -472,6 +481,8 @@ static int os_host_main_loop_wait(int64_t timeout) g_main_context_dispatch(context); } + g_main_context_release(context); + return select_ret || g_poll_ret; } #endif -- 2.9.3
On 31/03/2017 19:31, Richard W.M. Jones wrote: > On Fri, Mar 31, 2017 at 06:53:56PM +0200, Paolo Bonzini wrote: >> glib is expecting QEMU to use g_main_context_acquire around accesses to >> GMainContext. However QEMU is not doing that, instead it is taking its >> own mutex. So we should add g_main_context_acquire and >> g_main_context_release in the two implementations of >> os_host_main_loop_wait; these should undo the effect of Frediano's >> glib patch. > > Based on this paragraph, I'm testing the attached patch, and it does > also appear to solve the hanging serial port problem. Great, can you do more testing and/or send the patch formally as a toplevel message? Thanks, Paolo
On Fri, Mar 31, 2017 at 07:31:57PM +0200, Paolo Bonzini wrote: > > > On 31/03/2017 19:31, Richard W.M. Jones wrote: > > On Fri, Mar 31, 2017 at 06:53:56PM +0200, Paolo Bonzini wrote: > >> glib is expecting QEMU to use g_main_context_acquire around accesses to > >> GMainContext. However QEMU is not doing that, instead it is taking its > >> own mutex. So we should add g_main_context_acquire and > >> g_main_context_release in the two implementations of > >> os_host_main_loop_wait; these should undo the effect of Frediano's > >> glib patch. > > > > Based on this paragraph, I'm testing the attached patch, and it does > > also appear to solve the hanging serial port problem. > > Great, can you do more testing and/or send the patch formally as a > toplevel message? Sure, I'm going to test it a lot more overnight, and then if it survives all that I'll post it properly. Thanks, Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
On 31/03/2017 19:31, Richard W.M. Jones wrote: > On Fri, Mar 31, 2017 at 06:53:56PM +0200, Paolo Bonzini wrote: >> glib is expecting QEMU to use g_main_context_acquire around accesses to >> GMainContext. However QEMU is not doing that, instead it is taking its >> own mutex. So we should add g_main_context_acquire and >> g_main_context_release in the two implementations of >> os_host_main_loop_wait; these should undo the effect of Frediano's >> glib patch. > > Based on this paragraph, I'm testing the attached patch, and it does > also appear to solve the hanging serial port problem. Turn out that it may actually be a glib bug. Your patch would enable Frediano's optimizations so it's worth applying anyway. https://bugzilla.gnome.org/show_bug.cgi?id=761102 Paolo
On Fri, Mar 31, 2017 at 06:53:56PM +0200, Paolo Bonzini wrote: > > > On 31/03/2017 18:43, Stefan Hajnoczi wrote: > > The ISA serial port device's output can hang when the pipe on stdout > > becomes full. This is a race condition where the vcpu thread executing > > serial emulation code adds a watch on stdout while the main loop thread > > is blocked in ppoll(2). If no timer or other event wakes up the main > > loop, there will be no further output from the serial device even when > > the pipe becomes writable. > > > > Richard W. M. Jones was able to reproduce the hang on recent versions of > > guestfs-tools-c and libglib2 on Fedora 26 hosts. > > > > This patch kicks the main loop so the next iteration invokes ppoll(2) > > with the watch fd. > > > > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1435432 > > Reported-by: Richard W. M. Jones <rjones@redhat.com> > > Tested-by: Richard W. M. Jones <rjones@redhat.com> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > --- > > chardev/char.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > diff --git a/chardev/char.c b/chardev/char.c > > index 3df1163..6c99c34 100644 > > --- a/chardev/char.c > > +++ b/chardev/char.c > > @@ -1059,6 +1059,11 @@ guint qemu_chr_fe_add_watch(CharBackend *be, GIOCondition cond, > > tag = g_source_attach(src, NULL); > > g_source_unref(src); > > > > + /* The main loop may be in blocked waiting on events in another thread. > > + * Kick it so the new watch will be added. > > + */ > > + qemu_notify_event(); > > + > > return tag; > > } > > > > > > Thanks for looking at this, I was quite stuck and now I understand > what's going on. However, I don't believe your patch is the right > solution. > > According to Richard's bisection, the bug was introduced by the patch > at https://bug761102.bugzilla-attachments.gnome.org/attachment.cgi?id=319699. > > The g_wakeup_signal that is removed (actually made conditional) in that > patch is doing exactly the same thing as qemu_notify_event, which is > fishy... It would still be a QEMU bug according to the theory below but, > depending on how they handle backwards-compatibility, they might > consider undoing this change. > > glib is expecting QEMU to use g_main_context_acquire around accesses to > GMainContext. However QEMU is not doing that, instead it is taking its > own mutex. So we should add g_main_context_acquire and > g_main_context_release in the two implementations of > os_host_main_loop_wait; these should undo the effect of Frediano's > glib patch. > > In all fairness, the docs do say "You must be the owner of a context > before you can call g_main_context_prepare(), g_main_context_query(), > g_main_context_check(), g_main_context_dispatch()". However, it has > worked until now and the documentation does not say exactly why that > is necessary. Thanks Paolo, very interesting. I never realized that glib has its own inter-thread signal. NACK Stefan
On Fri, Mar 31, 2017 at 05:43:22PM +0100, Stefan Hajnoczi wrote: > The ISA serial port device's output can hang when the pipe on stdout > becomes full. This is a race condition where the vcpu thread executing > serial emulation code adds a watch on stdout while the main loop thread > is blocked in ppoll(2). If no timer or other event wakes up the main > loop, there will be no further output from the serial device even when > the pipe becomes writable. > > Richard W. M. Jones was able to reproduce the hang on recent versions of > guestfs-tools-c and libglib2 on Fedora 26 hosts. > > This patch kicks the main loop so the next iteration invokes ppoll(2) > with the watch fd. > > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1435432 > Reported-by: Richard W. M. Jones <rjones@redhat.com> > Tested-by: Richard W. M. Jones <rjones@redhat.com> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > --- > chardev/char.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/chardev/char.c b/chardev/char.c > index 3df1163..6c99c34 100644 > --- a/chardev/char.c > +++ b/chardev/char.c > @@ -1059,6 +1059,11 @@ guint qemu_chr_fe_add_watch(CharBackend *be, GIOCondition cond, > tag = g_source_attach(src, NULL); > g_source_unref(src); > > + /* The main loop may be in blocked waiting on events in another thread. > + * Kick it so the new watch will be added. > + */ > + qemu_notify_event(); > + > return tag; > } > Obviously Paolo has provided further insights in the other part of this thread, but from my point of view this does appear to fix the problem. I tested it on 3 different physical machines which previously showed hangs, and on all 3 the test ran for > 30 mins [before I killed the test] which is far longer than seen before. Therefore: ACK. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
© 2016 - 2024 Red Hat, Inc.