[Qemu-devel] [PATCH] char: kick main loop after adding a watch

Stefan Hajnoczi posted 1 patch 6 years, 12 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20170331164322.24020-1-stefanha@redhat.com
Test checkpatch passed
Test docker passed
Test s390x passed
chardev/char.c | 5 +++++
1 file changed, 5 insertions(+)
[Qemu-devel] [PATCH] char: kick main loop after adding a watch
Posted by Stefan Hajnoczi 6 years, 12 months ago
The ISA serial port device's output can hang when the pipe on stdout
becomes full.  This is a race condition where the vcpu thread executing
serial emulation code adds a watch on stdout while the main loop thread
is blocked in ppoll(2).  If no timer or other event wakes up the main
loop, there will be no further output from the serial device even when
the pipe becomes writable.

Richard W. M. Jones was able to reproduce the hang on recent versions of
guestfs-tools-c and libglib2 on Fedora 26 hosts.

This patch kicks the main loop so the next iteration invokes ppoll(2)
with the watch fd.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1435432
Reported-by: Richard W. M. Jones <rjones@redhat.com>
Tested-by: Richard W. M. Jones <rjones@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 chardev/char.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/chardev/char.c b/chardev/char.c
index 3df1163..6c99c34 100644
--- a/chardev/char.c
+++ b/chardev/char.c
@@ -1059,6 +1059,11 @@ guint qemu_chr_fe_add_watch(CharBackend *be, GIOCondition cond,
     tag = g_source_attach(src, NULL);
     g_source_unref(src);
 
+    /* The main loop may be in blocked waiting on events in another thread.
+     * Kick it so the new watch will be added.
+     */
+    qemu_notify_event();
+
     return tag;
 }
 
-- 
2.9.3


Re: [Qemu-devel] [PATCH] char: kick main loop after adding a watch
Posted by Paolo Bonzini 6 years, 12 months ago

On 31/03/2017 18:43, Stefan Hajnoczi wrote:
> The ISA serial port device's output can hang when the pipe on stdout
> becomes full.  This is a race condition where the vcpu thread executing
> serial emulation code adds a watch on stdout while the main loop thread
> is blocked in ppoll(2).  If no timer or other event wakes up the main
> loop, there will be no further output from the serial device even when
> the pipe becomes writable.
> 
> Richard W. M. Jones was able to reproduce the hang on recent versions of
> guestfs-tools-c and libglib2 on Fedora 26 hosts.
> 
> This patch kicks the main loop so the next iteration invokes ppoll(2)
> with the watch fd.
> 
> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1435432
> Reported-by: Richard W. M. Jones <rjones@redhat.com>
> Tested-by: Richard W. M. Jones <rjones@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  chardev/char.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/chardev/char.c b/chardev/char.c
> index 3df1163..6c99c34 100644
> --- a/chardev/char.c
> +++ b/chardev/char.c
> @@ -1059,6 +1059,11 @@ guint qemu_chr_fe_add_watch(CharBackend *be, GIOCondition cond,
>      tag = g_source_attach(src, NULL);
>      g_source_unref(src);
>  
> +    /* The main loop may be in blocked waiting on events in another thread.
> +     * Kick it so the new watch will be added.
> +     */
> +    qemu_notify_event();
> +
>      return tag;
>  }
>  
> 

Thanks for looking at this, I was quite stuck and now I understand 
what's going on.  However, I don't believe your patch is the right 
solution.

According to Richard's bisection, the bug was introduced by the patch 
at https://bug761102.bugzilla-attachments.gnome.org/attachment.cgi?id=319699.

The g_wakeup_signal that is removed (actually made conditional) in that 
patch is doing exactly the same thing as qemu_notify_event, which is
fishy...  It would still be a QEMU bug according to the theory below but,
depending on how they handle backwards-compatibility, they might 
consider undoing this change.

glib is expecting QEMU to use g_main_context_acquire around accesses to 
GMainContext.  However QEMU is not doing that, instead it is taking its 
own mutex.  So we should add g_main_context_acquire and
g_main_context_release in the two implementations of 
os_host_main_loop_wait; these should undo the effect of Frediano's 
glib patch.

In all fairness, the docs do say "You must be the owner of a context 
before you can call g_main_context_prepare(), g_main_context_query(), 
g_main_context_check(), g_main_context_dispatch()".  However, it has
worked until now and the documentation does not say exactly why that
is necessary.

Paolo

Re: [Qemu-devel] [PATCH] char: kick main loop after adding a watch
Posted by Richard W.M. Jones 6 years, 12 months ago
On Fri, Mar 31, 2017 at 06:53:56PM +0200, Paolo Bonzini wrote:
> glib is expecting QEMU to use g_main_context_acquire around accesses to 
> GMainContext.  However QEMU is not doing that, instead it is taking its 
> own mutex.  So we should add g_main_context_acquire and
> g_main_context_release in the two implementations of 
> os_host_main_loop_wait; these should undo the effect of Frediano's 
> glib patch.

Based on this paragraph, I'm testing the attached patch, and it does
also appear to solve the hanging serial port problem.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org
From d46afa25dae0f1d057cfac3a0b6ae94771be51f2 Mon Sep 17 00:00:00 2001
From: "Richard W.M. Jones" <rjones@redhat.com>
Date: Fri, 31 Mar 2017 18:29:01 +0100
Subject: [PATCH] main-loop: Acquire main_context lock around
 os_host_main_loop_wait.

---
 util/main-loop.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/util/main-loop.c b/util/main-loop.c
index 4534c89..19cad6b 100644
--- a/util/main-loop.c
+++ b/util/main-loop.c
@@ -218,9 +218,12 @@ static void glib_pollfds_poll(void)
 
 static int os_host_main_loop_wait(int64_t timeout)
 {
+    GMainContext *context = g_main_context_default();
     int ret;
     static int spin_counter;
 
+    g_main_context_acquire(context);
+
     glib_pollfds_fill(&timeout);
 
     /* If the I/O thread is very busy or we are incorrectly busy waiting in
@@ -256,6 +259,9 @@ static int os_host_main_loop_wait(int64_t timeout)
     }
 
     glib_pollfds_poll();
+
+    g_main_context_release(context);
+
     return ret;
 }
 #else
@@ -412,12 +418,15 @@ static int os_host_main_loop_wait(int64_t timeout)
     fd_set rfds, wfds, xfds;
     int nfds;
 
+    g_main_context_acquire(context);
+
     /* XXX: need to suppress polling by better using win32 events */
     ret = 0;
     for (pe = first_polling_entry; pe != NULL; pe = pe->next) {
         ret |= pe->func(pe->opaque);
     }
     if (ret != 0) {
+        g_main_context_release(context);
         return ret;
     }
 
@@ -472,6 +481,8 @@ static int os_host_main_loop_wait(int64_t timeout)
         g_main_context_dispatch(context);
     }
 
+    g_main_context_release(context);
+
     return select_ret || g_poll_ret;
 }
 #endif
-- 
2.9.3

Re: [Qemu-devel] [PATCH] char: kick main loop after adding a watch
Posted by Paolo Bonzini 6 years, 12 months ago

On 31/03/2017 19:31, Richard W.M. Jones wrote:
> On Fri, Mar 31, 2017 at 06:53:56PM +0200, Paolo Bonzini wrote:
>> glib is expecting QEMU to use g_main_context_acquire around accesses to 
>> GMainContext.  However QEMU is not doing that, instead it is taking its 
>> own mutex.  So we should add g_main_context_acquire and
>> g_main_context_release in the two implementations of 
>> os_host_main_loop_wait; these should undo the effect of Frediano's 
>> glib patch.
> 
> Based on this paragraph, I'm testing the attached patch, and it does
> also appear to solve the hanging serial port problem.

Great, can you do more testing and/or send the patch formally as a
toplevel message?

Thanks,

Paolo

Re: [Qemu-devel] [PATCH] char: kick main loop after adding a watch
Posted by Richard W.M. Jones 6 years, 12 months ago
On Fri, Mar 31, 2017 at 07:31:57PM +0200, Paolo Bonzini wrote:
> 
> 
> On 31/03/2017 19:31, Richard W.M. Jones wrote:
> > On Fri, Mar 31, 2017 at 06:53:56PM +0200, Paolo Bonzini wrote:
> >> glib is expecting QEMU to use g_main_context_acquire around accesses to 
> >> GMainContext.  However QEMU is not doing that, instead it is taking its 
> >> own mutex.  So we should add g_main_context_acquire and
> >> g_main_context_release in the two implementations of 
> >> os_host_main_loop_wait; these should undo the effect of Frediano's 
> >> glib patch.
> > 
> > Based on this paragraph, I'm testing the attached patch, and it does
> > also appear to solve the hanging serial port problem.
> 
> Great, can you do more testing and/or send the patch formally as a
> toplevel message?

Sure, I'm going to test it a lot more overnight, and then
if it survives all that I'll post it properly.

Thanks,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: [Qemu-devel] [PATCH] char: kick main loop after adding a watch
Posted by Paolo Bonzini 6 years, 12 months ago

On 31/03/2017 19:31, Richard W.M. Jones wrote:
> On Fri, Mar 31, 2017 at 06:53:56PM +0200, Paolo Bonzini wrote:
>> glib is expecting QEMU to use g_main_context_acquire around accesses to 
>> GMainContext.  However QEMU is not doing that, instead it is taking its 
>> own mutex.  So we should add g_main_context_acquire and
>> g_main_context_release in the two implementations of 
>> os_host_main_loop_wait; these should undo the effect of Frediano's 
>> glib patch.
> 
> Based on this paragraph, I'm testing the attached patch, and it does
> also appear to solve the hanging serial port problem.

Turn out that it may actually be a glib bug.  Your patch would enable
Frediano's optimizations so it's worth applying anyway.

https://bugzilla.gnome.org/show_bug.cgi?id=761102

Paolo

Re: [Qemu-devel] [PATCH] char: kick main loop after adding a watch
Posted by Stefan Hajnoczi 6 years, 11 months ago
On Fri, Mar 31, 2017 at 06:53:56PM +0200, Paolo Bonzini wrote:
> 
> 
> On 31/03/2017 18:43, Stefan Hajnoczi wrote:
> > The ISA serial port device's output can hang when the pipe on stdout
> > becomes full.  This is a race condition where the vcpu thread executing
> > serial emulation code adds a watch on stdout while the main loop thread
> > is blocked in ppoll(2).  If no timer or other event wakes up the main
> > loop, there will be no further output from the serial device even when
> > the pipe becomes writable.
> > 
> > Richard W. M. Jones was able to reproduce the hang on recent versions of
> > guestfs-tools-c and libglib2 on Fedora 26 hosts.
> > 
> > This patch kicks the main loop so the next iteration invokes ppoll(2)
> > with the watch fd.
> > 
> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1435432
> > Reported-by: Richard W. M. Jones <rjones@redhat.com>
> > Tested-by: Richard W. M. Jones <rjones@redhat.com>
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  chardev/char.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/chardev/char.c b/chardev/char.c
> > index 3df1163..6c99c34 100644
> > --- a/chardev/char.c
> > +++ b/chardev/char.c
> > @@ -1059,6 +1059,11 @@ guint qemu_chr_fe_add_watch(CharBackend *be, GIOCondition cond,
> >      tag = g_source_attach(src, NULL);
> >      g_source_unref(src);
> >  
> > +    /* The main loop may be in blocked waiting on events in another thread.
> > +     * Kick it so the new watch will be added.
> > +     */
> > +    qemu_notify_event();
> > +
> >      return tag;
> >  }
> >  
> > 
> 
> Thanks for looking at this, I was quite stuck and now I understand 
> what's going on.  However, I don't believe your patch is the right 
> solution.
> 
> According to Richard's bisection, the bug was introduced by the patch 
> at https://bug761102.bugzilla-attachments.gnome.org/attachment.cgi?id=319699.
> 
> The g_wakeup_signal that is removed (actually made conditional) in that 
> patch is doing exactly the same thing as qemu_notify_event, which is
> fishy...  It would still be a QEMU bug according to the theory below but,
> depending on how they handle backwards-compatibility, they might 
> consider undoing this change.
> 
> glib is expecting QEMU to use g_main_context_acquire around accesses to 
> GMainContext.  However QEMU is not doing that, instead it is taking its 
> own mutex.  So we should add g_main_context_acquire and
> g_main_context_release in the two implementations of 
> os_host_main_loop_wait; these should undo the effect of Frediano's 
> glib patch.
> 
> In all fairness, the docs do say "You must be the owner of a context 
> before you can call g_main_context_prepare(), g_main_context_query(), 
> g_main_context_check(), g_main_context_dispatch()".  However, it has
> worked until now and the documentation does not say exactly why that
> is necessary.

Thanks Paolo, very interesting.  I never realized that glib has its own
inter-thread signal.

NACK

Stefan
Re: [Qemu-devel] [PATCH] char: kick main loop after adding a watch
Posted by Richard W.M. Jones 6 years, 12 months ago
On Fri, Mar 31, 2017 at 05:43:22PM +0100, Stefan Hajnoczi wrote:
> The ISA serial port device's output can hang when the pipe on stdout
> becomes full.  This is a race condition where the vcpu thread executing
> serial emulation code adds a watch on stdout while the main loop thread
> is blocked in ppoll(2).  If no timer or other event wakes up the main
> loop, there will be no further output from the serial device even when
> the pipe becomes writable.
> 
> Richard W. M. Jones was able to reproduce the hang on recent versions of
> guestfs-tools-c and libglib2 on Fedora 26 hosts.
> 
> This patch kicks the main loop so the next iteration invokes ppoll(2)
> with the watch fd.
> 
> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1435432
> Reported-by: Richard W. M. Jones <rjones@redhat.com>
> Tested-by: Richard W. M. Jones <rjones@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  chardev/char.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/chardev/char.c b/chardev/char.c
> index 3df1163..6c99c34 100644
> --- a/chardev/char.c
> +++ b/chardev/char.c
> @@ -1059,6 +1059,11 @@ guint qemu_chr_fe_add_watch(CharBackend *be, GIOCondition cond,
>      tag = g_source_attach(src, NULL);
>      g_source_unref(src);
>  
> +    /* The main loop may be in blocked waiting on events in another thread.
> +     * Kick it so the new watch will be added.
> +     */
> +    qemu_notify_event();
> +
>      return tag;
>  }
>  

Obviously Paolo has provided further insights in the other part of
this thread, but from my point of view this does appear to fix the
problem.  I tested it on 3 different physical machines which
previously showed hangs, and on all 3 the test ran for > 30 mins
[before I killed the test] which is far longer than seen before.

Therefore: ACK.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org