[Qemu-devel] [PATCH for-2.12] iothread: workaround glib bug which hangs qmp-test

Peter Xu posted 1 patch 5 years, 11 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20180404065346.3252-1-peterx@redhat.com
Test checkpatch passed
Test docker-build@min-glib passed
Test docker-mingw@fedora passed
Test s390x passed
There is a newer version of this series
iothread.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
[Qemu-devel] [PATCH for-2.12] iothread: workaround glib bug which hangs qmp-test
Posted by Peter Xu 5 years, 11 months ago
Free the AIO context earlier than the GMainContext (if we have) to
workaround a possible Glib bug.  No functional change at all.

We encountered a qmp-test hang with oob:

  #0  0x00007f35ffe45334 in __lll_lock_wait () from /lib64/libpthread.so.0
  #1  0x00007f35ffe405d8 in _L_lock_854 () from /lib64/libpthread.so.0
  #2  0x00007f35ffe404a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
  #3  0x00007f35fc5b9c9d in g_source_unref_internal (source=0x24f0600, context=0x7f35f0000960, have_lock=0) at gmain.c:1685
  #4  0x0000000000aa6672 in aio_context_unref (ctx=0x24f0600) at /root/qemu/util/async.c:497
  #5  0x000000000065851c in iothread_instance_finalize (obj=0x24f0380) at /root/qemu/iothread.c:129
  #6  0x0000000000962d79 in object_deinit (obj=0x24f0380, type=0x242e960) at /root/qemu/qom/object.c:462
  #7  0x0000000000962e0d in object_finalize (data=0x24f0380) at /root/qemu/qom/object.c:476
  #8  0x0000000000964146 in object_unref (obj=0x24f0380) at /root/qemu/qom/object.c:924
  #9  0x0000000000965880 in object_finalize_child_property (obj=0x24ec640, name=0x24efca0 "mon_iothread", opaque=0x24f0380) at /root/qemu/qom/object.c:1436
  #10 0x0000000000962c33 in object_property_del_child (obj=0x24ec640, child=0x24f0380, errp=0x0) at /root/qemu/qom/object.c:436
  #11 0x0000000000962d26 in object_unparent (obj=0x24f0380) at /root/qemu/qom/object.c:455
  #12 0x0000000000658f00 in iothread_destroy (iothread=0x24f0380) at /root/qemu/iothread.c:365
  #13 0x00000000004c67a8 in monitor_cleanup () at /root/qemu/monitor.c:4663
  #14 0x0000000000669e27 in main (argc=16, argv=0x7ffc8b1ae2f8, envp=0x7ffc8b1ae380) at /root/qemu/vl.c:4749

With glib version 2.28.8-9 (current default version on centos6) we might
encounter above with the old code. It is verified that glib version
2.50.3-3 won't trigger that bug again, but since we are still supporting
glib 2.28.8-9, we may want this workaround.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 iothread.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/iothread.c b/iothread.c
index e675c38442..d41d661cdf 100644
--- a/iothread.c
+++ b/iothread.c
@@ -117,16 +117,26 @@ static void iothread_instance_finalize(Object *obj)
     IOThread *iothread = IOTHREAD(obj);
 
     iothread_stop(iothread);
+    /*
+     * With glib version 2.28.8-9 (current default version on centos6)
+     * we might encounter problem of qmp-test OOB hang if we unref the
+     * AIO context later than the GMainContext below.  Let's free the
+     * AIO context earlier to bypass that possible glib bug.
+     *
+     * It is verified that glib version 2.50.3-3 (or even earlier)
+     * won't trigger that bug again, but since we are still supporting
+     * glib 2.28.8-9, we need this workaround.
+     */
+    if (iothread->ctx) {
+        aio_context_unref(iothread->ctx);
+        iothread->ctx = NULL;
+    }
     if (iothread->worker_context) {
         g_main_context_unref(iothread->worker_context);
         iothread->worker_context = NULL;
     }
     qemu_cond_destroy(&iothread->init_done_cond);
     qemu_mutex_destroy(&iothread->init_done_lock);
-    if (!iothread->ctx) {
-        return;
-    }
-    aio_context_unref(iothread->ctx);
 }
 
 static void iothread_complete(UserCreatable *obj, Error **errp)
-- 
2.14.3


Re: [Qemu-devel] [PATCH for-2.12] iothread: workaround glib bug which hangs qmp-test
Posted by Fam Zheng 5 years, 11 months ago
On Wed, 04/04 14:53, Peter Xu wrote:
> Free the AIO context earlier than the GMainContext (if we have) to
> workaround a possible Glib bug.  No functional change at all.
> 
> We encountered a qmp-test hang with oob:
> 
>   #0  0x00007f35ffe45334 in __lll_lock_wait () from /lib64/libpthread.so.0
>   #1  0x00007f35ffe405d8 in _L_lock_854 () from /lib64/libpthread.so.0
>   #2  0x00007f35ffe404a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
>   #3  0x00007f35fc5b9c9d in g_source_unref_internal (source=0x24f0600, context=0x7f35f0000960, have_lock=0) at gmain.c:1685
>   #4  0x0000000000aa6672 in aio_context_unref (ctx=0x24f0600) at /root/qemu/util/async.c:497
>   #5  0x000000000065851c in iothread_instance_finalize (obj=0x24f0380) at /root/qemu/iothread.c:129
>   #6  0x0000000000962d79 in object_deinit (obj=0x24f0380, type=0x242e960) at /root/qemu/qom/object.c:462
>   #7  0x0000000000962e0d in object_finalize (data=0x24f0380) at /root/qemu/qom/object.c:476
>   #8  0x0000000000964146 in object_unref (obj=0x24f0380) at /root/qemu/qom/object.c:924
>   #9  0x0000000000965880 in object_finalize_child_property (obj=0x24ec640, name=0x24efca0 "mon_iothread", opaque=0x24f0380) at /root/qemu/qom/object.c:1436
>   #10 0x0000000000962c33 in object_property_del_child (obj=0x24ec640, child=0x24f0380, errp=0x0) at /root/qemu/qom/object.c:436
>   #11 0x0000000000962d26 in object_unparent (obj=0x24f0380) at /root/qemu/qom/object.c:455
>   #12 0x0000000000658f00 in iothread_destroy (iothread=0x24f0380) at /root/qemu/iothread.c:365
>   #13 0x00000000004c67a8 in monitor_cleanup () at /root/qemu/monitor.c:4663
>   #14 0x0000000000669e27 in main (argc=16, argv=0x7ffc8b1ae2f8, envp=0x7ffc8b1ae380) at /root/qemu/vl.c:4749
> 
> With glib version 2.28.8-9 (current default version on centos6) we might
> encounter above with the old code. It is verified that glib version
> 2.50.3-3 won't trigger that bug again, but since we are still supporting
> glib 2.28.8-9, we may want this workaround.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  iothread.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/iothread.c b/iothread.c
> index e675c38442..d41d661cdf 100644
> --- a/iothread.c
> +++ b/iothread.c
> @@ -117,16 +117,26 @@ static void iothread_instance_finalize(Object *obj)
>      IOThread *iothread = IOTHREAD(obj);
>  
>      iothread_stop(iothread);
> +    /*
> +     * With glib version 2.28.8-9 (current default version on centos6)
> +     * we might encounter problem of qmp-test OOB hang if we unref the
> +     * AIO context later than the GMainContext below.  Let's free the
> +     * AIO context earlier to bypass that possible glib bug.
> +     *
> +     * It is verified that glib version 2.50.3-3 (or even earlier)
> +     * won't trigger that bug again, but since we are still supporting
> +     * glib 2.28.8-9, we need this workaround.
> +     */
> +    if (iothread->ctx) {
> +        aio_context_unref(iothread->ctx);
> +        iothread->ctx = NULL;
> +    }
>      if (iothread->worker_context) {
>          g_main_context_unref(iothread->worker_context);
>          iothread->worker_context = NULL;
>      }
>      qemu_cond_destroy(&iothread->init_done_cond);
>      qemu_mutex_destroy(&iothread->init_done_lock);
> -    if (!iothread->ctx) {
> -        return;
> -    }
> -    aio_context_unref(iothread->ctx);
>  }
>  
>  static void iothread_complete(UserCreatable *obj, Error **errp)
> -- 
> 2.14.3
> 

Reviewed-by: Fam Zheng <famz@redhat.com>


Re: [Qemu-devel] [PATCH for-2.12] iothread: workaround glib bug which hangs qmp-test
Posted by Eric Blake 5 years, 11 months ago
On 04/04/2018 02:00 AM, Fam Zheng wrote:
> On Wed, 04/04 14:53, Peter Xu wrote:
>> Free the AIO context earlier than the GMainContext (if we have) to
>> workaround a possible Glib bug.  No functional change at all.
>>

>>
>> With glib version 2.28.8-9 (current default version on centos6) we might
>> encounter above with the old code. It is verified that glib version
>> 2.50.3-3 won't trigger that bug again, but since we are still supporting
>> glib 2.28.8-9, we may want this workaround.
>>

> 
> Reviewed-by: Fam Zheng <famz@redhat.com>

Thanks; will queue through my qapi tree for 2.12 (might be -rc3, since I
already sent a pull request for -rc2)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH for-2.12] iothread: workaround glib bug which hangs qmp-test
Posted by Stefan Hajnoczi 5 years, 11 months ago
On Wed, Apr 04, 2018 at 02:53:46PM +0800, Peter Xu wrote:
> Free the AIO context earlier than the GMainContext (if we have) to
> workaround a possible Glib bug.  No functional change at all.
> 
> We encountered a qmp-test hang with oob:
> 
>   #0  0x00007f35ffe45334 in __lll_lock_wait () from /lib64/libpthread.so.0
>   #1  0x00007f35ffe405d8 in _L_lock_854 () from /lib64/libpthread.so.0
>   #2  0x00007f35ffe404a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
>   #3  0x00007f35fc5b9c9d in g_source_unref_internal (source=0x24f0600, context=0x7f35f0000960, have_lock=0) at gmain.c:1685
>   #4  0x0000000000aa6672 in aio_context_unref (ctx=0x24f0600) at /root/qemu/util/async.c:497
>   #5  0x000000000065851c in iothread_instance_finalize (obj=0x24f0380) at /root/qemu/iothread.c:129
>   #6  0x0000000000962d79 in object_deinit (obj=0x24f0380, type=0x242e960) at /root/qemu/qom/object.c:462
>   #7  0x0000000000962e0d in object_finalize (data=0x24f0380) at /root/qemu/qom/object.c:476
>   #8  0x0000000000964146 in object_unref (obj=0x24f0380) at /root/qemu/qom/object.c:924
>   #9  0x0000000000965880 in object_finalize_child_property (obj=0x24ec640, name=0x24efca0 "mon_iothread", opaque=0x24f0380) at /root/qemu/qom/object.c:1436
>   #10 0x0000000000962c33 in object_property_del_child (obj=0x24ec640, child=0x24f0380, errp=0x0) at /root/qemu/qom/object.c:436
>   #11 0x0000000000962d26 in object_unparent (obj=0x24f0380) at /root/qemu/qom/object.c:455
>   #12 0x0000000000658f00 in iothread_destroy (iothread=0x24f0380) at /root/qemu/iothread.c:365
>   #13 0x00000000004c67a8 in monitor_cleanup () at /root/qemu/monitor.c:4663
>   #14 0x0000000000669e27 in main (argc=16, argv=0x7ffc8b1ae2f8, envp=0x7ffc8b1ae380) at /root/qemu/vl.c:4749
> 
> With glib version 2.28.8-9 (current default version on centos6) we might
> encounter above with the old code. It is verified that glib version
> 2.50.3-3 won't trigger that bug again, but since we are still supporting
> glib 2.28.8-9, we may want this workaround.

This patch does not contain enough information to explain what this
"possible Glib bug" is.  Please provide information on the root cause.

Without understanding the problem, it's hard for anyone to review this
patch and for other developers to avoid regressions in the future.

> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  iothread.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/iothread.c b/iothread.c
> index e675c38442..d41d661cdf 100644
> --- a/iothread.c
> +++ b/iothread.c
> @@ -117,16 +117,26 @@ static void iothread_instance_finalize(Object *obj)
>      IOThread *iothread = IOTHREAD(obj);
>  
>      iothread_stop(iothread);
> +    /*
> +     * With glib version 2.28.8-9 (current default version on centos6)
> +     * we might encounter problem of qmp-test OOB hang if we unref the
> +     * AIO context later than the GMainContext below.  Let's free the
> +     * AIO context earlier to bypass that possible glib bug.
> +     *
> +     * It is verified that glib version 2.50.3-3 (or even earlier)
> +     * won't trigger that bug again, but since we are still supporting
> +     * glib 2.28.8-9, we need this workaround.
> +     */
> +    if (iothread->ctx) {
> +        aio_context_unref(iothread->ctx);
> +        iothread->ctx = NULL;
> +    }
>      if (iothread->worker_context) {
>          g_main_context_unref(iothread->worker_context);
>          iothread->worker_context = NULL;
>      }
>      qemu_cond_destroy(&iothread->init_done_cond);
>      qemu_mutex_destroy(&iothread->init_done_lock);
> -    if (!iothread->ctx) {
> -        return;
> -    }
> -    aio_context_unref(iothread->ctx);
>  }
>  
>  static void iothread_complete(UserCreatable *obj, Error **errp)
> -- 
> 2.14.3
> 
> 
Re: [Qemu-devel] [PATCH for-2.12] iothread: workaround glib bug which hangs qmp-test
Posted by Peter Xu 5 years, 11 months ago
On Wed, Apr 04, 2018 at 03:53:05PM +0100, Stefan Hajnoczi wrote:
> On Wed, Apr 04, 2018 at 02:53:46PM +0800, Peter Xu wrote:
> > Free the AIO context earlier than the GMainContext (if we have) to
> > workaround a possible Glib bug.  No functional change at all.
> > 
> > We encountered a qmp-test hang with oob:
> > 
> >   #0  0x00007f35ffe45334 in __lll_lock_wait () from /lib64/libpthread.so.0
> >   #1  0x00007f35ffe405d8 in _L_lock_854 () from /lib64/libpthread.so.0
> >   #2  0x00007f35ffe404a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> >   #3  0x00007f35fc5b9c9d in g_source_unref_internal (source=0x24f0600, context=0x7f35f0000960, have_lock=0) at gmain.c:1685
> >   #4  0x0000000000aa6672 in aio_context_unref (ctx=0x24f0600) at /root/qemu/util/async.c:497
> >   #5  0x000000000065851c in iothread_instance_finalize (obj=0x24f0380) at /root/qemu/iothread.c:129
> >   #6  0x0000000000962d79 in object_deinit (obj=0x24f0380, type=0x242e960) at /root/qemu/qom/object.c:462
> >   #7  0x0000000000962e0d in object_finalize (data=0x24f0380) at /root/qemu/qom/object.c:476
> >   #8  0x0000000000964146 in object_unref (obj=0x24f0380) at /root/qemu/qom/object.c:924
> >   #9  0x0000000000965880 in object_finalize_child_property (obj=0x24ec640, name=0x24efca0 "mon_iothread", opaque=0x24f0380) at /root/qemu/qom/object.c:1436
> >   #10 0x0000000000962c33 in object_property_del_child (obj=0x24ec640, child=0x24f0380, errp=0x0) at /root/qemu/qom/object.c:436
> >   #11 0x0000000000962d26 in object_unparent (obj=0x24f0380) at /root/qemu/qom/object.c:455
> >   #12 0x0000000000658f00 in iothread_destroy (iothread=0x24f0380) at /root/qemu/iothread.c:365
> >   #13 0x00000000004c67a8 in monitor_cleanup () at /root/qemu/monitor.c:4663
> >   #14 0x0000000000669e27 in main (argc=16, argv=0x7ffc8b1ae2f8, envp=0x7ffc8b1ae380) at /root/qemu/vl.c:4749
> > 
> > With glib version 2.28.8-9 (current default version on centos6) we might
> > encounter above with the old code. It is verified that glib version
> > 2.50.3-3 won't trigger that bug again, but since we are still supporting
> > glib 2.28.8-9, we may want this workaround.
> 
> This patch does not contain enough information to explain what this
> "possible Glib bug" is.  Please provide information on the root cause.
> 
> Without understanding the problem, it's hard for anyone to review this
> patch and for other developers to avoid regressions in the future.

I suspect this can be the fix of the problem (commit ID of glib
repository, https://github.com/GNOME/glib):

    commit 26056558be4656ee6e891a4fae5d4198de7519cf
    Author: Dan Winship <danw@gnome.org>
    Date:   Mon Jul 30 08:06:57 2012 -0400

    gmain: allow g_source_get_context() on destroyed sources

The thing is that before this commit glib won't zero the
source->context fields of bound gsources when a context is
unreferenced, and after this patch it did.  Here I suspect the old
glib will try to take a lock of context which has already freed, hence
hanged death.

The commit is included in glib 2.33.10 or later, which seems
reasonable (the bad version I tried was 2.28.8, and I also verified
one of the good versions is 2.50.3, and 2.33.10 lies in them). I
didn't further verify the commit and rebuild glib/qemu, hopefully this
can be a valid explanation already.

If this is correct, then the rule of thumb would be: let's destroy all
the gsources that bound to the context before destroying the context
itself, until we drop support for glib 2.33.10.

I understand your worry about not having everything clear in the
commit message. Indeed we'd better know why the bug happened and then
we can avoid that and even use that information when we want to
increase the minimum version of glib we support.  However IMHO that's
something extra, it's not a reason to not merge the patch, it's not
the reason to refuse to fix QEMU which can at least let patchew run
nicely with QEMU 2.12 on centos6.

After all, the thing unclear is out of QEMU, we can't force every QEMU
developer to be fluent with internals of every library that QEMU uses
(glib is only one of them)...  And it may take a lot of time to dig
the real thing out for a QEMU developer.  So, if the patch can well
explain itself (IMHO this patch does - it only switches which object
to destroy first, nothing else is changed) and the patch is tested,
and can prove that it fixes something wrong has happened beneath QEMU,
then IMHO we should merge it.

So I would still like that we merge this patch for 2.12 (we can for
sure enhance the commit message a bit, though). In all cases, thanks
for your review.

-- 
Peter Xu