The unix domain socket remains even after the VM is destroyed

Masayoshi Mizuma posted 1 patch 2 years, 3 months ago
Test syntax-check failed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/libvirt tags/patchew/YfBNBRkbXAeG2wC+@gabell
The unix domain socket remains even after the VM is destroyed
Posted by Masayoshi Mizuma 2 years, 3 months ago
Hello,

I found an issue that libvirt isn't close an unix domain socket to connect to the qemu monitor
even after the VM is destroyed.
This issue happens since commit 695bdb3841 ("src: ensure GSource background unref happens in correct event loop")
on the system whose glib version is 2.56.
I would appreciate it if you could give any ideas to solve the issue.

The socket is allocated in qemuMonitorOpenUnix(), and used by the monitor->socket
and monitor->watch:

    qemuMonitorOpen
      qemuMonitorOpenUnix
        if ((monfd = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {

      qemuMonitorOpenInternal
        mon->socket = g_socket_new_from_fd(fd, &gerr);

        qemuMonitorRegister
          mon->watch = g_socket_create_source(mon->socket,

Usually, the socket is closed when the reference counter of the glib object
(mon->socket and mon->watch) gets 0:

    qemuMonitorClose
      qemuMonitorUnregister
        vir_g_source_unref(mon->watch, mon->context);
          g_source_set_callback(idle, virEventGLibSourceUnrefIdle, src, NULL);
            virEventGLibSourceUnrefIdle
              g_source_unref(src); <== unref monitor->watch

      g_object_unref(mon->socket); <== unref monitor->socket

It seems that the callback virEventGLibSourceUnrefIdle() to unref the monitor->watch doesn't
work when qemuMonitorUnregister() is called via qemuProcessStop(), so the socket isn't closed.
I'm not sure why the callback doesn't work at that time. I suspect that the VM is closing
so the main loop of the monitor doesn't work any more.

We can close the socket to add g_socket_close() before unref the mon->socket, however,
it may remain a memory leak because of mon->watch (GSource object), so probably
it isn't a good idea to close the socket.

We can unref the mon->watch to set NULL to the second argument of vir_g_source_unref()
because the default main loop still works at that time, however, I'm not
sure it's an appropriate way to avoid the gobject issue which the commit solves...

I found this issue on the qemu monitor, and probably the qemu agent has the same issue
because the socket procedure is almost the same as the monitor.

I would appreciate it if you could give any ideas to solve this issue.

Following is to observe the callback working:
---
diff --git a/src/util/glibcompat.c b/src/util/glibcompat.c
index eb6dcc0111..b8b1770424 100644
--- a/src/util/glibcompat.c
+++ b/src/util/glibcompat.c
@@ -24,6 +24,9 @@
 
 #include "glibcompat.h"
 
+#include "virlog.h"
+
+VIR_LOG_INIT("util.glibcompat");
 /*
  * Note that because of the GLIB_VERSION_MAX_ALLOWED constant in
  * config-post.h, allowing use of functions from newer GLib via
@@ -244,6 +247,7 @@ virEventGLibSourceUnrefIdle(gpointer data)
     GSource *src = data;
 
     g_source_unref(src);
+    VIR_DEBUG("unrefed: %p", src);
 
     return FALSE;
 }
@@ -257,6 +261,7 @@ void vir_g_source_unref(GSource *src, GMainContext *ctx)
     g_source_attach(idle, ctx);
 
     g_source_unref(idle);
+    VIR_DEBUG("unref registered: %p ctx: %p", src, ctx);
 }
 
 #endif
---

Case the mon->watch (0xffff28008af0) is unreffed correctly
(via qemuMonitorUpdateWatch()):

18:54:15.403+0000: 16845: debug : qemuMonitorEmitResume:1159 : mon=0xffff683ac020
18:54:15.403+0000: 16845: debug : qemuProcessHandleResume:713 : Transitioned guest test-vm into running state, reason 'booted', event detail 0
18:54:15.404+0000: 16845: debug : vir_g_source_unref:264 : unref registered: 0xffff28008af0 ctx: 0xffff780169a0
18:54:15.404+0000: 16845: debug : qemuMonitorJSONIOProcessLine:222 : Line [{"return": {}, "id": "libvirt-10"}]
18:54:15.404+0000: 16845: info : qemuMonitorJSONIOProcessLine:242 : QEMU_MONITOR_RECV_REPLY: mon=0xffff683ac020 reply={"return": {}, "id": "libvirt-10"}
18:54:15.404+0000: 16845: debug : vir_g_source_unref:264 : unref registered: 0xffff2819a260 ctx: 0xffff780169a0
18:54:15.404+0000: 16845: debug : virEventGLibSourceUnrefIdle:250 : unrefed: 0xffff28008af0

Case the mon->watch (0xffff7802bb30) isn't unreffed
(via qemuProcessStop()):

18:54:15.642+0000: 16589: debug : qemuProcessStop:8008 : Shutting down vm=0xaaaad40edec0 name=test-vm id=3 pid=16842, reason=destroyed, asyncJob=none, flags=0x0
18:54:15.642+0000: 16589: debug : qemuDomainLogAppendMessage:6733 : Append log message (vm='test-vm' message='2022-01-25 18:54:15.642+0000: shutting down, reason=destroyed) stdioLogD=1
18:54:15.643+0000: 16589: info : qemuMonitorClose:834 : QEMU_MONITOR_CLOSE: mon=0xffff683ac020
18:54:15.643+0000: 16589: debug : vir_g_source_unref:264 : unref registered: 0xffff7802bb30 ctx: 0xffff780169a0
18:54:15.643+0000: 16845: debug : qemuMonitorJSONIOProcessEvent:209 : handle SHUTDOWN handler=0xffff4ef057c0 data=0xffff28007da0
18:54:15.643+0000: 16845: debug : qemuMonitorEmitShutdown:1132 : mon=0xffff683ac020 guest=2
18:54:15.643+0000: 16845: debug : qemuProcessHandleShutdown:572 : vm=0xaaaad40edec0
18:54:15.643+0000: 16845: debug : qemuProcessHandleShutdown:592 : Transitioned guest test-vm to shutdown state
18:54:15.643+0000: 16845: debug : qemuProcessKill:7921 : vm=0xaaaad40edec0 name=test-vm pid=16842 flags=0x2
18:54:15.643+0000: 16845: debug : qemuMonitorDispose:216 : mon=0xffff683ac020
18:54:15.643+0000: 16589: debug : qemuProcessKill:7921 : vm=0xaaaad40edec0 name=test-vm pid=16842 flags=0x5
18:54:15.643+0000: 16589: debug : qemuDomainCleanupRun:7313 : driver=0xffff281441b0, vm=test-vm
18:54:15.644+0000: 16589: debug : qemuProcessAutoDestroyRemove:8395 : vm=test-vm
18:54:15.645+0000: 16589: debug : qemuRemoveCgroup:1221 : Failed to terminate cgroup for test-vm
18:54:15.647+0000: 16589: debug : qemuDomainObjEndJob:1141 : Stopping job: destroy (async=none vm=0xaaaad40edec0 name=test-vm)

Thanks!
Masa

Re: The unix domain socket remains even after the VM is destroyed
Posted by Martin Kletzander 2 years, 2 months ago
On Tue, Jan 25, 2022 at 02:18:29PM -0500, Masayoshi Mizuma wrote:
>Hello,
>
>I found an issue that libvirt isn't close an unix domain socket to connect to the qemu monitor
>even after the VM is destroyed.
>This issue happens since commit 695bdb3841 ("src: ensure GSource background unref happens in correct event loop")
>on the system whose glib version is 2.56.
>I would appreciate it if you could give any ideas to solve the issue.
>

My very rough and fast guess is that this happens due to our workaround
while running on a system on which the fix was actually backported.  You
can remove the workaround (which is implemented for glib < 2.64.0), keep
only the new code and see if it reproduces.  I know it's not very
scientific, but we already faced so many workarounds for old buggy glib
that it's tiresome to try and fix all that when it goes away with update
of glib.  If the above does not help, then I would look at what could
actually happening (maybe the unref does not get scheduled because the
event loop it is requested on does not run any more, but that's just a
guess).

>The socket is allocated in qemuMonitorOpenUnix(), and used by the monitor->socket
>and monitor->watch:
>
>    qemuMonitorOpen
>      qemuMonitorOpenUnix
>        if ((monfd = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
>
>      qemuMonitorOpenInternal
>        mon->socket = g_socket_new_from_fd(fd, &gerr);
>
>        qemuMonitorRegister
>          mon->watch = g_socket_create_source(mon->socket,
>
>Usually, the socket is closed when the reference counter of the glib object
>(mon->socket and mon->watch) gets 0:
>
>    qemuMonitorClose
>      qemuMonitorUnregister
>        vir_g_source_unref(mon->watch, mon->context);
>          g_source_set_callback(idle, virEventGLibSourceUnrefIdle, src, NULL);
>            virEventGLibSourceUnrefIdle
>              g_source_unref(src); <== unref monitor->watch
>
>      g_object_unref(mon->socket); <== unref monitor->socket
>
>It seems that the callback virEventGLibSourceUnrefIdle() to unref the monitor->watch doesn't
>work when qemuMonitorUnregister() is called via qemuProcessStop(), so the socket isn't closed.
>I'm not sure why the callback doesn't work at that time. I suspect that the VM is closing
>so the main loop of the monitor doesn't work any more.
>
>We can close the socket to add g_socket_close() before unref the mon->socket, however,
>it may remain a memory leak because of mon->watch (GSource object), so probably
>it isn't a good idea to close the socket.
>
>We can unref the mon->watch to set NULL to the second argument of vir_g_source_unref()
>because the default main loop still works at that time, however, I'm not
>sure it's an appropriate way to avoid the gobject issue which the commit solves...
>
>I found this issue on the qemu monitor, and probably the qemu agent has the same issue
>because the socket procedure is almost the same as the monitor.
>
>I would appreciate it if you could give any ideas to solve this issue.
>
>Following is to observe the callback working:
>---
>diff --git a/src/util/glibcompat.c b/src/util/glibcompat.c
>index eb6dcc0111..b8b1770424 100644
>--- a/src/util/glibcompat.c
>+++ b/src/util/glibcompat.c
>@@ -24,6 +24,9 @@
>
> #include "glibcompat.h"
>
>+#include "virlog.h"
>+
>+VIR_LOG_INIT("util.glibcompat");
> /*
>  * Note that because of the GLIB_VERSION_MAX_ALLOWED constant in
>  * config-post.h, allowing use of functions from newer GLib via
>@@ -244,6 +247,7 @@ virEventGLibSourceUnrefIdle(gpointer data)
>     GSource *src = data;
>
>     g_source_unref(src);
>+    VIR_DEBUG("unrefed: %p", src);
>
>     return FALSE;
> }
>@@ -257,6 +261,7 @@ void vir_g_source_unref(GSource *src, GMainContext *ctx)
>     g_source_attach(idle, ctx);
>
>     g_source_unref(idle);
>+    VIR_DEBUG("unref registered: %p ctx: %p", src, ctx);
> }
>
> #endif
>---
>
>Case the mon->watch (0xffff28008af0) is unreffed correctly
>(via qemuMonitorUpdateWatch()):
>
>18:54:15.403+0000: 16845: debug : qemuMonitorEmitResume:1159 : mon=0xffff683ac020
>18:54:15.403+0000: 16845: debug : qemuProcessHandleResume:713 : Transitioned guest test-vm into running state, reason 'booted', event detail 0
>18:54:15.404+0000: 16845: debug : vir_g_source_unref:264 : unref registered: 0xffff28008af0 ctx: 0xffff780169a0
>18:54:15.404+0000: 16845: debug : qemuMonitorJSONIOProcessLine:222 : Line [{"return": {}, "id": "libvirt-10"}]
>18:54:15.404+0000: 16845: info : qemuMonitorJSONIOProcessLine:242 : QEMU_MONITOR_RECV_REPLY: mon=0xffff683ac020 reply={"return": {}, "id": "libvirt-10"}
>18:54:15.404+0000: 16845: debug : vir_g_source_unref:264 : unref registered: 0xffff2819a260 ctx: 0xffff780169a0
>18:54:15.404+0000: 16845: debug : virEventGLibSourceUnrefIdle:250 : unrefed: 0xffff28008af0
>
>Case the mon->watch (0xffff7802bb30) isn't unreffed
>(via qemuProcessStop()):
>
>18:54:15.642+0000: 16589: debug : qemuProcessStop:8008 : Shutting down vm=0xaaaad40edec0 name=test-vm id=3 pid=16842, reason=destroyed, asyncJob=none, flags=0x0
>18:54:15.642+0000: 16589: debug : qemuDomainLogAppendMessage:6733 : Append log message (vm='test-vm' message='2022-01-25 18:54:15.642+0000: shutting down, reason=destroyed) stdioLogD=1
>18:54:15.643+0000: 16589: info : qemuMonitorClose:834 : QEMU_MONITOR_CLOSE: mon=0xffff683ac020
>18:54:15.643+0000: 16589: debug : vir_g_source_unref:264 : unref registered: 0xffff7802bb30 ctx: 0xffff780169a0
>18:54:15.643+0000: 16845: debug : qemuMonitorJSONIOProcessEvent:209 : handle SHUTDOWN handler=0xffff4ef057c0 data=0xffff28007da0
>18:54:15.643+0000: 16845: debug : qemuMonitorEmitShutdown:1132 : mon=0xffff683ac020 guest=2
>18:54:15.643+0000: 16845: debug : qemuProcessHandleShutdown:572 : vm=0xaaaad40edec0
>18:54:15.643+0000: 16845: debug : qemuProcessHandleShutdown:592 : Transitioned guest test-vm to shutdown state
>18:54:15.643+0000: 16845: debug : qemuProcessKill:7921 : vm=0xaaaad40edec0 name=test-vm pid=16842 flags=0x2
>18:54:15.643+0000: 16845: debug : qemuMonitorDispose:216 : mon=0xffff683ac020
>18:54:15.643+0000: 16589: debug : qemuProcessKill:7921 : vm=0xaaaad40edec0 name=test-vm pid=16842 flags=0x5
>18:54:15.643+0000: 16589: debug : qemuDomainCleanupRun:7313 : driver=0xffff281441b0, vm=test-vm
>18:54:15.644+0000: 16589: debug : qemuProcessAutoDestroyRemove:8395 : vm=test-vm
>18:54:15.645+0000: 16589: debug : qemuRemoveCgroup:1221 : Failed to terminate cgroup for test-vm
>18:54:15.647+0000: 16589: debug : qemuDomainObjEndJob:1141 : Stopping job: destroy (async=none vm=0xaaaad40edec0 name=test-vm)
>
>Thanks!
>Masa
>