[PATCH 2/2] qemu: Use virEventThreadStop() in qemuProcessStop()

Michal Privoznik posted 2 patches 2 months, 3 weeks ago
There is a newer version of this series
[PATCH 2/2] qemu: Use virEventThreadStop() in qemuProcessStop()
Posted by Michal Privoznik 2 months, 3 weeks ago
Currently, qemuProcessStop() unlocks given domain object right in
the middle of cleanup process. This is dangerous because there
might be another thread which is executing virDomainObjListAdd().
And since the domain object is on the list of domain objects AND
by the time qemuProcessStop() unlocks it the object is also
marked as inactive, the other thread acquires the lock and
switches vm->def pointer.

The unlocking of domain object is needed though, to allow even
processing thread finish its queue. Well, the processing can be
done before any cleanup is attempted.

Therefore, use freshly introduced virEventThreadStop() to join
the event thread and drop lock/unlock from the middle of
qemuProcessStop().

Fixes: 3865410e7f67ca4ec66e9a905e75f452762a97f0
Resolves: https://issues.redhat.com/browse/RHEL-49607
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
---
 src/qemu/qemu_process.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
index 6255ba92e7..0869307a90 100644
--- a/src/qemu/qemu_process.c
+++ b/src/qemu/qemu_process.c
@@ -8615,6 +8615,14 @@ void qemuProcessStop(virQEMUDriver *driver,
      * reporting so we don't squash a legit error. */
     virErrorPreserveLast(&orig_err);
 
+    /* By unlocking the domain object the events processing thread is
+     * allowed to finish its job. */
+    if (priv->eventThread) {
+        virObjectUnlock(vm);
+        virEventThreadStop(priv->eventThread);
+        virObjectLock(vm);
+    }
+
     if (asyncJob != VIR_ASYNC_JOB_NONE) {
         if (virDomainObjBeginNestedJob(vm, asyncJob) < 0)
             goto cleanup;
@@ -8681,10 +8689,8 @@ void qemuProcessStop(virQEMUDriver *driver,
     /* Wake up anything waiting on domain condition */
     virDomainObjBroadcast(vm);
 
-    /* IMPORTANT: qemuDomainObjStopWorker() unlocks @vm in order to prevent
-     * deadlocks with the per-VM event loop thread. This MUST be done after
-     * marking the VM as dead */
-    qemuDomainObjStopWorker(vm);
+    if (priv->eventThread)
+        g_object_unref(g_steal_pointer(&priv->eventThread));
 
     if (!!g_atomic_int_dec_and_test(&driver->nactive) && driver->inhibitCallback)
         driver->inhibitCallback(false, driver->inhibitOpaque);
-- 
2.44.2
Re: [PATCH 2/2] qemu: Use virEventThreadStop() in qemuProcessStop()
Posted by Peter Krempa 2 months, 3 weeks ago
On Thu, Jul 25, 2024 at 12:57:59 +0200, Michal Privoznik wrote:
> Currently, qemuProcessStop() unlocks given domain object right in
> the middle of cleanup process. This is dangerous because there
> might be another thread which is executing virDomainObjListAdd().
> And since the domain object is on the list of domain objects AND
> by the time qemuProcessStop() unlocks it the object is also
> marked as inactive, the other thread acquires the lock and
> switches vm->def pointer.
> 
> The unlocking of domain object is needed though, to allow even
> processing thread finish its queue. Well, the processing can be
> done before any cleanup is attempted.
> 
> Therefore, use freshly introduced virEventThreadStop() to join
> the event thread and drop lock/unlock from the middle of
> qemuProcessStop().
> 
> Fixes: 3865410e7f67ca4ec66e9a905e75f452762a97f0
> Resolves: https://issues.redhat.com/browse/RHEL-49607
> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
> ---
>  src/qemu/qemu_process.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
> index 6255ba92e7..0869307a90 100644
> --- a/src/qemu/qemu_process.c
> +++ b/src/qemu/qemu_process.c
> @@ -8615,6 +8615,14 @@ void qemuProcessStop(virQEMUDriver *driver,
>       * reporting so we don't squash a legit error. */
>      virErrorPreserveLast(&orig_err);
>  
> +    /* By unlocking the domain object the events processing thread is
> +     * allowed to finish its job. */
> +    if (priv->eventThread) {
> +        virObjectUnlock(vm);
> +        virEventThreadStop(priv->eventThread);
> +        virObjectLock(vm);
> +    }

Moving this unlocking above the line where we reset vm->def->id below
directly contradicts ...


> +
>      if (asyncJob != VIR_ASYNC_JOB_NONE) {
>          if (virDomainObjBeginNestedJob(vm, asyncJob) < 0)
>              goto cleanup;
> @@ -8681,10 +8689,8 @@ void qemuProcessStop(virQEMUDriver *driver,
>      /* Wake up anything waiting on domain condition */
>      virDomainObjBroadcast(vm);
>  
> -    /* IMPORTANT: qemuDomainObjStopWorker() unlocks @vm in order to prevent
> -     * deadlocks with the per-VM event loop thread. This MUST be done after
> -     * marking the VM as dead */

... this message here. Now due to other patches I've pushed, this is now
not a problem directly as we now keep the 'beingDestroyed' flag set
during the whole time thanks to my patch and Jirka's patch fixing few
cases where the flag would be leaked.

What I'm missing is a note why this is safe which I'd expect in the
commit message:

 This patch moves the unlocking of the VM object once again above the
 code which marks the VM as inactive by clearing 'vm->def->id'.

 At this point it's now safe to do as other qemu driver code which may
 be synchronized on the VM lock now also checks the
 'priv->beingDestroyed' flag in addition to 'vm->def->id'.

I'd also most likely prefer if 'qemuProcessStop' explicitly sets
'priv->beingDestroyed' to true before unlocking. The flag will be later
cleared but I didn't really go through all places where it's called to
see if we assert the flag before.


> -    qemuDomainObjStopWorker(vm);
> +    if (priv->eventThread)
> +        g_object_unref(g_steal_pointer(&priv->eventThread));
>  
>      if (!!g_atomic_int_dec_and_test(&driver->nactive) && driver->inhibitCallback)
>          driver->inhibitCallback(false, driver->inhibitOpaque);
> -- 
> 2.44.2
>
Re: [PATCH 2/2] qemu: Use virEventThreadStop() in qemuProcessStop()
Posted by Michal Prívozník 2 months, 3 weeks ago
On 7/25/24 14:33, Peter Krempa wrote:
> On Thu, Jul 25, 2024 at 12:57:59 +0200, Michal Privoznik wrote:
>> Currently, qemuProcessStop() unlocks given domain object right in
>> the middle of cleanup process. This is dangerous because there
>> might be another thread which is executing virDomainObjListAdd().
>> And since the domain object is on the list of domain objects AND
>> by the time qemuProcessStop() unlocks it the object is also
>> marked as inactive, the other thread acquires the lock and
>> switches vm->def pointer.
>>
>> The unlocking of domain object is needed though, to allow even
>> processing thread finish its queue. Well, the processing can be
>> done before any cleanup is attempted.
>>
>> Therefore, use freshly introduced virEventThreadStop() to join
>> the event thread and drop lock/unlock from the middle of
>> qemuProcessStop().
>>
>> Fixes: 3865410e7f67ca4ec66e9a905e75f452762a97f0
>> Resolves: https://issues.redhat.com/browse/RHEL-49607
>> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
>> ---
>>  src/qemu/qemu_process.c | 14 ++++++++++----
>>  1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>> index 6255ba92e7..0869307a90 100644
>> --- a/src/qemu/qemu_process.c
>> +++ b/src/qemu/qemu_process.c
>> @@ -8615,6 +8615,14 @@ void qemuProcessStop(virQEMUDriver *driver,
>>       * reporting so we don't squash a legit error. */
>>      virErrorPreserveLast(&orig_err);
>>  
>> +    /* By unlocking the domain object the events processing thread is
>> +     * allowed to finish its job. */
>> +    if (priv->eventThread) {
>> +        virObjectUnlock(vm);
>> +        virEventThreadStop(priv->eventThread);
>> +        virObjectLock(vm);
>> +    }
> 
> Moving this unlocking above the line where we reset vm->def->id below
> directly contradicts ...
> 
> 
>> +
>>      if (asyncJob != VIR_ASYNC_JOB_NONE) {
>>          if (virDomainObjBeginNestedJob(vm, asyncJob) < 0)
>>              goto cleanup;
>> @@ -8681,10 +8689,8 @@ void qemuProcessStop(virQEMUDriver *driver,
>>      /* Wake up anything waiting on domain condition */
>>      virDomainObjBroadcast(vm);
>>  
>> -    /* IMPORTANT: qemuDomainObjStopWorker() unlocks @vm in order to prevent
>> -     * deadlocks with the per-VM event loop thread. This MUST be done after
>> -     * marking the VM as dead */
> 
> ... this message here. Now due to other patches I've pushed, this is now
> not a problem directly as we now keep the 'beingDestroyed' flag set
> during the whole time thanks to my patch and Jirka's patch fixing few
> cases where the flag would be leaked.
> 
> What I'm missing is a note why this is safe which I'd expect in the
> commit message:
> 
>  This patch moves the unlocking of the VM object once again above the
>  code which marks the VM as inactive by clearing 'vm->def->id'.
> 
>  At this point it's now safe to do as other qemu driver code which may
>  be synchronized on the VM lock now also checks the
>  'priv->beingDestroyed' flag in addition to 'vm->def->id'.
> 
> I'd also most likely prefer if 'qemuProcessStop' explicitly sets
> 'priv->beingDestroyed' to true before unlocking. The flag will be later
> cleared but I didn't really go through all places where it's called to
> see if we assert the flag before.

I can set the flag, sure. And to some extent it feels safer. But, since
the domain object is no longer unlocked I'm wondering whether the flag
is needed at all?

I mean, with this patch merged, we no longer have a race condition, do we?

Michal
Re: [PATCH 2/2] qemu: Use virEventThreadStop() in qemuProcessStop()
Posted by Peter Krempa 2 months, 3 weeks ago
On Thu, Jul 25, 2024 at 15:04:51 +0200, Michal Prívozník wrote:
> On 7/25/24 14:33, Peter Krempa wrote:
> > On Thu, Jul 25, 2024 at 12:57:59 +0200, Michal Privoznik wrote:
> >> Currently, qemuProcessStop() unlocks given domain object right in
> >> the middle of cleanup process. This is dangerous because there
> >> might be another thread which is executing virDomainObjListAdd().
> >> And since the domain object is on the list of domain objects AND
> >> by the time qemuProcessStop() unlocks it the object is also
> >> marked as inactive, the other thread acquires the lock and
> >> switches vm->def pointer.
> >>
> >> The unlocking of domain object is needed though, to allow even
> >> processing thread finish its queue. Well, the processing can be
> >> done before any cleanup is attempted.
> >>
> >> Therefore, use freshly introduced virEventThreadStop() to join
> >> the event thread and drop lock/unlock from the middle of
> >> qemuProcessStop().
> >>
> >> Fixes: 3865410e7f67ca4ec66e9a905e75f452762a97f0
> >> Resolves: https://issues.redhat.com/browse/RHEL-49607
> >> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
> >> ---
> >>  src/qemu/qemu_process.c | 14 ++++++++++----
> >>  1 file changed, 10 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
> >> index 6255ba92e7..0869307a90 100644
> >> --- a/src/qemu/qemu_process.c
> >> +++ b/src/qemu/qemu_process.c
> >> @@ -8615,6 +8615,14 @@ void qemuProcessStop(virQEMUDriver *driver,


So we're still in qemuProcessStop ...


> >>       * reporting so we don't squash a legit error. */
> >>      virErrorPreserveLast(&orig_err);
> >>  
> >> +    /* By unlocking the domain object the events processing thread is
> >> +     * allowed to finish its job. */
> >> +    if (priv->eventThread) {
> >> +        virObjectUnlock(vm);

... unlocking the VM object, before 'vm->def->id' is set to -1 ...


> >> +        virEventThreadStop(priv->eventThread);
> >> +        virObjectLock(vm);
> >> +    }
> > 
> > Moving this unlocking above the line where we reset vm->def->id below
> > directly contradicts ...
> > 
> > 
> >> +
> >>      if (asyncJob != VIR_ASYNC_JOB_NONE) {
> >>          if (virDomainObjBeginNestedJob(vm, asyncJob) < 0)
> >>              goto cleanup;
> >> @@ -8681,10 +8689,8 @@ void qemuProcessStop(virQEMUDriver *driver,
> >>      /* Wake up anything waiting on domain condition */
> >>      virDomainObjBroadcast(vm);
> >>  
> >> -    /* IMPORTANT: qemuDomainObjStopWorker() unlocks @vm in order to prevent
> >> -     * deadlocks with the per-VM event loop thread. This MUST be done after
> >> -     * marking the VM as dead */
> > 
> > ... this message here. Now due to other patches I've pushed, this is now
> > not a problem directly as we now keep the 'beingDestroyed' flag set
> > during the whole time thanks to my patch and Jirka's patch fixing few
> > cases where the flag would be leaked.
> > 
> > What I'm missing is a note why this is safe which I'd expect in the
> > commit message:
> > 
> >  This patch moves the unlocking of the VM object once again above the
> >  code which marks the VM as inactive by clearing 'vm->def->id'.
> > 
> >  At this point it's now safe to do as other qemu driver code which may
> >  be synchronized on the VM lock now also checks the
> >  'priv->beingDestroyed' flag in addition to 'vm->def->id'.
> > 
> > I'd also most likely prefer if 'qemuProcessStop' explicitly sets
> > 'priv->beingDestroyed' to true before unlocking. The flag will be later
> > cleared but I didn't really go through all places where it's called to
> > see if we assert the flag before.
> 
> I can set the flag, sure. And to some extent it feels safer. But, since
> the domain object is no longer unlocked I'm wondering whether the flag
> is needed at all?

...

the original bug (crash if VM shut down itself during migration)
happened when qemuProcessStop() unlocked @vm before setting vm->def->id
to -1.

Now the impacted code path did have a check that 'priv->beingDestroyed'
is set, but that was not the case before one of my patches following the
fix when I've moved the unlocking *after* setting vm->def->id  to -1.

The original migration bug is no longer possible because
'priv->beingDestroyed' is asserted before entering qemuProcessStop. I
didn't inspect all other callers of qemuProcessStop though whether it
does set 'priv->beingDestroyed'. Since you're moving the locking before
clearing of 'vm->def->id' and qemuProcessStop itself will
unconditionally reset 'priv->beingDestroyed' it's IMO easier just to
set 'priv->beingDestroyed' to be sure before unlocking.

> I mean, with this patch merged, we no longer have a race condition, do we?

As noted above, it depends on which race you mean. The but you are
fixing will be surely fixed, but this patch exactly undoes the fix for
the previous bug, and we'd only not regress in that regard because of
additional hardening later.

> 
> Michal
> 
Re: [PATCH 2/2] qemu: Use virEventThreadStop() in qemuProcessStop()
Posted by Peter Krempa 2 months, 3 weeks ago
On Fri, Jul 26, 2024 at 11:01:43 +0200, Peter Krempa wrote:
> On Thu, Jul 25, 2024 at 15:04:51 +0200, Michal Prívozník wrote:
> > On 7/25/24 14:33, Peter Krempa wrote:
> > > On Thu, Jul 25, 2024 at 12:57:59 +0200, Michal Privoznik wrote:
> > >> Currently, qemuProcessStop() unlocks given domain object right in
> > >> the middle of cleanup process. This is dangerous because there
> > >> might be another thread which is executing virDomainObjListAdd().
> > >> And since the domain object is on the list of domain objects AND
> > >> by the time qemuProcessStop() unlocks it the object is also
> > >> marked as inactive, the other thread acquires the lock and
> > >> switches vm->def pointer.
> > >>
> > >> The unlocking of domain object is needed though, to allow even
> > >> processing thread finish its queue. Well, the processing can be
> > >> done before any cleanup is attempted.
> > >>
> > >> Therefore, use freshly introduced virEventThreadStop() to join
> > >> the event thread and drop lock/unlock from the middle of
> > >> qemuProcessStop().
> > >>
> > >> Fixes: 3865410e7f67ca4ec66e9a905e75f452762a97f0
> > >> Resolves: https://issues.redhat.com/browse/RHEL-49607
> > >> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
> > >> ---
> > >>  src/qemu/qemu_process.c | 14 ++++++++++----
> > >>  1 file changed, 10 insertions(+), 4 deletions(-)
> > >>
> > >> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
> > >> index 6255ba92e7..0869307a90 100644
> > >> --- a/src/qemu/qemu_process.c
> > >> +++ b/src/qemu/qemu_process.c
> > >> @@ -8615,6 +8615,14 @@ void qemuProcessStop(virQEMUDriver *driver,
> 
> 
> So we're still in qemuProcessStop ...
> 
> 
> > >>       * reporting so we don't squash a legit error. */
> > >>      virErrorPreserveLast(&orig_err);
> > >>  
> > >> +    /* By unlocking the domain object the events processing thread is
> > >> +     * allowed to finish its job. */

Based on what this patch is fixing I'd also suggest adding basically the
exact opposite warning than there was originally below. That any
unlocking must happen before resetting vm->def->id as the global domain
object list code depends on it (and it can't actually check
'priv->beingDestroyed as that's private).
Re: [PATCH 2/2] qemu: Use virEventThreadStop() in qemuProcessStop()
Posted by Daniel P. Berrangé 2 months, 3 weeks ago
On Thu, Jul 25, 2024 at 12:57:59PM +0200, Michal Privoznik wrote:
> Currently, qemuProcessStop() unlocks given domain object right in
> the middle of cleanup process. This is dangerous because there
> might be another thread which is executing virDomainObjListAdd().
> And since the domain object is on the list of domain objects AND
> by the time qemuProcessStop() unlocks it the object is also
> marked as inactive, the other thread acquires the lock and
> switches vm->def pointer.
> 
> The unlocking of domain object is needed though, to allow even
> processing thread finish its queue. Well, the processing can be
> done before any cleanup is attempted.
> 
> Therefore, use freshly introduced virEventThreadStop() to join
> the event thread and drop lock/unlock from the middle of
> qemuProcessStop().
> 
> Fixes: 3865410e7f67ca4ec66e9a905e75f452762a97f0
> Resolves: https://issues.redhat.com/browse/RHEL-49607
> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
> ---
>  src/qemu/qemu_process.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
> index 6255ba92e7..0869307a90 100644
> --- a/src/qemu/qemu_process.c
> +++ b/src/qemu/qemu_process.c
> @@ -8615,6 +8615,14 @@ void qemuProcessStop(virQEMUDriver *driver,
>       * reporting so we don't squash a legit error. */
>      virErrorPreserveLast(&orig_err);
>  
> +    /* By unlocking the domain object the events processing thread is
> +     * allowed to finish its job. */
> +    if (priv->eventThread) {
> +        virObjectUnlock(vm);
> +        virEventThreadStop(priv->eventThread);
> +        virObjectLock(vm);
> +    }
> +

I wonder if this is a little too early.

We don't call qemuMOnitorClose until a short while later,
just after we have done qemuProcessKill.

If we stop the event loop here, then I worry that we're
at risk of missing some final monitor events that might
be desirable ? eg the final lifecycle events indicating
that we're stopping/stopped ?

>      if (asyncJob != VIR_ASYNC_JOB_NONE) {
>          if (virDomainObjBeginNestedJob(vm, asyncJob) < 0)
>              goto cleanup;

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [PATCH 2/2] qemu: Use virEventThreadStop() in qemuProcessStop()
Posted by Michal Prívozník 2 months, 3 weeks ago
On 7/25/24 14:18, Daniel P. Berrangé wrote:
> On Thu, Jul 25, 2024 at 12:57:59PM +0200, Michal Privoznik wrote:
>> Currently, qemuProcessStop() unlocks given domain object right in
>> the middle of cleanup process. This is dangerous because there
>> might be another thread which is executing virDomainObjListAdd().
>> And since the domain object is on the list of domain objects AND
>> by the time qemuProcessStop() unlocks it the object is also
>> marked as inactive, the other thread acquires the lock and
>> switches vm->def pointer.
>>
>> The unlocking of domain object is needed though, to allow even
>> processing thread finish its queue. Well, the processing can be
>> done before any cleanup is attempted.
>>
>> Therefore, use freshly introduced virEventThreadStop() to join
>> the event thread and drop lock/unlock from the middle of
>> qemuProcessStop().
>>
>> Fixes: 3865410e7f67ca4ec66e9a905e75f452762a97f0
>> Resolves: https://issues.redhat.com/browse/RHEL-49607
>> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
>> ---
>>  src/qemu/qemu_process.c | 14 ++++++++++----
>>  1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>> index 6255ba92e7..0869307a90 100644
>> --- a/src/qemu/qemu_process.c
>> +++ b/src/qemu/qemu_process.c
>> @@ -8615,6 +8615,14 @@ void qemuProcessStop(virQEMUDriver *driver,
>>       * reporting so we don't squash a legit error. */
>>      virErrorPreserveLast(&orig_err);
>>  
>> +    /* By unlocking the domain object the events processing thread is
>> +     * allowed to finish its job. */
>> +    if (priv->eventThread) {
>> +        virObjectUnlock(vm);
>> +        virEventThreadStop(priv->eventThread);
>> +        virObjectLock(vm);
>> +    }
>> +
> 
> I wonder if this is a little too early.
> 
> We don't call qemuMOnitorClose until a short while later,
> just after we have done qemuProcessKill.
> 
> If we stop the event loop here, then I worry that we're
> at risk of missing some final monitor events that might
> be desirable ? eg the final lifecycle events indicating
> that we're stopping/stopped ?

But can we close the monitor without a job? And IIUC, we shouldn't try
to stop the event loop thread with a job held (as those callbacks try to
acquire _MODIFY job on their own).

Michal
Re: [PATCH 2/2] qemu: Use virEventThreadStop() in qemuProcessStop()
Posted by Daniel P. Berrangé 2 months, 3 weeks ago
On Thu, Jul 25, 2024 at 03:02:33PM +0200, Michal Prívozník wrote:
> On 7/25/24 14:18, Daniel P. Berrangé wrote:
> > On Thu, Jul 25, 2024 at 12:57:59PM +0200, Michal Privoznik wrote:
> >> Currently, qemuProcessStop() unlocks given domain object right in
> >> the middle of cleanup process. This is dangerous because there
> >> might be another thread which is executing virDomainObjListAdd().
> >> And since the domain object is on the list of domain objects AND
> >> by the time qemuProcessStop() unlocks it the object is also
> >> marked as inactive, the other thread acquires the lock and
> >> switches vm->def pointer.
> >>
> >> The unlocking of domain object is needed though, to allow even
> >> processing thread finish its queue. Well, the processing can be
> >> done before any cleanup is attempted.
> >>
> >> Therefore, use freshly introduced virEventThreadStop() to join
> >> the event thread and drop lock/unlock from the middle of
> >> qemuProcessStop().
> >>
> >> Fixes: 3865410e7f67ca4ec66e9a905e75f452762a97f0
> >> Resolves: https://issues.redhat.com/browse/RHEL-49607
> >> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
> >> ---
> >>  src/qemu/qemu_process.c | 14 ++++++++++----
> >>  1 file changed, 10 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
> >> index 6255ba92e7..0869307a90 100644
> >> --- a/src/qemu/qemu_process.c
> >> +++ b/src/qemu/qemu_process.c
> >> @@ -8615,6 +8615,14 @@ void qemuProcessStop(virQEMUDriver *driver,
> >>       * reporting so we don't squash a legit error. */
> >>      virErrorPreserveLast(&orig_err);
> >>  
> >> +    /* By unlocking the domain object the events processing thread is
> >> +     * allowed to finish its job. */
> >> +    if (priv->eventThread) {
> >> +        virObjectUnlock(vm);
> >> +        virEventThreadStop(priv->eventThread);
> >> +        virObjectLock(vm);
> >> +    }
> >> +
> > 
> > I wonder if this is a little too early.
> > 
> > We don't call qemuMOnitorClose until a short while later,
> > just after we have done qemuProcessKill.
> > 
> > If we stop the event loop here, then I worry that we're
> > at risk of missing some final monitor events that might
> > be desirable ? eg the final lifecycle events indicating
> > that we're stopping/stopped ?
> 
> But can we close the monitor without a job? And IIUC, we shouldn't try
> to stop the event loop thread with a job held (as those callbacks try to
> acquire _MODIFY job on their own).

The monitor will close itself if the QEMU pid dies.

IOW, once we've sent qemuProcessKill(), we need to wait for EOF from
the monitor, at which point we can close it & tear down the event
loop (once idle).

A complication is that the QEMU PID might have already gone away
and seen EOF on monitor, before qemuProcessStop even starts.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [PATCH 2/2] qemu: Use virEventThreadStop() in qemuProcessStop()
Posted by Martin Kletzander 2 months, 3 weeks ago
On Thu, Jul 25, 2024 at 02:46:01PM +0100, Daniel P. Berrangé wrote:
>On Thu, Jul 25, 2024 at 03:02:33PM +0200, Michal Prívozník wrote:
>> On 7/25/24 14:18, Daniel P. Berrangé wrote:
>> > On Thu, Jul 25, 2024 at 12:57:59PM +0200, Michal Privoznik wrote:
>> >> Currently, qemuProcessStop() unlocks given domain object right in
>> >> the middle of cleanup process. This is dangerous because there
>> >> might be another thread which is executing virDomainObjListAdd().
>> >> And since the domain object is on the list of domain objects AND
>> >> by the time qemuProcessStop() unlocks it the object is also
>> >> marked as inactive, the other thread acquires the lock and
>> >> switches vm->def pointer.
>> >>
>> >> The unlocking of domain object is needed though, to allow even
>> >> processing thread finish its queue. Well, the processing can be
>> >> done before any cleanup is attempted.
>> >>
>> >> Therefore, use freshly introduced virEventThreadStop() to join
>> >> the event thread and drop lock/unlock from the middle of
>> >> qemuProcessStop().
>> >>
>> >> Fixes: 3865410e7f67ca4ec66e9a905e75f452762a97f0
>> >> Resolves: https://issues.redhat.com/browse/RHEL-49607
>> >> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
>> >> ---
>> >>  src/qemu/qemu_process.c | 14 ++++++++++----
>> >>  1 file changed, 10 insertions(+), 4 deletions(-)
>> >>
>> >> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>> >> index 6255ba92e7..0869307a90 100644
>> >> --- a/src/qemu/qemu_process.c
>> >> +++ b/src/qemu/qemu_process.c
>> >> @@ -8615,6 +8615,14 @@ void qemuProcessStop(virQEMUDriver *driver,
>> >>       * reporting so we don't squash a legit error. */
>> >>      virErrorPreserveLast(&orig_err);
>> >>
>> >> +    /* By unlocking the domain object the events processing thread is
>> >> +     * allowed to finish its job. */
>> >> +    if (priv->eventThread) {
>> >> +        virObjectUnlock(vm);
>> >> +        virEventThreadStop(priv->eventThread);
>> >> +        virObjectLock(vm);
>> >> +    }
>> >> +
>> >
>> > I wonder if this is a little too early.
>> >
>> > We don't call qemuMOnitorClose until a short while later,
>> > just after we have done qemuProcessKill.
>> >
>> > If we stop the event loop here, then I worry that we're
>> > at risk of missing some final monitor events that might
>> > be desirable ? eg the final lifecycle events indicating
>> > that we're stopping/stopped ?
>>
>> But can we close the monitor without a job? And IIUC, we shouldn't try
>> to stop the event loop thread with a job held (as those callbacks try to
>> acquire _MODIFY job on their own).
>
>The monitor will close itself if the QEMU pid dies.
>
>IOW, once we've sent qemuProcessKill(), we need to wait for EOF from
>the monitor, at which point we can close it & tear down the event
>loop (once idle).
>

I thought about that scenario during review, but somehow missed the fact
that qemuProcessStop() is often called without the domain being dead.
Thanks for reminding me of that.  I'll let Michal and you come up with a
better solution.

>A complication is that the QEMU PID might have already gone away
>and seen EOF on monitor, before qemuProcessStop even starts.
>
>With regards,
>Daniel
>-- 
>|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
>|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
>|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>
Re: [PATCH 2/2] qemu: Use virEventThreadStop() in qemuProcessStop()
Posted by Martin Kletzander 2 months, 3 weeks ago
On Thu, Jul 25, 2024 at 12:57:59PM +0200, Michal Privoznik wrote:
>Currently, qemuProcessStop() unlocks given domain object right in
>the middle of cleanup process. This is dangerous because there
>might be another thread which is executing virDomainObjListAdd().
>And since the domain object is on the list of domain objects AND
>by the time qemuProcessStop() unlocks it the object is also
>marked as inactive, the other thread acquires the lock and
>switches vm->def pointer.
>
>The unlocking of domain object is needed though, to allow even
>processing thread finish its queue. Well, the processing can be
>done before any cleanup is attempted.
>
>Therefore, use freshly introduced virEventThreadStop() to join
>the event thread and drop lock/unlock from the middle of
>qemuProcessStop().
>
>Fixes: 3865410e7f67ca4ec66e9a905e75f452762a97f0
>Resolves: https://issues.redhat.com/browse/RHEL-49607
>Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
>---
> src/qemu/qemu_process.c | 14 ++++++++++----
> 1 file changed, 10 insertions(+), 4 deletions(-)
>
>diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>index 6255ba92e7..0869307a90 100644
>--- a/src/qemu/qemu_process.c
>+++ b/src/qemu/qemu_process.c
>@@ -8615,6 +8615,14 @@ void qemuProcessStop(virQEMUDriver *driver,
>      * reporting so we don't squash a legit error. */
>     virErrorPreserveLast(&orig_err);
>
>+    /* By unlocking the domain object the events processing thread is
>+     * allowed to finish its job. */
>+    if (priv->eventThread) {
>+        virObjectUnlock(vm);
>+        virEventThreadStop(priv->eventThread);
>+        virObjectLock(vm);
>+    }
>+

I think we would be even "safer" if this was done after the job was
acquired.

Other than that I think that's fine.

>     if (asyncJob != VIR_ASYNC_JOB_NONE) {
>         if (virDomainObjBeginNestedJob(vm, asyncJob) < 0)
>             goto cleanup;
>@@ -8681,10 +8689,8 @@ void qemuProcessStop(virQEMUDriver *driver,
>     /* Wake up anything waiting on domain condition */
>     virDomainObjBroadcast(vm);
>
>-    /* IMPORTANT: qemuDomainObjStopWorker() unlocks @vm in order to prevent
>-     * deadlocks with the per-VM event loop thread. This MUST be done after
>-     * marking the VM as dead */
>-    qemuDomainObjStopWorker(vm);
>+    if (priv->eventThread)
>+        g_object_unref(g_steal_pointer(&priv->eventThread));
>

More often we use `g_clear_pointer(ptr, g_object_unref)` instead even though it
does the same thing.

Reviewed-by: Martin Kletzander <mkletzan@redhat.com>

>     if (!!g_atomic_int_dec_and_test(&driver->nactive) && driver->inhibitCallback)
>         driver->inhibitCallback(false, driver->inhibitOpaque);
>-- 
>2.44.2
>