[PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute "query-balloon" after CPR transfer

Zhenzhong Duan posted 6 patches 4 months, 2 weeks ago
Maintainers: Paolo Bonzini <pbonzini@redhat.com>, Steve Sistare <steven.sistare@oracle.com>, Alex Williamson <alex.williamson@redhat.com>, "Cédric Le Goater" <clg@redhat.com>
[PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute "query-balloon" after CPR transfer
Posted by Zhenzhong Duan 4 months, 2 weeks ago
After CPR transfer, source QEMU closes kvm fd and sets kvm_state to NULL,
"query-balloon" will check kvm_state->sync_mmu and trigger NULL pointer
reference.

We don't need to NULL kvm_state as all states in kvm_state aren't released
actually. Just closing kvm fd is enough so we could still query states
through "query_*" qmp command.

Opportunistically drop an unnecessary check in kvm_close().

Fixes: 7ed0919119b0 ("migration: close kvm after cpr")
Suggested-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 accel/kvm/kvm-all.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 23fd491441..b4c717290d 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -639,13 +639,10 @@ void kvm_close(void)
         cpu->kvm_vcpu_stats_fd = -1;
     }
 
-    if (kvm_state && kvm_state->fd != -1) {
-        close(kvm_state->vmfd);
-        kvm_state->vmfd = -1;
-        close(kvm_state->fd);
-        kvm_state->fd = -1;
-    }
-    kvm_state = NULL;
+    close(kvm_state->vmfd);
+    kvm_state->vmfd = -1;
+    close(kvm_state->fd);
+    kvm_state->fd = -1;
 }
 
 /*
-- 
2.47.1
Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute "query-balloon" after CPR transfer
Posted by Steven Sistare 4 months, 1 week ago
On 9/28/2025 4:54 AM, Zhenzhong Duan wrote:
> After CPR transfer, source QEMU closes kvm fd and sets kvm_state to NULL,
> "query-balloon" will check kvm_state->sync_mmu and trigger NULL pointer
> reference.
> 
> We don't need to NULL kvm_state as all states in kvm_state aren't released
> actually. Just closing kvm fd is enough so we could still query states
> through "query_*" qmp command.

IMO this does not make sense.  Much of the state in kvm_state was derived
from ioctl's on the descriptors, and closing them invalidates it.  Asking
historical questions about what used to be makes no sense.

Clearing kvm_state and setting kvm_allowed=false would be a safer fix.

- Steve

> Opportunistically drop an unnecessary check in kvm_close().
> 
> Fixes: 7ed0919119b0 ("migration: close kvm after cpr")
> Suggested-by: Markus Armbruster <armbru@redhat.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   accel/kvm/kvm-all.c | 11 ++++-------
>   1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 23fd491441..b4c717290d 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -639,13 +639,10 @@ void kvm_close(void)
>           cpu->kvm_vcpu_stats_fd = -1;
>       }
>   
> -    if (kvm_state && kvm_state->fd != -1) {
> -        close(kvm_state->vmfd);
> -        kvm_state->vmfd = -1;
> -        close(kvm_state->fd);
> -        kvm_state->fd = -1;
> -    }
> -    kvm_state = NULL;
> +    close(kvm_state->vmfd);
> +    kvm_state->vmfd = -1;
> +    close(kvm_state->fd);
> +    kvm_state->fd = -1;
>   }
>   
>   /*
RE: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute "query-balloon" after CPR transfer
Posted by Duan, Zhenzhong 4 months, 1 week ago

>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
>"query-balloon" after CPR transfer
>
>On 9/28/2025 4:54 AM, Zhenzhong Duan wrote:
>> After CPR transfer, source QEMU closes kvm fd and sets kvm_state to
>NULL,
>> "query-balloon" will check kvm_state->sync_mmu and trigger NULL pointer
>> reference.
>>
>> We don't need to NULL kvm_state as all states in kvm_state aren't released
>> actually. Just closing kvm fd is enough so we could still query states
>> through "query_*" qmp command.
>
>IMO this does not make sense.  Much of the state in kvm_state was derived
>from ioctl's on the descriptors, and closing them invalidates it.  Asking
>historical questions about what used to be makes no sense.

You also have your valid point.

>
>Clearing kvm_state and setting kvm_allowed=false would be a safer fix.

But clearing kvm_state will make some qmp commands which dereferencing
kvm_state trigger SIGSEGV. We can expect failure on those qmp, but SIGSEGV
is worse.

E.g., {"execute": "query-cpu-definitions"}, below error print with v2 but SIGSEGV with v1:

KVM get MSR (index=0x10a) feature failed, Bad file descriptor


I also see inconsistence on "query-balloon" result with v1 patch, before cpr-transfer:

{"error": {"class": "DeviceNotActive", "desc": "No balloon device has been activated"}}

after transfer:

{"error": {"class": "KVMMissingCap", "desc": "Using KVM without synchronous MMU, balloon unavailable"}}

It's confusing there is no synchronous MMU support but we do have it.

Thanks
Zhenzhong
Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute "query-balloon" after CPR transfer
Posted by Steven Sistare 4 months, 1 week ago
On 9/30/2025 2:00 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steven Sistare <steven.sistare@oracle.com>
>> Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
>> "query-balloon" after CPR transfer
>>
>> On 9/28/2025 4:54 AM, Zhenzhong Duan wrote:
>>> After CPR transfer, source QEMU closes kvm fd and sets kvm_state to
>> NULL,
>>> "query-balloon" will check kvm_state->sync_mmu and trigger NULL pointer
>>> reference.
>>>
>>> We don't need to NULL kvm_state as all states in kvm_state aren't released
>>> actually. Just closing kvm fd is enough so we could still query states
>>> through "query_*" qmp command.
>>
>> IMO this does not make sense.  Much of the state in kvm_state was derived
>>from ioctl's on the descriptors, and closing them invalidates it.  Asking
>> historical questions about what used to be makes no sense.
> 
> You also have your valid point.
> 
>>
>> Clearing kvm_state and setting kvm_allowed=false would be a safer fix.

Setting kvm_allowed=false causes kvm_enabled() to return false which should
prevent kvm_state from being dereferenced anywhere:

     #define kvm_enabled()           (kvm_allowed)

   Eg for the balloon:

     static bool have_balloon(Error **errp)
     {
         if (kvm_enabled() && !kvm_has_sync_mmu()) {

- Steve

> But clearing kvm_state will make some qmp commands which dereferencing
> kvm_state trigger SIGSEGV. We can expect failure on those qmp, but SIGSEGV
> is worse.
> 
> E.g., {"execute": "query-cpu-definitions"}, below error print with v2 but SIGSEGV with v1:
> 
> KVM get MSR (index=0x10a) feature failed, Bad file descriptor
> 
> 
> I also see inconsistence on "query-balloon" result with v1 patch, before cpr-transfer:
> 
> {"error": {"class": "DeviceNotActive", "desc": "No balloon device has been activated"}}
> 
> after transfer:
> 
> {"error": {"class": "KVMMissingCap", "desc": "Using KVM without synchronous MMU, balloon unavailable"}}
> 
> It's confusing there is no synchronous MMU support but we do have it.
> 
> Thanks
> Zhenzhong
RE: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute "query-balloon" after CPR transfer
Posted by Duan, Zhenzhong 4 months ago

>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
>"query-balloon" after CPR transfer
>
>On 9/30/2025 2:00 AM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steven Sistare <steven.sistare@oracle.com>
>>> Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
>>> "query-balloon" after CPR transfer
>>>
>>> On 9/28/2025 4:54 AM, Zhenzhong Duan wrote:
>>>> After CPR transfer, source QEMU closes kvm fd and sets kvm_state to
>>> NULL,
>>>> "query-balloon" will check kvm_state->sync_mmu and trigger NULL
>pointer
>>>> reference.
>>>>
>>>> We don't need to NULL kvm_state as all states in kvm_state aren't
>released
>>>> actually. Just closing kvm fd is enough so we could still query states
>>>> through "query_*" qmp command.
>>>
>>> IMO this does not make sense.  Much of the state in kvm_state was
>derived
>>>from ioctl's on the descriptors, and closing them invalidates it.  Asking
>>> historical questions about what used to be makes no sense.
>>
>> You also have your valid point.
>>
>>>
>>> Clearing kvm_state and setting kvm_allowed=false would be a safer fix.
>
>Setting kvm_allowed=false causes kvm_enabled() to return false which should
>prevent kvm_state from being dereferenced anywhere:
>
>     #define kvm_enabled()           (kvm_allowed)
>
>   Eg for the balloon:
>
>     static bool have_balloon(Error **errp)
>     {
>         if (kvm_enabled() && !kvm_has_sync_mmu()) {

OK, will do, clearing kvm_allowed is a good idea.

Currently there are two qmp commands "query-balloon" and "query-cpu-definitions"
causing SIGSEGV after CPR-transfer, clearing kvm_allowed helps for both.
But clearing both kvm_allowed and kvm_state cause SIGSEGV on "query-cpu-definitions".

I'll send a patch to clearing only kvm_allowed if you are fine with it.

Thanks
Zhenzhong
Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute "query-balloon" after CPR transfer
Posted by Steven Sistare 4 months ago
On 10/8/2025 6:22 AM, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Steven Sistare <steven.sistare@oracle.com>
>> Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
>> "query-balloon" after CPR transfer
>>
>> On 9/30/2025 2:00 AM, Duan, Zhenzhong wrote:
>>>> -----Original Message-----
>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>> Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
>>>> "query-balloon" after CPR transfer
>>>>
>>>> On 9/28/2025 4:54 AM, Zhenzhong Duan wrote:
>>>>> After CPR transfer, source QEMU closes kvm fd and sets kvm_state to
>>>> NULL,
>>>>> "query-balloon" will check kvm_state->sync_mmu and trigger NULL
>> pointer
>>>>> reference.
>>>>>
>>>>> We don't need to NULL kvm_state as all states in kvm_state aren't
>> released
>>>>> actually. Just closing kvm fd is enough so we could still query states
>>>>> through "query_*" qmp command.
>>>>
>>>> IMO this does not make sense.  Much of the state in kvm_state was
>> derived
>>> >from ioctl's on the descriptors, and closing them invalidates it.  Asking
>>>> historical questions about what used to be makes no sense.
>>>
>>> You also have your valid point.
>>>
>>>>
>>>> Clearing kvm_state and setting kvm_allowed=false would be a safer fix.
>>
>> Setting kvm_allowed=false causes kvm_enabled() to return false which should
>> prevent kvm_state from being dereferenced anywhere:
>>
>>      #define kvm_enabled()           (kvm_allowed)
>>
>>    Eg for the balloon:
>>
>>      static bool have_balloon(Error **errp)
>>      {
>>          if (kvm_enabled() && !kvm_has_sync_mmu()) {
> 
> OK, will do, clearing kvm_allowed is a good idea.
> 
> Currently there are two qmp commands "query-balloon" and "query-cpu-definitions"
> causing SIGSEGV after CPR-transfer, clearing kvm_allowed helps for both.
> But clearing both kvm_allowed and kvm_state cause SIGSEGV on "query-cpu-definitions".
> 
> I'll send a patch to clearing only kvm_allowed if you are fine with it.

I still don't love the idea.  kvm_state is no longer valid.

It sounds like "query-cpu-definitions" is missing a check for kvm_enabled().

- Steve
Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute "query-balloon" after CPR transfer
Posted by Daniel P. Berrangé 4 months ago
On Thu, Oct 09, 2025 at 10:11:17AM -0400, Steven Sistare wrote:
> On 10/8/2025 6:22 AM, Duan, Zhenzhong wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: Steven Sistare <steven.sistare@oracle.com>
> > > Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
> > > "query-balloon" after CPR transfer
> > > 
> > > On 9/30/2025 2:00 AM, Duan, Zhenzhong wrote:
> > > > > -----Original Message-----
> > > > > From: Steven Sistare <steven.sistare@oracle.com>
> > > > > Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
> > > > > "query-balloon" after CPR transfer
> > > > > 
> > > > > On 9/28/2025 4:54 AM, Zhenzhong Duan wrote:
> > > > > > After CPR transfer, source QEMU closes kvm fd and sets kvm_state to
> > > > > NULL,
> > > > > > "query-balloon" will check kvm_state->sync_mmu and trigger NULL
> > > pointer
> > > > > > reference.
> > > > > > 
> > > > > > We don't need to NULL kvm_state as all states in kvm_state aren't
> > > released
> > > > > > actually. Just closing kvm fd is enough so we could still query states
> > > > > > through "query_*" qmp command.
> > > > > 
> > > > > IMO this does not make sense.  Much of the state in kvm_state was
> > > derived
> > > > >from ioctl's on the descriptors, and closing them invalidates it.  Asking
> > > > > historical questions about what used to be makes no sense.
> > > > 
> > > > You also have your valid point.
> > > > 
> > > > > 
> > > > > Clearing kvm_state and setting kvm_allowed=false would be a safer fix.
> > > 
> > > Setting kvm_allowed=false causes kvm_enabled() to return false which should
> > > prevent kvm_state from being dereferenced anywhere:
> > > 
> > >      #define kvm_enabled()           (kvm_allowed)
> > > 
> > >    Eg for the balloon:
> > > 
> > >      static bool have_balloon(Error **errp)
> > >      {
> > >          if (kvm_enabled() && !kvm_has_sync_mmu()) {
> > 
> > OK, will do, clearing kvm_allowed is a good idea.
> > 
> > Currently there are two qmp commands "query-balloon" and "query-cpu-definitions"
> > causing SIGSEGV after CPR-transfer, clearing kvm_allowed helps for both.
> > But clearing both kvm_allowed and kvm_state cause SIGSEGV on "query-cpu-definitions".
> > 
> > I'll send a patch to clearing only kvm_allowed if you are fine with it.
> 
> I still don't love the idea.  kvm_state is no longer valid.
> 
> It sounds like "query-cpu-definitions" is missing a check for kvm_enabled().

This patch  / bug feels like it is side-stepping a general conceptual
question.  After cpr-transfer is complete what QMP commands are still
valid to call in general ? This thread mentions two which have caused
bugs, but beyond that it feels like a large subset of QMP comamnds
are conceptually invalid to use.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute "query-balloon" after CPR transfer
Posted by Steven Sistare 4 months ago
On 10/9/2025 10:19 AM, Daniel P. Berrangé wrote:
> On Thu, Oct 09, 2025 at 10:11:17AM -0400, Steven Sistare wrote:
>> On 10/8/2025 6:22 AM, Duan, Zhenzhong wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>> Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
>>>> "query-balloon" after CPR transfer
>>>>
>>>> On 9/30/2025 2:00 AM, Duan, Zhenzhong wrote:
>>>>>> -----Original Message-----
>>>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>>>> Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
>>>>>> "query-balloon" after CPR transfer
>>>>>>
>>>>>> On 9/28/2025 4:54 AM, Zhenzhong Duan wrote:
>>>>>>> After CPR transfer, source QEMU closes kvm fd and sets kvm_state to
>>>>>> NULL,
>>>>>>> "query-balloon" will check kvm_state->sync_mmu and trigger NULL
>>>> pointer
>>>>>>> reference.
>>>>>>>
>>>>>>> We don't need to NULL kvm_state as all states in kvm_state aren't
>>>> released
>>>>>>> actually. Just closing kvm fd is enough so we could still query states
>>>>>>> through "query_*" qmp command.
>>>>>>
>>>>>> IMO this does not make sense.  Much of the state in kvm_state was
>>>> derived
>>>>> >from ioctl's on the descriptors, and closing them invalidates it.  Asking
>>>>>> historical questions about what used to be makes no sense.
>>>>>
>>>>> You also have your valid point.
>>>>>
>>>>>>
>>>>>> Clearing kvm_state and setting kvm_allowed=false would be a safer fix.
>>>>
>>>> Setting kvm_allowed=false causes kvm_enabled() to return false which should
>>>> prevent kvm_state from being dereferenced anywhere:
>>>>
>>>>       #define kvm_enabled()           (kvm_allowed)
>>>>
>>>>     Eg for the balloon:
>>>>
>>>>       static bool have_balloon(Error **errp)
>>>>       {
>>>>           if (kvm_enabled() && !kvm_has_sync_mmu()) {
>>>
>>> OK, will do, clearing kvm_allowed is a good idea.
>>>
>>> Currently there are two qmp commands "query-balloon" and "query-cpu-definitions"
>>> causing SIGSEGV after CPR-transfer, clearing kvm_allowed helps for both.
>>> But clearing both kvm_allowed and kvm_state cause SIGSEGV on "query-cpu-definitions".
>>>
>>> I'll send a patch to clearing only kvm_allowed if you are fine with it.
>>
>> I still don't love the idea.  kvm_state is no longer valid.
>>
>> It sounds like "query-cpu-definitions" is missing a check for kvm_enabled().
> 
> This patch  / bug feels like it is side-stepping a general conceptual
> question.  After cpr-transfer is complete what QMP commands are still
> valid to call in general ? This thread mentions two which have caused
> bugs, but beyond that it feels like a large subset of QMP comamnds
> are conceptually invalid to use.

Agreed. I don't see why these commands should be issued to the dead instance.
How about migration status, quit, and nothing else?
Half serious.

- Steve


RE: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute "query-balloon" after CPR transfer
Posted by Duan, Zhenzhong 4 months ago

>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
>"query-balloon" after CPR transfer
>
>On 10/9/2025 10:19 AM, Daniel P. Berrangé wrote:
>> On Thu, Oct 09, 2025 at 10:11:17AM -0400, Steven Sistare wrote:
>>> On 10/8/2025 6:22 AM, Duan, Zhenzhong wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>>> Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
>>>>> "query-balloon" after CPR transfer
>>>>>
>>>>> On 9/30/2025 2:00 AM, Duan, Zhenzhong wrote:
>>>>>>> -----Original Message-----
>>>>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>>>>> Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
>>>>>>> "query-balloon" after CPR transfer
>>>>>>>
>>>>>>> On 9/28/2025 4:54 AM, Zhenzhong Duan wrote:
>>>>>>>> After CPR transfer, source QEMU closes kvm fd and sets kvm_state
>to
>>>>>>> NULL,
>>>>>>>> "query-balloon" will check kvm_state->sync_mmu and trigger NULL
>>>>> pointer
>>>>>>>> reference.
>>>>>>>>
>>>>>>>> We don't need to NULL kvm_state as all states in kvm_state aren't
>>>>> released
>>>>>>>> actually. Just closing kvm fd is enough so we could still query states
>>>>>>>> through "query_*" qmp command.
>>>>>>>
>>>>>>> IMO this does not make sense.  Much of the state in kvm_state was
>>>>> derived
>>>>>> >from ioctl's on the descriptors, and closing them invalidates it.
>Asking
>>>>>>> historical questions about what used to be makes no sense.
>>>>>>
>>>>>> You also have your valid point.
>>>>>>
>>>>>>>
>>>>>>> Clearing kvm_state and setting kvm_allowed=false would be a safer
>fix.
>>>>>
>>>>> Setting kvm_allowed=false causes kvm_enabled() to return false which
>should
>>>>> prevent kvm_state from being dereferenced anywhere:
>>>>>
>>>>>       #define kvm_enabled()           (kvm_allowed)
>>>>>
>>>>>     Eg for the balloon:
>>>>>
>>>>>       static bool have_balloon(Error **errp)
>>>>>       {
>>>>>           if (kvm_enabled() && !kvm_has_sync_mmu()) {
>>>>
>>>> OK, will do, clearing kvm_allowed is a good idea.
>>>>
>>>> Currently there are two qmp commands "query-balloon" and
>"query-cpu-definitions"
>>>> causing SIGSEGV after CPR-transfer, clearing kvm_allowed helps for both.
>>>> But clearing both kvm_allowed and kvm_state cause SIGSEGV on
>"query-cpu-definitions".
>>>>
>>>> I'll send a patch to clearing only kvm_allowed if you are fine with it.
>>>
>>> I still don't love the idea.  kvm_state is no longer valid.
>>>
>>> It sounds like "query-cpu-definitions" is missing a check for kvm_enabled().
>>
>> This patch  / bug feels like it is side-stepping a general conceptual
>> question.  After cpr-transfer is complete what QMP commands are still
>> valid to call in general ? This thread mentions two which have caused
>> bugs, but beyond that it feels like a large subset of QMP comamnds
>> are conceptually invalid to use.
>
>Agreed. I don't see why these commands should be issued to the dead
>instance.

Uh, yes, that's why I hesitate if it's deserved to fix it.

>How about migration status, quit, and nothing else?
>Half serious.

We only have issues with "query-balloon" and "query-cpu-definitions", all other "query-*" qmp commands work fine.

{"execute": "query-migrate"}
{"timestamp": {"seconds": 1760067694, "microseconds": 815419}, "event": "STOP"}
{"return": {"status": "completed", "setup-time": 3, "downtime": 14, "total-time": 67, "ram": {"total": 0, "postcopy-requests": 0, "dirty-sync-count": 2, "multifd-bytes": 0, "pages-per-second": 0, "downtime-bytes": 0, "page-size": 4096, "remaining": 0, "postcopy-bytes": 0, "mbps": 58.152999999999999, "transferred": 465224, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 0, "duplicate": 0, "dirty-pages-rate": 0, "normal-bytes": 0, "normal": 0}}}

{"execute": "quit"}
{"timestamp": {"seconds": 1760067734, "microseconds": 599830}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
{"return": {}}
Connection closed by foreign host.Connection closed by foreign host.

Thanks
Zhenzhong
Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute "query-balloon" after CPR transfer
Posted by Daniel P. Berrangé 4 months ago
On Thu, Oct 09, 2025 at 10:32:34AM -0400, Steven Sistare wrote:
> On 10/9/2025 10:19 AM, Daniel P. Berrangé wrote:
> > On Thu, Oct 09, 2025 at 10:11:17AM -0400, Steven Sistare wrote:
> > > On 10/8/2025 6:22 AM, Duan, Zhenzhong wrote:
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Steven Sistare <steven.sistare@oracle.com>
> > > > > Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
> > > > > "query-balloon" after CPR transfer
> > > > > 
> > > > > On 9/30/2025 2:00 AM, Duan, Zhenzhong wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: Steven Sistare <steven.sistare@oracle.com>
> > > > > > > Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
> > > > > > > "query-balloon" after CPR transfer
> > > > > > > 
> > > > > > > On 9/28/2025 4:54 AM, Zhenzhong Duan wrote:
> > > > > > > > After CPR transfer, source QEMU closes kvm fd and sets kvm_state to
> > > > > > > NULL,
> > > > > > > > "query-balloon" will check kvm_state->sync_mmu and trigger NULL
> > > > > pointer
> > > > > > > > reference.
> > > > > > > > 
> > > > > > > > We don't need to NULL kvm_state as all states in kvm_state aren't
> > > > > released
> > > > > > > > actually. Just closing kvm fd is enough so we could still query states
> > > > > > > > through "query_*" qmp command.
> > > > > > > 
> > > > > > > IMO this does not make sense.  Much of the state in kvm_state was
> > > > > derived
> > > > > > >from ioctl's on the descriptors, and closing them invalidates it.  Asking
> > > > > > > historical questions about what used to be makes no sense.
> > > > > > 
> > > > > > You also have your valid point.
> > > > > > 
> > > > > > > 
> > > > > > > Clearing kvm_state and setting kvm_allowed=false would be a safer fix.
> > > > > 
> > > > > Setting kvm_allowed=false causes kvm_enabled() to return false which should
> > > > > prevent kvm_state from being dereferenced anywhere:
> > > > > 
> > > > >       #define kvm_enabled()           (kvm_allowed)
> > > > > 
> > > > >     Eg for the balloon:
> > > > > 
> > > > >       static bool have_balloon(Error **errp)
> > > > >       {
> > > > >           if (kvm_enabled() && !kvm_has_sync_mmu()) {
> > > > 
> > > > OK, will do, clearing kvm_allowed is a good idea.
> > > > 
> > > > Currently there are two qmp commands "query-balloon" and "query-cpu-definitions"
> > > > causing SIGSEGV after CPR-transfer, clearing kvm_allowed helps for both.
> > > > But clearing both kvm_allowed and kvm_state cause SIGSEGV on "query-cpu-definitions".
> > > > 
> > > > I'll send a patch to clearing only kvm_allowed if you are fine with it.
> > > 
> > > I still don't love the idea.  kvm_state is no longer valid.
> > > 
> > > It sounds like "query-cpu-definitions" is missing a check for kvm_enabled().
> > 
> > This patch  / bug feels like it is side-stepping a general conceptual
> > question.  After cpr-transfer is complete what QMP commands are still
> > valid to call in general ? This thread mentions two which have caused
> > bugs, but beyond that it feels like a large subset of QMP comamnds
> > are conceptually invalid to use.
> 
> Agreed. I don't see why these commands should be issued to the dead instance.
> How about migration status, quit, and nothing else?
> Half serious.

I was hoping you'd suggest such a short list :-)

Essentially a case of calling qmp_for_each_command(), and in the callback
run qmp_disable_command() for everything not in the allow-list.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


RE: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute "query-balloon" after CPR transfer
Posted by Duan, Zhenzhong 4 months ago

>-----Original Message-----
>From: Daniel P. Berrangé <berrange@redhat.com>
>Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
>"query-balloon" after CPR transfer
>
>On Thu, Oct 09, 2025 at 10:32:34AM -0400, Steven Sistare wrote:
>> On 10/9/2025 10:19 AM, Daniel P. Berrangé wrote:
>> > On Thu, Oct 09, 2025 at 10:11:17AM -0400, Steven Sistare wrote:
>> > > On 10/8/2025 6:22 AM, Duan, Zhenzhong wrote:
>> > > >
>> > > >
>> > > > > -----Original Message-----
>> > > > > From: Steven Sistare <steven.sistare@oracle.com>
>> > > > > Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
>> > > > > "query-balloon" after CPR transfer
>> > > > >
>> > > > > On 9/30/2025 2:00 AM, Duan, Zhenzhong wrote:
>> > > > > > > -----Original Message-----
>> > > > > > > From: Steven Sistare <steven.sistare@oracle.com>
>> > > > > > > Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when
>execute
>> > > > > > > "query-balloon" after CPR transfer
>> > > > > > >
>> > > > > > > On 9/28/2025 4:54 AM, Zhenzhong Duan wrote:
>> > > > > > > > After CPR transfer, source QEMU closes kvm fd and sets
>kvm_state to
>> > > > > > > NULL,
>> > > > > > > > "query-balloon" will check kvm_state->sync_mmu and trigger
>NULL
>> > > > > pointer
>> > > > > > > > reference.
>> > > > > > > >
>> > > > > > > > We don't need to NULL kvm_state as all states in kvm_state
>aren't
>> > > > > released
>> > > > > > > > actually. Just closing kvm fd is enough so we could still query
>states
>> > > > > > > > through "query_*" qmp command.
>> > > > > > >
>> > > > > > > IMO this does not make sense.  Much of the state in kvm_state
>was
>> > > > > derived
>> > > > > > >from ioctl's on the descriptors, and closing them invalidates it.
>Asking
>> > > > > > > historical questions about what used to be makes no sense.
>> > > > > >
>> > > > > > You also have your valid point.
>> > > > > >
>> > > > > > >
>> > > > > > > Clearing kvm_state and setting kvm_allowed=false would be a
>safer fix.
>> > > > >
>> > > > > Setting kvm_allowed=false causes kvm_enabled() to return false
>which should
>> > > > > prevent kvm_state from being dereferenced anywhere:
>> > > > >
>> > > > >       #define kvm_enabled()           (kvm_allowed)
>> > > > >
>> > > > >     Eg for the balloon:
>> > > > >
>> > > > >       static bool have_balloon(Error **errp)
>> > > > >       {
>> > > > >           if (kvm_enabled() && !kvm_has_sync_mmu()) {
>> > > >
>> > > > OK, will do, clearing kvm_allowed is a good idea.
>> > > >
>> > > > Currently there are two qmp commands "query-balloon" and
>"query-cpu-definitions"
>> > > > causing SIGSEGV after CPR-transfer, clearing kvm_allowed helps for
>both.
>> > > > But clearing both kvm_allowed and kvm_state cause SIGSEGV on
>"query-cpu-definitions".
>> > > >
>> > > > I'll send a patch to clearing only kvm_allowed if you are fine with it.
>> > >
>> > > I still don't love the idea.  kvm_state is no longer valid.
>> > >
>> > > It sounds like "query-cpu-definitions" is missing a check for
>kvm_enabled().
>> >
>> > This patch  / bug feels like it is side-stepping a general conceptual
>> > question.  After cpr-transfer is complete what QMP commands are still
>> > valid to call in general ? This thread mentions two which have caused
>> > bugs, but beyond that it feels like a large subset of QMP comamnds
>> > are conceptually invalid to use.
>>
>> Agreed. I don't see why these commands should be issued to the dead
>instance.
>> How about migration status, quit, and nothing else?
>> Half serious.
>
>I was hoping you'd suggest such a short list :-)
>
>Essentially a case of calling qmp_for_each_command(), and in the callback
>run qmp_disable_command() for everything not in the allow-list.

Thanks for your suggestion, I agree this is a perfect scheme, but is also heavy on this corner case😊
Just curious if we need to do same for live migration. E.g., send qmp command on source qemu after live migration.

Thanks
Zhenzhong
Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute "query-balloon" after CPR transfer
Posted by Daniel P. Berrangé 4 months ago
On Fri, Oct 10, 2025 at 03:54:42AM +0000, Duan, Zhenzhong wrote:
> 
> 
> >-----Original Message-----
> >From: Daniel P. Berrangé <berrange@redhat.com>
> >Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
> >"query-balloon" after CPR transfer
> >
> >On Thu, Oct 09, 2025 at 10:32:34AM -0400, Steven Sistare wrote:
> >> On 10/9/2025 10:19 AM, Daniel P. Berrangé wrote:
> >> > On Thu, Oct 09, 2025 at 10:11:17AM -0400, Steven Sistare wrote:
> >> > > On 10/8/2025 6:22 AM, Duan, Zhenzhong wrote:
> >> > > >
> >> > > >
> >> > > > > -----Original Message-----
> >> > > > > From: Steven Sistare <steven.sistare@oracle.com>
> >> > > > > Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
> >> > > > > "query-balloon" after CPR transfer
> >> > > > >
> >> > > > > On 9/30/2025 2:00 AM, Duan, Zhenzhong wrote:
> >> > > > > > > -----Original Message-----
> >> > > > > > > From: Steven Sistare <steven.sistare@oracle.com>
> >> > > > > > > Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when
> >execute
> >> > > > > > > "query-balloon" after CPR transfer
> >> > > > > > >
> >> > > > > > > On 9/28/2025 4:54 AM, Zhenzhong Duan wrote:
> >> > > > > > > > After CPR transfer, source QEMU closes kvm fd and sets
> >kvm_state to
> >> > > > > > > NULL,
> >> > > > > > > > "query-balloon" will check kvm_state->sync_mmu and trigger
> >NULL
> >> > > > > pointer
> >> > > > > > > > reference.
> >> > > > > > > >
> >> > > > > > > > We don't need to NULL kvm_state as all states in kvm_state
> >aren't
> >> > > > > released
> >> > > > > > > > actually. Just closing kvm fd is enough so we could still query
> >states
> >> > > > > > > > through "query_*" qmp command.
> >> > > > > > >
> >> > > > > > > IMO this does not make sense.  Much of the state in kvm_state
> >was
> >> > > > > derived
> >> > > > > > >from ioctl's on the descriptors, and closing them invalidates it.
> >Asking
> >> > > > > > > historical questions about what used to be makes no sense.
> >> > > > > >
> >> > > > > > You also have your valid point.
> >> > > > > >
> >> > > > > > >
> >> > > > > > > Clearing kvm_state and setting kvm_allowed=false would be a
> >safer fix.
> >> > > > >
> >> > > > > Setting kvm_allowed=false causes kvm_enabled() to return false
> >which should
> >> > > > > prevent kvm_state from being dereferenced anywhere:
> >> > > > >
> >> > > > >       #define kvm_enabled()           (kvm_allowed)
> >> > > > >
> >> > > > >     Eg for the balloon:
> >> > > > >
> >> > > > >       static bool have_balloon(Error **errp)
> >> > > > >       {
> >> > > > >           if (kvm_enabled() && !kvm_has_sync_mmu()) {
> >> > > >
> >> > > > OK, will do, clearing kvm_allowed is a good idea.
> >> > > >
> >> > > > Currently there are two qmp commands "query-balloon" and
> >"query-cpu-definitions"
> >> > > > causing SIGSEGV after CPR-transfer, clearing kvm_allowed helps for
> >both.
> >> > > > But clearing both kvm_allowed and kvm_state cause SIGSEGV on
> >"query-cpu-definitions".
> >> > > >
> >> > > > I'll send a patch to clearing only kvm_allowed if you are fine with it.
> >> > >
> >> > > I still don't love the idea.  kvm_state is no longer valid.
> >> > >
> >> > > It sounds like "query-cpu-definitions" is missing a check for
> >kvm_enabled().
> >> >
> >> > This patch  / bug feels like it is side-stepping a general conceptual
> >> > question.  After cpr-transfer is complete what QMP commands are still
> >> > valid to call in general ? This thread mentions two which have caused
> >> > bugs, but beyond that it feels like a large subset of QMP comamnds
> >> > are conceptually invalid to use.
> >>
> >> Agreed. I don't see why these commands should be issued to the dead
> >instance.
> >> How about migration status, quit, and nothing else?
> >> Half serious.
> >
> >I was hoping you'd suggest such a short list :-)
> >
> >Essentially a case of calling qmp_for_each_command(), and in the callback
> >run qmp_disable_command() for everything not in the allow-list.
> 
> Thanks for your suggestion, I agree this is a perfect scheme, but is also heavy on this corner case😊
> Just curious if we need to do same for live migration. E.g., send qmp command on source qemu after live migration.

At the end of a normal migration, there is no functional reason to prevent
use of any QMP command.  Indeed we need to have the ability to carry on
using the orignal source QEMU, in case the final steps on migration on
the target fail and we need to restart the source.

The cpr-transfer is unusual in the restrictions it has after completion.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


RE: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute "query-balloon" after CPR transfer
Posted by Duan, Zhenzhong 4 months ago


>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
>"query-balloon" after CPR transfer
>
>On 10/8/2025 6:22 AM, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Steven Sistare <steven.sistare@oracle.com>
>>> Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
>>> "query-balloon" after CPR transfer
>>>
>>> On 9/30/2025 2:00 AM, Duan, Zhenzhong wrote:
>>>>> -----Original Message-----
>>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>>> Subject: Re: [PATCH v2 6/6] accel/kvm: Fix SIGSEGV when execute
>>>>> "query-balloon" after CPR transfer
>>>>>
>>>>> On 9/28/2025 4:54 AM, Zhenzhong Duan wrote:
>>>>>> After CPR transfer, source QEMU closes kvm fd and sets kvm_state to
>>>>> NULL,
>>>>>> "query-balloon" will check kvm_state->sync_mmu and trigger NULL
>>> pointer
>>>>>> reference.
>>>>>>
>>>>>> We don't need to NULL kvm_state as all states in kvm_state aren't
>>> released
>>>>>> actually. Just closing kvm fd is enough so we could still query states
>>>>>> through "query_*" qmp command.
>>>>>
>>>>> IMO this does not make sense.  Much of the state in kvm_state was
>>> derived
>>>> >from ioctl's on the descriptors, and closing them invalidates it.  Asking
>>>>> historical questions about what used to be makes no sense.
>>>>
>>>> You also have your valid point.
>>>>
>>>>>
>>>>> Clearing kvm_state and setting kvm_allowed=false would be a safer fix.
>>>
>>> Setting kvm_allowed=false causes kvm_enabled() to return false which
>should
>>> prevent kvm_state from being dereferenced anywhere:
>>>
>>>      #define kvm_enabled()           (kvm_allowed)
>>>
>>>    Eg for the balloon:
>>>
>>>      static bool have_balloon(Error **errp)
>>>      {
>>>          if (kvm_enabled() && !kvm_has_sync_mmu()) {
>>
>> OK, will do, clearing kvm_allowed is a good idea.
>>
>> Currently there are two qmp commands "query-balloon" and
>"query-cpu-definitions"
>> causing SIGSEGV after CPR-transfer, clearing kvm_allowed helps for both.
>> But clearing both kvm_allowed and kvm_state cause SIGSEGV on
>"query-cpu-definitions".
>>
>> I'll send a patch to clearing only kvm_allowed if you are fine with it.
>
>I still don't love the idea.  kvm_state is no longer valid.

OK, what about another idea to not call vfio_cpr_kvm_close_notifier() for cpr-transfer, like below:

--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -197,7 +197,7 @@ void vfio_cpr_add_kvm_notifier(void)
     if (!kvm_close_notifier.notify) {
         migration_add_notifier_modes(&kvm_close_notifier,
                                      vfio_cpr_kvm_close_notifier,
-                                     MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC,
+                                     MIG_MODE_CPR_EXEC,
                                      -1);
     }
 }

The close_kvm_after_cpr is only for cpr-exec, with this change, we are in same situation as live migration and can run all qmp commands same as after live migration.

>
>It sounds like "query-cpu-definitions" is missing a check for kvm_enabled().

Yes.

Thanks
Zhenzhong