migrate/ram: let ram_save_target_page_legacy() return if qemu file got error

[PATCH] migrate/ram: let ram_save_target_page_legacy() return if qemu file got error

Posted by Guoyi Tu 9 months ago

When the migration process of a virtual machine using huge pages is 
cancelled,
QEMU will continue to complete the processing of the current huge page
through the qemu file object got an error set. These processing, such as
compression and encryption, will consume a lot of CPU resources which may
affact the the performance of the other VMs.

To terminate the migration process more quickly and minimize unnecessary
resource occupancy, it's neccessary to add logic to check the error status
of qemu file object in the beginning of ram_save_target_page_legacy 
function,
and make sure the function returns immediately if qemu file got an error.

Signed-off-by: Guoyi Tu <tugy@chinatelecom.cn>
---
  migration/ram.c | 4 ++++
  1 file changed, 4 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 9040d66e61..3e2ebf3004 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2133,6 +2133,10 @@ static int ram_save_target_page_legacy(RAMState 
*rs, PageSearchStatus *pss)
      ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
      int res;

+    if (qemu_file_get_error(pss->pss_channel)) {
+        return -1;
+    }
+
      if (control_save_page(pss, block, offset, &res)) {
          return res;
      }
-- 
2.27.0

--
Guoyi

Re: [PATCH] migrate/ram: let ram_save_target_page_legacy() return if qemu file got error

Posted by Guoyi Tu 8 months, 2 weeks ago

Hi Juan, what do you think of this patch? Can it be merged into
the upstream?

On 2023/8/15 15:21, Guoyi Tu wrote:
> When the migration process of a virtual machine using huge pages is 
> cancelled,
> QEMU will continue to complete the processing of the current huge page
> through the qemu file object got an error set. These processing, such as
> compression and encryption, will consume a lot of CPU resources which may
> affact the the performance of the other VMs.
> 
> To terminate the migration process more quickly and minimize unnecessary
> resource occupancy, it's neccessary to add logic to check the error status
> of qemu file object in the beginning of ram_save_target_page_legacy 
> function,
> and make sure the function returns immediately if qemu file got an error.
> 
> Signed-off-by: Guoyi Tu <tugy@chinatelecom.cn>
> ---
>   migration/ram.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 9040d66e61..3e2ebf3004 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2133,6 +2133,10 @@ static int ram_save_target_page_legacy(RAMState 
> *rs, PageSearchStatus *pss)
>       ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
>       int res;
> 
> +    if (qemu_file_get_error(pss->pss_channel)) {
> +        return -1;
> +    }
> +
>       if (control_save_page(pss, block, offset, &res)) {
>           return res;
>       }

Re: [PATCH] migrate/ram: let ram_save_target_page_legacy() return if qemu file got error

Posted by Fabiano Rosas 9 months ago

Guoyi Tu <tugy@chinatelecom.cn> writes:

> When the migration process of a virtual machine using huge pages is 
> cancelled,
> QEMU will continue to complete the processing of the current huge page
> through the qemu file object got an error set. These processing, such as
> compression and encryption, will consume a lot of CPU resources which may
> affact the the performance of the other VMs.
>
> To terminate the migration process more quickly and minimize unnecessary
> resource occupancy, it's neccessary to add logic to check the error status
> of qemu file object in the beginning of ram_save_target_page_legacy 
> function,
> and make sure the function returns immediately if qemu file got an error.
>
> Signed-off-by: Guoyi Tu <tugy@chinatelecom.cn>

Ok, you're off the hook because the qemu_file_*_error situation is a
preexisting mess. We don't need to complicate this further.

Let's go with this patch as it is.

Reviewed-by: Fabiano Rosas <farosas@suse.de>

Re: [PATCH] migrate/ram: let ram_save_target_page_legacy() return if qemu file got error

Posted by Fabiano Rosas 9 months ago

Guoyi Tu <tugy@chinatelecom.cn> writes:

> When the migration process of a virtual machine using huge pages is 
> cancelled,
> QEMU will continue to complete the processing of the current huge page
> through the qemu file object got an error set. These processing, such as
> compression and encryption, will consume a lot of CPU resources which may
> affact the the performance of the other VMs.
>
> To terminate the migration process more quickly and minimize unnecessary
> resource occupancy, it's neccessary to add logic to check the error status
> of qemu file object in the beginning of ram_save_target_page_legacy 
> function,
> and make sure the function returns immediately if qemu file got an error.
>
> Signed-off-by: Guoyi Tu <tugy@chinatelecom.cn>
> ---
>   migration/ram.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 9040d66e61..3e2ebf3004 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2133,6 +2133,10 @@ static int ram_save_target_page_legacy(RAMState 
> *rs, PageSearchStatus *pss)
>       ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
>       int res;
>
> +    if (qemu_file_get_error(pss->pss_channel)) {
> +        return -1;
> +    }

Where was the error set? Is this from cancelling via QMP? Or something
from within ram_save_target_page_legacy? We should probably make the
check closer to where the error happens. At the very least moving the
check into the loop.

Re: [PATCH] migrate/ram: let ram_save_target_page_legacy() return if qemu file got error

Posted by Peter Xu 9 months ago

On Tue, Aug 15, 2023 at 09:35:19AM -0300, Fabiano Rosas wrote:
> Guoyi Tu <tugy@chinatelecom.cn> writes:
> 
> > When the migration process of a virtual machine using huge pages is 
> > cancelled,
> > QEMU will continue to complete the processing of the current huge page
> > through the qemu file object got an error set. These processing, such as
> > compression and encryption, will consume a lot of CPU resources which may
> > affact the the performance of the other VMs.
> >
> > To terminate the migration process more quickly and minimize unnecessary
> > resource occupancy, it's neccessary to add logic to check the error status
> > of qemu file object in the beginning of ram_save_target_page_legacy 
> > function,
> > and make sure the function returns immediately if qemu file got an error.
> >
> > Signed-off-by: Guoyi Tu <tugy@chinatelecom.cn>
> > ---
> >   migration/ram.c | 4 ++++
> >   1 file changed, 4 insertions(+)
> >
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 9040d66e61..3e2ebf3004 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -2133,6 +2133,10 @@ static int ram_save_target_page_legacy(RAMState 
> > *rs, PageSearchStatus *pss)
> >       ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
> >       int res;
> >
> > +    if (qemu_file_get_error(pss->pss_channel)) {
> > +        return -1;
> > +    }
> 
> Where was the error set? Is this from cancelling via QMP? Or something
> from within ram_save_target_page_legacy? We should probably make the
> check closer to where the error happens. At the very least moving the
> check into the loop.

Fabiano - I think it's in the loop (of all target pages within a same host
page), and IIUC Guoyi mentioned it's part of cancelling.

Guoyi, I assume you just saw qemu cancel too slow over e.g. 1g pages?
The patch looks good here.

Thanks,

-- 
Peter Xu

Re: [PATCH] migrate/ram: let ram_save_target_page_legacy() return if qemu file got error

Posted by Guoyi Tu 9 months ago


On 2023/8/16 6:19, 【外部账号】 Peter Xu wrote:
> On Tue, Aug 15, 2023 at 09:35:19AM -0300, Fabiano Rosas wrote:
>> Guoyi Tu <tugy@chinatelecom.cn> writes:
>>
>>> When the migration process of a virtual machine using huge pages is
>>> cancelled,
>>> QEMU will continue to complete the processing of the current huge page
>>> through the qemu file object got an error set. These processing, such as
>>> compression and encryption, will consume a lot of CPU resources which may
>>> affact the the performance of the other VMs.
>>>
>>> To terminate the migration process more quickly and minimize unnecessary
>>> resource occupancy, it's neccessary to add logic to check the error status
>>> of qemu file object in the beginning of ram_save_target_page_legacy
>>> function,
>>> and make sure the function returns immediately if qemu file got an error.
>>>
>>> Signed-off-by: Guoyi Tu <tugy@chinatelecom.cn>
>>> ---
>>>    migration/ram.c | 4 ++++
>>>    1 file changed, 4 insertions(+)
>>>
>>> diff --git a/migration/ram.c b/migration/ram.c
>>> index 9040d66e61..3e2ebf3004 100644
>>> --- a/migration/ram.c
>>> +++ b/migration/ram.c
>>> @@ -2133,6 +2133,10 @@ static int ram_save_target_page_legacy(RAMState
>>> *rs, PageSearchStatus *pss)
>>>        ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
>>>        int res;
>>>
>>> +    if (qemu_file_get_error(pss->pss_channel)) {
>>> +        return -1;
>>> +    }
>>
>> Where was the error set? Is this from cancelling via QMP? Or something
>> from within ram_save_target_page_legacy? We should probably make the
>> check closer to where the error happens. At the very least moving the
>> check into the loop.
> 
> Fabiano - I think it's in the loop (of all target pages within a same host
> page), and IIUC Guoyi mentioned it's part of cancelling.
> 
> Guoyi, I assume you just saw qemu cancel too slow over e.g. 1g pages?
> The patch looks good here.

Yes, when migration process got cancelled, i think there is no need to 
handle the remaining part of the huge page, we should quit immediatley

> Thanks,
>

Re: [PATCH] migrate/ram: let ram_save_target_page_legacy() return if qemu file got error

Posted by Fabiano Rosas 9 months ago

Peter Xu <peterx@redhat.com> writes:

> On Tue, Aug 15, 2023 at 09:35:19AM -0300, Fabiano Rosas wrote:
>> Guoyi Tu <tugy@chinatelecom.cn> writes:
>> 
>> > When the migration process of a virtual machine using huge pages is 
>> > cancelled,
>> > QEMU will continue to complete the processing of the current huge page
>> > through the qemu file object got an error set. These processing, such as
>> > compression and encryption, will consume a lot of CPU resources which may
>> > affact the the performance of the other VMs.
>> >
>> > To terminate the migration process more quickly and minimize unnecessary
>> > resource occupancy, it's neccessary to add logic to check the error status
>> > of qemu file object in the beginning of ram_save_target_page_legacy 
>> > function,
>> > and make sure the function returns immediately if qemu file got an error.
>> >
>> > Signed-off-by: Guoyi Tu <tugy@chinatelecom.cn>
>> > ---
>> >   migration/ram.c | 4 ++++
>> >   1 file changed, 4 insertions(+)
>> >
>> > diff --git a/migration/ram.c b/migration/ram.c
>> > index 9040d66e61..3e2ebf3004 100644
>> > --- a/migration/ram.c
>> > +++ b/migration/ram.c
>> > @@ -2133,6 +2133,10 @@ static int ram_save_target_page_legacy(RAMState 
>> > *rs, PageSearchStatus *pss)
>> >       ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
>> >       int res;
>> >
>> > +    if (qemu_file_get_error(pss->pss_channel)) {
>> > +        return -1;
>> > +    }
>> 
>> Where was the error set? Is this from cancelling via QMP? Or something
>> from within ram_save_target_page_legacy? We should probably make the
>> check closer to where the error happens. At the very least moving the
>> check into the loop.
>
> Fabiano - I think it's in the loop (of all target pages within a same host
> page), and IIUC Guoyi mentioned it's part of cancelling.

Yep, I see that. I meant explicitly move the code into the loop. Feels a
bit weird to check the QEMUFile for errors first thing inside the
function when nothing around it should have touched the QEMUFile.

About cancelling, QMP is not the only way to cancel. I was trying to
probe whether the cancelling itself is what causes the perceived issue
or if something else went wrong that caused the migration code to cancel
itself. We might be missing an error check somewhere else.

Re: [PATCH] migrate/ram: let ram_save_target_page_legacy() return if qemu file got error

Posted by Peter Xu 9 months ago

On Tue, Aug 15, 2023 at 07:42:24PM -0300, Fabiano Rosas wrote:
> Yep, I see that. I meant explicitly move the code into the loop. Feels a
> bit weird to check the QEMUFile for errors first thing inside the
> function when nothing around it should have touched the QEMUFile.

Valid point.  This reminded me that now we have one indirection into
->ram_save_target_page() which is a hook now.  Putting in the caller will
work for all hooks, even though they're not yet exist.

But since we don't have any other hooks yet, it'll be the same for now.

Acked-by: Peter Xu <peterx@redhat.com>

For the long term: there's one more reason to rework qemu_put_byte()/... to
return error codes.. Then things like save_normal_page() can simply already
return negatives when hit an error.

Fabiano - I see that you've done quite a few patches in reworking migration
code.  I had that for a long time in my todo, but if you're interested feel
free to look into it.

IIUC the idea is introducing another similar layer of API for qemufile (I'd
call it qemu_put_1|2|4|8(), or anything you can come up better with..) then
let migration to switch over to it, with retval reflecting errors.  Then we
should be able to drop this patch along with most of the explicit error
checks for the qemufile spread all over.

-- 
Peter Xu

Re: [PATCH] migrate/ram: let ram_save_target_page_legacy() return if qemu file got error

Posted by Fabiano Rosas 9 months ago

Peter Xu <peterx@redhat.com> writes:

> On Tue, Aug 15, 2023 at 07:42:24PM -0300, Fabiano Rosas wrote:
>> Yep, I see that. I meant explicitly move the code into the loop. Feels a
>> bit weird to check the QEMUFile for errors first thing inside the
>> function when nothing around it should have touched the QEMUFile.
>
> Valid point.  This reminded me that now we have one indirection into
> ->ram_save_target_page() which is a hook now.  Putting in the caller will
> work for all hooks, even though they're not yet exist.
>
> But since we don't have any other hooks yet, it'll be the same for now.
>
> Acked-by: Peter Xu <peterx@redhat.com>
>
> For the long term: there's one more reason to rework qemu_put_byte()/... to
> return error codes.. Then things like save_normal_page() can simply already
> return negatives when hit an error.
>
> Fabiano - I see that you've done quite a few patches in reworking migration
> code.  I had that for a long time in my todo, but if you're interested feel
> free to look into it.
>
> IIUC the idea is introducing another similar layer of API for qemufile (I'd
> call it qemu_put_1|2|4|8(), or anything you can come up better with..) then
> let migration to switch over to it, with retval reflecting errors.  Then we
> should be able to drop this patch along with most of the explicit error
> checks for the qemufile spread all over.

I was just ranting about this situation in another thread! Yes, we need
something like that. QEMUFile errors should only be set by code doing
actual IO and if we want to store the error for other parts of the code
to use, that should be another interface.

While reviewing this patch I noticed we have stuff like this:

pages = ram_find_and_save_block()
...
if (pages < 0) {
    qemu_file_set_error(f, pages);
    break;
}

So the low-level code sets the error, ram_save_target_page_legacy() sees
it and returns -1, and this^ code loses all track of the initial error
and inadvertently turns it into -EPERM!

I'll try to find some time to start cleaning this up.

Re: [PATCH] migrate/ram: let ram_save_target_page_legacy() return if qemu file got error

Posted by Guoyi Tu 9 months ago

I apologize for the previous email being cut off. I am resending it here.

It sounds very reasonable. the return value of the QEMUFile interface
cannot accurately reflect the actual situation, and the way these
interfaces are being called during the migration process also is a
little bit weird.

I'm glad to see that you have plans to improve these interfaces. If you
need any assistance, I'd be more than happy to be involved

On 2023/8/16 23:15, 【外部账号】 Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
>> On Tue, Aug 15, 2023 at 07:42:24PM -0300, Fabiano Rosas wrote:
>>> Yep, I see that. I meant explicitly move the code into the loop. Feels a
>>> bit weird to check the QEMUFile for errors first thing inside the
>>> function when nothing around it should have touched the QEMUFile.
>>
>> Valid point.  This reminded me that now we have one indirection into
>> ->ram_save_target_page() which is a hook now.  Putting in the caller will
>> work for all hooks, even though they're not yet exist.
>>
>> But since we don't have any other hooks yet, it'll be the same for now.
>>
>> Acked-by: Peter Xu <peterx@redhat.com>
>>
>> For the long term: there's one more reason to rework qemu_put_byte()/... to
>> return error codes.. Then things like save_normal_page() can simply already
>> return negatives when hit an error.
>>
>> Fabiano - I see that you've done quite a few patches in reworking migration
>> code.  I had that for a long time in my todo, but if you're interested feel
>> free to look into it.
>>
>> IIUC the idea is introducing another similar layer of API for qemufile (I'd
>> call it qemu_put_1|2|4|8(), or anything you can come up better with..) then
>> let migration to switch over to it, with retval reflecting errors.  Then we
>> should be able to drop this patch along with most of the explicit error
>> checks for the qemufile spread all over.
> 
> I was just ranting about this situation in another thread! Yes, we need
> something like that. QEMUFile errors should only be set by code doing
> actual IO and if we want to store the error for other parts of the code
> to use, that should be another interface.
> 
> While reviewing this patch I noticed we have stuff like this:
> 
> pages = ram_find_and_save_block()
> ...
> if (pages < 0) {
>      qemu_file_set_error(f, pages);
>      break;
> }
> 
> So the low-level code sets the error, ram_save_target_page_legacy() sees
> it and returns -1, and this^ code loses all track of the initial error
> and inadvertently turns it into -EPERM!
> 
> I'll try to find some time to start cleaning this up

Re: [PATCH] migrate/ram: let ram_save_target_page_legacy() return if qemu file got error

Posted by Guoyi Tu 9 months ago


On 2023/8/16 23:15, 【外部账号】 Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
>> On Tue, Aug 15, 2023 at 07:42:24PM -0300, Fabiano Rosas wrote:
>>> Yep, I see that. I meant explicitly move the code into the loop. Feels a
>>> bit weird to check the QEMUFile for errors first thing inside the
>>> function when nothing around it should have touched the QEMUFile.
>>
>> Valid point.  This reminded me that now we have one indirection into
>> ->ram_save_target_page() which is a hook now.  Putting in the caller will
>> work for all hooks, even though they're not yet exist.
>>
>> But since we don't have any other hooks yet, it'll be the same for now.
>>
>> Acked-by: Peter Xu <peterx@redhat.com>
>>
>> For the long term: there's one more reason to rework qemu_put_byte()/... to
>> return error codes.. Then things like save_normal_page() can simply already
>> return negatives when hit an error.
>>
>> Fabiano - I see that you've done quite a few patches in reworking migration
>> code.  I had that for a long time in my todo, but if you're interested feel
>> free to look into it.
>>
>> IIUC the idea is introducing another similar layer of API for qemufile (I'd
>> call it qemu_put_1|2|4|8(), or anything you can come up better with..) then
>> let migration to switch over to it, with retval reflecting errors.  Then we
>> should be able to drop this patch along with most of the explicit error
>> checks for the qemufile spread all over.
> 
> I was just ranting about this situation in another thread! Yes, we need
> something like that. QEMUFile errors should only be set by code doing
> actual IO and if we want to store the error for other parts of the code
> to use, that should be another interface.
> 
> While reviewing this patch I noticed we have stuff like this:
> 
> pages = ram_find_and_save_block()
> ...
> if (pages < 0) {
>      qemu_file_set_error(f, pages);
>      break;
> }
> 
> So the low-level code sets the error, ram_save_target_page_legacy() sees
> it and returns -1, and this^ code loses all track of the initial error
> and inadvertently turns it into -EPERM!
> 
> I'll try to find some time to start cleaning this up

It sounds very reasonable. the return value of the QEMUFile interface
cannot accurately reflect the actual situation, and the way these
interfaces are being called during the migration process also is a
little bit weird.

I'm glad to see that you have plans to improve these interfaces. If you
need any assistance, I'd be more than happy to be involved.

Re: [PATCH] migrate/ram: let ram_save_target_page_legacy() return if qemu file got error

Posted by Peter Xu 9 months ago

On Thu, Aug 17, 2023 at 10:19:19AM +0800, Guoyi Tu wrote:
> 
> 
> On 2023/8/16 23:15, 【外部账号】 Fabiano Rosas wrote:
> > Peter Xu <peterx@redhat.com> writes:
> > 
> > > On Tue, Aug 15, 2023 at 07:42:24PM -0300, Fabiano Rosas wrote:
> > > > Yep, I see that. I meant explicitly move the code into the loop. Feels a
> > > > bit weird to check the QEMUFile for errors first thing inside the
> > > > function when nothing around it should have touched the QEMUFile.
> > > 
> > > Valid point.  This reminded me that now we have one indirection into
> > > ->ram_save_target_page() which is a hook now.  Putting in the caller will
> > > work for all hooks, even though they're not yet exist.
> > > 
> > > But since we don't have any other hooks yet, it'll be the same for now

Guoyi,

Your email got cut from here.  Same thing happened on emails from Hyman
(also sent from China Telecom email address), maybe your mail system did
something wrong.

-- 
Peter Xu

Re: [PATCH] migrate/ram: let ram_save_target_page_legacy() return if qemu file got error

Posted by Guoyi Tu 9 months ago

Thank you for the reminder. There might be some issues with the 
company's email service. I also noticed this morning that I missed 
receiving an email in response from Fabiano.


On 2023/8/17 21:35, 【外部账号】 Peter Xu wrote:
> On Thu, Aug 17, 2023 at 10:19:19AM +0800, Guoyi Tu wrote:
>>
>>
>> On 2023/8/16 23:15, 【外部账号】 Fabiano Rosas wrote:
>>> Peter Xu <peterx@redhat.com> writes:
>>>
>>>> On Tue, Aug 15, 2023 at 07:42:24PM -0300, Fabiano Rosas wrote:
>>>>> Yep, I see that. I meant explicitly move the code into the loop. Feels a
>>>>> bit weird to check the QEMUFile for errors first thing inside the
>>>>> function when nothing around it should have touched the QEMUFile.
>>>>
>>>> Valid point.  This reminded me that now we have one indirection into
>>>> ->ram_save_target_page() which is a hook now.  Putting in the caller will
>>>> work for all hooks, even though they're not yet exist.
>>>>
>>>> But since we don't have any other hooks yet, it'll be the same for now
> 
> Guoyi,
> 
> Your email got cut from here.  Same thing happened on emails from Hyman
> (also sent from China Telecom email address), maybe your mail system did
> something wrong.
>