[PATCH v2 0/1] Qemu crashes on VM migration after an handled memory error

“William Roche posted 1 patch 8 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20230914202054.3551250-1-william.roche@oracle.com
Maintainers: Paolo Bonzini <pbonzini@redhat.com>, Juan Quintela <quintela@redhat.com>, Peter Xu <peterx@redhat.com>, Leonardo Bras <leobras@redhat.com>
There is a newer version of this series
accel/kvm/kvm-all.c      | 14 ++++++++++++++
accel/stubs/kvm-stub.c   |  5 +++++
include/sysemu/kvm.h     | 10 ++++++++++
migration/ram-compress.c |  3 ++-
migration/ram.c          | 23 +++++++++++++++++++++--
migration/ram.h          |  2 ++
6 files changed, 54 insertions(+), 3 deletions(-)
[PATCH v2 0/1] Qemu crashes on VM migration after an handled memory error
Posted by “William Roche 8 months ago
From: William Roche <william.roche@oracle.com>

A Qemu VM can survive a memory error, as qemu can relay the error to the
VM kernel which could also deal with it -- poisoning/off-lining the impacted
page.
This situation creates a hole in the VM memory address space that the VM kernel
knows about (an unreadable page or set of pages).

But the migration of this VM (live migration through the network or
pseudo-migration with the creation of a state file) will crash Qemu when
it sequentially reads the memory address space and stumbles on the
existing hole.

In order to correct this problem, I suggest to treat the poisoned pages as if
they were zero-pages for the migration copy.
This fix also works with underlying large pages, taking into account the
RAMBlock segment "page-size".
This fix is scripts/checkpatch.pl clean.

v2:
  - adding compressed transfer handling of poisoned pages
 
Testing: I could verify that migration now works with a poisoned page
through standard and compressed migration with 4k and large (2M) pages.

The RDMA transfer is not considered by this patch.

William Roche (1):
  migration: skip poisoned memory pages on "ram saving" phase

 accel/kvm/kvm-all.c      | 14 ++++++++++++++
 accel/stubs/kvm-stub.c   |  5 +++++
 include/sysemu/kvm.h     | 10 ++++++++++
 migration/ram-compress.c |  3 ++-
 migration/ram.c          | 23 +++++++++++++++++++++--
 migration/ram.h          |  2 ++
 6 files changed, 54 insertions(+), 3 deletions(-)

-- 
2.39.3
Re: [PATCH v2 0/1] Qemu crashes on VM migration after an handled memory error
Posted by Peter Xu 8 months ago
On Thu, Sep 14, 2023 at 08:20:53PM +0000, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> A Qemu VM can survive a memory error, as qemu can relay the error to the
> VM kernel which could also deal with it -- poisoning/off-lining the impacted
> page.
> This situation creates a hole in the VM memory address space that the VM kernel
> knows about (an unreadable page or set of pages).
> 
> But the migration of this VM (live migration through the network or
> pseudo-migration with the creation of a state file) will crash Qemu when
> it sequentially reads the memory address space and stumbles on the
> existing hole.
> 
> In order to correct this problem, I suggest to treat the poisoned pages as if
> they were zero-pages for the migration copy.
> This fix also works with underlying large pages, taking into account the
> RAMBlock segment "page-size".
> This fix is scripts/checkpatch.pl clean.
> 
> v2:
>   - adding compressed transfer handling of poisoned pages
>  
> Testing: I could verify that migration now works with a poisoned page
> through standard and compressed migration with 4k and large (2M) pages.
> 
> The RDMA transfer is not considered by this patch.
> 
> William Roche (1):
>   migration: skip poisoned memory pages on "ram saving" phase

If there's a new version, please consider adding a TODO above
control_save_page() that poison page is probably broken there, so we can
still remember.

Reviewed-by: Peter Xu <peterx@redhat.com>

Copy:

lizhijian@fujitsu.com, lidongchen@tencent.com

Thanks,

-- 
Peter Xu