[PATCH v1 0/1] Qemu crashes on VM migration after an handled memory error

“William Roche posted 1 patch 10 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20240130190640.139364-1-william.roche@oracle.com
Maintainers: Paolo Bonzini <pbonzini@redhat.com>, Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>
There is a newer version of this series
accel/kvm/kvm-all.c    | 10 ++++++++++
accel/stubs/kvm-stub.c |  5 +++++
include/sysemu/kvm.h   |  6 ++++++
migration/migration.c  |  7 +++++++
4 files changed, 28 insertions(+)
[PATCH v1 0/1] Qemu crashes on VM migration after an handled memory error
Posted by “William Roche 10 months ago
From: William Roche <william.roche@oracle.com>

Problem:
--------
A Qemu VM can survive a memory error, as qemu can relay the error to the
VM kernel which could also deal with it -- poisoning/off-lining the impacted
page. This situation creates a hole in the VM memory address space (an
unreadable page or set of pages).

A migration request of this VM (live migration through the network or
pseudo-migration with the creation of a state file) will crash Qemu when
it sequentially reads the memory address space and stumbles on the
existing hole.

New fix proposal:
-----------------
Let's prevent the migration when we know that there is a poison page in
the VM address space.


History:
--------
My first fix proposal for this crash condition (latest version:
https://lore.kernel.org/all/20231106220319.456765-1-william.roche@oracle.com/ )
relied on a well behaving kernel to guaranty that a known poison page is
not accessed. It introduced an ARM platform specificity.
I haven't received any feedback about the ARM specificity to avoid
a possible memory corruption after a migration transforming a poisoned
page into an all zero page.

I also accept that when a memory error leads to memory poisoning, this
platform functionality has to be honored as long as a physical platform
would provide it.

Peter asked for a complete correction of this problem (transfering
the memory holes information with the migration and recreating these
holes on the destination platform).

In the meantime, this is a very small fix to avoid the current crash
situation reading the poisoned memory pages.  I'm simply preventing
the migration when we know that it would crash, when there is a
poisoned page in the VM address space.

This is a generic protection code, avoiding a crash condition and
reporting the following error message:
"Error: Can't migrate this vm with hardware poisoned memory, please reboot the vm and try again"
instead of crashing the VM.

This fix is scripts/checkpatch.pl clean.
Unit tested on ARM and x86.


William Roche (1):
  migration: prevent migration when VM has poisoned memory

 accel/kvm/kvm-all.c    | 10 ++++++++++
 accel/stubs/kvm-stub.c |  5 +++++
 include/sysemu/kvm.h   |  6 ++++++
 migration/migration.c  |  7 +++++++
 4 files changed, 28 insertions(+)

-- 
2.39.3