[PATCH 0/5] CPU: Detect put cpu register errors for migrations

Peter Xu posted 5 patches 1 year, 11 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20220617144857.34189-1-peterx@redhat.com
Maintainers: Cameron Esfahani <dirty@apple.com>, Roman Bolshakov <r.bolshakov@yadro.com>, Paolo Bonzini <pbonzini@redhat.com>, Richard Henderson <richard.henderson@linaro.org>, Eduardo Habkost <eduardo@habkost.net>, Marcel Apfelbaum <marcel.apfelbaum@gmail.com>, "Philippe Mathieu-Daudé" <f4bug@amsat.org>, Yanan Wang <wangyanan55@huawei.com>, Juan Quintela <quintela@redhat.com>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>, Wenchao Wang <wenchao.wang@intel.com>, Kamil Rytarowski <kamil@netbsd.org>, Reinoud Zandijk <reinoud@netbsd.org>, Sunil Muthuswamy <sunilmut@microsoft.com>
accel/hvf/hvf-accel-ops.c     |  2 +-
accel/kvm/kvm-all.c           | 15 +++++++---
accel/kvm/kvm-cpus.h          |  2 +-
cpus-common.c                 | 55 +++++++++++++++++++++++++++++++++--
hw/core/machine.c             |  2 +-
include/hw/core/cpu.h         | 28 ++++++++++++++++++
include/sysemu/accel-ops.h    |  2 +-
include/sysemu/cpus.h         |  2 +-
include/sysemu/hw_accel.h     |  1 +
migration/savevm.c            | 20 +++++++++++--
softmmu/cpus.c                | 23 ++++++++++++---
stubs/cpu-synchronize-state.c |  3 ++
target/i386/hax/hax-all.c     |  2 +-
target/i386/nvmm/nvmm-all.c   |  2 +-
target/i386/whpx/whpx-all.c   |  2 +-
15 files changed, 139 insertions(+), 22 deletions(-)
[PATCH 0/5] CPU: Detect put cpu register errors for migrations
Posted by Peter Xu 1 year, 11 months ago
rfc->v1:
- Rebase to master, drop RFC tag.

This series teaches QEMU to detect errors when e.g. putting registers from
QEMU to KVM, and fail migrations properly.

For the rational of this series and why it was posted, please refer to the
bug report here:

https://lore.kernel.org/all/YppVupW+IWsm7Osr@xz-m1.local/

But I'd rather not go into that if the reviewer doesn't have that context,
because we don't really need that complexity..  It can be simple as we
should fail migration early when we see issues happening already, so:

  1) We fail explicitly, rather than afterward with some weird guest
     errors.  In my bug report, it was a guest double fault.  There's
     another bug report that Sean mentioned in the thread from Mike Tancsa
     that can have other sympotons rather than double fault, but anyway
     they'll be hard to diagnose since the processor state can be corrupted
     (please refer to kvm_arch_put_registers() where we stop putting more
     registers to KVM when we see any error).

  2) For precopy, with this early failure the VM won't crash itself since
     we still have a chance to keep running it on src host, while if
     without this patch we will fail later, and it can crash the VM.

In this specific case, when KVM_SET_XSAVE ioctl failed on dest host before
start running the VM there, we should fail the migration already.

After the patchset applied, the above "double fault" issue will become
migration failures, and...

For precopy, we can see some error dumped for precopy on dest, then the VM
will be kept running on src host:

2022-06-07T22:48:48.804234Z qemu-system-x86_64: kvm_arch_put_registers() failed with retval=-22
2022-06-07T22:48:48.804588Z qemu-system-x86_64: load of migration failed: Invalid argument

For postcopy, currently we'll pause the VM immediately for admin to decide
what to do:

2022-06-07T22:47:49.448192Z qemu-system-x86_64: kvm_arch_put_registers() failed with retval=-22
13072@1654642069.518993:runstate_set current_run_state 1 (inmigrate) new_state 4 (paused)

If something like this series is welcomed, we could do better in the future
by telling the src host about this issue and keep running, because
put-register happens right at the switch-over, so we actually have this
chance (no dirty page on dest host yet).

Comments welcomed.  Thanks,

Peter Xu (5):
  cpus-common: Introduce run_on_cpu_func2 which allows error returns
  cpus-common: Add run_on_cpu2()
  accel: Allow synchronize_post_init() to take an Error**
  cpu: Allow cpu_synchronize_all_post_init() to take an errp
  KVM: Hook kvm_arch_put_registers() errors to the caller

 accel/hvf/hvf-accel-ops.c     |  2 +-
 accel/kvm/kvm-all.c           | 15 +++++++---
 accel/kvm/kvm-cpus.h          |  2 +-
 cpus-common.c                 | 55 +++++++++++++++++++++++++++++++++--
 hw/core/machine.c             |  2 +-
 include/hw/core/cpu.h         | 28 ++++++++++++++++++
 include/sysemu/accel-ops.h    |  2 +-
 include/sysemu/cpus.h         |  2 +-
 include/sysemu/hw_accel.h     |  1 +
 migration/savevm.c            | 20 +++++++++++--
 softmmu/cpus.c                | 23 ++++++++++++---
 stubs/cpu-synchronize-state.c |  3 ++
 target/i386/hax/hax-all.c     |  2 +-
 target/i386/nvmm/nvmm-all.c   |  2 +-
 target/i386/whpx/whpx-all.c   |  2 +-
 15 files changed, 139 insertions(+), 22 deletions(-)

-- 
2.32.0
Re: [PATCH 0/5] CPU: Detect put cpu register errors for migrations
Posted by Peter Xu 1 year, 10 months ago
On Fri, Jun 17, 2022 at 10:48:52AM -0400, Peter Xu wrote:
> rfc->v1:
> - Rebase to master, drop RFC tag.
> 
> This series teaches QEMU to detect errors when e.g. putting registers from
> QEMU to KVM, and fail migrations properly.
> 
> For the rational of this series and why it was posted, please refer to the
> bug report here:
> 
> https://lore.kernel.org/all/YppVupW+IWsm7Osr@xz-m1.local/
> 
> But I'd rather not go into that if the reviewer doesn't have that context,
> because we don't really need that complexity..  It can be simple as we
> should fail migration early when we see issues happening already, so:
> 
>   1) We fail explicitly, rather than afterward with some weird guest
>      errors.  In my bug report, it was a guest double fault.  There's
>      another bug report that Sean mentioned in the thread from Mike Tancsa
>      that can have other sympotons rather than double fault, but anyway
>      they'll be hard to diagnose since the processor state can be corrupted
>      (please refer to kvm_arch_put_registers() where we stop putting more
>      registers to KVM when we see any error).
> 
>   2) For precopy, with this early failure the VM won't crash itself since
>      we still have a chance to keep running it on src host, while if
>      without this patch we will fail later, and it can crash the VM.
> 
> In this specific case, when KVM_SET_XSAVE ioctl failed on dest host before
> start running the VM there, we should fail the migration already.
> 
> After the patchset applied, the above "double fault" issue will become
> migration failures, and...
> 
> For precopy, we can see some error dumped for precopy on dest, then the VM
> will be kept running on src host:
> 
> 2022-06-07T22:48:48.804234Z qemu-system-x86_64: kvm_arch_put_registers() failed with retval=-22
> 2022-06-07T22:48:48.804588Z qemu-system-x86_64: load of migration failed: Invalid argument
> 
> For postcopy, currently we'll pause the VM immediately for admin to decide
> what to do:
> 
> 2022-06-07T22:47:49.448192Z qemu-system-x86_64: kvm_arch_put_registers() failed with retval=-22
> 13072@1654642069.518993:runstate_set current_run_state 1 (inmigrate) new_state 4 (paused)
> 
> If something like this series is welcomed, we could do better in the future
> by telling the src host about this issue and keep running, because
> put-register happens right at the switch-over, so we actually have this
> chance (no dirty page on dest host yet).
> 
> Comments welcomed.  Thanks,

Ping,

We seem to still hit FPU-put issues and it always takes time to debug on
unsynchronized cpu registers because it can cause all kinds of errors. The
real evil should be that not only we ignored errors on e.g. put xsave, but
also we will stop putting all the rest cpu regs after any of such failure,
then qemu continues to run with synced + fail-synced + unsynced registers.

It'll be great if something like this series could still be considered.

Thanks,

-- 
Peter Xu