[RFC 0/3] Mitigation of migration failures accross different host kernels

Eric Auger posted 3 patches 5 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20250911134324.3702720-1-eric.auger@redhat.com
Maintainers: Peter Maydell <peter.maydell@linaro.org>, Paolo Bonzini <pbonzini@redhat.com>
target/arm/cpu.h        | 15 +++++++
hw/arm/virt.c           | 19 ++++++++
target/arm/kvm.c        | 99 +++++++++++++++++++++++++++++++++++++++--
target/arm/trace-events |  6 +++
4 files changed, 135 insertions(+), 4 deletions(-)
[RFC 0/3] Mitigation of migration failures accross different host kernels
Posted by Eric Auger 5 months ago
When migrating ARM guests accross same machines with different host
kernels we are likely to encounter failures such as:

"failed to load cpu:cpreg_vmstate_array_len"

This is due to the fact KVM exposes a different number of registers
to qemu on source and destination. When trying to migrate a bigger
register set to a smaller one, qemu cannot save the CPU state.

For example, recently we faced such kind of situations with:
- unconditionnal exposure of KVM_REG_ARM_VENDOR_HYP_BMAP_2 FW pseudo
  register from v6.16 onwards. Causes backward migration failure.
- removal of unconditionnal exposure of TCR2_EL1, PIRE0_EL1,  PIR_EL1
  from v6.13 onwards. Causes forward migration failure.

More details can be found in individual patches.

This situation is really problematic for distributions which want to
guarantee forward and backward migration of a given machine type
between different releases.

This small series tries to address that issue by introducing CPU
array properties that list the registers to ignore or to fake according
to the situation. An example is given to illustrate how those props
could be used to apply compats for machine types supposed to "see" the
same register set accross various host kernels.

Obviously this is a last resort situation and this situation should be
avoided as much as possible.

Eric Auger (3):
  target/arm/cpu: Add new CPU property for KVM regs to hide
  target/arm/kvm: Add new CPU property for KVM regs to enforce
  hw/arm/virt: [DO NOT UPSTREAM] Enforce compatibility with older
    kernels

 target/arm/cpu.h        | 15 +++++++
 hw/arm/virt.c           | 19 ++++++++
 target/arm/kvm.c        | 99 +++++++++++++++++++++++++++++++++++++++--
 target/arm/trace-events |  6 +++
 4 files changed, 135 insertions(+), 4 deletions(-)

-- 
2.49.0
Re: [RFC 0/3] Mitigation of migration failures accross different host kernels
Posted by Eric Auger 4 months, 1 week ago
Hi,

On 9/11/25 3:40 PM, Eric Auger wrote:
> When migrating ARM guests accross same machines with different host
> kernels we are likely to encounter failures such as:
>
> "failed to load cpu:cpreg_vmstate_array_len"
>
> This is due to the fact KVM exposes a different number of registers
> to qemu on source and destination. When trying to migrate a bigger
> register set to a smaller one, qemu cannot save the CPU state.
>
> For example, recently we faced such kind of situations with:
> - unconditionnal exposure of KVM_REG_ARM_VENDOR_HYP_BMAP_2 FW pseudo
>   register from v6.16 onwards. Causes backward migration failure.
> - removal of unconditionnal exposure of TCR2_EL1, PIRE0_EL1,  PIR_EL1
>   from v6.13 onwards. Causes forward migration failure.
>
> More details can be found in individual patches.
>
> This situation is really problematic for distributions which want to
> guarantee forward and backward migration of a given machine type
> between different releases.
>
> This small series tries to address that issue by introducing CPU
> array properties that list the registers to ignore or to fake according
> to the situation. An example is given to illustrate how those props
> could be used to apply compats for machine types supposed to "see" the
> same register set accross various host kernels.
>
> Obviously this is a last resort situation and this situation should be
> avoided as much as possible.

Gentle ping. Any other comments/advices on how to mitigate those kinds
of issue?
I think I will split the series because although it tries to address the
same

"failed to load cpu:cpreg_vmstate_array_len" class of error, hiding or faking KVM registers induce different side effects and risks and it may be better to handle them separately.

I forgot to mention that when registers disappear without notice/KVM knob, one could argue that the easiest way is to backport the fix in older kernels but there will always be a point when a VM running on a non fixed host kernel won't be live migratable to a kernel where the fix was backported, which kills the point of doing live migration I think.

Best Regards

Eric 

>
> Eric Auger (3):
>   target/arm/cpu: Add new CPU property for KVM regs to hide
>   target/arm/kvm: Add new CPU property for KVM regs to enforce
>   hw/arm/virt: [DO NOT UPSTREAM] Enforce compatibility with older
>     kernels
>
>  target/arm/cpu.h        | 15 +++++++
>  hw/arm/virt.c           | 19 ++++++++
>  target/arm/kvm.c        | 99 +++++++++++++++++++++++++++++++++++++++--
>  target/arm/trace-events |  6 +++
>  4 files changed, 135 insertions(+), 4 deletions(-)
>