[RESEND PATCH 0/7] Mitigation of "failed to load cpu:cpreg_vmstate_array_len" migration failures

Eric Auger posted 7 patches 4 weeks, 1 day ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20251016140039.250111-1-eric.auger@redhat.com
Maintainers: Paolo Bonzini <pbonzini@redhat.com>, Peter Maydell <peter.maydell@linaro.org>, Eduardo Habkost <eduardo@habkost.net>, Marcel Apfelbaum <marcel.apfelbaum@gmail.com>, "Philippe Mathieu-Daudé" <philmd@linaro.org>, Yanan Wang <wangyanan55@huawei.com>, Zhao Liu <zhao1.liu@intel.com>
include/hw/core/cpu.h   |  2 ++
target/arm/cpu.h        | 42 ++++++++++++++++++++++++
accel/kvm/kvm-all.c     | 12 +++++++
hw/arm/virt.c           | 19 +++++++++++
target/arm/cpu.c        | 12 +++++++
target/arm/kvm.c        | 73 ++++++++++++++++++++++++++++++++++++++++-
target/arm/machine.c    | 71 +++++++++++++++++++++++++++++++++++----
target/arm/trace-events | 11 +++++++
8 files changed, 235 insertions(+), 7 deletions(-)
[RESEND PATCH 0/7] Mitigation of "failed to load cpu:cpreg_vmstate_array_len" migration failures
Posted by Eric Auger 4 weeks, 1 day ago
When migrating ARM guests accross same machines with different host
kernels we are likely to encounter failures such as:

"failed to load cpu:cpreg_vmstate_array_len"

This is due to the fact KVM exposes a different number of registers
to qemu on source and destination. When trying to migrate a bigger
register set to a smaller one, qemu cannot save the CPU state.

For example, recently we faced such kind of situations with:
- unconditionnal exposure of KVM_REG_ARM_VENDOR_HYP_BMAP_2 FW pseudo
  register from v6.16 onwards. Causes backward migration failure.
- removal of unconditionnal exposure of TCR2_EL1, PIRE0_EL1, PIR_EL1
  from v6.13 onwards. Causes forward migration failure.

This situation is really problematic for distributions which want to
guarantee forward and backward migration of a given machine type
between different releases.

This small series tries to address that issue by introducing CPU
array properties that list the registers to ignore or to fake according
to the situation. An example is given to illustrate how those props
could be used to apply compats for machine types supposed to "see" the
same register set accross various host kernels.

The first patch improves the tracing so that we can quickly detect
which registers are unexpected and cause the migration failure. Missing
registers are also traced. Those do not fail migration but their default
value is kept on the destination.

Then we introduce the infrastructure to handle 'hidden' registers and
'fake' registers.

Eric Auger (7):
  target/arm/machine: Improve traces on register mismatch during
    migration
  target/arm/kvm: Introduce the concept of hidden KVM regs
  target/arm/kvm: Introduce the concept of enforced/fake registers
  kvm-all: Add the capability to blacklist some KVM regs
  target/arm/cpu: Implement hide_reg callback()
  target/arm/kvm: Expose kvm-hidden-regs and kvm-fake-regs properties
  hw/arm/virt: [DO NOT UPSTREAM] Enforce compatibility with older
    kernels

 include/hw/core/cpu.h   |  2 ++
 target/arm/cpu.h        | 42 ++++++++++++++++++++++++
 accel/kvm/kvm-all.c     | 12 +++++++
 hw/arm/virt.c           | 19 +++++++++++
 target/arm/cpu.c        | 12 +++++++
 target/arm/kvm.c        | 73 ++++++++++++++++++++++++++++++++++++++++-
 target/arm/machine.c    | 71 +++++++++++++++++++++++++++++++++++----
 target/arm/trace-events | 11 +++++++
 8 files changed, 235 insertions(+), 7 deletions(-)

-- 
2.49.0
Re: [RESEND PATCH 0/7] Mitigation of "failed to load cpu:cpreg_vmstate_array_len" migration failures
Posted by Eric Auger 2 weeks, 3 days ago

On 10/16/25 3:59 PM, Eric Auger wrote:
> When migrating ARM guests accross same machines with different host
> kernels we are likely to encounter failures such as:
>
> "failed to load cpu:cpreg_vmstate_array_len"
>
> This is due to the fact KVM exposes a different number of registers
> to qemu on source and destination. When trying to migrate a bigger
> register set to a smaller one, qemu cannot save the CPU state.
>
> For example, recently we faced such kind of situations with:
> - unconditionnal exposure of KVM_REG_ARM_VENDOR_HYP_BMAP_2 FW pseudo
>   register from v6.16 onwards. Causes backward migration failure.
> - removal of unconditionnal exposure of TCR2_EL1, PIRE0_EL1, PIR_EL1
>   from v6.13 onwards. Causes forward migration failure.
>
> This situation is really problematic for distributions which want to
> guarantee forward and backward migration of a given machine type
> between different releases.
>
> This small series tries to address that issue by introducing CPU
> array properties that list the registers to ignore or to fake according
> to the situation. An example is given to illustrate how those props
> could be used to apply compats for machine types supposed to "see" the
> same register set accross various host kernels.
>
> The first patch improves the tracing so that we can quickly detect
> which registers are unexpected and cause the migration failure. Missing
> registers are also traced. Those do not fail migration but their default
> value is kept on the destination.
>
> Then we introduce the infrastructure to handle 'hidden' registers and
> 'fake' registers.
>
> Eric Auger (7):
>   target/arm/machine: Improve traces on register mismatch during
>     migration
>   target/arm/kvm: Introduce the concept of hidden KVM regs
>   target/arm/kvm: Introduce the concept of enforced/fake registers
>   kvm-all: Add the capability to blacklist some KVM regs
>   target/arm/cpu: Implement hide_reg callback()
>   target/arm/kvm: Expose kvm-hidden-regs and kvm-fake-regs properties
>   hw/arm/virt: [DO NOT UPSTREAM] Enforce compatibility with older
>     kernels

Gentle ping.

Any comments on the approach?

Thanks

Eric
>
>  include/hw/core/cpu.h   |  2 ++
>  target/arm/cpu.h        | 42 ++++++++++++++++++++++++
>  accel/kvm/kvm-all.c     | 12 +++++++
>  hw/arm/virt.c           | 19 +++++++++++
>  target/arm/cpu.c        | 12 +++++++
>  target/arm/kvm.c        | 73 ++++++++++++++++++++++++++++++++++++++++-
>  target/arm/machine.c    | 71 +++++++++++++++++++++++++++++++++++----
>  target/arm/trace-events | 11 +++++++
>  8 files changed, 235 insertions(+), 7 deletions(-)
>
Re: [RESEND PATCH 0/7] Mitigation of "failed to load cpu:cpreg_vmstate_array_len" migration failures
Posted by Peter Maydell 2 weeks, 3 days ago
On Tue, 28 Oct 2025 at 10:05, Eric Auger <eric.auger@redhat.com> wrote:
>
>
>
> On 10/16/25 3:59 PM, Eric Auger wrote:
> > When migrating ARM guests accross same machines with different host
> > kernels we are likely to encounter failures such as:
> >
> > "failed to load cpu:cpreg_vmstate_array_len"
> >
> > This is due to the fact KVM exposes a different number of registers
> > to qemu on source and destination. When trying to migrate a bigger
> > register set to a smaller one, qemu cannot save the CPU state.
> >
> > For example, recently we faced such kind of situations with:
> > - unconditionnal exposure of KVM_REG_ARM_VENDOR_HYP_BMAP_2 FW pseudo
> >   register from v6.16 onwards. Causes backward migration failure.
> > - removal of unconditionnal exposure of TCR2_EL1, PIRE0_EL1, PIR_EL1
> >   from v6.13 onwards. Causes forward migration failure.

> Gentle ping.
>
> Any comments on the approach?

A couple of general remarks:

(1) This isn't KVM specific -- see e.g. commit 4f2b82f60
where we had to add back a fake cpreg to un-break forward
migration of TCG CPUs. So our handling of this kind of problem
shouldn't be restricted to only working with KVM.

(2) essentially we're re-inventing the migration compat
support that VMStateDescriptions provide. That's kind of
unavoidable because of the way I implemented cpreg migration
years ago, but is there anything we can learn in terms of
(a) required feature set and (b) trying to keep parallels
between the two for the way things work ?

thanks
-- PMM
Re: [RESEND PATCH 0/7] Mitigation of "failed to load cpu:cpreg_vmstate_array_len" migration failures
Posted by Eric Auger 2 weeks, 2 days ago
Hi Peter,

On 10/28/25 11:47 AM, Peter Maydell wrote:
> On Tue, 28 Oct 2025 at 10:05, Eric Auger <eric.auger@redhat.com> wrote:
>>
>>
>> On 10/16/25 3:59 PM, Eric Auger wrote:
>>> When migrating ARM guests accross same machines with different host
>>> kernels we are likely to encounter failures such as:
>>>
>>> "failed to load cpu:cpreg_vmstate_array_len"
>>>
>>> This is due to the fact KVM exposes a different number of registers
>>> to qemu on source and destination. When trying to migrate a bigger
>>> register set to a smaller one, qemu cannot save the CPU state.
>>>
>>> For example, recently we faced such kind of situations with:
>>> - unconditionnal exposure of KVM_REG_ARM_VENDOR_HYP_BMAP_2 FW pseudo
>>>   register from v6.16 onwards. Causes backward migration failure.
>>> - removal of unconditionnal exposure of TCR2_EL1, PIRE0_EL1, PIR_EL1
>>>   from v6.13 onwards. Causes forward migration failure.
>> Gentle ping.
>>
>> Any comments on the approach?
> A couple of general remarks:
>
> (1) This isn't KVM specific -- see e.g. commit 4f2b82f60
> where we had to add back a fake cpreg to un-break forward
> migration of TCG CPUs. So our handling of this kind of problem
> shouldn't be restricted to only working with KVM.

interesting. I will see how this can be extended to TCG
>
> (2) essentially we're re-inventing the migration compat
> support that VMStateDescriptions provide. That's kind of
> unavoidable because of the way I implemented cpreg migration
> years ago, but is there anything we can learn in terms of
> (a) required feature set and (b) trying to keep parallels
> between the two for the way things work ?

OK I will study that further.

Thank you for your suggestions!

Eric
>
> thanks
> -- PMM
>