migration: Fix aarch64 cpregs migration

[RFC PATCH 2/3] tests/qtest/migration: Only test aarch64 on TCG

Posted by Fabiano Rosas 6 months, 1 week ago

Currently our aarch64 tests are only being run using identical QEMU
versions. When running the tests with different QEMU versions, which
is a common use-case for migration, the tests are broken due to the
current choice of the 'max' cpu, which is not stable and is prone to
breaking migration.

This means aarch64 tests are currently only testing about the same
situations as any other arch, i.e. no arm-specific testing is being
done.

To make the aarch64 tests more useful, -cpu max will be changed to
-cpu neoverse-n1 in the next patch. Before doing that, make sure
aarch64 tests only run with TCG, since KVM testing depends on usage of
the -cpu host and we currently don't have code to switch between cpus
according to test runtime environment.

Also, TCG alone should allow us to catch most issues with migration,
since there is no guarantee of a uniform environment as there is with
KVM.

Suggested-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 tests/qtest/migration/framework.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/tests/qtest/migration/framework.c b/tests/qtest/migration/framework.c
index 407c9023c0..f09365d122 100644
--- a/tests/qtest/migration/framework.c
+++ b/tests/qtest/migration/framework.c
@@ -353,8 +353,17 @@ int migrate_start(QTestState **from, QTestState **to, const char *uri,
         memory_backend = g_strdup_printf("-m %s ", memory_size);
     }
 
-    if (args->use_dirty_ring) {
-        kvm_opts = ",dirty-ring-size=4096";
+
+    if (g_str_equal(arch, "aarch64")) {
+        /*
+         * aarch64 is only tested with TCG because there is no single
+         * cpu that can be used for both KVM and TCG.
+         */
+        kvm_opts = NULL;
+    } else if (args->use_dirty_ring) {
+        kvm_opts = "-accel kvm,dirty-ring-size=4096";
+    } else {
+        kvm_opts = "-accel kvm";
     }
 
     if (!qtest_has_machine(machine_alias)) {
@@ -368,7 +377,7 @@ int migrate_start(QTestState **from, QTestState **to, const char *uri,
 
     g_test_message("Using machine type: %s", machine);
 
-    cmd_source = g_strdup_printf("-accel kvm%s -accel tcg "
+    cmd_source = g_strdup_printf("%s -accel tcg "
                                  "-machine %s,%s "
                                  "-name source,debug-threads=on "
                                  "%s "
@@ -395,7 +404,7 @@ int migrate_start(QTestState **from, QTestState **to, const char *uri,
      */
     events = args->defer_target_connect ? "-global migration.x-events=on" : "";
 
-    cmd_target = g_strdup_printf("-accel kvm%s -accel tcg "
+    cmd_target = g_strdup_printf("%s -accel tcg "
                                  "-machine %s,%s "
                                  "-name target,debug-threads=on "
                                  "%s "
-- 
2.35.3

Re: [RFC PATCH 2/3] tests/qtest/migration: Only test aarch64 on TCG

Posted by Peter Maydell 6 months, 1 week ago

On Wed, 30 Jul 2025 at 21:52, Fabiano Rosas <farosas@suse.de> wrote:
>
> Currently our aarch64 tests are only being run using identical QEMU
> versions. When running the tests with different QEMU versions, which
> is a common use-case for migration, the tests are broken due to the
> current choice of the 'max' cpu, which is not stable and is prone to
> breaking migration.
>
> This means aarch64 tests are currently only testing about the same
> situations as any other arch, i.e. no arm-specific testing is being
> done.
>
> To make the aarch64 tests more useful, -cpu max will be changed to
> -cpu neoverse-n1 in the next patch. Before doing that, make sure
> aarch64 tests only run with TCG, since KVM testing depends on usage of
> the -cpu host and we currently don't have code to switch between cpus
> according to test runtime environment.
>
> Also, TCG alone should allow us to catch most issues with migration,
> since there is no guarantee of a uniform environment as there is with
> KVM.

The difficulty with only testing TCG migration is that now
we're testing the setup that most cross-versions migration users
don't care about. At least my assumption is that it's KVM
cross-version migration that is the real use case here.

For instance, this migration bug with the DBGDTR register
isn't a problem for KVM, because with KVM we use the kernel
to tell us what system registers are present, and whether
a register is defined with a cpreg in QEMU or not doesn't
affect what we put on the wire for migration. Conversely
there might be migration compat issues that show up only
with KVM and not TCG (though the most obvious source of those
would be host kernel changes, which is kind of out of scope
for us).

Though of course with our CI jobs we're probably not
doing AArch64 KVM cross-version testing anyway...

thanks
-- PMM

Re: [RFC PATCH 2/3] tests/qtest/migration: Only test aarch64 on TCG

Posted by Mohamed Mediouni 6 months, 1 week ago


> On 31. Jul 2025, at 15:37, Peter Maydell <peter.maydell@linaro.org> wrote:
> 
> On Wed, 30 Jul 2025 at 21:52, Fabiano Rosas <farosas@suse.de> wrote:
>> 
>> Currently our aarch64 tests are only being run using identical QEMU
>> versions. When running the tests with different QEMU versions, which
>> is a common use-case for migration, the tests are broken due to the
>> current choice of the 'max' cpu, which is not stable and is prone to
>> breaking migration.
>> 
>> This means aarch64 tests are currently only testing about the same
>> situations as any other arch, i.e. no arm-specific testing is being
>> done.
>> 
>> To make the aarch64 tests more useful, -cpu max will be changed to
>> -cpu neoverse-n1 in the next patch. Before doing that, make sure
>> aarch64 tests only run with TCG, since KVM testing depends on usage of
>> the -cpu host and we currently don't have code to switch between cpus
>> according to test runtime environment.
>> 
>> Also, TCG alone should allow us to catch most issues with migration,
>> since there is no guarantee of a uniform environment as there is with
>> KVM.
> 
> The difficulty with only testing TCG migration is that now
> we're testing the setup that most cross-versions migration users
> don't care about. At least my assumption is that it's KVM
> cross-version migration that is the real use case here.
> 
> For instance, this migration bug with the DBGDTR register
> isn't a problem for KVM, because with KVM we use the kernel
> to tell us what system registers are present, and whether
> a register is defined with a cpreg in QEMU or not doesn't
> affect what we put on the wire for migration. Conversely
> there might be migration compat issues that show up only
> with KVM and not TCG (though the most obvious source of those
> would be host kernel changes, which is kind of out of scope
> for us).
> 
> Though of course with our CI jobs we're probably not
> doing AArch64 KVM cross-version testing anyway...
> 
On the cloud provider side*, we do rely on having rollbacks work.

We rely on staged deployments with rolling back if things go wrong
as we observe progress.

Note that the set of MSRs KVM gives (at least on AArch64) does sometimes
vary between releases so for rolling back you’ll need to ignore some (new) 
sysregs in the vmm. With careful planning so that you deploy a VMM
release with a point-fix to ignore the new registers and then the kernel update.

So not dealing with that would make the cloud use case not usable without 
downstream patches.

*although we don’t rely on Qemu for Nitro System VMs

Thank you,
> thanks
> -- PMM
>