[RFC PATCH] tests/qtest: Don't parallelize migration-test

Peter Maydell posted 1 patch 2 months, 2 weeks ago
tests/qtest/meson.build | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
[RFC PATCH] tests/qtest: Don't parallelize migration-test
Posted by Peter Maydell 2 months, 2 weeks ago
The migration-test is a long-running test whose subtests all launch
at least two QEMU processes.  This means that if for example the host
has 4 CPUs then 'make check' defaults to a parallelism of 5, and if
we launch 5 migration-tests in parallel then we will be running 10
QEMU instances on a 4 CPU system.  If the system is not very fast
then the test can spuriously time out because the different tests are
all stealing CPU from each other.  This seems to particularly be a
problem on our S390 CI job and the cross-i686-tci CI job.

Force meson to run migration-test non-parallel, so there is never any
other test running at the same time as it.  This will slow down
overall test execution time somewhat, but hopefully make our CI less
flaky.

The downside is that because each migration-test instance runs for
between 2 and 5 minutes and we run it for five architectures this
significantly increases the runtime.  For an all-architectures build
on my local machine 'make check -j8' goes from

 real    8m19.127s
 user    31m47.534s
 sys     19m42.650s

to

 real    20m31.218s
 user    32m48.712s
 sys     19m52.133s

more than doubling the wallclock runtime.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
Also, looking at these figures we spend a *lot* of our overall
'make check' time on migration-test. Do we really need to do
that much for every architecture?

It's unfortunate that meson doesn't let us say "parallel is
OK, but not very parallel". One other approach would be
to have mtest2make say "run tests at half the parallelism that
-jN suggests, rather than at that parallelism", I guess...
---
 tests/qtest/meson.build | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
index fc852f3d8ba..dbf2b8e2be1 100644
--- a/tests/qtest/meson.build
+++ b/tests/qtest/meson.build
@@ -17,6 +17,21 @@ slow_qtests = {
   'vmgenid-test': 610,
 }
 
+# Tests which override the default of "can run in parallel".
+# Don't use this to work around test bugs which prevent parallelism.
+# Do document why we need to make a particular test serialized.
+# Do be sparing with use of this: tests listed here will not be
+# run in parallel with any other test, not merely not with other
+# instances of themselves.
+#
+# The migration-test's subtests will each kick off two QEMU processes,
+# so allowing multiple migration-tests in parallel can overload the
+# host system and result in intermittent timeouts. So we only want to
+# run one migration-test at once.a
+qtests_parallelism = {
+  'migration-test': false,
+}
+
 qtests_generic = [
   'cdrom-test',
   'device-introspect-test',
@@ -411,6 +426,7 @@ foreach dir : target_dirs
          protocol: 'tap',
          timeout: slow_qtests.get(test, 60),
          priority: slow_qtests.get(test, 60),
+         is_parallel: qtests_parallelism.get(test, true),
          suite: ['qtest', 'qtest-' + target_base])
   endforeach
 endforeach
-- 
2.34.1
Re: [RFC PATCH] tests/qtest: Don't parallelize migration-test
Posted by Alex Bennée 2 months, 2 weeks ago
Peter Maydell <peter.maydell@linaro.org> writes:

> The migration-test is a long-running test whose subtests all launch
> at least two QEMU processes.  This means that if for example the host
> has 4 CPUs then 'make check' defaults to a parallelism of 5, and if
> we launch 5 migration-tests in parallel then we will be running 10
> QEMU instances on a 4 CPU system.  If the system is not very fast
> then the test can spuriously time out because the different tests are
> all stealing CPU from each other.  This seems to particularly be a
> problem on our S390 CI job and the cross-i686-tci CI job.
>
> Force meson to run migration-test non-parallel, so there is never any
> other test running at the same time as it.  This will slow down
> overall test execution time somewhat, but hopefully make our CI less
> flaky.
>
> The downside is that because each migration-test instance runs for
> between 2 and 5 minutes and we run it for five architectures this
> significantly increases the runtime.  For an all-architectures build
> on my local machine 'make check -j8' goes from
>
>  real    8m19.127s
>  user    31m47.534s
>  sys     19m42.650s
>
> to
>
>  real    20m31.218s
>  user    32m48.712s
>  sys     19m52.133s
>
> more than doubling the wallclock runtime.
>
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
> Also, looking at these figures we spend a *lot* of our overall
> 'make check' time on migration-test. Do we really need to do
> that much for every architecture?

I guess one question is are we getting value from all the extra
migration tests? There certainly seem to be some sub-tests that are
slower than the others and I assume testing a small delta on the tests
before it.

On s390x it seems the native test runs pretty much to the same time as
the other TCG guests. Do we exercise any extra migration code by running
tests for every architecture as opposed to one KVM/native hyp and one
TCG one?

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro
Re: [RFC PATCH] tests/qtest: Don't parallelize migration-test
Posted by Peter Maydell 2 months, 2 weeks ago
On Mon, 9 Sept 2024 at 16:23, Alex Bennée <alex.bennee@linaro.org> wrote:
> I guess one question is are we getting value from all the extra
> migration tests? There certainly seem to be some sub-tests that are
> slower than the others and I assume testing a small delta on the tests
> before it.
>
> On s390x it seems the native test runs pretty much to the same time as
> the other TCG guests. Do we exercise any extra migration code by running
> tests for every architecture as opposed to one KVM/native hyp and one
> TCG one?

s390 is an interesting one because Christian pointed out that
although it has "KVM" support, we're actually running on a
VM under z/VM, and so when we run a CI test under "-accel KVM"
that's actually nested-KVM and its effects on the host CPU's
TLB could be such that it's actually worse than using TCG...

-- PMM