[PATCH v3 00/16] tests: enable meson test timeouts to improve debuggability

Thomas Huth posted 16 patches 4 months, 2 weeks ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20231215070357.10888-1-thuth@redhat.com
Maintainers: Paolo Bonzini <pbonzini@redhat.com>, "Alex Bennée" <alex.bennee@linaro.org>, Thomas Huth <thuth@redhat.com>, John Snow <jsnow@redhat.com>, Cleber Rosa <crosa@redhat.com>, Aurelien Jarno <aurelien@aurel32.net>, Peter Maydell <peter.maydell@linaro.org>, Laurent Vivier <lvivier@redhat.com>
scripts/mtest2make.py   |  3 ++-
tests/fp/meson.build    |  2 +-
tests/qtest/meson.build | 25 +++++++++++++------------
tests/unit/meson.build  |  2 ++
4 files changed, 18 insertions(+), 14 deletions(-)
[PATCH v3 00/16] tests: enable meson test timeouts to improve debuggability
Posted by Thomas Huth 4 months, 2 weeks ago
This is a respin of Daniel's series that re-enables the meson test
runner timeouts. To make sure that we do not get into trouble on
older systems, I ran all the tests with "make check SPEED=slow -j32"
on my laptop that has only 16 SMT threads, so each test was running
quite a bit slower than with a normal "-j$(nproc)" run. I think
that these timeouts should now work in most cases - if not, we still
can adjust them easily later.

Daniel's original patch series description follows:

---------------------------- 8< -------------------------------------

Perhaps the most painful of all the GitLab CI failures we see are
the enforced job timeouts:

   "ERROR: Job failed: execution took longer than 1h15m0s seconds"

   https://gitlab.com/qemu-project/qemu/-/jobs/4387047648

when that hits the CI log shows what has *already* run, but figuring
out what was currently running (or rather stuck) is an horrendously
difficult.

The initial meson port disabled the meson test timeouts, in order to
limit the scope for introducing side effects from the port that would
complicate adoption.

Now that the meson port is basically finished we can take advantage of
more of its improved features. It has the ability to set timeouts for
test programs, defaulting to 30 seconds, but overridable per test. This
is further helped by fact that we changed the iotests integration so
that each iotests was a distinct meson test, instead of having one
single giant (slow) test.

We already set overrides for a bunch of tests, but they've not been
kept up2date since we had timeouts disabled. So this series first
updates the timeout overrides such that all tests pass when run in
my test gitlab CI pipeline. Then it enables use of meson timeouts.

We might still hit timeouts due to non-deterministic performance of
gitlab CI runners. So we'll probably have to increase a few more
timeouts in the short term. Fortunately this is going to be massively
easier to diagnose. For example this job during my testing:

   https://gitlab.com/berrange/qemu/-/jobs/4392029495

we can immediately see  the problem tests

Summary of Failures:
  6/252 qemu:qtest+qtest-i386 / qtest-i386/bios-tables-test                TIMEOUT        120.02s   killed by signal 15 SIGTERM
  7/252 qemu:qtest+qtest-aarch64 / qtest-aarch64/bios-tables-test          TIMEOUT        120.03s   killed by signal 15 SIGTERM
 64/252 qemu:qtest+qtest-aarch64 / qtest-aarch64/qom-test                  TIMEOUT        300.03s   killed by signal 15 SIGTERM

The full meson testlog.txt will show each individual TAP log output,
so we can then see exactly which test case we got stuck on.

---------------------------- 8< -------------------------------------

Daniel P. Berrangé (12):
  qtest: bump min meson timeout to 60 seconds
  qtest: bump migration-test timeout to 8 minutes
  qtest: bump qom-test timeout to 15 minutes
  qtest: bump npcm7xx_pwn-test timeout to 5 minutes
  qtest: bump test-hmp timeout to 4 minutes
  qtest: bump pxe-test timeout to 10 minutes
  qtest: bump prom-env-test timeout to 6 minutes
  qtest: bump boot-serial-test timeout to 3 minutes
  qtest: bump qos-test timeout to 2 minutes
  qtest: bump aspeed_smc-test timeout to 6 minutes
  qtest: bump bios-table-test timeout to 9 minutes
  mtest2make: stop disabling meson test timeouts

Thomas Huth (4):
  tests/qtest: Bump the device-introspect-test timeout to 12 minutes
  tests/unit: Bump test-aio-multithread test timeout to 2 minutes
  tests/unit: Bump test-crypto-block test timeout to 5 minutes
  tests/fp: Bump fp-test-mulAdd test timeout to 3 minutes

 scripts/mtest2make.py   |  3 ++-
 tests/fp/meson.build    |  2 +-
 tests/qtest/meson.build | 25 +++++++++++++------------
 tests/unit/meson.build  |  2 ++
 4 files changed, 18 insertions(+), 14 deletions(-)

-- 
2.43.0


Re: [PATCH v3 00/16] tests: enable meson test timeouts to improve debuggability
Posted by Alex Bennée 4 months, 2 weeks ago
Thomas Huth <thuth@redhat.com> writes:

> This is a respin of Daniel's series that re-enables the meson test
> runner timeouts. To make sure that we do not get into trouble on
> older systems, I ran all the tests with "make check SPEED=slow -j32"
> on my laptop that has only 16 SMT threads, so each test was running
> quite a bit slower than with a normal "-j$(nproc)" run. I think
> that these timeouts should now work in most cases - if not, we still
> can adjust them easily later.

Queued to testing/next, thanks.

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro
Re: [PATCH v3 00/16] tests: enable meson test timeouts to improve debuggability
Posted by Michael Tokarev 3 months, 1 week ago
15.12.2023 10:03, Thomas Huth wrote:
> This is a respin of Daniel's series that re-enables the meson test
> runner timeouts. To make sure that we do not get into trouble on
> older systems, I ran all the tests with "make check SPEED=slow -j32"
> on my laptop that has only 16 SMT threads, so each test was running
> quite a bit slower than with a normal "-j$(nproc)" run. I think
> that these timeouts should now work in most cases - if not, we still
> can adjust them easily later.

I'm picking this up for stable branches too, since there we have the same
problems in CI environment. In particular, bios-tables-test almost always
times out, even hitting retry doesn't help.  Let's see how it goes..

JFYI.

Thanks,

/mjt
Re: [PATCH v3 00/16] tests: enable meson test timeouts to improve debuggability
Posted by Thomas Huth 3 months, 1 week ago
On 23/01/2024 17.50, Michael Tokarev wrote:
> 15.12.2023 10:03, Thomas Huth wrote:
>> This is a respin of Daniel's series that re-enables the meson test
>> runner timeouts. To make sure that we do not get into trouble on
>> older systems, I ran all the tests with "make check SPEED=slow -j32"
>> on my laptop that has only 16 SMT threads, so each test was running
>> quite a bit slower than with a normal "-j$(nproc)" run. I think
>> that these timeouts should now work in most cases - if not, we still
>> can adjust them easily later.
> 
> I'm picking this up for stable branches too, since there we have the same
> problems in CI environment. In particular, bios-tables-test almost always
> times out, even hitting retry doesn't help.  Let's see how it goes..

Uh, wait, that does not make too much sense ... if bios-tables-test already 
times out *without* the additional meson-based timeouts, then adding the 
meson timeouts won't help. bios-tables-test uses the manually coded timeout 
from boot_sector_test() that is currently set to 600 seconds. If you hit 
that timeout, that likely means that something is really broken in your 
branch - or is it sometimes still succeeding?

  Thomas



Re: [PATCH v3 00/16] tests: enable meson test timeouts to improve debuggability
Posted by Michael Tokarev 3 months, 1 week ago
23.01.2024 20:47, Thomas Huth:
> On 23/01/2024 17.50, Michael Tokarev wrote:
..

>> I'm picking this up for stable branches too, since there we have the same
>> problems in CI environment. In particular, bios-tables-test almost always
>> times out, even hitting retry doesn't help.  Let's see how it goes..
> 
> Uh, wait, that does not make too much sense ... if bios-tables-test already times out *without* the additional meson-based timeouts, then adding the 
> meson timeouts won't help. bios-tables-test uses the manually coded timeout from boot_sector_test() that is currently set to 600 seconds. If you hit 
> that timeout, that likely means that something is really broken in your branch - or is it sometimes still succeeding?

I mistyped the test name as I was dealing with bios-tables-test at that
time in another context (unrelated).  Actual failing test in this case,
among others, is avocado acpi_smbios_bits, eg
  https://gitlab.com/qemu-project/qemu/-/jobs/5991505589#L231 which timed
out on multiple attempts.  In this example it took a bit less than 65s.
Subsequent retry succeeded in 51s:
  https://gitlab.com/qemu-project/qemu/-/jobs/5995055845#L212
but this run was at much later time, apparently when gitlab was had less
load, - as whole run was significantly faster.

So this particular failure has nothing to do with this patchset, and
the patchset does not do anything to it.

(I was in a bit distracted mode whole day today due to $ork issues).

Thanks,

/mjt

Re: [PATCH v3 00/16] tests: enable meson test timeouts to improve debuggability
Posted by Peter Maydell 3 months, 1 week ago
On Tue, 23 Jan 2024 at 20:52, Michael Tokarev <mjt@tls.msk.ru> wrote:
>
> 23.01.2024 20:47, Thomas Huth:
> > On 23/01/2024 17.50, Michael Tokarev wrote:
> ..
>
> >> I'm picking this up for stable branches too, since there we have the same
> >> problems in CI environment. In particular, bios-tables-test almost always
> >> times out, even hitting retry doesn't help.  Let's see how it goes..
> >
> > Uh, wait, that does not make too much sense ... if bios-tables-test already times out *without* the additional meson-based timeouts, then adding the
> > meson timeouts won't help. bios-tables-test uses the manually coded timeout from boot_sector_test() that is currently set to 600 seconds. If you hit
> > that timeout, that likely means that something is really broken in your branch - or is it sometimes still succeeding?
>
> I mistyped the test name as I was dealing with bios-tables-test at that
> time in another context (unrelated).  Actual failing test in this case,
> among others, is avocado acpi_smbios_bits, eg
>   https://gitlab.com/qemu-project/qemu/-/jobs/5991505589#L231 which timed
> out on multiple attempts.  In this example it took a bit less than 65s.

The fix for that flakiness is commit 7ef4c41e91d59
("acpi/tests/avocado/bits: wait for 200 seconds for SHUTDOWN event
from bits VM").

thanks
-- PMM
Re: [PATCH v3 00/16] tests: enable meson test timeouts to improve debuggability
Posted by Daniel P. Berrangé 3 months, 1 week ago
On Tue, Jan 23, 2024 at 07:50:09PM +0300, Michael Tokarev wrote:
> 15.12.2023 10:03, Thomas Huth wrote:
> > This is a respin of Daniel's series that re-enables the meson test
> > runner timeouts. To make sure that we do not get into trouble on
> > older systems, I ran all the tests with "make check SPEED=slow -j32"
> > on my laptop that has only 16 SMT threads, so each test was running
> > quite a bit slower than with a normal "-j$(nproc)" run. I think
> > that these timeouts should now work in most cases - if not, we still
> > can adjust them easily later.
> 
> I'm picking this up for stable branches too, since there we have the same
> problems in CI environment. In particular, bios-tables-test almost always
> times out, even hitting retry doesn't help.  Let's see how it goes..
> 
> JFYI.

There have been a bunch of followups that Thomas has posted since this
series merged that you should pick up too when they merge.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|