scripts/mtest2make.py | 3 ++- tests/qtest/meson.build | 16 ++++++---------- 2 files changed, 8 insertions(+), 11 deletions(-)
Perhaps the most painful of all the GitLab CI failures we see are the enforced job timeouts: "ERROR: Job failed: execution took longer than 1h15m0s seconds" https://gitlab.com/qemu-project/qemu/-/jobs/4387047648 when that hits the CI log shows what has *already* run, but figuring out what was currently running (or rather stuck) is an horrendously difficult. The initial meson port disabled the meson test timeouts, in order to limit the scope for introducing side effects from the port that would complicate adoption. Now that the meson port is basically finished we can take advantage of more of its improved features. It has the ability to set timeouts for test programs, defaulting to 30 seconds, but overridable per test. This is further helped by fact that we changed the iotests integration so that each iotests was a distinct meson test, instead of having one single giant (slow) test. We already set overrides for a bunch of tests, but they've not been kept up2date since we had timeouts disabled. So this series first updates the timeout overrides such that all tests pass when run in my test gitlab CI pipeline. Then it enables use of meson timeouts. We might still hit timeouts due to non-deterministic performance of gitlab CI runners. So we'll probably have to increase a few more timeouts in the short term. Fortunately this is going to be massively easier to diagnose. For example this job during my testing: https://gitlab.com/berrange/qemu/-/jobs/4392029495 we can immediately see the problem tests Summary of Failures: 6/252 qemu:qtest+qtest-i386 / qtest-i386/bios-tables-test TIMEOUT 120.02s killed by signal 15 SIGTERM 7/252 qemu:qtest+qtest-aarch64 / qtest-aarch64/bios-tables-test TIMEOUT 120.03s killed by signal 15 SIGTERM 64/252 qemu:qtest+qtest-aarch64 / qtest-aarch64/qom-test TIMEOUT 300.03s killed by signal 15 SIGTERM The full meson testlog.txt will show each individual TAP log output, so we can then see exactly which test case we got stuck on. NB, the artifacts are missing on the job links above, until this patch merges: https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg04668.html NB, this series sets the migration-test timeout to 5 minutes, which is only valid if this series is merged to make the migration test not suck: https://lists.gnu.org/archive/html/qemu-devel/2023-06/msg00286.html without that series, we'll need to set the migration-test timeout to 30 minutes instead. Daniel P. Berrangé (6): qtest: bump min meson timeout to 60 seconds qtest: bump migration-test timeout to 5 minutes qtest: bump qom-test timeout to 7 minutes qtest: bump aspeed_smc-test timeout to 2 minutes qtest: bump bios-table-test timeout to 6 minutes mtest2make: stop disabling meson test timeouts scripts/mtest2make.py | 3 ++- tests/qtest/meson.build | 16 ++++++---------- 2 files changed, 8 insertions(+), 11 deletions(-) -- 2.40.1
On 6/1/23 09:31, Daniel P. Berrangé wrote: > Daniel P. Berrangé (6): > qtest: bump min meson timeout to 60 seconds > qtest: bump migration-test timeout to 5 minutes > qtest: bump qom-test timeout to 7 minutes > qtest: bump aspeed_smc-test timeout to 2 minutes > qtest: bump bios-table-test timeout to 6 minutes > mtest2make: stop disabling meson test timeouts Acked-by: Richard Henderson <richard.henderson@linaro.org> r~
On 01/06/2023 18.31, Daniel P. Berrangé wrote: > Perhaps the most painful of all the GitLab CI failures we see are > the enforced job timeouts: > > "ERROR: Job failed: execution took longer than 1h15m0s seconds" > > https://gitlab.com/qemu-project/qemu/-/jobs/4387047648 > > when that hits the CI log shows what has *already* run, but figuring > out what was currently running (or rather stuck) is an horrendously > difficult. > > The initial meson port disabled the meson test timeouts, in order to > limit the scope for introducing side effects from the port that would > complicate adoption. > > Now that the meson port is basically finished we can take advantage of > more of its improved features. It has the ability to set timeouts for > test programs, defaulting to 30 seconds, but overridable per test. This > is further helped by fact that we changed the iotests integration so > that each iotests was a distinct meson test, instead of having one > single giant (slow) test. > > We already set overrides for a bunch of tests, but they've not been > kept up2date since we had timeouts disabled. So this series first > updates the timeout overrides such that all tests pass when run in > my test gitlab CI pipeline. Then it enables use of meson timeouts. > > We might still hit timeouts due to non-deterministic performance of > gitlab CI runners. So we'll probably have to increase a few more > timeouts in the short term. Fortunately this is going to be massively > easier to diagnose. For example this job during my testing: > > https://gitlab.com/berrange/qemu/-/jobs/4392029495 > > we can immediately see the problem tests > > Summary of Failures: > 6/252 qemu:qtest+qtest-i386 / qtest-i386/bios-tables-test TIMEOUT 120.02s killed by signal 15 SIGTERM > 7/252 qemu:qtest+qtest-aarch64 / qtest-aarch64/bios-tables-test TIMEOUT 120.03s killed by signal 15 SIGTERM > 64/252 qemu:qtest+qtest-aarch64 / qtest-aarch64/qom-test TIMEOUT 300.03s killed by signal 15 SIGTERM > > The full meson testlog.txt will show each individual TAP log output, > so we can then see exactly which test case we got stuck on. > > NB, the artifacts are missing on the job links above, until this > patch merges: > > https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg04668.html > > NB, this series sets the migration-test timeout to 5 minutes, which > is only valid if this series is merged to make the migration test > not suck: > > https://lists.gnu.org/archive/html/qemu-devel/2023-06/msg00286.html > > without that series, we'll need to set the migration-test timeout to > 30 minutes instead. > > Daniel P. Berrangé (6): > qtest: bump min meson timeout to 60 seconds > qtest: bump migration-test timeout to 5 minutes > qtest: bump qom-test timeout to 7 minutes > qtest: bump aspeed_smc-test timeout to 2 minutes > qtest: bump bios-table-test timeout to 6 minutes > mtest2make: stop disabling meson test timeouts > > scripts/mtest2make.py | 3 ++- > tests/qtest/meson.build | 16 ++++++---------- > 2 files changed, 8 insertions(+), 11 deletions(-) FWIW, I now ran this on my rather old laptop with an --enable-debug build with "make -j$(nproc) check-qtest" and got these additional failures (beside the expected migration-test that still needs its final speedup): qtest-aarch64/test-hmp TIMEOUT 120.07s killed by signal 15 SIGTERM qtest-aarch64/qom-test TIMEOUT 420.09s killed by signal 15 SIGTERM qtest-arm/qom-test TIMEOUT 420.10s killed by signal 15 SIGTERM qtest-arm/npcm7xx_pwm-test TIMEOUT 150.04s killed by signal 15 SIGTERM qtest-ppc64/pxe-test TIMEOUT 60.01s killed by signal 15 SIGTERM qtest-sparc/prom-env-test TIMEOUT 60.01s killed by signal 15 SIGTERM qtest-sparc/boot-serial-test TIMEOUT 60.01s killed by signal 15 SIGTERM When I run them manually without the timeout patch, I get these values: qtest-aarch64/test-hmp OK 168.66s 95 subtests passed qtest-aarch64/qom-test OK 646.37s 94 subtests passed qtest-arm/qom-test OK 621.64s 89 subtests passed qtest-arm/npcm7xx_pwm-test OK 225.48s 24 subtests passed qtest-ppc64/pxe-test OK 96.95s 2 subtests passed qtest-sparc/prom-env-test OK 95.94s 3 subtests passed qtest-sparc/boot-serial-test OK 92.96s 3 subtests passed HTH, Thomas
On Mon, 5 Jun 2023 at 15:08, Thomas Huth <thuth@redhat.com> wrote: > When I run them manually without the timeout patch, I get these > values: > > qtest-aarch64/test-hmp OK 168.66s 95 subtests passed > qtest-aarch64/qom-test OK 646.37s 94 subtests passed > qtest-arm/qom-test OK 621.64s 89 subtests passed > qtest-arm/npcm7xx_pwm-test OK 225.48s 24 subtests passed > qtest-ppc64/pxe-test OK 96.95s 2 subtests passed > qtest-sparc/prom-env-test OK 95.94s 3 subtests passed > qtest-sparc/boot-serial-test OK 92.96s 3 subtests passed The qom-tests being slow is kind of expected, since they have to go through every board and device, and there are a lot of arm boards and devices. That npcm7xx_pwm-test runtime is a bit OTT though for something testing a single device. Would one of the Nuvoton maintainers like to take a look at that test and make it a bit less of a hog of runtime ? thanks -- PMM
On Mon, Jun 05, 2023 at 04:07:46PM +0200, Thomas Huth wrote: > On 01/06/2023 18.31, Daniel P. Berrangé wrote: > > Perhaps the most painful of all the GitLab CI failures we see are > > the enforced job timeouts: > > > > "ERROR: Job failed: execution took longer than 1h15m0s seconds" > > > > https://gitlab.com/qemu-project/qemu/-/jobs/4387047648 > > > > when that hits the CI log shows what has *already* run, but figuring > > out what was currently running (or rather stuck) is an horrendously > > difficult. > > > > The initial meson port disabled the meson test timeouts, in order to > > limit the scope for introducing side effects from the port that would > > complicate adoption. > > > > Now that the meson port is basically finished we can take advantage of > > more of its improved features. It has the ability to set timeouts for > > test programs, defaulting to 30 seconds, but overridable per test. This > > is further helped by fact that we changed the iotests integration so > > that each iotests was a distinct meson test, instead of having one > > single giant (slow) test. > > > > We already set overrides for a bunch of tests, but they've not been > > kept up2date since we had timeouts disabled. So this series first > > updates the timeout overrides such that all tests pass when run in > > my test gitlab CI pipeline. Then it enables use of meson timeouts. > > > > We might still hit timeouts due to non-deterministic performance of > > gitlab CI runners. So we'll probably have to increase a few more > > timeouts in the short term. Fortunately this is going to be massively > > easier to diagnose. For example this job during my testing: > > > > https://gitlab.com/berrange/qemu/-/jobs/4392029495 > > > > we can immediately see the problem tests > > > > Summary of Failures: > > 6/252 qemu:qtest+qtest-i386 / qtest-i386/bios-tables-test TIMEOUT 120.02s killed by signal 15 SIGTERM > > 7/252 qemu:qtest+qtest-aarch64 / qtest-aarch64/bios-tables-test TIMEOUT 120.03s killed by signal 15 SIGTERM > > 64/252 qemu:qtest+qtest-aarch64 / qtest-aarch64/qom-test TIMEOUT 300.03s killed by signal 15 SIGTERM > > > > The full meson testlog.txt will show each individual TAP log output, > > so we can then see exactly which test case we got stuck on. > > > > NB, the artifacts are missing on the job links above, until this > > patch merges: > > > > https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg04668.html > > > > NB, this series sets the migration-test timeout to 5 minutes, which > > is only valid if this series is merged to make the migration test > > not suck: > > > > https://lists.gnu.org/archive/html/qemu-devel/2023-06/msg00286.html > > > > without that series, we'll need to set the migration-test timeout to > > 30 minutes instead. > > > > Daniel P. Berrangé (6): > > qtest: bump min meson timeout to 60 seconds > > qtest: bump migration-test timeout to 5 minutes > > qtest: bump qom-test timeout to 7 minutes > > qtest: bump aspeed_smc-test timeout to 2 minutes > > qtest: bump bios-table-test timeout to 6 minutes > > mtest2make: stop disabling meson test timeouts > > > > scripts/mtest2make.py | 3 ++- > > tests/qtest/meson.build | 16 ++++++---------- > > 2 files changed, 8 insertions(+), 11 deletions(-) > > FWIW, I now ran this on my rather old laptop with an --enable-debug > build with "make -j$(nproc) check-qtest" and got these additional > failures (beside the expected migration-test that still needs its > final speedup): > > qtest-aarch64/test-hmp TIMEOUT 120.07s killed by signal 15 SIGTERM > qtest-aarch64/qom-test TIMEOUT 420.09s killed by signal 15 SIGTERM > qtest-arm/qom-test TIMEOUT 420.10s killed by signal 15 SIGTERM > qtest-arm/npcm7xx_pwm-test TIMEOUT 150.04s killed by signal 15 SIGTERM > qtest-ppc64/pxe-test TIMEOUT 60.01s killed by signal 15 SIGTERM > qtest-sparc/prom-env-test TIMEOUT 60.01s killed by signal 15 SIGTERM > qtest-sparc/boot-serial-test TIMEOUT 60.01s killed by signal 15 SIGTERM Did you see any others in the 45-60 second time window, as those would be candidates for increases too - don't want to have things right below the 60 second cutoff ? > When I run them manually without the timeout patch, I get these > values: > > qtest-aarch64/test-hmp OK 168.66s 95 subtests passed > qtest-aarch64/qom-test OK 646.37s 94 subtests passed > qtest-arm/qom-test OK 621.64s 89 subtests passed > qtest-arm/npcm7xx_pwm-test OK 225.48s 24 subtests passed > qtest-ppc64/pxe-test OK 96.95s 2 subtests passed > qtest-sparc/prom-env-test OK 95.94s 3 subtests passed > qtest-sparc/boot-serial-test OK 92.96s 3 subtests passed > > HTH, > Thomas > With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On 05/06/2023 16.14, Daniel P. Berrangé wrote: > On Mon, Jun 05, 2023 at 04:07:46PM +0200, Thomas Huth wrote: >> On 01/06/2023 18.31, Daniel P. Berrangé wrote: >>> Perhaps the most painful of all the GitLab CI failures we see are >>> the enforced job timeouts: >>> >>> "ERROR: Job failed: execution took longer than 1h15m0s seconds" >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/4387047648 >>> >>> when that hits the CI log shows what has *already* run, but figuring >>> out what was currently running (or rather stuck) is an horrendously >>> difficult. >>> >>> The initial meson port disabled the meson test timeouts, in order to >>> limit the scope for introducing side effects from the port that would >>> complicate adoption. >>> >>> Now that the meson port is basically finished we can take advantage of >>> more of its improved features. It has the ability to set timeouts for >>> test programs, defaulting to 30 seconds, but overridable per test. This >>> is further helped by fact that we changed the iotests integration so >>> that each iotests was a distinct meson test, instead of having one >>> single giant (slow) test. >>> >>> We already set overrides for a bunch of tests, but they've not been >>> kept up2date since we had timeouts disabled. So this series first >>> updates the timeout overrides such that all tests pass when run in >>> my test gitlab CI pipeline. Then it enables use of meson timeouts. >>> >>> We might still hit timeouts due to non-deterministic performance of >>> gitlab CI runners. So we'll probably have to increase a few more >>> timeouts in the short term. Fortunately this is going to be massively >>> easier to diagnose. For example this job during my testing: >>> >>> https://gitlab.com/berrange/qemu/-/jobs/4392029495 >>> >>> we can immediately see the problem tests >>> >>> Summary of Failures: >>> 6/252 qemu:qtest+qtest-i386 / qtest-i386/bios-tables-test TIMEOUT 120.02s killed by signal 15 SIGTERM >>> 7/252 qemu:qtest+qtest-aarch64 / qtest-aarch64/bios-tables-test TIMEOUT 120.03s killed by signal 15 SIGTERM >>> 64/252 qemu:qtest+qtest-aarch64 / qtest-aarch64/qom-test TIMEOUT 300.03s killed by signal 15 SIGTERM >>> >>> The full meson testlog.txt will show each individual TAP log output, >>> so we can then see exactly which test case we got stuck on. >>> >>> NB, the artifacts are missing on the job links above, until this >>> patch merges: >>> >>> https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg04668.html >>> >>> NB, this series sets the migration-test timeout to 5 minutes, which >>> is only valid if this series is merged to make the migration test >>> not suck: >>> >>> https://lists.gnu.org/archive/html/qemu-devel/2023-06/msg00286.html >>> >>> without that series, we'll need to set the migration-test timeout to >>> 30 minutes instead. >>> >>> Daniel P. Berrangé (6): >>> qtest: bump min meson timeout to 60 seconds >>> qtest: bump migration-test timeout to 5 minutes >>> qtest: bump qom-test timeout to 7 minutes >>> qtest: bump aspeed_smc-test timeout to 2 minutes >>> qtest: bump bios-table-test timeout to 6 minutes >>> mtest2make: stop disabling meson test timeouts >>> >>> scripts/mtest2make.py | 3 ++- >>> tests/qtest/meson.build | 16 ++++++---------- >>> 2 files changed, 8 insertions(+), 11 deletions(-) >> >> FWIW, I now ran this on my rather old laptop with an --enable-debug >> build with "make -j$(nproc) check-qtest" and got these additional >> failures (beside the expected migration-test that still needs its >> final speedup): >> >> qtest-aarch64/test-hmp TIMEOUT 120.07s killed by signal 15 SIGTERM >> qtest-aarch64/qom-test TIMEOUT 420.09s killed by signal 15 SIGTERM >> qtest-arm/qom-test TIMEOUT 420.10s killed by signal 15 SIGTERM >> qtest-arm/npcm7xx_pwm-test TIMEOUT 150.04s killed by signal 15 SIGTERM >> qtest-ppc64/pxe-test TIMEOUT 60.01s killed by signal 15 SIGTERM >> qtest-sparc/prom-env-test TIMEOUT 60.01s killed by signal 15 SIGTERM >> qtest-sparc/boot-serial-test TIMEOUT 60.01s killed by signal 15 SIGTERM > > Did you see any others in the 45-60 second time window, as those would > be candidates for increases too - don't want to have things right below > the 60 second cutoff ? The qos-test on arm likely needs some adjustment, too: qtest-arm/qos-test OK 40.72s 115 subtests passed qtest-aarch64/qos-test OK 54.00s 108 subtests passed Thomas
© 2016 - 2024 Red Hat, Inc.