[PATCH v2] .gitlab-ci.d/crossbuilds.yml: Force 'make check' single-threaded for cross-i686-tci

Peter Maydell posted 1 patch 2 months, 1 week ago
.gitlab-ci.d/crossbuilds.yml | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
[PATCH v2] .gitlab-ci.d/crossbuilds.yml: Force 'make check' single-threaded for cross-i686-tci
Posted by Peter Maydell 2 months, 1 week ago
The cross-i686-tci CI job is persistently flaky with various tests
hitting timeouts.  One theory for why this is happening is that we're
running too many tests in parallel and so sometimes a test gets
starved of CPU and isn't able to complete within the timeout.

(The environment this CI job runs in seems to cause us to default
to a parallelism of 9 in the main CI.)

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
If this works we might be able to wind this up to -j2 or -j3,
and/or consider whether other CI jobs need something similar.
---
 .gitlab-ci.d/crossbuilds.yml | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/.gitlab-ci.d/crossbuilds.yml b/.gitlab-ci.d/crossbuilds.yml
index 459273f9da5..1e21d082aa4 100644
--- a/.gitlab-ci.d/crossbuilds.yml
+++ b/.gitlab-ci.d/crossbuilds.yml
@@ -62,7 +62,11 @@ cross-i686-tci:
     IMAGE: debian-i686-cross
     ACCEL: tcg-interpreter
     EXTRA_CONFIGURE_OPTS: --target-list=i386-softmmu,i386-linux-user,aarch64-softmmu,aarch64-linux-user,ppc-softmmu,ppc-linux-user --disable-plugins --disable-kvm
-    MAKE_CHECK_ARGS: check check-tcg
+    # Force tests to run in series, to see whether this
+    # reduces the flakiness of this CI job. The CI
+    # environment by default shows us 8 CPUs and so we
+    # would otherwise be using a parallelism of 9.
+    MAKE_CHECK_ARGS: check check-tcg -j1
 
 cross-mipsel-system:
   extends: .cross_system_build_job
-- 
2.34.1
Re: [PATCH v2] .gitlab-ci.d/crossbuilds.yml: Force 'make check' single-threaded for cross-i686-tci
Posted by Peter Maydell 2 months, 1 week ago
On Thu, 12 Sept 2024 at 16:10, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> The cross-i686-tci CI job is persistently flaky with various tests
> hitting timeouts.  One theory for why this is happening is that we're
> running too many tests in parallel and so sometimes a test gets
> starved of CPU and isn't able to complete within the timeout.
>
> (The environment this CI job runs in seems to cause us to default
> to a parallelism of 9 in the main CI.)
>
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
> If this works we might be able to wind this up to -j2 or -j3,
> and/or consider whether other CI jobs need something similar.

I gave this a try, but unfortunately the result seems to be
that the whole job times out:
https://gitlab.com/qemu-project/qemu/-/jobs/7818441897

Maybe we could try a compromise of -j3 or thereabouts...

-- PMM
Re: [PATCH v2] .gitlab-ci.d/crossbuilds.yml: Force 'make check' single-threaded for cross-i686-tci
Posted by Peter Maydell 2 months, 1 week ago
On Fri, 13 Sept 2024 at 13:24, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Thu, 12 Sept 2024 at 16:10, Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > The cross-i686-tci CI job is persistently flaky with various tests
> > hitting timeouts.  One theory for why this is happening is that we're
> > running too many tests in parallel and so sometimes a test gets
> > starved of CPU and isn't able to complete within the timeout.
> >
> > (The environment this CI job runs in seems to cause us to default
> > to a parallelism of 9 in the main CI.)
> >
> > Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> > ---
> > If this works we might be able to wind this up to -j2 or -j3,
> > and/or consider whether other CI jobs need something similar.
>
> I gave this a try, but unfortunately the result seems to be
> that the whole job times out:
> https://gitlab.com/qemu-project/qemu/-/jobs/7818441897

...but then this simple retry passed with a runtime of 47 mins:

https://gitlab.com/qemu-project/qemu/-/jobs/7819225200

I'm tempted to commit this as-is, and see whether it helps.
If it doesn't I can always back it off to -j2, and if it does
generate a lot of full-job-timeouts it's only me it's annoying.

Looking at the timed-out job it looks like it just took a lot
longer on the compile phase... (Though it's hard to say because
the fact we use "make all check-build" in our gitlab CI config
means gitlab treats this as all one step when it adds time
annotations, and you can't separate time-for-compile from
time-for-tests.)

-- PMM
Re: [PATCH v2] .gitlab-ci.d/crossbuilds.yml: Force 'make check' single-threaded for cross-i686-tci
Posted by Daniel P. Berrangé 2 months, 1 week ago
On Fri, Sep 13, 2024 at 02:31:34PM +0100, Peter Maydell wrote:
> On Fri, 13 Sept 2024 at 13:24, Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > On Thu, 12 Sept 2024 at 16:10, Peter Maydell <peter.maydell@linaro.org> wrote:
> > >
> > > The cross-i686-tci CI job is persistently flaky with various tests
> > > hitting timeouts.  One theory for why this is happening is that we're
> > > running too many tests in parallel and so sometimes a test gets
> > > starved of CPU and isn't able to complete within the timeout.
> > >
> > > (The environment this CI job runs in seems to cause us to default
> > > to a parallelism of 9 in the main CI.)
> > >
> > > Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> > > ---
> > > If this works we might be able to wind this up to -j2 or -j3,
> > > and/or consider whether other CI jobs need something similar.
> >
> > I gave this a try, but unfortunately the result seems to be
> > that the whole job times out:
> > https://gitlab.com/qemu-project/qemu/-/jobs/7818441897
> 
> ...but then this simple retry passed with a runtime of 47 mins:
> 
> https://gitlab.com/qemu-project/qemu/-/jobs/7819225200
> 
> I'm tempted to commit this as-is, and see whether it helps.
> If it doesn't I can always back it off to -j2, and if it does
> generate a lot of full-job-timeouts it's only me it's annoying.

Anyone know how many vCPUs our k8s runners have ?

The gitlab runners that contributor forks use will have 2
vCPUs. So our current make -j$(nproc+1)  will be effectively
-j3 already in pipelines for forks. IOW, we intentionally
slightly over-commit CPUs right now. Backing off to just
-j$(nproc)  may be better than hardcoding -j1/-j2, so that
it takes account of different runner sizes ?


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [PATCH v2] .gitlab-ci.d/crossbuilds.yml: Force 'make check' single-threaded for cross-i686-tci
Posted by Peter Maydell 2 months, 1 week ago
On Fri, 13 Sept 2024 at 15:05, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Fri, Sep 13, 2024 at 02:31:34PM +0100, Peter Maydell wrote:
> > On Fri, 13 Sept 2024 at 13:24, Peter Maydell <peter.maydell@linaro.org> wrote:
> > >
> > > On Thu, 12 Sept 2024 at 16:10, Peter Maydell <peter.maydell@linaro.org> wrote:
> > > >
> > > > The cross-i686-tci CI job is persistently flaky with various tests
> > > > hitting timeouts.  One theory for why this is happening is that we're
> > > > running too many tests in parallel and so sometimes a test gets
> > > > starved of CPU and isn't able to complete within the timeout.
> > > >
> > > > (The environment this CI job runs in seems to cause us to default
> > > > to a parallelism of 9 in the main CI.)
> > > >
> > > > Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> > > > ---
> > > > If this works we might be able to wind this up to -j2 or -j3,
> > > > and/or consider whether other CI jobs need something similar.
> > >
> > > I gave this a try, but unfortunately the result seems to be
> > > that the whole job times out:
> > > https://gitlab.com/qemu-project/qemu/-/jobs/7818441897
> >
> > ...but then this simple retry passed with a runtime of 47 mins:
> >
> > https://gitlab.com/qemu-project/qemu/-/jobs/7819225200
> >
> > I'm tempted to commit this as-is, and see whether it helps.
> > If it doesn't I can always back it off to -j2, and if it does
> > generate a lot of full-job-timeouts it's only me it's annoying.
>
> Anyone know how many vCPUs our k8s runners have ?

They report as 8, I think, given that in the main CI run this
job gets run as -j9. But we clearly aren't actually getting
a reliable 9 CPUs worth.

-- PMM
Re: [PATCH v2] .gitlab-ci.d/crossbuilds.yml: Force 'make check' single-threaded for cross-i686-tci
Posted by Thomas Huth 2 months, 1 week ago
On 13/09/2024 15.31, Peter Maydell wrote:
> On Fri, 13 Sept 2024 at 13:24, Peter Maydell <peter.maydell@linaro.org> wrote:
>>
>> On Thu, 12 Sept 2024 at 16:10, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>
>>> The cross-i686-tci CI job is persistently flaky with various tests
>>> hitting timeouts.  One theory for why this is happening is that we're
>>> running too many tests in parallel and so sometimes a test gets
>>> starved of CPU and isn't able to complete within the timeout.
>>>
>>> (The environment this CI job runs in seems to cause us to default
>>> to a parallelism of 9 in the main CI.)
>>>
>>> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
>>> ---
>>> If this works we might be able to wind this up to -j2 or -j3,
>>> and/or consider whether other CI jobs need something similar.
>>
>> I gave this a try, but unfortunately the result seems to be
>> that the whole job times out:
>> https://gitlab.com/qemu-project/qemu/-/jobs/7818441897
> 
> ...but then this simple retry passed with a runtime of 47 mins:
> 
> https://gitlab.com/qemu-project/qemu/-/jobs/7819225200
> 
> I'm tempted to commit this as-is, and see whether it helps.

FWIW, I just had a try with your patch, too, and it took 53 minutes:

  https://gitlab.com/thuth/qemu/-/jobs/7818945368

Older jobs without your patch seem to take ~ 25 to ~ 30 minutes instead, so 
the runtime got definitely much worse by the -j1.

Considering that we're close to the 60 minutes timeout, you might need to 
bump the timeout of the job to 70 or 75 minutes now, to be on the safe side? 
Or maybe really try -j2 first?

  Thomas
Re: [PATCH v2] .gitlab-ci.d/crossbuilds.yml: Force 'make check' single-threaded for cross-i686-tci
Posted by Thomas Huth 2 months, 1 week ago
On 12/09/2024 17.10, Peter Maydell wrote:
> The cross-i686-tci CI job is persistently flaky with various tests
> hitting timeouts.  One theory for why this is happening is that we're
> running too many tests in parallel and so sometimes a test gets
> starved of CPU and isn't able to complete within the timeout.
> 
> (The environment this CI job runs in seems to cause us to default
> to a parallelism of 9 in the main CI.)
> 
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
> If this works we might be able to wind this up to -j2 or -j3,
> and/or consider whether other CI jobs need something similar.

As a start, we could also try replacing the

  JOBS=$(expr $(nproc) + 1)

with

  JOBS=$(nproc)

in the buildtest-template.yml file...?

> ---
>   .gitlab-ci.d/crossbuilds.yml | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/.gitlab-ci.d/crossbuilds.yml b/.gitlab-ci.d/crossbuilds.yml
> index 459273f9da5..1e21d082aa4 100644
> --- a/.gitlab-ci.d/crossbuilds.yml
> +++ b/.gitlab-ci.d/crossbuilds.yml
> @@ -62,7 +62,11 @@ cross-i686-tci:
>       IMAGE: debian-i686-cross
>       ACCEL: tcg-interpreter
>       EXTRA_CONFIGURE_OPTS: --target-list=i386-softmmu,i386-linux-user,aarch64-softmmu,aarch64-linux-user,ppc-softmmu,ppc-linux-user --disable-plugins --disable-kvm
> -    MAKE_CHECK_ARGS: check check-tcg
> +    # Force tests to run in series, to see whether this
> +    # reduces the flakiness of this CI job. The CI
> +    # environment by default shows us 8 CPUs and so we
> +    # would otherwise be using a parallelism of 9.
> +    MAKE_CHECK_ARGS: check check-tcg -j1

Reviewed-by: Thomas Huth <thuth@redhat.com>