ci: Reduce number of stages

[libvirt PATCH] ci: Reduce number of stages

Posted by Andrea Bolognani 5 years, 8 months ago

Right now we're dividing the jobs into three stages: prebuild, which
includes DCO checking as well as building artifacts such as the
website, and native_build/cross_build, which do exactly what you'd
expect based on their names.

This organization is nice from the logical point of view, but results
in poor utilization of the available CI resources: in particular, the
fact that cross_build jobs can only start after all native_build jobs
have finished means that if even a single one of the latter takes a
bit longer the pipeline will stall, and with native builds taking
anywhere from less than 10 minutes to more than 20, this happens all
the time.

Building artifacts in a separate pipeline stage also doesn't have any
advantages, and only delays further stages by a couple of minutes.
The only job that really makes sense in its own stage is the DCO
check, because it's extremely fast (less than 1 minute) and, if that
fails, we can avoid kicking off all other jobs.

Reducing the number of stages results in significant speedups:
specifically, going from three stages to two stages reduces the
overall completion time for a full CI pipeline from ~45 minutes[1]
to ~30 minutes[2].

[1] https://gitlab.com/abologna/libvirt/-/pipelines/154751893
[2] https://gitlab.com/abologna/libvirt/-/pipelines/154771173

Signed-off-by: Andrea Bolognani <abologna@redhat.com>
---
 .gitlab-ci.yml | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
index 7a8142b506..8d9313e415 100644
--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml
@@ -2,9 +2,8 @@ variables:
   GIT_DEPTH: 100
 
 stages:
-  - prebuild
-  - native_build
-  - cross_build
+  - sanity_checks
+  - builds
 
 .script_variables: &script_variables |
   export MAKEFLAGS="-j$(getconf _NPROCESSORS_ONLN)"
@@ -17,7 +16,7 @@ stages:
 
 # Default native build jobs that are always run
 .native_build_default_job_template: &native_build_default_job_definition
-  stage: native_build
+  stage: builds
   cache:
     paths:
       - ccache/
@@ -42,7 +41,7 @@ stages:
 # system other than Linux. These jobs will only run if the required
 # setup has been performed on the GitLab account (see ci/README.rst).
 .cirrus_build_default_job_template: &cirrus_build_default_job_definition
-  stage: native_build
+  stage: builds
   image: registry.gitlab.com/libvirt/libvirt-ci/cirrus-run:master
   script:
     - cirrus-run ci/cirrus/$NAME.yml.j2
@@ -64,7 +63,7 @@ stages:
 
 # Default cross build jobs that are always run
 .cross_build_default_job_template: &cross_build_default_job_definition
-  stage: cross_build
+  stage: builds
   cache:
     paths:
       - ccache/
@@ -194,7 +193,7 @@ mingw64-fedora-rawhide:
 # be deployed to the web root:
 #    https://gitlab.com/libvirt/libvirt/-/jobs/artifacts/master/download?job=website
 website:
-  stage: prebuild
+  stage: builds
   before_script:
     - *script_variables
   script:
@@ -216,7 +215,7 @@ website:
 
 
 codestyle:
-  stage: prebuild
+  stage: builds
   before_script:
     - *script_variables
   script:
@@ -231,7 +230,7 @@ codestyle:
 # for translation usage:
 #    https://gitlab.com/libvirt/libvirt/-/jobs/artifacts/master/download?job=potfile
 potfile:
-  stage: prebuild
+  stage: builds
   only:
     - master
   before_script:
@@ -259,7 +258,7 @@ potfile:
 # this test on developer's personal forks from which
 # merge requests are submitted
 check-dco:
-  stage: prebuild
+  stage: sanity_checks
   image: registry.gitlab.com/libvirt/libvirt-ci/check-dco:master
   script:
     - /check-dco
-- 
2.25.4

Re: [libvirt PATCH] ci: Reduce number of stages

Posted by Peter Krempa 5 years, 8 months ago

On Wed, Jun 10, 2020 at 13:33:01 +0200, Andrea Bolognani wrote:

[...]

> Building artifacts in a separate pipeline stage also doesn't have any
> advantages, and only delays further stages by a couple of minutes.
> The only job that really makes sense in its own stage is the DCO
> check, because it's extremely fast (less than 1 minute) and, if that
> fails, we can avoid kicking off all other jobs.

On the contrary I think that the DCO check should be made after builds
as that usually forces users to add a sign-off just to bypass that check
if they want to sanity check their series.

Since the lack of a sign off can be effectively used as a mark for an
patch that is not ready to be pushed, but a build-check is still needed.
This adds a pointless hurdle in using the CI and also removes one of the
meaningful uses to have a sign off checker.

Re: [libvirt PATCH] ci: Reduce number of stages

Posted by Daniel P. Berrangé 5 years, 8 months ago

On Wed, Jun 10, 2020 at 01:51:29PM +0200, Peter Krempa wrote:
> On Wed, Jun 10, 2020 at 13:33:01 +0200, Andrea Bolognani wrote:
> 
> [...]
> 
> > Building artifacts in a separate pipeline stage also doesn't have any
> > advantages, and only delays further stages by a couple of minutes.
> > The only job that really makes sense in its own stage is the DCO
> > check, because it's extremely fast (less than 1 minute) and, if that
> > fails, we can avoid kicking off all other jobs.
> 
> On the contrary I think that the DCO check should be made after builds
> as that usually forces users to add a sign-off just to bypass that check
> if they want to sanity check their series.

Missing signoff is quite common for new contributors, so it was put as
the first check so that get quick notification of this mistake.

> Since the lack of a sign off can be effectively used as a mark for an
> patch that is not ready to be pushed, but a build-check is still needed.
> This adds a pointless hurdle in using the CI and also removes one of the
> meaningful uses to have a sign off checker.

That kind of usage of signoff is not really required in a merge request
workflow. You won't typically open the merge request in the first place
if code isn't ready, but if you did, then there's explicit "WIP" flag for
merge requests to achieve this. Once libvirt.git uses merge request, we
will fully block all ability to push directly to git.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt PATCH] ci: Reduce number of stages

Posted by Ján Tomko 5 years, 8 months ago

On a Wednesday in 2020, Daniel P. Berrangé wrote:
>On Wed, Jun 10, 2020 at 01:51:29PM +0200, Peter Krempa wrote:
>> On Wed, Jun 10, 2020 at 13:33:01 +0200, Andrea Bolognani wrote:
>>
>> [...]
>>
>> > Building artifacts in a separate pipeline stage also doesn't have any
>> > advantages, and only delays further stages by a couple of minutes.
>> > The only job that really makes sense in its own stage is the DCO
>> > check, because it's extremely fast (less than 1 minute) and, if that
>> > fails, we can avoid kicking off all other jobs.
>>
>> On the contrary I think that the DCO check should be made after builds
>> as that usually forces users to add a sign-off just to bypass that check
>> if they want to sanity check their series.
>
>Missing signoff is quite common for new contributors, so it was put as
>the first check so that get quick notification of this mistake.
>
>> Since the lack of a sign off can be effectively used as a mark for an
>> patch that is not ready to be pushed, but a build-check is still needed.
>> This adds a pointless hurdle in using the CI and also removes one of the
>> meaningful uses to have a sign off checker.
>
>That kind of usage of signoff is not really required in a merge request
>workflow. You won't typically open the merge request in the first place
>if code isn't ready, but if you did, then there's explicit "WIP" flag for
>merge requests to achieve this. Once libvirt.git uses merge request, we
>will fully block all ability to push directly to git.

I think we have a long way to go until merge requests are usable without
pushing directly to git.

Jano

>
>Regards,
>Daniel
>-- 
>|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
>|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
>|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Re: [libvirt PATCH] ci: Reduce number of stages

Posted by Daniel P. Berrangé 5 years, 8 months ago

On Wed, Jun 10, 2020 at 01:33:01PM +0200, Andrea Bolognani wrote:
> Right now we're dividing the jobs into three stages: prebuild, which
> includes DCO checking as well as building artifacts such as the
> website, and native_build/cross_build, which do exactly what you'd
> expect based on their names.
> 
> This organization is nice from the logical point of view, but results
> in poor utilization of the available CI resources: in particular, the
> fact that cross_build jobs can only start after all native_build jobs
> have finished means that if even a single one of the latter takes a
> bit longer the pipeline will stall, and with native builds taking
> anywhere from less than 10 minutes to more than 20, this happens all
> the time.
> 
> Building artifacts in a separate pipeline stage also doesn't have any
> advantages, and only delays further stages by a couple of minutes.
> The only job that really makes sense in its own stage is the DCO
> check, because it's extremely fast (less than 1 minute) and, if that
> fails, we can avoid kicking off all other jobs.
> 
> Reducing the number of stages results in significant speedups:
> specifically, going from three stages to two stages reduces the
> overall completion time for a full CI pipeline from ~45 minutes[1]
> to ~30 minutes[2].
> 
> [1] https://gitlab.com/abologna/libvirt/-/pipelines/154751893
> [2] https://gitlab.com/abologna/libvirt/-/pipelines/154771173
> 
> Signed-off-by: Andrea Bolognani <abologna@redhat.com>
> ---
>  .gitlab-ci.yml | 19 +++++++++----------
>  1 file changed, 9 insertions(+), 10 deletions(-)

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt PATCH] ci: Reduce number of stages

Posted by Daniel P. Berrangé 5 years, 8 months ago

On Wed, Jun 10, 2020 at 01:33:01PM +0200, Andrea Bolognani wrote:
> Right now we're dividing the jobs into three stages: prebuild, which
> includes DCO checking as well as building artifacts such as the
> website, and native_build/cross_build, which do exactly what you'd
> expect based on their names.
> 
> This organization is nice from the logical point of view, but results
> in poor utilization of the available CI resources: in particular, the
> fact that cross_build jobs can only start after all native_build jobs
> have finished means that if even a single one of the latter takes a
> bit longer the pipeline will stall, and with native builds taking
> anywhere from less than 10 minutes to more than 20, this happens all
> the time.
> 
> Building artifacts in a separate pipeline stage also doesn't have any
> advantages, and only delays further stages by a couple of minutes.
> The only job that really makes sense in its own stage is the DCO
> check, because it's extremely fast (less than 1 minute) and, if that
> fails, we can avoid kicking off all other jobs.

The advantage of using stages is that it makes it easy to see at a
glance where the pipeline was failing. 

> 
> Reducing the number of stages results in significant speedups:
> specifically, going from three stages to two stages reduces the
> overall completion time for a full CI pipeline from ~45 minutes[1]
> to ~30 minutes[2].
> 
> [1] https://gitlab.com/abologna/libvirt/-/pipelines/154751893
> [2] https://gitlab.com/abologna/libvirt/-/pipelines/154771173

I don't think this time comparison is showing a genuine difference.

If we look at the original staged pipeline, every single individual
job took much longer than every individual jobs in the simplified
pipeline. I think the difference in job times accounts for most
(possibly all) of the difference in the pipelines time.

If we look at the history of libvirt pipelines:

   https://gitlab.com/libvirt/libvirt/pipelines

the vast majority of the time we're completing in 30 minutes or
less already.

If you want to demonstrate an time improvement from these merged
stages, then run 20 pipelines over a cople of days and show
that they're consistently better than what we see already, and
not just a reflection of the CI infra load at a point in time.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt PATCH] ci: Reduce number of stages

Posted by Daniel P. Berrangé 5 years, 8 months ago

On Wed, Jun 10, 2020 at 01:14:51PM +0100, Daniel P. Berrangé wrote:
> On Wed, Jun 10, 2020 at 01:33:01PM +0200, Andrea Bolognani wrote:
> > Right now we're dividing the jobs into three stages: prebuild, which
> > includes DCO checking as well as building artifacts such as the
> > website, and native_build/cross_build, which do exactly what you'd
> > expect based on their names.
> > 
> > This organization is nice from the logical point of view, but results
> > in poor utilization of the available CI resources: in particular, the
> > fact that cross_build jobs can only start after all native_build jobs
> > have finished means that if even a single one of the latter takes a
> > bit longer the pipeline will stall, and with native builds taking
> > anywhere from less than 10 minutes to more than 20, this happens all
> > the time.
> > 
> > Building artifacts in a separate pipeline stage also doesn't have any
> > advantages, and only delays further stages by a couple of minutes.
> > The only job that really makes sense in its own stage is the DCO
> > check, because it's extremely fast (less than 1 minute) and, if that
> > fails, we can avoid kicking off all other jobs.
> 
> The advantage of using stages is that it makes it easy to see at a
> glance where the pipeline was failing. 
> 
> > 
> > Reducing the number of stages results in significant speedups:
> > specifically, going from three stages to two stages reduces the
> > overall completion time for a full CI pipeline from ~45 minutes[1]
> > to ~30 minutes[2].
> > 
> > [1] https://gitlab.com/abologna/libvirt/-/pipelines/154751893
> > [2] https://gitlab.com/abologna/libvirt/-/pipelines/154771173
> 
> I don't think this time comparison is showing a genuine difference.
> 
> If we look at the original staged pipeline, every single individual
> job took much longer than every individual jobs in the simplified
> pipeline. I think the difference in job times accounts for most
> (possibly all) of the difference in the pipelines time.
> 
> If we look at the history of libvirt pipelines:
> 
>    https://gitlab.com/libvirt/libvirt/pipelines
> 
> the vast majority of the time we're completing in 30 minutes or
> less already.
> 
> If you want to demonstrate an time improvement from these merged
> stages, then run 20 pipelines over a cople of days and show
> that they're consistently better than what we see already, and
> not just a reflection of the CI infra load at a point in time.

Also remember that we're using ccache, so slower builds may just be a
reflection of the ccache having low hit rate - a sequence of repeated
builds of the same branch should identify if that's the case.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt PATCH] ci: Reduce number of stages

Posted by Andrea Bolognani 5 years, 8 months ago

On Wed, 2020-06-10 at 13:31 +0100, Daniel P. Berrangé wrote:
> On Wed, Jun 10, 2020 at 01:14:51PM +0100, Daniel P. Berrangé wrote:
> > On Wed, Jun 10, 2020 at 01:33:01PM +0200, Andrea Bolognani wrote:
> > > Building artifacts in a separate pipeline stage also doesn't have any
> > > advantages, and only delays further stages by a couple of minutes.
> > > The only job that really makes sense in its own stage is the DCO
> > > check, because it's extremely fast (less than 1 minute) and, if that
> > > fails, we can avoid kicking off all other jobs.
> > 
> > The advantage of using stages is that it makes it easy to see at a
> > glance where the pipeline was failing.

Ultimately you'll need to drill down to the actual failure, though,
so the only situation in which it would really provide value is if
for some reason *all* cross builds failed at once, which is not
something that happens frequently enough to optimize for.

> > > Reducing the number of stages results in significant speedups:
> > > specifically, going from three stages to two stages reduces the
> > > overall completion time for a full CI pipeline from ~45 minutes[1]
> > > to ~30 minutes[2].
> > > 
> > > [1] https://gitlab.com/abologna/libvirt/-/pipelines/154751893
> > > [2] https://gitlab.com/abologna/libvirt/-/pipelines/154771173
> > 
> > I don't think this time comparison is showing a genuine difference.
> > 
> > If we look at the original staged pipeline, every single individual
> > job took much longer than every individual jobs in the simplified
> > pipeline. I think the difference in job times accounts for most
> > (possibly all) of the difference in the pipelines time.
> > 
> > If we look at the history of libvirt pipelines:
> > 
> >    https://gitlab.com/libvirt/libvirt/pipelines
> > 
> > the vast majority of the time we're completing in 30 minutes or
> > less already.

That was before introducing FreeBSD builds, which for whatever reason
take a significantly longer time: the last couple of jobs both took
50+ minutes. Installing packages is very inefficient, it would seem.

Either way, even looking at earlier jobs, it seems clear that we
leave compute time on the table: for the last 10 jobs before adding
FreeBSD, we have

  Longest job | Shortest job
  ------------ -------------
        21:20 | 12:12
        16:11 | 09:04
        21:31 | 13:40
        16:32 | 08:28
        14:53 | 08:16
        16:01 | 07:59
        16:17 | 08:40
        15:30 | 08:49
        15:12 | 09:11
        16:20 | 08:34

which means the pipeline is stalled for at least 5-8 minutes each
time. That's time that we could use to run builds, but we just sit
idly and wait instead. The difference becomes even bigger with
FreeBSD in the mix.

Even from a more semantical point of view, pipeline stages exist to
implement dependencies between jobs: a good example is our container
build jobs, which of course need to happen *before* the build job
that uses that container can start. There are no dependencies
whatsoever between native builds and cross builds.

> > If you want to demonstrate an time improvement from these merged
> > stages, then run 20 pipelines over a cople of days and show
> > that they're consistently better than what we see already, and
> > not just a reflection of the CI infra load at a point in time.

I could do that, sure, it just seems like a waste of shared runner
CPU time...

> Also remember that we're using ccache, so slower builds may just be a
> reflection of the ccache having low hit rate - a sequence of repeated
> builds of the same branch should identify if that's the case.

I've been running builds pretty much non-stop over the past few days,
and since the cache is keyed off the job's name there should be no
significant skew caused by this.

-- 
Andrea Bolognani / Red Hat / Virtualization

Re: [libvirt PATCH] ci: Reduce number of stages

Posted by Daniel P. Berrangé 5 years, 8 months ago

On Wed, Jun 10, 2020 at 05:15:55PM +0200, Andrea Bolognani wrote:
> On Wed, 2020-06-10 at 13:31 +0100, Daniel P. Berrangé wrote:
> > On Wed, Jun 10, 2020 at 01:14:51PM +0100, Daniel P. Berrangé wrote:
> > > On Wed, Jun 10, 2020 at 01:33:01PM +0200, Andrea Bolognani wrote:
> > > > Building artifacts in a separate pipeline stage also doesn't have any
> > > > advantages, and only delays further stages by a couple of minutes.
> > > > The only job that really makes sense in its own stage is the DCO
> > > > check, because it's extremely fast (less than 1 minute) and, if that
> > > > fails, we can avoid kicking off all other jobs.
> > > 
> > > The advantage of using stages is that it makes it easy to see at a
> > > glance where the pipeline was failing.
> 
> Ultimately you'll need to drill down to the actual failure, though,
> so the only situation in which it would really provide value is if
> for some reason *all* cross builds failed at once, which is not
> something that happens frequently enough to optimize for.
> 
> > > > Reducing the number of stages results in significant speedups:
> > > > specifically, going from three stages to two stages reduces the
> > > > overall completion time for a full CI pipeline from ~45 minutes[1]
> > > > to ~30 minutes[2].
> > > > 
> > > > [1] https://gitlab.com/abologna/libvirt/-/pipelines/154751893
> > > > [2] https://gitlab.com/abologna/libvirt/-/pipelines/154771173
> > > 
> > > I don't think this time comparison is showing a genuine difference.
> > > 
> > > If we look at the original staged pipeline, every single individual
> > > job took much longer than every individual jobs in the simplified
> > > pipeline. I think the difference in job times accounts for most
> > > (possibly all) of the difference in the pipelines time.
> > > 
> > > If we look at the history of libvirt pipelines:
> > > 
> > >    https://gitlab.com/libvirt/libvirt/pipelines
> > > 
> > > the vast majority of the time we're completing in 30 minutes or
> > > less already.
> 
> That was before introducing FreeBSD builds, which for whatever reason
> take a significantly longer time: the last couple of jobs both took
> 50+ minutes. Installing packages is very inefficient, it would seem.

Oh dear, yeah, i missed that it introduced FreeBSD.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|