[v1] fix build failures from incorrectly skipped container build jobs

[PATCH 0/3] fix build failures from incorrectly skipped container build jobs

Posted by Daniel P. Berrangé 5 years ago

This series fixes a problem with our gitlab CI rules that cause
container builds to be skipped. See the commit description in the
first patch for the details on this problem.

The overall result of this series though is a small increase in overall
pipeline time.

Previously

 - When container jobs are skipped: approx 1hr 5 mins
 - When container jobs are run, cached by docker: approx 1hr 15 mins
 - When container jobs are run, not cached by docker: approx 1hr 30 mins

With this series applied the first scenario no longer exists, so
all piplines are either 1hr 15 or 1hr 30 depending on whether the
container phase is skipped.

On the plus side the builds are more reliable as we're actually
building container images at correct times.

There is still a race condition though where build jobs can run
with the wrong containers. This happens if you push two different
branches to gitlab with different docker file content. If the
container jobs for the 2nd branch finish before the 1st
branch runs its build jobs, the 1st branch can end up using
containers fro the second branch.  The only fix to truely fix
that would be to stop using "latest" docker tag and always
use a tag based on the branch name. This would mean we build
up a growing set of docker images in the gitlab registry.

At least this series is much more correct that what exists in
git currently.

Daniel P. Berrangé (3):
  gitlab: always build container images
  gitlab: add fine grained job deps for all build jobs
  gitlab: fix inconsistent indentation

 .gitlab-ci.d/containers.yml |  7 ----
 .gitlab-ci.yml              | 74 +++++++++++++++++++++++++++++++++----
 2 files changed, 66 insertions(+), 15 deletions(-)

-- 
2.29.2

Re: [PATCH 0/3] fix build failures from incorrectly skipped container build jobs

Posted by Daniel P. Berrangé 5 years ago

On Mon, Feb 08, 2021 at 04:33:36PM +0000, Daniel P. Berrangé wrote:
> This series fixes a problem with our gitlab CI rules that cause
> container builds to be skipped. See the commit description in the
> first patch for the details on this problem.
> 
> The overall result of this series though is a small increase in overall
> pipeline time.
> 
> Previously
> 
>  - When container jobs are skipped: approx 1hr 5 mins
>  - When container jobs are run, cached by docker: approx 1hr 15 mins
>  - When container jobs are run, not cached by docker: approx 1hr 30 mins
> 
> With this series applied the first scenario no longer exists, so
> all piplines are either 1hr 15 or 1hr 30 depending on whether the
> container phase is skipped.

I mean to say the biggest problem I see is the cross-win64-system
job. This consumes 1 hour 5 minutes all on its own. It is at least
15 minutes longer that every other job AFAICT. So no matter how
well we parallelize stuff, 1 hr 5 is a hard lower limit on pipeline
duration right now.

We might want to consider how to split the win64 job or cut down
what it does in some way ?


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH 0/3] fix build failures from incorrectly skipped container build jobs

Posted by Philippe Mathieu-Daudé 4 years, 12 months ago

On 2/8/21 6:22 PM, Daniel P. Berrangé wrote:
> On Mon, Feb 08, 2021 at 04:33:36PM +0000, Daniel P. Berrangé wrote:
>> This series fixes a problem with our gitlab CI rules that cause
>> container builds to be skipped. See the commit description in the
>> first patch for the details on this problem.
>>
>> The overall result of this series though is a small increase in overall
>> pipeline time.
>>
>> Previously
>>
>>  - When container jobs are skipped: approx 1hr 5 mins
>>  - When container jobs are run, cached by docker: approx 1hr 15 mins
>>  - When container jobs are run, not cached by docker: approx 1hr 30 mins
>>
>> With this series applied the first scenario no longer exists, so
>> all piplines are either 1hr 15 or 1hr 30 depending on whether the
>> container phase is skipped.
> 
> I mean to say the biggest problem I see is the cross-win64-system
> job. This consumes 1 hour 5 minutes all on its own. It is at least
> 15 minutes longer that every other job AFAICT. So no matter how
> well we parallelize stuff, 1 hr 5 is a hard lower limit on pipeline
> duration right now.
> 
> We might want to consider how to split the win64 job or cut down
> what it does in some way ?

When the win64 build was added (on Debian), it was to mostly to cover
the WHPX. Later we moved mingw jobs to Fedora. I just checked and
WHPX is no more built, and nobody complained, so it is not relevant
anymore.

I don't mind much what you do with the Gitlab win64 job, as this config
is better covered on Cirrus. I'd like to keep the win32 job because it
has been useful to catch 32-bit issues.

Regards,

Phil.

Re: [PATCH 0/3] fix build failures from incorrectly skipped container build jobs

Posted by Daniel P. Berrangé 4 years, 12 months ago

On Mon, Feb 08, 2021 at 07:08:39PM +0100, Philippe Mathieu-Daudé wrote:
> On 2/8/21 6:22 PM, Daniel P. Berrangé wrote:
> > On Mon, Feb 08, 2021 at 04:33:36PM +0000, Daniel P. Berrangé wrote:
> >> This series fixes a problem with our gitlab CI rules that cause
> >> container builds to be skipped. See the commit description in the
> >> first patch for the details on this problem.
> >>
> >> The overall result of this series though is a small increase in overall
> >> pipeline time.
> >>
> >> Previously
> >>
> >>  - When container jobs are skipped: approx 1hr 5 mins
> >>  - When container jobs are run, cached by docker: approx 1hr 15 mins
> >>  - When container jobs are run, not cached by docker: approx 1hr 30 mins
> >>
> >> With this series applied the first scenario no longer exists, so
> >> all piplines are either 1hr 15 or 1hr 30 depending on whether the
> >> container phase is skipped.
> > 
> > I mean to say the biggest problem I see is the cross-win64-system
> > job. This consumes 1 hour 5 minutes all on its own. It is at least
> > 15 minutes longer that every other job AFAICT. So no matter how
> > well we parallelize stuff, 1 hr 5 is a hard lower limit on pipeline
> > duration right now.
> > 
> > We might want to consider how to split the win64 job or cut down
> > what it does in some way ?
> 
> When the win64 build was added (on Debian), it was to mostly to cover
> the WHPX. Later we moved mingw jobs to Fedora. I just checked and
> WHPX is no more built, and nobody complained, so it is not relevant
> anymore.
> 
> I don't mind much what you do with the Gitlab win64 job, as this config
> is better covered on Cirrus. I'd like to keep the win32 job because it
> has been useful to catch 32-bit issues.

I'm not suggesting we remove it. Most developers won't setup Cirrus CI,
so will only run GitLab CI jobs.  IME it is good to have both win32
and win64 coverage because things do break differently on them - especially
if you use bad printf format strings that are not independant of host
word size


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH 0/3] fix build failures from incorrectly skipped container build jobs

Posted by Stefan Weil 4 years, 12 months ago

Am 08.02.21 um 19:12 schrieb Daniel P. Berrangé:

> On Mon, Feb 08, 2021 at 07:08:39PM +0100, Philippe Mathieu-Daudé wrote:
>> On 2/8/21 6:22 PM, Daniel P. Berrangé wrote:
>>> On Mon, Feb 08, 2021 at 04:33:36PM +0000, Daniel P. Berrangé wrote:
>>>> This series fixes a problem with our gitlab CI rules that cause
>>>> container builds to be skipped. See the commit description in the
>>>> first patch for the details on this problem.
>>>>
>>>> The overall result of this series though is a small increase in overall
>>>> pipeline time.
>>>>
>>>> Previously
>>>>
>>>>   - When container jobs are skipped: approx 1hr 5 mins
>>>>   - When container jobs are run, cached by docker: approx 1hr 15 mins
>>>>   - When container jobs are run, not cached by docker: approx 1hr 30 mins
>>>>
>>>> With this series applied the first scenario no longer exists, so
>>>> all piplines are either 1hr 15 or 1hr 30 depending on whether the
>>>> container phase is skipped.
>>> I mean to say the biggest problem I see is the cross-win64-system
>>> job. This consumes 1 hour 5 minutes all on its own. It is at least
>>> 15 minutes longer that every other job AFAICT. So no matter how
>>> well we parallelize stuff, 1 hr 5 is a hard lower limit on pipeline
>>> duration right now.
>>>
>>> We might want to consider how to split the win64 job or cut down
>>> what it does in some way ?
>> When the win64 build was added (on Debian), it was to mostly to cover
>> the WHPX. Later we moved mingw jobs to Fedora. I just checked and
>> WHPX is no more built, and nobody complained, so it is not relevant
>> anymore.
>>
>> I don't mind much what you do with the Gitlab win64 job, as this config
>> is better covered on Cirrus. I'd like to keep the win32 job because it
>> has been useful to catch 32-bit issues.
> I'm not suggesting we remove it. Most developers won't setup Cirrus CI,
> so will only run GitLab CI jobs.  IME it is good to have both win32
> and win64 coverage because things do break differently on them - especially
> if you use bad printf format strings that are not independant of host
> word size


I would not say that something is not relevant just because nobody 
complains. Nobody would miss any CI if everything were always fine. So 
people would miss WHPX CI as soon as there are changes (which happen 
infrequently) and something breaks. WHPX should be covered by the w64 
build, and as many as possible other features where I currently see a 
"NO" in the configure output as well.

Nevertheless I don't think that each CI job must run frequently. Each 
run not only costs time, but also energy, and contributes to our climate 
change.

I think that for the win32 and win64 jobs it would be sufficient to run 
them weekly, maybe even alternating if that is possible.

Stefan

Re: [PATCH 0/3] fix build failures from incorrectly skipped container build jobs

Posted by Thomas Huth 4 years, 12 months ago

On 09/02/2021 07.01, Stefan Weil wrote:
> Am 08.02.21 um 19:12 schrieb Daniel P. Berrangé:
> 
>> On Mon, Feb 08, 2021 at 07:08:39PM +0100, Philippe Mathieu-Daudé wrote:
>>> On 2/8/21 6:22 PM, Daniel P. Berrangé wrote:
>>>> On Mon, Feb 08, 2021 at 04:33:36PM +0000, Daniel P. Berrangé wrote:
>>>>> This series fixes a problem with our gitlab CI rules that cause
>>>>> container builds to be skipped. See the commit description in the
>>>>> first patch for the details on this problem.
>>>>>
>>>>> The overall result of this series though is a small increase in overall
>>>>> pipeline time.
>>>>>
>>>>> Previously
>>>>>
>>>>>   - When container jobs are skipped: approx 1hr 5 mins
>>>>>   - When container jobs are run, cached by docker: approx 1hr 15 mins
>>>>>   - When container jobs are run, not cached by docker: approx 1hr 30 mins
>>>>>
>>>>> With this series applied the first scenario no longer exists, so
>>>>> all piplines are either 1hr 15 or 1hr 30 depending on whether the
>>>>> container phase is skipped.
>>>> I mean to say the biggest problem I see is the cross-win64-system
>>>> job. This consumes 1 hour 5 minutes all on its own. It is at least
>>>> 15 minutes longer that every other job AFAICT. So no matter how
>>>> well we parallelize stuff, 1 hr 5 is a hard lower limit on pipeline
>>>> duration right now.
>>>>
>>>> We might want to consider how to split the win64 job or cut down
>>>> what it does in some way ?
>>> When the win64 build was added (on Debian), it was to mostly to cover
>>> the WHPX. Later we moved mingw jobs to Fedora. I just checked and
>>> WHPX is no more built, and nobody complained, so it is not relevant
>>> anymore.
>>>
>>> I don't mind much what you do with the Gitlab win64 job, as this config
>>> is better covered on Cirrus. I'd like to keep the win32 job because it
>>> has been useful to catch 32-bit issues.
>> I'm not suggesting we remove it. Most developers won't setup Cirrus CI,
>> so will only run GitLab CI jobs.  IME it is good to have both win32
>> and win64 coverage because things do break differently on them - especially
>> if you use bad printf format strings that are not independant of host
>> word size
> 
> 
> I would not say that something is not relevant just because nobody 
> complains. Nobody would miss any CI if everything were always fine. So 
> people would miss WHPX CI as soon as there are changes (which happen 
> infrequently) and something breaks. WHPX should be covered by the w64 build, 
> and as many as possible other features where I currently see a "NO" in the 
> configure output as well.

Yes, I agree, we should add WHPX there again. Could somebody please check 
whether the headers are already available in the latest Fedora? Then we 
might simply switch the container to use the latest version of Fedora instead.

Otherwise we should install the headers manually there, like we did in 
commit d3dd34a1e5e134 for the now-removed Debian container.

> Nevertheless I don't think that each CI job must run frequently. Each run 
> not only costs time, but also energy, and contributes to our climate change.
> 
> I think that for the win32 and win64 jobs it would be sufficient to run them 
> weekly, maybe even alternating if that is possible.

Maybe it would be sufficient to run those jobs only on tags (so that they 
get checked for pull requests) and on the master and staging branch?

  Thomas

Re: [PATCH 0/3] fix build failures from incorrectly skipped container build jobs

Posted by Daniel P. Berrangé 4 years, 12 months ago

On Tue, Feb 09, 2021 at 07:01:49AM +0100, Stefan Weil wrote:
> Am 08.02.21 um 19:12 schrieb Daniel P. Berrangé:
> 
> > On Mon, Feb 08, 2021 at 07:08:39PM +0100, Philippe Mathieu-Daudé wrote:
> > > On 2/8/21 6:22 PM, Daniel P. Berrangé wrote:
> > > > On Mon, Feb 08, 2021 at 04:33:36PM +0000, Daniel P. Berrangé wrote:
> > > > > This series fixes a problem with our gitlab CI rules that cause
> > > > > container builds to be skipped. See the commit description in the
> > > > > first patch for the details on this problem.
> > > > > 
> > > > > The overall result of this series though is a small increase in overall
> > > > > pipeline time.
> > > > > 
> > > > > Previously
> > > > > 
> > > > >   - When container jobs are skipped: approx 1hr 5 mins
> > > > >   - When container jobs are run, cached by docker: approx 1hr 15 mins
> > > > >   - When container jobs are run, not cached by docker: approx 1hr 30 mins
> > > > > 
> > > > > With this series applied the first scenario no longer exists, so
> > > > > all piplines are either 1hr 15 or 1hr 30 depending on whether the
> > > > > container phase is skipped.
> > > > I mean to say the biggest problem I see is the cross-win64-system
> > > > job. This consumes 1 hour 5 minutes all on its own. It is at least
> > > > 15 minutes longer that every other job AFAICT. So no matter how
> > > > well we parallelize stuff, 1 hr 5 is a hard lower limit on pipeline
> > > > duration right now.
> > > > 
> > > > We might want to consider how to split the win64 job or cut down
> > > > what it does in some way ?
> > > When the win64 build was added (on Debian), it was to mostly to cover
> > > the WHPX. Later we moved mingw jobs to Fedora. I just checked and
> > > WHPX is no more built, and nobody complained, so it is not relevant
> > > anymore.
> > > 
> > > I don't mind much what you do with the Gitlab win64 job, as this config
> > > is better covered on Cirrus. I'd like to keep the win32 job because it
> > > has been useful to catch 32-bit issues.
> > I'm not suggesting we remove it. Most developers won't setup Cirrus CI,
> > so will only run GitLab CI jobs.  IME it is good to have both win32
> > and win64 coverage because things do break differently on them - especially
> > if you use bad printf format strings that are not independant of host
> > word size
> 
> 
> I would not say that something is not relevant just because nobody
> complains. Nobody would miss any CI if everything were always fine. So
> people would miss WHPX CI as soon as there are changes (which happen
> infrequently) and something breaks. WHPX should be covered by the w64 build,
> and as many as possible other features where I currently see a "NO" in the
> configure output as well.
> 
> Nevertheless I don't think that each CI job must run frequently. Each run
> not only costs time, but also energy, and contributes to our climate change.
> 
> I think that for the win32 and win64 jobs it would be sufficient to run them
> weekly, maybe even alternating if that is possible.

I think that would be a major step backwards. Not only do we need these
jobs to be gating for merges into git master to prevent regressions getting
merged, but we want want to stop contributors and subsystem maintainers
sending broken patch series.  Not running them every time will loose these
protections.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH 0/3] fix build failures from incorrectly skipped container build jobs

Posted by Philippe Mathieu-Daudé 4 years, 11 months ago

On 2/8/21 5:33 PM, Daniel P. Berrangé wrote:
> This series fixes a problem with our gitlab CI rules that cause
> container builds to be skipped. See the commit description in the
> first patch for the details on this problem.
...

> There is still a race condition though where build jobs can run
> with the wrong containers. This happens if you push two different
> branches to gitlab with different docker file content. If the
> container jobs for the 2nd branch finish before the 1st
> branch runs its build jobs, the 1st branch can end up using
> containers fro the second branch.  The only fix to truely fix
> that would be to stop using "latest" docker tag and always
> use a tag based on the branch name. This would mean we build
> up a growing set of docker images in the gitlab registry.

OK this indeed describes the problem I'm facing.

> At least this series is much more correct that what exists in
> git currently.

Good, I'll test it then.

Regards,

Phil.