scripts/ci: add gitlab-failure-analysis script

[RFC PATCH] scripts/ci: add gitlab-failure-analysis script

Posted by Alex Bennée 1 day ago

This is a script designed to collect data from multiple pipelines and
analyse the failure modes they have.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
 scripts/ci/gitlab-failure-analysis | 65 ++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)
 create mode 100755 scripts/ci/gitlab-failure-analysis

diff --git a/scripts/ci/gitlab-failure-analysis b/scripts/ci/gitlab-failure-analysis
new file mode 100755
index 00000000000..195db63a0c0
--- /dev/null
+++ b/scripts/ci/gitlab-failure-analysis
@@ -0,0 +1,65 @@
+#!/usr/bin/env python3
+#
+# A script to analyse failures in the gitlab pipelines. It requires an
+# API key from gitlab with the following permissions:
+#  - api
+#  - read_repository
+#  - read_user
+#
+
+import argparse
+import gitlab
+import os
+
+#
+# Arguments
+#
+parser = argparse.ArgumentParser(description="Analyse failed GitLab CI runs.")
+
+parser.add_argument("--gitlab",
+                    default="https://gitlab.com",
+                    help="GitLab instance URL (default: https://gitlab.com).")
+parser.add_argument("--id", default=11167699,
+                    type=int,
+                    help="GitLab project id (default: 11167699 for qemu-project/qemu)")
+parser.add_argument("--token",
+                    default=os.getenv("GITLAB_TOKEN"),
+                    help="Your personal access token with 'api' scope.")
+parser.add_argument("--branch",
+                    default="staging",
+                    help="The name of the branch (default: 'staging')")
+parser.add_argument("--count", type=int,
+                    default=3,
+                    help="The number of failed runs to fetch.")
+
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+
+    gl = gitlab.Gitlab(url=args.gitlab, private_token=args.token)
+    project = gl.projects.get(args.id)
+
+    # Use an iterator to fetch the pipelines
+    pipe_iter = project.pipelines.list(iterator=True,
+                                       status="failed",
+                                       ref=args.branch)
+    pipe_failed = [next(pipe_iter) for _ in range(args.count)]
+
+    # Check each failed pipeline
+    for p in pipe_failed:
+
+        jobs = p.jobs.list(get_all = True)
+        failed_jobs = [j for j in jobs if j.status == "failed"]
+        skipped_jobs = [j for j in jobs if j.status == "skipped"]
+        manual_jobs = [j for j in jobs if j.status == "manual"]
+
+        test_report = p.test_report.get()
+
+        print(f"Failed pipeline {p.id}, total jobs {len(jobs)}, "
+              f"skipped {len(skipped_jobs)}, "
+              f"failed {len(failed_jobs)}, ",
+              f"{test_report.total_count} tests, "
+              f"{test_report.failed_count} failed tests")
+
+        for j in failed_jobs:
+            print(f"  Failed {j.id}, {j.name}, {j.web_url}")
-- 
2.47.3

Re: [RFC PATCH] scripts/ci: add gitlab-failure-analysis script

Posted by Thomas Huth 16 hours ago

On 08/09/2025 23.18, Alex Bennée wrote:
> This is a script designed to collect data from multiple pipelines and
> analyse the failure modes they have.
> 
> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
> ---
>   scripts/ci/gitlab-failure-analysis | 65 ++++++++++++++++++++++++++++++
>   1 file changed, 65 insertions(+)
>   create mode 100755 scripts/ci/gitlab-failure-analysis

You already get a nice overview by visiting a page like 
https://gitlab.com/qemu-project/qemu/-/pipelines/2019002986 ... what's the 
advantage of this script?

  Thomas

Re: [RFC PATCH] scripts/ci: add gitlab-failure-analysis script

Posted by Alex Bennée 12 hours ago

Thomas Huth <thuth@redhat.com> writes:

> On 08/09/2025 23.18, Alex Bennée wrote:
>> This is a script designed to collect data from multiple pipelines and
>> analyse the failure modes they have.
>> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
>> ---
>>   scripts/ci/gitlab-failure-analysis | 65 ++++++++++++++++++++++++++++++
>>   1 file changed, 65 insertions(+)
>>   create mode 100755 scripts/ci/gitlab-failure-analysis
>
> You already get a nice overview by visiting a page like
> https://gitlab.com/qemu-project/qemu/-/pipelines/2019002986 ... what's
> the advantage of this script?

Not having to click every link when I want to see what the pattern of
failures is and what might be a candidate for making flaky.

>
>  Thomas

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

Re: [RFC PATCH] scripts/ci: add gitlab-failure-analysis script

Posted by Peter Maydell 12 hours ago

On Tue, 9 Sept 2025 at 09:39, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Thomas Huth <thuth@redhat.com> writes:
>
> > On 08/09/2025 23.18, Alex Bennée wrote:
> >> This is a script designed to collect data from multiple pipelines and
> >> analyse the failure modes they have.
> >> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
> >> ---
> >>   scripts/ci/gitlab-failure-analysis | 65 ++++++++++++++++++++++++++++++
> >>   1 file changed, 65 insertions(+)
> >>   create mode 100755 scripts/ci/gitlab-failure-analysis
> >
> > You already get a nice overview by visiting a page like
> > https://gitlab.com/qemu-project/qemu/-/pipelines/2019002986 ... what's
> > the advantage of this script?
>
> Not having to click every link when I want to see what the pattern of
> failures is and what might be a candidate for making flaky.

What I would like for finding flaky tests is to find every
case where:
 * a job failed on commit hash X
 * we also have the same job succeeding on the same commit X

Those are the flaky tests, where we hit retry and it just
passed the second time, and it rules out the cases where
we had a genuine "job failed because the code being tested
for merge had a problem".

When we find those jobs that only failed because of a flaky
test then we can analyse their logs to identify what the
exact failures were.

Can we find those with this script ?  (You can't do it with
the gitlab web UI, whose search and filter capabilities
are extremely limited.)

thanks
-- PMM

Re: [RFC PATCH] scripts/ci: add gitlab-failure-analysis script

Posted by Daniel P. Berrangé 12 hours ago

On Tue, Sep 09, 2025 at 10:00:05AM +0100, Peter Maydell wrote:
> On Tue, 9 Sept 2025 at 09:39, Alex Bennée <alex.bennee@linaro.org> wrote:
> >
> > Thomas Huth <thuth@redhat.com> writes:
> >
> > > On 08/09/2025 23.18, Alex Bennée wrote:
> > >> This is a script designed to collect data from multiple pipelines and
> > >> analyse the failure modes they have.
> > >> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
> > >> ---
> > >>   scripts/ci/gitlab-failure-analysis | 65 ++++++++++++++++++++++++++++++
> > >>   1 file changed, 65 insertions(+)
> > >>   create mode 100755 scripts/ci/gitlab-failure-analysis
> > >
> > > You already get a nice overview by visiting a page like
> > > https://gitlab.com/qemu-project/qemu/-/pipelines/2019002986 ... what's
> > > the advantage of this script?
> >
> > Not having to click every link when I want to see what the pattern of
> > failures is and what might be a candidate for making flaky.
> 
> What I would like for finding flaky tests is to find every
> case where:
>  * a job failed on commit hash X
>  * we also have the same job succeeding on the same commit X
> 
> Those are the flaky tests, where we hit retry and it just
> passed the second time, and it rules out the cases where
> we had a genuine "job failed because the code being tested
> for merge had a problem".
> 
> When we find those jobs that only failed because of a flaky
> test then we can analyse their logs to identify what the
> exact failures were.
> 
> Can we find those with this script ?  (You can't do it with
> the gitlab web UI, whose search and filter capabilities
> are extremely limited.)

Downloading data from gitlab API is painfully slow so not something
you want to do regularly/repeatedly.

If we can have the script to download the data and save it locally,
we could then do something like populate a sqllite DB with pipeline
results which can we efficiently query to extract failure patterns.
I guess this script at least starts us moving in that direction by
giving us the framework to fetch data, and build on that...

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|