[v1] tests/functional: improve handling flaky downloads

[PATCH 3/3] tests/functional: handle URLError when fetching assets

Posted by Daniel P. Berrangé 1 month ago

We treat most HTTP errors as non-fatal when fetching assets,
but forgot to handle network level errors. This adds catching
of URLError so that we retry on failure, and will ultimately
trigger graceful skipping in the pre-cache task.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
 tests/functional/qemu_test/asset.py | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/tests/functional/qemu_test/asset.py b/tests/functional/qemu_test/asset.py
index ae2bec3ea5..36f64fe2f6 100644
--- a/tests/functional/qemu_test/asset.py
+++ b/tests/functional/qemu_test/asset.py
@@ -15,7 +15,7 @@
 from time import sleep
 from pathlib import Path
 from shutil import copyfileobj
-from urllib.error import HTTPError
+from urllib.error import HTTPError, URLError
 
 class AssetError(Exception):
     def __init__(self, asset, msg, transient=False):
@@ -171,6 +171,13 @@ def fetch(self):
                     raise AssetError(self, "Unable to download %s: "
                                      "HTTP error %d" % (self.url, e.code))
                 continue
+            except URLError as e:
+                # This is typically a network/service level error
+                # eg urlopen error [Errno 110] Connection timed out>
+                tmp_cache_file.unlink()
+                self.log.error("Unable to download %s: URL error %s",
+                               self.url, e)
+                continue
             except Exception as e:
                 tmp_cache_file.unlink()
                 raise AssetError(self, "Unable to download %s: %s" % (
-- 
2.50.1

Re: [PATCH 3/3] tests/functional: handle URLError when fetching assets

Posted by Thomas Huth 1 month ago

On 29/08/2025 10.31, Daniel P. Berrangé wrote:
> We treat most HTTP errors as non-fatal when fetching assets,
> but forgot to handle network level errors. This adds catching
> of URLError so that we retry on failure, and will ultimately
> trigger graceful skipping in the pre-cache task.
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>   tests/functional/qemu_test/asset.py | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/tests/functional/qemu_test/asset.py b/tests/functional/qemu_test/asset.py
> index ae2bec3ea5..36f64fe2f6 100644
> --- a/tests/functional/qemu_test/asset.py
> +++ b/tests/functional/qemu_test/asset.py
> @@ -15,7 +15,7 @@
>   from time import sleep
>   from pathlib import Path
>   from shutil import copyfileobj
> -from urllib.error import HTTPError
> +from urllib.error import HTTPError, URLError
>   
>   class AssetError(Exception):
>       def __init__(self, asset, msg, transient=False):
> @@ -171,6 +171,13 @@ def fetch(self):
>                       raise AssetError(self, "Unable to download %s: "
>                                        "HTTP error %d" % (self.url, e.code))
>                   continue
> +            except URLError as e:
> +                # This is typically a network/service level error
> +                # eg urlopen error [Errno 110] Connection timed out>
> +                tmp_cache_file.unlink()
> +                self.log.error("Unable to download %s: URL error %s",
> +                               self.url, e)
> +                continue

Hmm, I don't think we should retry on each and every URLError. For example 
if you have a typo in the server name, you get a "Name or service not known" 
URLError, and it does not make sense to retry in that case.

Also, in case of the server being down, it takes a minute or two 'til the 
urllib gives up, so if you retry multiple times in that case, you can easily 
extend the test time by > 5 minutes (as you can see by the timestamps in the 
example in your cover letter), which is unfortunate, too (considering that 
there is also a timeout setting for the gitlab CI jobs).

I think I'd maybe rather do something like this instead, without retrying 
the download, just marking certain errors as transient:

        except URLError as e:
            tmp_cache_file.unlink()
            msg = str(e.reason)
            self.log.error("Unable to download %s: URL error %s",
                           self.url, msg)
            raise AssetError(self, msg,
                             transient=("Network is unreachable" in msg))

WDYT?

  Thomas

Re: [PATCH 3/3] tests/functional: handle URLError when fetching assets

Posted by Daniel P. Berrangé 1 month ago

On Fri, Aug 29, 2025 at 11:00:12AM +0200, Thomas Huth wrote:
> On 29/08/2025 10.31, Daniel P. Berrangé wrote:
> > We treat most HTTP errors as non-fatal when fetching assets,
> > but forgot to handle network level errors. This adds catching
> > of URLError so that we retry on failure, and will ultimately
> > trigger graceful skipping in the pre-cache task.
> > 
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > ---
> >   tests/functional/qemu_test/asset.py | 9 ++++++++-
> >   1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/tests/functional/qemu_test/asset.py b/tests/functional/qemu_test/asset.py
> > index ae2bec3ea5..36f64fe2f6 100644
> > --- a/tests/functional/qemu_test/asset.py
> > +++ b/tests/functional/qemu_test/asset.py
> > @@ -15,7 +15,7 @@
> >   from time import sleep
> >   from pathlib import Path
> >   from shutil import copyfileobj
> > -from urllib.error import HTTPError
> > +from urllib.error import HTTPError, URLError
> >   class AssetError(Exception):
> >       def __init__(self, asset, msg, transient=False):
> > @@ -171,6 +171,13 @@ def fetch(self):
> >                       raise AssetError(self, "Unable to download %s: "
> >                                        "HTTP error %d" % (self.url, e.code))
> >                   continue
> > +            except URLError as e:
> > +                # This is typically a network/service level error
> > +                # eg urlopen error [Errno 110] Connection timed out>
> > +                tmp_cache_file.unlink()
> > +                self.log.error("Unable to download %s: URL error %s",
> > +                               self.url, e)
> > +                continue
> 
> Hmm, I don't think we should retry on each and every URLError. For example
> if you have a typo in the server name, you get a "Name or service not known"
> URLError, and it does not make sense to retry in that case.
> 
> Also, in case of the server being down, it takes a minute or two 'til the
> urllib gives up, so if you retry multiple times in that case, you can easily
> extend the test time by > 5 minutes (as you can see by the timestamps in the
> example in your cover letter), which is unfortunate, too (considering that
> there is also a timeout setting for the gitlab CI jobs).
> 
> I think I'd maybe rather do something like this instead, without retrying
> the download, just marking certain errors as transient:
> 
>        except URLError as e:
>            tmp_cache_file.unlink()
>            msg = str(e.reason)
>            self.log.error("Unable to download %s: URL error %s",
>                           self.url, msg)
>            raise AssetError(self, msg,
>                             transient=("Network is unreachable" in msg))
> 
> WDYT?

Ok, yes, we can do that instead.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|