Fix tests on macOS

[libvirt PATCH 2/2] tests: don't set G_DEBUG=fatal-warnings on macOS

Posted by Daniel P. Berrangé 3 years, 9 months ago

On macOS when BROKEN_POLL is set in GLib, our tests will periodically
trigger a warning:

  (process:50880): GLib-WARNING **: 02:54:15.272: poll(2) failed due to: Bad file descriptor.

Our code is inherantly racy, calling g_source_destroy which
removes the FD from the event thread poll asynchronously but
we close the FD immediately after g_source_destroy returns.

With poll() this results in POLLNVAL which we're ignoring, but
with select() it generates the BADF error on macOS.

We need to ignore the warnings on macOS to avoid abort()ing
our test programs.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
 tests/meson.build | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/tests/meson.build b/tests/meson.build
index fd78d8a2fd..08d2ae67cb 100644
--- a/tests/meson.build
+++ b/tests/meson.build
@@ -39,9 +39,24 @@ tests_env = [
   'abs_top_srcdir=@0@'.format(meson.source_root()),
   'LC_ALL=C',
   'LIBVIRT_AUTOSTART=0',
-  'G_DEBUG=fatal-warnings',
 ]
 
+
+# On macOS when BROKEN_POLL is set for GLib, our tests will
+# periodically trigger a warning:
+#
+#  (process:50880): GLib-WARNING **: 02:54:15.272: poll(2) failed due to: Bad file descriptor.
+#
+# Our code is inherantly racy, calling g_source_destroy which
+# removes the FD from the event thread poll asynchronously
+# but we close the FD immediately after g_source_destroy returns.
+#
+# With poll() this results in POLLNVAL, but with select() it
+# generates the BADF error on macOS
+if host_machine.system() != 'darwin'
+  tests_env += ['G_DEBUG=fatal-warnings']
+endif
+
 if use_expensive_tests
   tests_env += 'VIR_TEST_EXPENSIVE=1'
 else
-- 
2.35.1

Re: [libvirt PATCH 2/2] tests: don't set G_DEBUG=fatal-warnings on macOS

Posted by Andrea Bolognani 3 years, 9 months ago

On Thu, Apr 28, 2022 at 12:22:14PM +0100, Daniel P. Berrangé wrote:
> +# On macOS when BROKEN_POLL is set for GLib, our tests will
> +# periodically trigger a warning:
> +#
> +#  (process:50880): GLib-WARNING **: 02:54:15.272: poll(2) failed due to: Bad file descriptor.
> +#
> +# Our code is inherantly racy, calling g_source_destroy which
> +# removes the FD from the event thread poll asynchronously
> +# but we close the FD immediately after g_source_destroy returns.
> +#
> +# With poll() this results in POLLNVAL, but with select() it
> +# generates the BADF error on macOS
> +if host_machine.system() != 'darwin'
> +  tests_env += ['G_DEBUG=fatal-warnings']
> +endif

Doesn't this just sweep the issue under the carpet? Shouldn't GLib be
changed to do something like

  #ifndef BROKEN_POLL
    if (poll() < 0)
      if (!POLLNVAL)
        return -1;
  #else
    if (select() < 0)
      if (!EBADF)
        return -1;
  #endif

instead? If acting on an fd that's already been closed is okay when
using the poll()-based implementation, the same should apply to the
select()-based one as well.

-- 
Andrea Bolognani / Red Hat / Virtualization

Re: [libvirt PATCH 2/2] tests: don't set G_DEBUG=fatal-warnings on macOS

Posted by Daniel P. Berrangé 3 years, 9 months ago

On Thu, Apr 28, 2022 at 06:39:42AM -0700, Andrea Bolognani wrote:
> On Thu, Apr 28, 2022 at 12:22:14PM +0100, Daniel P. Berrangé wrote:
> > +# On macOS when BROKEN_POLL is set for GLib, our tests will
> > +# periodically trigger a warning:
> > +#
> > +#  (process:50880): GLib-WARNING **: 02:54:15.272: poll(2) failed due to: Bad file descriptor.
> > +#
> > +# Our code is inherantly racy, calling g_source_destroy which
> > +# removes the FD from the event thread poll asynchronously
> > +# but we close the FD immediately after g_source_destroy returns.
> > +#
> > +# With poll() this results in POLLNVAL, but with select() it
> > +# generates the BADF error on macOS
> > +if host_machine.system() != 'darwin'
> > +  tests_env += ['G_DEBUG=fatal-warnings']
> > +endif
> 
> Doesn't this just sweep the issue under the carpet? Shouldn't GLib be
> changed to do something like
> 
>   #ifndef BROKEN_POLL
>     if (poll() < 0)
>       if (!POLLNVAL)
>         return -1;

That's not the way POLLNVAL works - this is an even that is set in
each 'struct pollfd', so it is associated with individual file
descriptors. The application code can thus receive and handle the
G_IO_NVAL condition - it isn't fatal from POV of g_poll

>   #else
>     if (select() < 0)
>       if (!EBADF)
>         return -1;
>   #endif

> 
> instead? If acting on an fd that's already been closed is okay when
> using the poll()-based implementation, the same should apply to the
> select()-based one as well.

The only way to get the same semantics for select() would be to see
EBADF and then iterate over struct pollfd, calling fcntl() to see
which  one(s) are BADF. Possible, but its unpleasantly non-scalable,
so I imagine that's why they didn't do this.

I'll still probably inclined to file a bug against glib, if nothing
else to provide a bug that other people can discover to explain why
the behaviour is different.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt PATCH 2/2] tests: don't set G_DEBUG=fatal-warnings on macOS

Posted by Andrea Bolognani 3 years, 9 months ago

On Thu, Apr 28, 2022 at 02:52:45PM +0100, Daniel P. Berrangé wrote:
> On Thu, Apr 28, 2022 at 06:39:42AM -0700, Andrea Bolognani wrote:
> > Shouldn't GLib be
> > changed to do something like
> >
> >   #ifndef BROKEN_POLL
> >     if (poll() < 0)
> >       if (!POLLNVAL)
> >         return -1;
>
> That's not the way POLLNVAL works - this is an even that is set in
> each 'struct pollfd', so it is associated with individual file
> descriptors. The application code can thus receive and handle the
> G_IO_NVAL condition - it isn't fatal from POV of g_poll

Right, I might have oversimplified things :) But your correction made
the situation clearer in my head, thanks!

> >   #else
> >     if (select() < 0)
> >       if (!EBADF)
> >         return -1;
> >   #endif
>
> >
> > instead? If acting on an fd that's already been closed is okay when
> > using the poll()-based implementation, the same should apply to the
> > select()-based one as well.
>
> The only way to get the same semantics for select() would be to see
> EBADF and then iterate over struct pollfd, calling fcntl() to see
> which  one(s) are BADF. Possible, but its unpleasantly non-scalable,
> so I imagine that's why they didn't do this.

IIUC g_poll(), just like any other function used as GPollFunc, is
supposed to have the same semantics as poll()[1], which means that
passing an invalid FD should result in the corresponding flag being
set rather than an outright failure.

In other words, the current implementation of g_poll() on macOS
doesn't follow the contract defined by GLib itself. It seems to me
that this is a (fairly serious) bug in the library, no?

[1] https://docs.gtk.org/glib/callback.PollFunc.html
-- 
Andrea Bolognani / Red Hat / Virtualization

Re: [libvirt PATCH 2/2] tests: don't set G_DEBUG=fatal-warnings on macOS

Posted by Daniel P. Berrangé 3 years, 9 months ago

On Thu, Apr 28, 2022 at 08:33:46AM -0700, Andrea Bolognani wrote:
> On Thu, Apr 28, 2022 at 02:52:45PM +0100, Daniel P. Berrangé wrote:
> > >   #else
> > >     if (select() < 0)
> > >       if (!EBADF)
> > >         return -1;
> > >   #endif
> >
> > >
> > > instead? If acting on an fd that's already been closed is okay when
> > > using the poll()-based implementation, the same should apply to the
> > > select()-based one as well.
> >
> > The only way to get the same semantics for select() would be to see
> > EBADF and then iterate over struct pollfd, calling fcntl() to see
> > which  one(s) are BADF. Possible, but its unpleasantly non-scalable,
> > so I imagine that's why they didn't do this.
> 
> IIUC g_poll(), just like any other function used as GPollFunc, is
> supposed to have the same semantics as poll()[1], which means that
> passing an invalid FD should result in the corresponding flag being
> set rather than an outright failure.
> 
> In other words, the current implementation of g_poll() on macOS
> doesn't follow the contract defined by GLib itself. It seems to me
> that this is a (fairly serious) bug in the library, no?

It is significant, but long standing. GLib actually had this behaviour
forever on macOS, but it regressed when Meson was introduced, until
the recent fix.

The question is whether efficiency trumps API semantics. Normally I'm
heavily biased towards API semantics, but poll is a performance
critical API, so it isn't so easy to declare we must workaround all
the quirks. 

I filed a bug to raise the subject for discussion though

  https://gitlab.gnome.org/GNOME/glib/-/issues/2644

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt PATCH 2/2] tests: don't set G_DEBUG=fatal-warnings on macOS

Posted by Andrea Bolognani 3 years, 9 months ago

On Thu, Apr 28, 2022 at 05:55:41PM +0100, Daniel P. Berrangé wrote:
> On Thu, Apr 28, 2022 at 08:33:46AM -0700, Andrea Bolognani wrote:
> > In other words, the current implementation of g_poll() on macOS
> > doesn't follow the contract defined by GLib itself. It seems to me
> > that this is a (fairly serious) bug in the library, no?
>
> It is significant, but long standing. GLib actually had this behaviour
> forever on macOS, but it regressed when Meson was introduced, until
> the recent fix.
>
> The question is whether efficiency trumps API semantics. Normally I'm
> heavily biased towards API semantics, but poll is a performance
> critical API, so it isn't so easy to declare we must workaround all
> the quirks.
>
> I filed a bug to raise the subject for discussion though
>
>   https://gitlab.gnome.org/GNOME/glib/-/issues/2644

That's an excellent bug report! Thanks for spending time on it, and
let's see where the upstream discussion leads :)

-- 
Andrea Bolognani / Red Hat / Virtualization

Re: [libvirt PATCH 2/2] tests: don't set G_DEBUG=fatal-warnings on macOS

Posted by Martin Kletzander 3 years, 9 months ago

On Fri, Apr 29, 2022 at 01:05:42AM -0700, Andrea Bolognani wrote:
>On Thu, Apr 28, 2022 at 05:55:41PM +0100, Daniel P. Berrangé wrote:
>> On Thu, Apr 28, 2022 at 08:33:46AM -0700, Andrea Bolognani wrote:
>> > In other words, the current implementation of g_poll() on macOS
>> > doesn't follow the contract defined by GLib itself. It seems to me
>> > that this is a (fairly serious) bug in the library, no?
>>
>> It is significant, but long standing. GLib actually had this behaviour
>> forever on macOS, but it regressed when Meson was introduced, until
>> the recent fix.
>>
>> The question is whether efficiency trumps API semantics. Normally I'm
>> heavily biased towards API semantics, but poll is a performance
>> critical API, so it isn't so easy to declare we must workaround all
>> the quirks.
>>
>> I filed a bug to raise the subject for discussion though
>>
>>   https://gitlab.gnome.org/GNOME/glib/-/issues/2644
>
>That's an excellent bug report! Thanks for spending time on it, and
>let's see where the upstream discussion leads :)
>

Just so I don't sway the potential discussion in the issue to the wrong
way, I'll ask here.  Since BROKEN_POLL is used due to poll() on OS/X not
behaving correctly with fds=NULL, nfds=0 [0], wouldn't it be possible to
"fix" at least the internal calls to g_poll() which are done when
running the main loop since they always have at least one FD (the
eventfd used to signal the main loop [1])?  I know it would not fix the
problem, but it would at least workaround the part that is used the
most, I presume.

Martin

[0] https://gitlab.gnome.org/GNOME/glib/-/blob/main/glib/gpoll.c#L59
[1] this is a presumption I did not find in the code, but I've seen in
     gdb and strace

Re: [libvirt PATCH 2/2] tests: don't set G_DEBUG=fatal-warnings on macOS

Posted by Daniel P. Berrangé 3 years, 9 months ago

On Mon, May 02, 2022 at 03:28:50PM +0200, Martin Kletzander wrote:
> On Fri, Apr 29, 2022 at 01:05:42AM -0700, Andrea Bolognani wrote:
> > On Thu, Apr 28, 2022 at 05:55:41PM +0100, Daniel P. Berrangé wrote:
> > > On Thu, Apr 28, 2022 at 08:33:46AM -0700, Andrea Bolognani wrote:
> > > > In other words, the current implementation of g_poll() on macOS
> > > > doesn't follow the contract defined by GLib itself. It seems to me
> > > > that this is a (fairly serious) bug in the library, no?
> > > 
> > > It is significant, but long standing. GLib actually had this behaviour
> > > forever on macOS, but it regressed when Meson was introduced, until
> > > the recent fix.
> > > 
> > > The question is whether efficiency trumps API semantics. Normally I'm
> > > heavily biased towards API semantics, but poll is a performance
> > > critical API, so it isn't so easy to declare we must workaround all
> > > the quirks.
> > > 
> > > I filed a bug to raise the subject for discussion though
> > > 
> > >   https://gitlab.gnome.org/GNOME/glib/-/issues/2644
> > 
> > That's an excellent bug report! Thanks for spending time on it, and
> > let's see where the upstream discussion leads :)
> > 
> 
> Just so I don't sway the potential discussion in the issue to the wrong
> way, I'll ask here.  Since BROKEN_POLL is used due to poll() on OS/X not
> behaving correctly with fds=NULL, nfds=0 [0], wouldn't it be possible to
> "fix" at least the internal calls to g_poll() which are done when
> running the main loop since they always have at least one FD (the
> eventfd used to signal the main loop [1])?  I know it would not fix the
> problem, but it would at least workaround the part that is used the
> most, I presume.

The comment is somewhat misleading - the nfds=0 case is not actually
the main problem they were fixing. The real killer issue is that
poll() doesn't work with block/char devices - ie any /dev/NNN FD.
This is what broke QEMU because it failed with TAP device FDs.

> 
> [0] https://gitlab.gnome.org/GNOME/glib/-/blob/main/glib/gpoll.c#L59

Dunno why the comment mentions nfds=0 because the the original
commit message describes the device node FDs scenario.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|