[PATCH v2] nbd/server: Suppress Broken pipe errors on abrupt disconnection

Richard W.M. Jones posted 1 patch 2 years, 7 months ago
Test checkpatch passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20210913151936.392705-1-rjones@redhat.com
Maintainers: Eric Blake <eblake@redhat.com>, Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
nbd/server.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
[PATCH v2] nbd/server: Suppress Broken pipe errors on abrupt disconnection
Posted by Richard W.M. Jones 2 years, 7 months ago
$ rm -f /tmp/sock /tmp/pid
$ qemu-img create -f qcow2 /tmp/disk.qcow2 1M
$ qemu-nbd -t --format=qcow2 --socket=/tmp/sock --pid-file=/tmp/pid /tmp/disk.qcow2 &
$ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.get_size()'
qemu-nbd: Disconnect client, due to: Failed to send reply: Unable to write to socket: Broken pipe
$ killall qemu-nbd

nbdsh is abruptly dropping the NBD connection here which is a valid
way to close the connection.  It seems unnecessary to print an error
in this case so this commit suppresses it.

Note that if you call the nbdsh h.shutdown() method then the message
was not printed:

$ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.get_size()' -c 'h.shutdown()'

Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
---
 nbd/server.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/nbd/server.c b/nbd/server.c
index 3927f7789d..e0a43802b2 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2669,7 +2669,12 @@ static coroutine_fn void nbd_trip(void *opaque)
         ret = nbd_handle_request(client, &request, req->data, &local_err);
     }
     if (ret < 0) {
-        error_prepend(&local_err, "Failed to send reply: ");
+        if (errno != EPIPE) {
+            error_prepend(&local_err, "Failed to send reply: ");
+        } else {
+            error_free(local_err);
+            local_err = NULL;
+        }
         goto disconnect;
     }
 
-- 
2.32.0


Re: [PATCH v2] nbd/server: Suppress Broken pipe errors on abrupt disconnection
Posted by Vladimir Sementsov-Ogievskiy 2 years, 7 months ago
13.09.2021 18:19, Richard W.M. Jones wrote:
> $ rm -f /tmp/sock /tmp/pid
> $ qemu-img create -f qcow2 /tmp/disk.qcow2 1M
> $ qemu-nbd -t --format=qcow2 --socket=/tmp/sock --pid-file=/tmp/pid /tmp/disk.qcow2 &
> $ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.get_size()'
> qemu-nbd: Disconnect client, due to: Failed to send reply: Unable to write to socket: Broken pipe
> $ killall qemu-nbd
> 
> nbdsh is abruptly dropping the NBD connection here which is a valid
> way to close the connection.  It seems unnecessary to print an error
> in this case so this commit suppresses it.
> 
> Note that if you call the nbdsh h.shutdown() method then the message
> was not printed:
> 
> $ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.get_size()' -c 'h.shutdown()'

My personal opinion, is that this warning doesn't hurt in general. I think in production tools should gracefully shutdown any connection, and abrupt shutdown is a sign of something wrong - i.e., worth warning.

Shouldn't nbdsh do graceful shutdown by default?

> 
> Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
> ---
>   nbd/server.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/nbd/server.c b/nbd/server.c
> index 3927f7789d..e0a43802b2 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -2669,7 +2669,12 @@ static coroutine_fn void nbd_trip(void *opaque)
>           ret = nbd_handle_request(client, &request, req->data, &local_err);
>       }
>       if (ret < 0) {
> -        error_prepend(&local_err, "Failed to send reply: ");
> +        if (errno != EPIPE) {

Both nbd_handle_request() and nbd_send_generic_reply() declares that they return -errno on failure in communication with client. I think, you should use ret here: if (ret != -EPIPE). It's safer: who knows, does errno really set on all error paths of called functions? If not, we may see here errno of some another previous operation.

> +            error_prepend(&local_err, "Failed to send reply: ");
> +        } else {
> +            error_free(local_err);
> +            local_err = NULL;
> +        }
>           goto disconnect;
>       }
>   
> 


-- 
Best regards,
Vladimir

Re: [PATCH v2] nbd/server: Suppress Broken pipe errors on abrupt disconnection
Posted by Eric Blake 2 years, 7 months ago
On Tue, Sep 14, 2021 at 05:40:59PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 13.09.2021 18:19, Richard W.M. Jones wrote:
> > $ rm -f /tmp/sock /tmp/pid
> > $ qemu-img create -f qcow2 /tmp/disk.qcow2 1M
> > $ qemu-nbd -t --format=qcow2 --socket=/tmp/sock --pid-file=/tmp/pid /tmp/disk.qcow2 &
> > $ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.get_size()'
> > qemu-nbd: Disconnect client, due to: Failed to send reply: Unable to write to socket: Broken pipe
> > $ killall qemu-nbd
> > 
> > nbdsh is abruptly dropping the NBD connection here which is a valid
> > way to close the connection.  It seems unnecessary to print an error
> > in this case so this commit suppresses it.
> > 
> > Note that if you call the nbdsh h.shutdown() method then the message
> > was not printed:
> > 
> > $ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.get_size()' -c 'h.shutdown()'
> 
> My personal opinion, is that this warning doesn't hurt in general. I think in production tools should gracefully shutdown any connection, and abrupt shutdown is a sign of something wrong - i.e., worth warning.
> 
> Shouldn't nbdsh do graceful shutdown by default?

nbdsh exposes the ability to do graceful shutdown, but does not force
it (it is up to the client software using nbdsh whether it calls the
right APIs for a graceful shutdown).

We might consider a new API (which we'd then expose via a new
command-line option to nbdsh) that requests that libnbd try harder to
send NBD_OPT_ABORT or NBD_CMD_DISC prior to closing, but it would
still be something that end users would have to opt into using, and
not something we can turn on by default.

> > +++ b/nbd/server.c
> > @@ -2669,7 +2669,12 @@ static coroutine_fn void nbd_trip(void *opaque)
> >           ret = nbd_handle_request(client, &request, req->data, &local_err);
> >       }
> >       if (ret < 0) {
> > -        error_prepend(&local_err, "Failed to send reply: ");
> > +        if (errno != EPIPE) {
> 
> Both nbd_handle_request() and nbd_send_generic_reply() declares that they return -errno on failure in communication with client. I think, you should use ret here: if (ret != -EPIPE). It's safer: who knows, does errno really set on all error paths of called functions? If not, we may see here errno of some another previous operation.

Correct - 'errno' is indeterminate at this point; the correct check is
if (-ret != EPIPE).  I can make that tweak while taking this patch, if
we decide it is worthwhile.


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


Re: [PATCH v2] nbd/server: Suppress Broken pipe errors on abrupt disconnection
Posted by Richard W.M. Jones 2 years, 7 months ago
On Tue, Sep 14, 2021 at 05:40:59PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 13.09.2021 18:19, Richard W.M. Jones wrote:
> >$ rm -f /tmp/sock /tmp/pid
> >$ qemu-img create -f qcow2 /tmp/disk.qcow2 1M
> >$ qemu-nbd -t --format=qcow2 --socket=/tmp/sock --pid-file=/tmp/pid /tmp/disk.qcow2 &
> >$ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.get_size()'
> >qemu-nbd: Disconnect client, due to: Failed to send reply: Unable to write to socket: Broken pipe
> >$ killall qemu-nbd
> >
> >nbdsh is abruptly dropping the NBD connection here which is a valid
> >way to close the connection.  It seems unnecessary to print an error
> >in this case so this commit suppresses it.
> >
> >Note that if you call the nbdsh h.shutdown() method then the message
> >was not printed:
> >
> >$ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.get_size()' -c 'h.shutdown()'
>
> My personal opinion, is that this warning doesn't hurt in general. I
> think in production tools should gracefully shutdown any connection,
> and abrupt shutdown is a sign of something wrong - i.e., worth
> warning.
>
> Shouldn't nbdsh do graceful shutdown by default?

On the client side the only difference is that nbd_shutdown sends
NBD_CMD_DISC to the server (versus simply closing the socket).  On the
server side when the server receives NBD_CMD_DISC it must complete any
in-flight requests, but there's no requirement for the server to
commit anything to disk.  IOW you can still lose data even though you
took the time to disconnect.

So I don't think there's any reason for libnbd to always gracefully
shut down (especially in this case where there are no in-flight
requests), and anyway it would break ABI to make that change and slow
down the client in cases when there's nothing to clean up.

> >Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
> >---
> >  nbd/server.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> >diff --git a/nbd/server.c b/nbd/server.c
> >index 3927f7789d..e0a43802b2 100644
> >--- a/nbd/server.c
> >+++ b/nbd/server.c
> >@@ -2669,7 +2669,12 @@ static coroutine_fn void nbd_trip(void *opaque)
> >          ret = nbd_handle_request(client, &request, req->data, &local_err);
> >      }
> >      if (ret < 0) {
> >-        error_prepend(&local_err, "Failed to send reply: ");
> >+        if (errno != EPIPE) {
>
> Both nbd_handle_request() and nbd_send_generic_reply() declares that
> they return -errno on failure in communication with client. I think,
> you should use ret here: if (ret != -EPIPE). It's safer: who knows,
> does errno really set on all error paths of called functions? If
> not, we may see here errno of some another previous operation.

Should we set errno = 0 earlier in nbd_trip()?  I don't really know
how coroutines in qemu interact with thread-local variables though.

Rich.

> >+            error_prepend(&local_err, "Failed to send reply: ");
> >+        } else {
> >+            error_free(local_err);
> >+            local_err = NULL;
> >+        }
> >          goto disconnect;
> >      }
> >
> 
> 
> -- 
> Best regards,
> Vladimir

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v


Re: [PATCH v2] nbd/server: Suppress Broken pipe errors on abrupt disconnection
Posted by Vladimir Sementsov-Ogievskiy 2 years, 7 months ago
14.09.2021 17:52, Richard W.M. Jones wrote:
> On Tue, Sep 14, 2021 at 05:40:59PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> 13.09.2021 18:19, Richard W.M. Jones wrote:
>>> $ rm -f /tmp/sock /tmp/pid
>>> $ qemu-img create -f qcow2 /tmp/disk.qcow2 1M
>>> $ qemu-nbd -t --format=qcow2 --socket=/tmp/sock --pid-file=/tmp/pid /tmp/disk.qcow2 &
>>> $ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.get_size()'
>>> qemu-nbd: Disconnect client, due to: Failed to send reply: Unable to write to socket: Broken pipe
>>> $ killall qemu-nbd
>>>
>>> nbdsh is abruptly dropping the NBD connection here which is a valid
>>> way to close the connection.  It seems unnecessary to print an error
>>> in this case so this commit suppresses it.
>>>
>>> Note that if you call the nbdsh h.shutdown() method then the message
>>> was not printed:
>>>
>>> $ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.get_size()' -c 'h.shutdown()'
>>
>> My personal opinion, is that this warning doesn't hurt in general. I
>> think in production tools should gracefully shutdown any connection,
>> and abrupt shutdown is a sign of something wrong - i.e., worth
>> warning.
>>
>> Shouldn't nbdsh do graceful shutdown by default?
> 
> On the client side the only difference is that nbd_shutdown sends
> NBD_CMD_DISC to the server (versus simply closing the socket).

Qemu as NBD client always send NBD_CMD_DISC on close() (if io channel is alive).

>  On the
> server side when the server receives NBD_CMD_DISC it must complete any
> in-flight requests, but there's no requirement for the server to
> commit anything to disk.  IOW you can still lose data even though you
> took the time to disconnect.
> 
> So I don't think there's any reason for libnbd to always gracefully

Hmm. Me go to NBD spec :)

I think, there is a reason:

"The client and the server MUST NOT initiate any form of disconnect other than in one of the above circumstances."

And the only possibility for client to initiate a hard disconnect listed above is "if it detects a violation by the other party of a mandatory condition within this document".

So at least, nbdsh violates NBD protocol. May be spec should be updated to satisfy your needs.

> shut down (especially in this case where there are no in-flight
> requests), and anyway it would break ABI to make that change and slow
> down the client in cases when there's nothing to clean up.

Which ABI will it break?

> 
>>> Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
>>> ---
>>>   nbd/server.c | 7 ++++++-
>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/nbd/server.c b/nbd/server.c
>>> index 3927f7789d..e0a43802b2 100644
>>> --- a/nbd/server.c
>>> +++ b/nbd/server.c
>>> @@ -2669,7 +2669,12 @@ static coroutine_fn void nbd_trip(void *opaque)
>>>           ret = nbd_handle_request(client, &request, req->data, &local_err);
>>>       }
>>>       if (ret < 0) {
>>> -        error_prepend(&local_err, "Failed to send reply: ");
>>> +        if (errno != EPIPE) {
>>
>> Both nbd_handle_request() and nbd_send_generic_reply() declares that
>> they return -errno on failure in communication with client. I think,
>> you should use ret here: if (ret != -EPIPE). It's safer: who knows,
>> does errno really set on all error paths of called functions? If
>> not, we may see here errno of some another previous operation.
> 
> Should we set errno = 0 earlier in nbd_trip()?  I don't really know
> how coroutines in qemu interact with thread-local variables though.
> 
> Rich.
> 
>>> +            error_prepend(&local_err, "Failed to send reply: ");
>>> +        } else {
>>> +            error_free(local_err);
>>> +            local_err = NULL;
>>> +        }
>>>           goto disconnect;
>>>       }
>>>
>>
>>
>> -- 
>> Best regards,
>> Vladimir
> 


-- 
Best regards,
Vladimir

Re: [PATCH v2] nbd/server: Suppress Broken pipe errors on abrupt disconnection
Posted by Richard W.M. Jones 2 years, 7 months ago
On Tue, Sep 14, 2021 at 06:21:58PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 14.09.2021 17:52, Richard W.M. Jones wrote:
> > On the
> >server side when the server receives NBD_CMD_DISC it must complete any
> >in-flight requests, but there's no requirement for the server to
> >commit anything to disk.  IOW you can still lose data even though you
> >took the time to disconnect.
> >
> >So I don't think there's any reason for libnbd to always gracefully
> 
> Hmm. Me go to NBD spec :)
> 
> I think, there is a reason:
> 
> "The client and the server MUST NOT initiate any form of disconnect other than in one of the above circumstances."
> 
> And the only possibility for client to initiate a hard disconnect listed above is "if it detects a violation by the other party of a mandatory condition within this document".
> 
> So at least, nbdsh violates NBD protocol. May be spec should be updated to satisfy your needs.

I would say the spec is at best contradictory, but if you read other
parts of the spec, then it's pretty clear that we're allowed to drop
the connection whenever we like.  This section says as much:

https://github.com/NetworkBlockDevice/nbd/blob/5805b25ad3da96e7c0b3160cda51ea19eb518d5b/doc/proto.md#terminating-the-transmission-phase

  There are two methods of terminating the transmission phase:
  ...
  "The client or the server drops the TCP session (in which case it
  SHOULD shut down the TLS session first). This is referred to as
  'initiating a hard disconnect'."

Anyway we're dropping the TCP connection because sometimes we are just
interrogating an NBD server eg to find out what it supports, and doing
a graceful shutdown is a waste of time and internet.

> >shut down (especially in this case where there are no in-flight
> >requests), and anyway it would break ABI to make that change and slow
> >down the client in cases when there's nothing to clean up.
> 
> Which ABI will it break?

Our contract with callers using nbd_close(3), if nbd_shutdown(3) is
not called beforehand.
https://libguestfs.org/nbd_shutdown.3.html
https://libguestfs.org/nbd_create.3.html (really nbd_close ...)

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v


Re: [PATCH v2] nbd/server: Suppress Broken pipe errors on abrupt disconnection
Posted by Vladimir Sementsov-Ogievskiy 2 years, 7 months ago
14.09.2021 19:32, Richard W.M. Jones wrote:
> On Tue, Sep 14, 2021 at 06:21:58PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> 14.09.2021 17:52, Richard W.M. Jones wrote:
>>> On the
>>> server side when the server receives NBD_CMD_DISC it must complete any
>>> in-flight requests, but there's no requirement for the server to
>>> commit anything to disk.  IOW you can still lose data even though you
>>> took the time to disconnect.
>>>
>>> So I don't think there's any reason for libnbd to always gracefully
>>
>> Hmm. Me go to NBD spec :)
>>
>> I think, there is a reason:
>>
>> "The client and the server MUST NOT initiate any form of disconnect other than in one of the above circumstances."
>>
>> And the only possibility for client to initiate a hard disconnect listed above is "if it detects a violation by the other party of a mandatory condition within this document".
>>
>> So at least, nbdsh violates NBD protocol. May be spec should be updated to satisfy your needs.
> 
> I would say the spec is at best contradictory, but if you read other
> parts of the spec, then it's pretty clear that we're allowed to drop
> the connection whenever we like.  This section says as much:
> 
> https://github.com/NetworkBlockDevice/nbd/blob/5805b25ad3da96e7c0b3160cda51ea19eb518d5b/doc/proto.md#terminating-the-transmission-phase

Hmm, that was exactly the section I read and quoted :)

> 
>    There are two methods of terminating the transmission phase:
>    ...
>    "The client or the server drops the TCP session (in which case it
>    SHOULD shut down the TLS session first). This is referred to as
>    'initiating a hard disconnect'."

Right. Here the method is defined, no word that client can do it at any time.

Next, in same section:

    Either side MAY initiate a hard disconnect if it detects a violation by the other party of a mandatory condition within this document.

Next
    
    The client MAY issue a soft disconnect at any time

And finally:

    The client and the server MUST NOT initiate any form of disconnect other than in one of the above circumstances.


Or do you mean some other spec section I missed?

> 
> Anyway we're dropping the TCP connection because sometimes we are just
> interrogating an NBD server eg to find out what it supports, and doing
> a graceful shutdown is a waste of time and internet.
> 
>>> shut down (especially in this case where there are no in-flight
>>> requests), and anyway it would break ABI to make that change and slow
>>> down the client in cases when there's nothing to clean up.
>>
>> Which ABI will it break?
> 
> Our contract with callers using nbd_close(3), if nbd_shutdown(3) is
> not called beforehand.
> https://libguestfs.org/nbd_shutdown.3.html
> https://libguestfs.org/nbd_create.3.html (really nbd_close ...)
> 
> Rich.
> 


-- 
Best regards,
Vladimir

Re: [PATCH v2] nbd/server: Suppress Broken pipe errors on abrupt disconnection
Posted by Richard W.M. Jones 2 years, 7 months ago
On Wed, Sep 15, 2021 at 10:15:20AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 14.09.2021 19:32, Richard W.M. Jones wrote:
> >On Tue, Sep 14, 2021 at 06:21:58PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> >>14.09.2021 17:52, Richard W.M. Jones wrote:
> >>>On the
> >>>server side when the server receives NBD_CMD_DISC it must complete any
> >>>in-flight requests, but there's no requirement for the server to
> >>>commit anything to disk.  IOW you can still lose data even though you
> >>>took the time to disconnect.
> >>>
> >>>So I don't think there's any reason for libnbd to always gracefully
> >>
> >>Hmm. Me go to NBD spec :)
> >>
> >>I think, there is a reason:
> >>
> >>"The client and the server MUST NOT initiate any form of disconnect other than in one of the above circumstances."
> >>
> >>And the only possibility for client to initiate a hard disconnect listed above is "if it detects a violation by the other party of a mandatory condition within this document".
> >>
> >>So at least, nbdsh violates NBD protocol. May be spec should be updated to satisfy your needs.
> >
> >I would say the spec is at best contradictory, but if you read other
> >parts of the spec, then it's pretty clear that we're allowed to drop
> >the connection whenever we like.  This section says as much:
> >
> >https://github.com/NetworkBlockDevice/nbd/blob/5805b25ad3da96e7c0b3160cda51ea19eb518d5b/doc/proto.md#terminating-the-transmission-phase
> 
> Hmm, that was exactly the section I read and quoted :)
> 
> >
> >   There are two methods of terminating the transmission phase:
> >   ...
> >   "The client or the server drops the TCP session (in which case it
> >   SHOULD shut down the TLS session first). This is referred to as
> >   'initiating a hard disconnect'."
> 
> Right. Here the method is defined, no word that client can do it at any time.

I don't read this as a definition.

But we should probably clarify the spec to note that dropping the
connection is fine, because it is - no one has made any argument so
far that there's an actual reason to waste time on NBD_CMD_DISC.

Rich.

> Next, in same section:
> 
>    Either side MAY initiate a hard disconnect if it detects a violation by the other party of a mandatory condition within this document.
> 
> Next
>    The client MAY issue a soft disconnect at any time
> 
> And finally:
> 
>    The client and the server MUST NOT initiate any form of disconnect other than in one of the above circumstances.
> 
> 
> Or do you mean some other spec section I missed?
> 
> >
> >Anyway we're dropping the TCP connection because sometimes we are just
> >interrogating an NBD server eg to find out what it supports, and doing
> >a graceful shutdown is a waste of time and internet.
> >
> >>>shut down (especially in this case where there are no in-flight
> >>>requests), and anyway it would break ABI to make that change and slow
> >>>down the client in cases when there's nothing to clean up.
> >>
> >>Which ABI will it break?
> >
> >Our contract with callers using nbd_close(3), if nbd_shutdown(3) is
> >not called beforehand.
> >https://libguestfs.org/nbd_shutdown.3.html
> >https://libguestfs.org/nbd_create.3.html (really nbd_close ...)
> >
> >Rich.
> >
> 
> 
> -- 
> Best regards,
> Vladimir

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/


Re: [PATCH v2] nbd/server: Suppress Broken pipe errors on abrupt disconnection
Posted by Eric Blake 2 years, 7 months ago
On Wed, Sep 15, 2021 at 10:00:25AM +0100, Richard W.M. Jones wrote:
> > >I would say the spec is at best contradictory, but if you read other
> > >parts of the spec, then it's pretty clear that we're allowed to drop
> > >the connection whenever we like.  This section says as much:
> > >
> > >https://github.com/NetworkBlockDevice/nbd/blob/5805b25ad3da96e7c0b3160cda51ea19eb518d5b/doc/proto.md#terminating-the-transmission-phase
> > 
> > Hmm, that was exactly the section I read and quoted :)
> > 
> > >
> > >   There are two methods of terminating the transmission phase:
> > >   ...
> > >   "The client or the server drops the TCP session (in which case it
> > >   SHOULD shut down the TLS session first). This is referred to as
> > >   'initiating a hard disconnect'."
> > 
> > Right. Here the method is defined, no word that client can do it at any time.
> 
> I don't read this as a definition.
> 
> But we should probably clarify the spec to note that dropping the
> connection is fine, because it is - no one has made any argument so
> far that there's an actual reason to waste time on NBD_CMD_DISC.
> 
> Rich.
> 
> > Next, in same section:
> > 
> >    Either side MAY initiate a hard disconnect if it detects a violation by the other party of a mandatory condition within this document.
> > 
> > Next
> >    The client MAY issue a soft disconnect at any time
> > 
> > And finally:
> > 
> >    The client and the server MUST NOT initiate any form of disconnect other than in one of the above circumstances.

I went and re-read the upstream NBD discussion when this part of the
doc was added...

https://lists.debian.org/nbd/2016/04/msg00420.html
https://lists.debian.org/nbd/2016/04/msg00425.html

The argument back then was that sending CMD_DISC is desirable (clients
want to be nice to the server) but not mandatory (server must be
prepared for clients that aren't nice).  Sending CMD_DISC is what
counts as initiating soft disconnect; terminating abruptly without
CMD_DISC is hard disconnect (servers must not do it without detecting
some other first, but clients doing it when they have nothing further
to send is something we have to live with).

At the time the text was added, there was a question of whether to add
a command NBD_CMD_CLOSE that guaranteed clean server shutdown (the
client had to wait for the server's response):
https://lists.debian.org/nbd/2016/04/msg00236.html

but in the end, it was decided that the semantics of NBD_CMD_DISC were
already good enough for that purpose.  Which in turn means that
clients really do expect servers to gracefully flush things to disk on
NBD_CMD_DISC, and merely disconnecting (a hard shutdown) after
NBD_CMD_FLUSH is a bit riskier than sending a soft shutdown request.

> > 
> > 
> > Or do you mean some other spec section I missed?
> > 
> > >
> > >Anyway we're dropping the TCP connection because sometimes we are just
> > >interrogating an NBD server eg to find out what it supports, and doing
> > >a graceful shutdown is a waste of time and internet.

You're right that it costs another couple of packets, particularly for
things like 'nbdinfo --list', but is it really dominating the time
spent in the normal use of NBD?  Micro-optimizing the shutdown doesn't
have as much payoff as optimizing the speed of NBD_CMD_READ/WRITE.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


Re: [PATCH v2] nbd/server: Suppress Broken pipe errors on abrupt disconnection
Posted by Vladimir Sementsov-Ogievskiy 2 years, 7 months ago
15.09.2021 12:00, Richard W.M. Jones wrote:
> On Wed, Sep 15, 2021 at 10:15:20AM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> 14.09.2021 19:32, Richard W.M. Jones wrote:
>>> On Tue, Sep 14, 2021 at 06:21:58PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>>>> 14.09.2021 17:52, Richard W.M. Jones wrote:
>>>>> On the
>>>>> server side when the server receives NBD_CMD_DISC it must complete any
>>>>> in-flight requests, but there's no requirement for the server to
>>>>> commit anything to disk.  IOW you can still lose data even though you
>>>>> took the time to disconnect.
>>>>>
>>>>> So I don't think there's any reason for libnbd to always gracefully
>>>>
>>>> Hmm. Me go to NBD spec :)
>>>>
>>>> I think, there is a reason:
>>>>
>>>> "The client and the server MUST NOT initiate any form of disconnect other than in one of the above circumstances."
>>>>
>>>> And the only possibility for client to initiate a hard disconnect listed above is "if it detects a violation by the other party of a mandatory condition within this document".
>>>>
>>>> So at least, nbdsh violates NBD protocol. May be spec should be updated to satisfy your needs.
>>>
>>> I would say the spec is at best contradictory, but if you read other
>>> parts of the spec, then it's pretty clear that we're allowed to drop
>>> the connection whenever we like.  This section says as much:
>>>
>>> https://github.com/NetworkBlockDevice/nbd/blob/5805b25ad3da96e7c0b3160cda51ea19eb518d5b/doc/proto.md#terminating-the-transmission-phase
>>
>> Hmm, that was exactly the section I read and quoted :)
>>
>>>
>>>    There are two methods of terminating the transmission phase:
>>>    ...
>>>    "The client or the server drops the TCP session (in which case it
>>>    SHOULD shut down the TLS session first). This is referred to as
>>>    'initiating a hard disconnect'."
>>
>> Right. Here the method is defined, no word that client can do it at any time.
> 
> I don't read this as a definition.

If so, next paragraphs that specify client behavior doesn't make sense.

> 
> But we should probably clarify the spec to note that dropping the
> connection is fine, because it is - no one has made any argument so
> far that there's an actual reason to waste time on NBD_CMD_DISC.
> 

I agree that it's safe enough..

Hmm. If we update the spec to clarify that sending DISC is not necessary, is there any reason to send it at all? We can stop sending it in Qemu as well.


> 
>> Next, in same section:
>>
>>     Either side MAY initiate a hard disconnect if it detects a violation by the other party of a mandatory condition within this document.
>>
>> Next
>>     The client MAY issue a soft disconnect at any time
>>
>> And finally:
>>
>>     The client and the server MUST NOT initiate any form of disconnect other than in one of the above circumstances.
>>
>>
>> Or do you mean some other spec section I missed?
>>
>>>
>>> Anyway we're dropping the TCP connection because sometimes we are just
>>> interrogating an NBD server eg to find out what it supports, and doing
>>> a graceful shutdown is a waste of time and internet.
>>>
>>>>> shut down (especially in this case where there are no in-flight
>>>>> requests), and anyway it would break ABI to make that change and slow
>>>>> down the client in cases when there's nothing to clean up.
>>>>
>>>> Which ABI will it break?
>>>
>>> Our contract with callers using nbd_close(3), if nbd_shutdown(3) is
>>> not called beforehand.
>>> https://libguestfs.org/nbd_shutdown.3.html
>>> https://libguestfs.org/nbd_create.3.html (really nbd_close ...)
>>>
>>> Rich.
>>>
>>
>>
>> -- 
>> Best regards,
>> Vladimir
> 


-- 
Best regards,
Vladimir

Re: [PATCH v2] nbd/server: Suppress Broken pipe errors on abrupt disconnection
Posted by Eric Blake 2 years, 7 months ago
On Wed, Sep 15, 2021 at 12:11:35PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > > >    There are two methods of terminating the transmission phase:
> > > >    ...
> > > >    "The client or the server drops the TCP session (in which case it
> > > >    SHOULD shut down the TLS session first). This is referred to as
> > > >    'initiating a hard disconnect'."
> > > 
> > > Right. Here the method is defined, no word that client can do it at any time.

Note that right now, most TLS clients are NOT gracefully shutting down
the TLS session, on either hard or soft disconnect.  And this is even
observable in non-deterministic behavior of what message the server
reports when it diagnoses that a TLS client has gone away.  As argued
elsewhere, a hard disconnect does not necessarily lose data, but it
does add non-determinism into the equation, which makes regression
testing a bit harder.

> > 
> > I don't read this as a definition.
> 
> If so, next paragraphs that specify client behavior doesn't make sense.
> 
> > 
> > But we should probably clarify the spec to note that dropping the
> > connection is fine, because it is - no one has made any argument so
> > far that there's an actual reason to waste time on NBD_CMD_DISC.
> > 
> 
> I agree that it's safe enough..
> 
> Hmm. If we update the spec to clarify that sending DISC is not necessary, is there any reason to send it at all? We can stop sending it in Qemu as well.

No, I'd still prefer that qemu send NBD_CMD_DISC where possible.
Although existing servers tolerate hard disconnects, the way I read
the NBD spec is that clients SHOULD prefer soft disconnect, in case it
deals with a server where the difference matters.

When implementing 'qemu-nbd --list', I remember having to rejigger
code to add in a graceful NBD_OPT_ABORT into more places, for the same
reason as sending NBD_CMD_DISC at the end of transmission phase
reduces the chance for non-determinism on how the server may react.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


Re: [PATCH v2] nbd/server: Suppress Broken pipe errors on abrupt disconnection
Posted by Eric Blake 2 years, 7 months ago
[IOn Tue, Sep 14, 2021 at 03:52:00PM +0100, Richard W.M. Jones wrote:
> On Tue, Sep 14, 2021 at 05:40:59PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > 13.09.2021 18:19, Richard W.M. Jones wrote:
> > >$ rm -f /tmp/sock /tmp/pid
> > >$ qemu-img create -f qcow2 /tmp/disk.qcow2 1M
> > >$ qemu-nbd -t --format=qcow2 --socket=/tmp/sock --pid-file=/tmp/pid /tmp/disk.qcow2 &
> > >$ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.get_size()'
> > >qemu-nbd: Disconnect client, due to: Failed to send reply: Unable to write to socket: Broken pipe
> > >$ killall qemu-nbd
> > >
> > >nbdsh is abruptly dropping the NBD connection here which is a valid
> > >way to close the connection.  It seems unnecessary to print an error
> > >in this case so this commit suppresses it.
> > >
> > >Note that if you call the nbdsh h.shutdown() method then the message
> > >was not printed:
> > >
> > >$ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.get_size()' -c 'h.shutdown()'
> >
> > My personal opinion, is that this warning doesn't hurt in general. I
> > think in production tools should gracefully shutdown any connection,
> > and abrupt shutdown is a sign of something wrong - i.e., worth
> > warning.
> >
> > Shouldn't nbdsh do graceful shutdown by default?
> 
> On the client side the only difference is that nbd_shutdown sends
> NBD_CMD_DISC to the server (versus simply closing the socket).  On the
> server side when the server receives NBD_CMD_DISC it must complete any
> in-flight requests, but there's no requirement for the server to
> commit anything to disk.  IOW you can still lose data even though you
> took the time to disconnect.

If you use NBD_CMD_FLUSH as the last command before NBD_CMD_DISC, then
you shouldn't have data loss (but it requires the server to support
flush).  And in general, while a server that does not flush data on
CMD_DISC is compliant, it is poor quality of implementation if it
strands data that easily, for a client that tried hard to exit
gracefully.

> 
> So I don't think there's any reason for libnbd to always gracefully
> shut down (especially in this case where there are no in-flight
> requests), and anyway it would break ABI to make that change and slow
> down the client in cases when there's nothing to clean up.

I agree that we don't want libnbd to always gracefully shut down by
default; end users can already choose a graceful shutdown when they
want.

At the same time, I would not be opposed to improving the libnbd and
nbdkit testsuite usage of libnbd to request graceful shutdown in
places where it is currently getting an abrupt disconnect merely
because we were lazy when writing the test.

> 
> > >Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
> > >---
> > >  nbd/server.c | 7 ++++++-
> > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > >
> > >diff --git a/nbd/server.c b/nbd/server.c
> > >index 3927f7789d..e0a43802b2 100644
> > >--- a/nbd/server.c
> > >+++ b/nbd/server.c
> > >@@ -2669,7 +2669,12 @@ static coroutine_fn void nbd_trip(void *opaque)
> > >          ret = nbd_handle_request(client, &request, req->data, &local_err);
> > >      }
> > >      if (ret < 0) {
> > >-        error_prepend(&local_err, "Failed to send reply: ");
> > >+        if (errno != EPIPE) {
> >
> > Both nbd_handle_request() and nbd_send_generic_reply() declares that
> > they return -errno on failure in communication with client. I think,
> > you should use ret here: if (ret != -EPIPE). It's safer: who knows,
> > does errno really set on all error paths of called functions? If
> > not, we may see here errno of some another previous operation.
> 
> Should we set errno = 0 earlier in nbd_trip()?  I don't really know
> how coroutines in qemu interact with thread-local variables though.

No, we don't need to set errno to 0 prior to a call except at points
where we expect errno to be reliable after the call; but nbd_trip()
does not have any guarantees of reliable errno in the first place
(instead, it captured errno into the return value prior to any point
where errno loses its reliability).


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


Re: [PATCH v2] nbd/server: Suppress Broken pipe errors on abrupt disconnection
Posted by Eric Blake 2 years, 5 months ago
Revisiting an older thread

On Mon, Sep 13, 2021 at 04:19:36PM +0100, Richard W.M. Jones wrote:
> $ rm -f /tmp/sock /tmp/pid
> $ qemu-img create -f qcow2 /tmp/disk.qcow2 1M
> $ qemu-nbd -t --format=qcow2 --socket=/tmp/sock --pid-file=/tmp/pid /tmp/disk.qcow2 &
> $ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.get_size()'
> qemu-nbd: Disconnect client, due to: Failed to send reply: Unable to write to socket: Broken pipe
> $ killall qemu-nbd
> 
> nbdsh is abruptly dropping the NBD connection here which is a valid
> way to close the connection.  It seems unnecessary to print an error
> in this case so this commit suppresses it.

I've investigated this a bit further, and found that we have a
regression in 6.0 when the abrupt disconnect occurs over TCP instead
of Unix sockets.  Prior to that point, if a client does a hard
disconnect with no pending message, a 5.2 server reports:

$ qemu-nbd -f raw file &
[1] 1059670
$ nbdsh -u nbd://localhost:10809 -c 'h.trim(1,0)'
qemu-nbd: Disconnect client, due to: Failed to read request: Unexpected end-of-file before all bytes were read
[1]+  Done                    qemu-nbd -f raw file

but in 6.0, we started reporting:

qemu-nbd: Disconnect client, due to: Request handling failed in intermediate state

and looking further, I discovered that if you also use --trace=nbd_\*,
qemu-nbd ends up tracing uninitialized memory; it is now failing
because it sees an unexpected magic number from the uninitialized
memory, rather than its earlier detection of early EOF.

It's interesting that a Unix socket sees EPIPE when a TCP socket does
not, but it also means that your patch in isolation won't solve the
TCP issue, so I'm trying to come up with something today.

> 
> Note that if you call the nbdsh h.shutdown() method then the message
> was not printed:
> 
> $ nbdsh -u 'nbd+unix:///?socket=/tmp/sock' -c 'h.get_size()' -c 'h.shutdown()'
> 
> Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
> ---
>  nbd/server.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/nbd/server.c b/nbd/server.c
> index 3927f7789d..e0a43802b2 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -2669,7 +2669,12 @@ static coroutine_fn void nbd_trip(void *opaque)
>          ret = nbd_handle_request(client, &request, req->data, &local_err);
>      }
>      if (ret < 0) {
> -        error_prepend(&local_err, "Failed to send reply: ");
> +        if (errno != EPIPE) {
> +            error_prepend(&local_err, "Failed to send reply: ");
> +        } else {
> +            error_free(local_err);
> +            local_err = NULL;
> +        }
>          goto disconnect;
>      }
>  
> -- 
> 2.32.0
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org