RFC: qio: Improve corking of TLS sessions

[Qemu-devel] [PATCH] RFC: qio: Improve corking of TLS sessions

Posted by Eric Blake 4 years, 11 months ago

Our current implementation of qio_channel_set_cork() is pointless for
TLS sessions: we block the underlying channel, but still hand things
piecemeal to gnutls which then produces multiple encryption packets.
Better is to directly use gnutls corking, which collects multiple
inputs into a single encryption packet.

Signed-off-by: Eric Blake <eblake@redhat.com>

---

RFC because unfortunately, I'm not sure I like the results.  My test
case (using latest nbdkit.git) involves sending 10G of random data
over NBD using parallel writes (and doing nothing on the server end;
this is all about timing the data transfer):

$ dd if=/dev/urandom of=rand bs=1M count=10k
$ time nbdkit -p 10810 --tls=require --tls-verify-peer \
   --tls-psk=/tmp/keys.psk --filter=stats null 10g statsfile=/dev/stderr \
   --run '~/qemu/qemu-img convert -f raw -W -n --target-image-opts \
     --object tls-creds-psk,id=tls0,endpoint=client,dir=/tmp,username=eblake \
     rand driver=nbd,server.type=inet,server.host=localhost,server.port=10810,tls-creds=tls0'

Pre-patch, I measured:
real	0m34.536s
user	0m29.264s
sys	0m4.014s

while post-patch, it changed to:
real	0m36.055s
user	0m27.685s
sys	0m10.138s

Less time spent in user space, but for this particular qemu-img
behavior (writing 2M chunks at a time), gnutls is now uncorking huge
packets and the kernel is doing enough extra work that the overall
program actually takes longer. :(

For smaller I/O patterns, the effects of corking are more likely to be
beneficial, but I don't have a ready test case to produce that pattern
(short of creating a guest running fio on a device backed by nbd).

Ideas for improvements are welcome; see my recent thread on the
libguestfs about how TCP_CORK is already a painful interface (it
requires additional syscalls), and that we may be better off teaching
qio_channel_writev about taking a flag similar to send(,MSG_MORE),
which can achieve the same effect as setsockopt(TCP_CORK) but in fewer
syscalls:
https://www.redhat.com/archives/libguestfs/2019-June/msg00078.html
https://www.redhat.com/archives/libguestfs/2019-June/msg00081.html

Another idea might be teaching channel-tls.c to be smarter about the
maximum size of data it is willing to cork, as well as to autocork
during writev (it's a shame that gnutls doesn't have a sendmsg
counterpart for sending vectors).  And after all, writev already
auto-corks for sockets, which is why we already went to the effort of
allowing clients to use writev-like interfaces to qio channels,
whether or not we also add in an ability to exploit MSG_MORE when we
have back-to-back writevs to further group where it makes sense.
---
 include/crypto/tlssession.h | 15 +++++++++++++++
 include/io/channel.h        |  4 ++--
 crypto/tlssession.c         | 16 ++++++++++++++++
 io/channel-socket.c         |  3 ++-
 io/channel-tls.c            |  9 ++++++---
 io/channel-websock.c        |  3 ++-
 io/channel.c                | 11 ++++++++++-
 7 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/include/crypto/tlssession.h b/include/crypto/tlssession.h
index 1c7414e4ffdd..451f58c2c742 100644
--- a/include/crypto/tlssession.h
+++ b/include/crypto/tlssession.h
@@ -319,4 +319,19 @@ int qcrypto_tls_session_get_key_size(QCryptoTLSSession *sess,
  */
 char *qcrypto_tls_session_get_peer_name(QCryptoTLSSession *sess);

+/**
+ * qcrypto_tls_session_cork:
+ * @sess: the TLS session object
+ * @enabled: the desired cork status
+ *
+ * Update the cork status of the session. If @enabled is true, this is
+ * a hint that the next few writes should be batched together until
+ * the session is uncorked again. If false, then proceed to write
+ * batched data, and it is safe to call this in a loop in case
+ * flushing the queue would block.
+ *
+ * Returns: 0 for success, or -EAGAIN if uncorking is incomplete.
+ */
+int qcrypto_tls_session_cork(QCryptoTLSSession *sess, bool enabled);
+
 #endif /* QCRYPTO_TLSSESSION_H */
diff --git a/include/io/channel.h b/include/io/channel.h
index 59460cb1ec7a..e9565b9f7d65 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -129,8 +129,8 @@ struct QIOChannelClass {
     int (*io_shutdown)(QIOChannel *ioc,
                        QIOChannelShutdown how,
                        Error **errp);
-    void (*io_set_cork)(QIOChannel *ioc,
-                        bool enabled);
+    int (*io_set_cork)(QIOChannel *ioc,
+                       bool enabled);
     void (*io_set_delay)(QIOChannel *ioc,
                          bool enabled);
     off_t (*io_seek)(QIOChannel *ioc,
diff --git a/crypto/tlssession.c b/crypto/tlssession.c
index c3a920dfe80e..6ef5d9001375 100644
--- a/crypto/tlssession.c
+++ b/crypto/tlssession.c
@@ -547,6 +547,17 @@ qcrypto_tls_session_get_peer_name(QCryptoTLSSession *session)
     return NULL;
 }

+int qcrypto_tls_session_cork(QCryptoTLSSession *session, bool enabled)
+{
+    return 0;
+    if (enabled) {
+        gnutls_record_cork(session->handle);
+    } else if (gnutls_record_uncork(session->handle, 0) < 0) {
+        return -EAGAIN;
+    }
+    return 0;
+}
+

 #else /* ! CONFIG_GNUTLS */

@@ -639,4 +650,9 @@ qcrypto_tls_session_get_peer_name(QCryptoTLSSession *sess)
     return NULL;
 }

+int qcrypto_tls_session_cork(QCryptoTLSSession *sess, bool enabled)
+{
+    return 0;
+}
+
 #endif
diff --git a/io/channel-socket.c b/io/channel-socket.c
index bc5f80e780eb..44eb85cd2ba4 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -669,7 +669,7 @@ qio_channel_socket_set_delay(QIOChannel *ioc,
 }


-static void
+static int
 qio_channel_socket_set_cork(QIOChannel *ioc,
                             bool enabled)
 {
@@ -677,6 +677,7 @@ qio_channel_socket_set_cork(QIOChannel *ioc,
     int v = enabled ? 1 : 0;

     socket_set_cork(sioc->fd, v);
+    return 0;
 }


diff --git a/io/channel-tls.c b/io/channel-tls.c
index c98ead21b01e..93162d5ecc85 100644
--- a/io/channel-tls.c
+++ b/io/channel-tls.c
@@ -346,12 +346,15 @@ static void qio_channel_tls_set_delay(QIOChannel *ioc,
     qio_channel_set_delay(tioc->master, enabled);
 }

-static void qio_channel_tls_set_cork(QIOChannel *ioc,
-                                     bool enabled)
+static int qio_channel_tls_set_cork(QIOChannel *ioc,
+                                    bool enabled)
 {
     QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);

-    qio_channel_set_cork(tioc->master, enabled);
+    if (qcrypto_tls_session_cork(tioc->session, enabled) == -EAGAIN) {
+        return QIO_CHANNEL_ERR_BLOCK;
+    }
+    return 0;
 }

 static int qio_channel_tls_shutdown(QIOChannel *ioc,
diff --git a/io/channel-websock.c b/io/channel-websock.c
index 77d30f0e4aa4..a288e21a2a75 100644
--- a/io/channel-websock.c
+++ b/io/channel-websock.c
@@ -1186,12 +1186,13 @@ static void qio_channel_websock_set_delay(QIOChannel *ioc,
     qio_channel_set_delay(tioc->master, enabled);
 }

-static void qio_channel_websock_set_cork(QIOChannel *ioc,
+static int qio_channel_websock_set_cork(QIOChannel *ioc,
                                          bool enabled)
 {
     QIOChannelWebsock *tioc = QIO_CHANNEL_WEBSOCK(ioc);

     qio_channel_set_cork(tioc->master, enabled);
+    return 0;
 }

 static int qio_channel_websock_shutdown(QIOChannel *ioc,
diff --git a/io/channel.c b/io/channel.c
index 2a26c2a2c0b9..3510912465ac 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -379,7 +379,16 @@ void qio_channel_set_cork(QIOChannel *ioc,
     QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);

     if (klass->io_set_cork) {
-        klass->io_set_cork(ioc, enabled);
+        int r = klass->io_set_cork(ioc, enabled);
+
+        while (r == QIO_CHANNEL_ERR_BLOCK) {
+            if (qemu_in_coroutine()) {
+                qio_channel_yield(ioc, G_IO_OUT);
+            } else {
+                qio_channel_wait(ioc, G_IO_OUT);
+            }
+            r = klass->io_set_cork(ioc, enabled);
+        }
     }
 }

-- 
2.20.1

Re: [Qemu-devel] [PATCH] RFC: qio: Improve corking of TLS sessions

Posted by Daniel P. Berrangé 4 years, 11 months ago

On Fri, Jun 07, 2019 at 05:14:14PM -0500, Eric Blake wrote:
> Our current implementation of qio_channel_set_cork() is pointless for
> TLS sessions: we block the underlying channel, but still hand things
> piecemeal to gnutls which then produces multiple encryption packets.
> Better is to directly use gnutls corking, which collects multiple
> inputs into a single encryption packet.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> ---
> 
> RFC because unfortunately, I'm not sure I like the results.  My test
> case (using latest nbdkit.git) involves sending 10G of random data
> over NBD using parallel writes (and doing nothing on the server end;
> this is all about timing the data transfer):
> 
> $ dd if=/dev/urandom of=rand bs=1M count=10k
> $ time nbdkit -p 10810 --tls=require --tls-verify-peer \
>    --tls-psk=/tmp/keys.psk --filter=stats null 10g statsfile=/dev/stderr \
>    --run '~/qemu/qemu-img convert -f raw -W -n --target-image-opts \
>      --object tls-creds-psk,id=tls0,endpoint=client,dir=/tmp,username=eblake \
>      rand driver=nbd,server.type=inet,server.host=localhost,server.port=10810,tls-creds=tls0'
> 
> Pre-patch, I measured:
> real	0m34.536s
> user	0m29.264s
> sys	0m4.014s
> 
> while post-patch, it changed to:
> real	0m36.055s
> user	0m27.685s
> sys	0m10.138s
> 
> Less time spent in user space, but for this particular qemu-img
> behavior (writing 2M chunks at a time), gnutls is now uncorking huge
> packets and the kernel is doing enough extra work that the overall
> program actually takes longer. :(

I wonder if the problem is that we're hitting a tradeoff between
time spent in encryption vs time spent in network I/O.

When we don't cork, the kernel has already sent the first packet
during the time gnutls is burning CPU encrypting the second packet.

When we do cork gnutls has to encrypt both packets before the kernel
can send anything.

> For smaller I/O patterns, the effects of corking are more likely to be
> beneficial, but I don't have a ready test case to produce that pattern
> (short of creating a guest running fio on a device backed by nbd).

I think it would probably be useful to see guest kernels with fio
and definitely see results for something closer to sector sized
I/O.

2 MB chunks is pretty unrealistic for a guest workload where there
will be lots of sector sized I/O. Sure QEMU / guest OS can merge
sectors, but it still feels like most I/O is going to be much smaller
than 2 MB.


> diff --git a/io/channel.c b/io/channel.c
> index 2a26c2a2c0b9..3510912465ac 100644
> --- a/io/channel.c
> +++ b/io/channel.c
> @@ -379,7 +379,16 @@ void qio_channel_set_cork(QIOChannel *ioc,
>      QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
> 
>      if (klass->io_set_cork) {
> -        klass->io_set_cork(ioc, enabled);
> +        int r = klass->io_set_cork(ioc, enabled);
> +
> +        while (r == QIO_CHANNEL_ERR_BLOCK) {
> +            if (qemu_in_coroutine()) {
> +                qio_channel_yield(ioc, G_IO_OUT);
> +            } else {
> +                qio_channel_wait(ioc, G_IO_OUT);
> +            }
> +            r = klass->io_set_cork(ioc, enabled);
> +        }
>      }

Interesting - did you actually hit this EAGAIN behaviour ? I wouldn't
have expected anything to be pending in gnutls that needs retry. 


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [Qemu-devel] [PATCH] RFC: qio: Improve corking of TLS sessions

Posted by Eric Blake 4 years, 11 months ago

On 6/10/19 4:08 AM, Daniel P. Berrangé wrote:
> On Fri, Jun 07, 2019 at 05:14:14PM -0500, Eric Blake wrote:
>> Our current implementation of qio_channel_set_cork() is pointless for
>> TLS sessions: we block the underlying channel, but still hand things
>> piecemeal to gnutls which then produces multiple encryption packets.
>> Better is to directly use gnutls corking, which collects multiple
>> inputs into a single encryption packet.
>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>>
>> ---
>>
>> RFC because unfortunately, I'm not sure I like the results.  My test
>> case (using latest nbdkit.git) involves sending 10G of random data
>> over NBD using parallel writes (and doing nothing on the server end;
>> this is all about timing the data transfer):
>>
>> $ dd if=/dev/urandom of=rand bs=1M count=10k
>> $ time nbdkit -p 10810 --tls=require --tls-verify-peer \
>>    --tls-psk=/tmp/keys.psk --filter=stats null 10g statsfile=/dev/stderr \
>>    --run '~/qemu/qemu-img convert -f raw -W -n --target-image-opts \
>>      --object tls-creds-psk,id=tls0,endpoint=client,dir=/tmp,username=eblake \
>>      rand driver=nbd,server.type=inet,server.host=localhost,server.port=10810,tls-creds=tls0'
>>
>> Pre-patch, I measured:
>> real	0m34.536s
>> user	0m29.264s
>> sys	0m4.014s
>>
>> while post-patch, it changed to:
>> real	0m36.055s
>> user	0m27.685s
>> sys	0m10.138s
>>
>> Less time spent in user space, but for this particular qemu-img
>> behavior (writing 2M chunks at a time), gnutls is now uncorking huge
>> packets and the kernel is doing enough extra work that the overall
>> program actually takes longer. :(
> 
> I wonder if the problem is that we're hitting a tradeoff between
> time spent in encryption vs time spent in network I/O.
> 
> When we don't cork, the kernel has already sent the first packet
> during the time gnutls is burning CPU encrypting the second packet.
> 
> When we do cork gnutls has to encrypt both packets before the kernel
> can send anything.

Exactly. That's why my gut feel is that having only a binary cork is too
course, and we instead want something more like MSG_MORE. If the flag is
set, AND the current size is small, then cork; then on the next message,
we see that we are still corked, and decide to either uncork immediately
(current message is large, corking buys us nothing useful as we're going
to split the current message anyway) or uncork after appending the
current message (current message is small enough that keeping things
corked may still be a win).  Visually,

cork
send()
send()
uncork

is too late - you can't tell whether the second send would have
benefitted from staying corked after the first.  But:

send(MSG_MORE)
send()

is ideal; under the hood, we can translate it to:

send(MSG_MORE)
  gnutls_record_cork()
  gnutls_record_send()
send()
  if (size > threshold) {
    gnutls_record_uncork()
    gnutls_record_send()
  } else {
    gnutls_record_send()
    gnutls_record_uncork()
  }

So we really need a way to plumb a MSG_MORE flag for senders to use,
when they KNOW they will be sending back-to-back pieces and where the
first piece is short, but it is not yet obvious whether the second piece
is short or long.

MSG_MORE was lon the next message to go through the stack, if the
previous message next paccork for

> 
>> For smaller I/O patterns, the effects of corking are more likely to be
>> beneficial, but I don't have a ready test case to produce that pattern
>> (short of creating a guest running fio on a device backed by nbd).
> 
> I think it would probably be useful to see guest kernels with fio
> and definitely see results for something closer to sector sized
> I/O.
> 
> 2 MB chunks is pretty unrealistic for a guest workload where there
> will be lots of sector sized I/O. Sure QEMU / guest OS can merge
> sectors, but it still feels like most I/O is going to be much smaller
> than 2 MB.

I'll have to play with other benchmarking setups, and see if I can come
up with a reliable fio test.

> 
> 
>> diff --git a/io/channel.c b/io/channel.c
>> index 2a26c2a2c0b9..3510912465ac 100644
>> --- a/io/channel.c
>> +++ b/io/channel.c
>> @@ -379,7 +379,16 @@ void qio_channel_set_cork(QIOChannel *ioc,
>>      QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
>>
>>      if (klass->io_set_cork) {
>> -        klass->io_set_cork(ioc, enabled);
>> +        int r = klass->io_set_cork(ioc, enabled);
>> +
>> +        while (r == QIO_CHANNEL_ERR_BLOCK) {
>> +            if (qemu_in_coroutine()) {
>> +                qio_channel_yield(ioc, G_IO_OUT);
>> +            } else {
>> +                qio_channel_wait(ioc, G_IO_OUT);
>> +            }
>> +            r = klass->io_set_cork(ioc, enabled);
>> +        }
>>      }
> 
> Interesting - did you actually hit this EAGAIN behaviour ? I wouldn't
> have expected anything to be pending in gnutls that needs retry. 

Yes. When gnutls_record_cork() is called, NOTHING goes over the wire,
but instead everything gets appended to a realloc'd buffer; when you
finally gnutls_record_uncork() and are using a non-blocking socket, you
WILL hit EAGAIN if the amount of data being uncorked is larger than TCP
max segment size.

But again, I think that having qio_channel_set_cork() do the blocking
wait is too late, we really want a nicer interface where you can request
a per-message MSG_MORE, at which point crypto/tlssession.c handles the
cork/uncork logic as part of send()ing rather than making qio_uncork
have to block.  It's just that it's a lot more work (either all existing
callers need to adjust to a flags parameter, or we need new entry points
to expose a flag parameter to the few callers that care - right now, nbd
and sheepdog, although we may find others).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH] RFC: qio: Improve corking of TLS sessions

Posted by Eric Blake 4 years, 11 months ago

On 6/10/19 9:02 AM, Eric Blake wrote:

> 
> send(MSG_MORE)
> send()
> 
> is ideal; under the hood, we can translate it to:
> 
> send(MSG_MORE)
>   gnutls_record_cork()
>   gnutls_record_send()
> send()
>   if (size > threshold) {
>     gnutls_record_uncork()
>     gnutls_record_send()
>   } else {
>     gnutls_record_send()
>     gnutls_record_uncork()
>   }
> 
> So we really need a way to plumb a MSG_MORE flag for senders to use,
> when they KNOW they will be sending back-to-back pieces and where the
> first piece is short, but it is not yet obvious whether the second piece
> is short or long.

This is what I meant to say,

> 
> MSG_MORE was lon the next message to go through the stack, if the
> previous message next paccork for

this was an editing accident on incomplete thoughts.  But I wanted to add:

Setting up the ability to pass MGS_MORE through the qio stack will
require either an update to ALL callers of qio_write to pass a flags
argument (usually 0), or to add a set of new entry points to qio for the
few callers that want to pass a non-zero flags argument (for now, nbd
and sheepdog).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH] RFC: qio: Improve corking of TLS sessions

Posted by Eric Blake 4 years, 11 months ago

On 6/7/19 5:14 PM, Eric Blake wrote:
> Our current implementation of qio_channel_set_cork() is pointless for
> TLS sessions: we block the underlying channel, but still hand things
> piecemeal to gnutls which then produces multiple encryption packets.
> Better is to directly use gnutls corking, which collects multiple
> inputs into a single encryption packet.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> ---
> 
> RFC because unfortunately, I'm not sure I like the results.  My test
> case (using latest nbdkit.git) involves sending 10G of random data
> over NBD using parallel writes (and doing nothing on the server end;
> this is all about timing the data transfer):
> 
> $ dd if=/dev/urandom of=rand bs=1M count=10k
> $ time nbdkit -p 10810 --tls=require --tls-verify-peer \
>    --tls-psk=/tmp/keys.psk --filter=stats null 10g statsfile=/dev/stderr \
>    --run '~/qemu/qemu-img convert -f raw -W -n --target-image-opts \
>      --object tls-creds-psk,id=tls0,endpoint=client,dir=/tmp,username=eblake \
>      rand driver=nbd,server.type=inet,server.host=localhost,server.port=10810,tls-creds=tls0'
> 
> Pre-patch, I measured:
> real	0m34.536s
> user	0m29.264s
> sys	0m4.014s
> 
> while post-patch, it changed to:
> real	0m36.055s
> user	0m27.685s
> sys	0m10.138s

For grins, I also tried compiling with channel-tls.c doing absolutely
nothing for cork requests (neither TCP_CORKing the underlying socket,
nor using gnutls_record_cork); the results were:

real	0m35.904s
user	0m27.480s
sys	0m10.373s

which is actually faster than using gnutls_record_cork, but slower than
using TCP_CORK.

> 
> Less time spent in user space, but for this particular qemu-img
> behavior (writing 2M chunks at a time), gnutls is now uncorking huge
> packets and the kernel is doing enough extra work that the overall
> program actually takes longer. :(
> 
> For smaller I/O patterns, the effects of corking are more likely to be
> beneficial, but I don't have a ready test case to produce that pattern
> (short of creating a guest running fio on a device backed by nbd).
> 
> Ideas for improvements are welcome; see my recent thread on the
> libguestfs about how TCP_CORK is already a painful interface (it
> requires additional syscalls), and that we may be better off teaching
> qio_channel_writev about taking a flag similar to send(,MSG_MORE),
> which can achieve the same effect as setsockopt(TCP_CORK) but in fewer
> syscalls:
> https://www.redhat.com/archives/libguestfs/2019-June/msg00078.html
> https://www.redhat.com/archives/libguestfs/2019-June/msg00081.html

My thought process at the moment is that there is probably some ideal
size (whether 1500 bytes for traditional Ethernet frame size, or 9000
bytes for gigabit jumbo frame size, or even 64k as a nice round number).
 If we get a channel_write[v]() call with a new QIO_MORE flag set, which
adds less than our magic number of bytes to any already-queued data,
then add it to our queue. If we get a call without the QIO_MORE flag
set, or if the amount of data is larger than the magic size, then it's
time to pass on our buffered data and the rest of the request to the
real channel.  And within channel_writev, treat all but the last
vectored element as if they had the QIO_MORE flag set.

Or put another way, after staring a bit at the current gnutls code, the
implementation of gnutls_record_cork is not very smart - it really
blocks ALL traffic, and memcpy()s data into a holding buffer, then when
you finally gnutls_record_uncork, it tries to flush the entire holding
buffer at once.  The kernel's MSG_MORE algorithm is smarter - it will
only buffer data for up to 200ms or until it has a full packet, at which
point the buffering ends without further user input.  But even when
using TLS, we may be able to emulate some of the kernel's behaviors, by
ignoring QIO_MORE for large packets, while calling gnutls_record_cork
for small packets (no need to duplicate a buffer ourselves if gnutls
already has one) and then uncork it as soon as we've hit our magic
limit, rather than blindly waiting to uncork it after megabytes of data
when the user finally uncorks their channel.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org