[PATCH nbd 0/4] Enable multi-conn NBD [for discussion only]

Richard W.M. Jones posted 4 patches 1 year, 1 month ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20230309113946.1528247-1-rjones@redhat.com
Maintainers: Kevin Wolf <kwolf@redhat.com>, Hanna Reitz <hreitz@redhat.com>, Eric Blake <eblake@redhat.com>, Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
|
[PATCH nbd 0/4] Enable multi-conn NBD [for discussion only]
Posted by Richard W.M. Jones 1 year, 1 month ago
[ Patch series also available here, along with this cover letter and the
  script used to generate test results:
  https://gitlab.com/rwmjones/qemu/-/commits/2023-nbd-multi-conn-v1 ]

This patch series adds multi-conn support to the NBD block driver in
qemu.  It is only meant for discussion and testing because it has a
number of obvious shortcomings (see "XXX" in commit messages and
code).  If we decided this was a good idea, we can work on a better
patch.

The Network Block Driver (NBD) protocol allows servers to advertise
that they are capable of multi-conn.  This means they obey certain
requirements around how data is cached, allowing multiple connections
to be safely opened to the NBD server.  For example, a flush or FUA
operation on one connection is guaranteed to flush the cache on all
connections.

Clients that use multi-conn can achieve better performance.  This
seems to be down to at least two factors:

 - Avoids "head of line blocking" of large requests.

 - With NBD over Unix domain sockets, more cores can be used.

qemu-nbd, nbdkit and libnbd have all supported multi-conn for ages,
but qemu's built in NBD client does not, which is what this patch
fixes.

Below I've produced a few benchmarks.

Note these are mostly concoted to try to test NBD performance and may
not make sense in their own terms (eg. qemu's disk image layer has a
curl client so you wouldn't need to run one separately).  In the real
world we use long pipelines of NBD operations where different tools
are mixed together to achieve efficient downloads, conversions, disk
modifications and sparsification, and they would exercise different
aspects of this.

I've also included nbdcopy as a client for comparison in some tests.

All tests were run 4 times, the first result discarded, and the last 3
averaged.  If any of the last 3 were > 10% away from the average then
the test was stopped.

My summary:

 - It works effectively for qemu client & nbdkit server, especially in
   cases where the server does large, heavyweight requests.  This is
   important for us because virt-v2v uses an nbdkit Python plugin and
   various other heavyweight plugins (eg. plugins that access remote
   servers for each request).

 - It seems to make little or no difference with qemu + qemu-nbd
   server.  I speculate that's because qemu-nbd doesn't support system
   threads, so networking is bottlenecked through a single core.  Even
   though there are coroutines handling different sockets, they must
   all wait in turn to issue send(3) or recv(3) calls on the same
   core.

 - qemu-img unfortunately uses a single thread for all coroutines so
   it suffers from a similar problem to qemu-nbd.  This change would
   be much more effective if we could distribute coroutines across
   threads.

 - For tests which are highly bottlenecked on disk I/O (eg. the large
   local file test and null test) multi-conn doesn't make much
   difference.

 - Multi-conn even with only 2 connections can make up for the
   overhead of range requests, exceeding the performance of wget.

 - In the curlremote test, qemu-nbd is especially slow, for unknown
   reasons.


Integrity test (./multi-conn.pl integrity)
==========================================

nbdkit-sparse-random-plugin
  |                 ^
  | nbd+unix        | nbd+unix
  v                 |
   qemu-img convert

Reading from and writing the same data back to nbdkit sparse-random
plugin checks that the data written is the same as the data read.
This uses two Unix domain sockets, with or without multi-conn.  This
test is mainly here to check we don't crash or corrupt data with this
patch.

  server          client        multi-conn
  ---------------------------------------------------------------
    nbdkit	  qemu-img	[u/s]	9.07s	
    nbdkit	  qemu-img	1	9.05s	
    nbdkit	  qemu-img	2	9.02s	
    nbdkit	  qemu-img	4	8.98s	

[u/s] = upstream qemu 7.2.0


Curl local server test (./multi-conn.pl curlhttp)
=================================================

Localhost Apache serving a file over http
                  |
                  | http
                  v
nbdkit-curl-plugin   or   qemu-nbd
                  |
                  | nbd+unix
                  v
qemu-img convert   or   nbdcopy

We download an image from a local web server through
nbdkit-curl-plugin or qemu-nbd using the curl block driver, over NBD.
The image is copied to /dev/null.

  server          client        multi-conn
  ---------------------------------------------------------------
  qemu-nbd	   nbdcopy	1	8.88s	
  qemu-nbd	   nbdcopy	2	8.64s	
  qemu-nbd	   nbdcopy	4	8.37s	
  qemu-nbd	  qemu-img	[u/s]	6.47s	
  qemu-nbd	  qemu-img	1	6.56s	
  qemu-nbd	  qemu-img	2	6.63s	
  qemu-nbd	  qemu-img	4	6.50s	
    nbdkit	   nbdcopy	1	12.15s	
    nbdkit	   nbdcopy	2	7.05s	(72.36% better)
    nbdkit	   nbdcopy	4	3.54s	(242.90% better)
    nbdkit	  qemu-img	[u/s]	6.90s	
    nbdkit	  qemu-img	1	7.00s	
    nbdkit	  qemu-img	2	3.85s	(79.15% better)
    nbdkit	  qemu-img	4	3.85s	(79.15% better)


Curl local file test (./multi-conn.pl curlfile)
===============================================

nbdkit-curl-plugin   using file:/// URI
                  |
                  | nbd+unix
                  v
qemu-img convert   or   nbdcopy

We download from a file:/// URI.  This test is designed to exercise
NBD and some curl internal paths without the overhead from an external
server.  qemu-nbd doesn't support file:/// URIs so we cannot duplicate
the test for qemu as server.

  server          client        multi-conn
  ---------------------------------------------------------------
    nbdkit	   nbdcopy	1	31.32s	
    nbdkit	   nbdcopy	2	20.29s	(54.38% better)
    nbdkit	   nbdcopy	4	13.22s	(136.91% better)
    nbdkit	  qemu-img	[u/s]	31.55s	
    nbdkit	  qemu-img	1	31.70s	
    nbdkit	  qemu-img	2	21.60s	(46.07% better)
    nbdkit	  qemu-img	4	13.88s	(127.25% better)


Curl remote server test (./multi-conn.pl curlremote)
====================================================

nbdkit-curl-plugin   using http://remote/*.qcow2 URI
         |
         | nbd+unix
         v
qemu-img convert

We download from a remote qcow2 file to a local raw file, converting
between formats during copying.

qemu-nbd   using http://remote/*.qcow2 URI
    |
    | nbd+unix
    v
qemu-img convert

Similarly, replacing nbdkit with qemu-nbd (treating the remote file as
if it is raw, so the conversion is still done by qemu-img).

Additionally we compare downloading the file with wget (note this
doesn't include the time for conversion, but that should only be a few
seconds).

  server          client        multi-conn
  ---------------------------------------------------------------
         -	      wget	1	58.19s	
    nbdkit	  qemu-img	[u/s]	68.29s	(17.36% worse)
    nbdkit	  qemu-img	1	67.85s	(16.60% worse)
    nbdkit	  qemu-img	2	58.17s	
    nbdkit	  qemu-img	4	59.80s	
    nbdkit	  qemu-img	6	59.15s	
    nbdkit	  qemu-img	8	59.52s	

  qemu-nbd	  qemu-img	[u/s]	202.55s
  qemu-nbd	  qemu-img	1	204.61s	
  qemu-nbd	  qemu-img	2	196.73s	
  qemu-nbd	  qemu-img	4	179.53s	(12.83% better)
  qemu-nbd	  qemu-img	6	181.70s	(11.48% better)
  qemu-nbd	  qemu-img	8	181.05s	(11.88% better)


Local file test (./multi-conn.pl file)
======================================

qemu-nbd or nbdkit serving a large local file
                  |
                  | nbd+unix
                  v
qemu-img convert   or   nbdcopy

We download a local file over NBD.  The image is copied to /dev/null.

  server          client        multi-conn
  ---------------------------------------------------------------
  qemu-nbd	   nbdcopy	1	15.50s	
  qemu-nbd	   nbdcopy	2	14.36s	
  qemu-nbd	   nbdcopy	4	14.32s	
  qemu-nbd	  qemu-img	[u/s]	10.16s	
  qemu-nbd	  qemu-img	1	11.17s	(10.01% worse)
  qemu-nbd	  qemu-img	2	10.35s	
  qemu-nbd	  qemu-img	4	10.39s	
    nbdkit	   nbdcopy	1	9.10s	
    nbdkit	   nbdcopy	2	8.25s	
    nbdkit	   nbdcopy	4	8.60s	
    nbdkit	  qemu-img	[u/s]	8.64s	
    nbdkit	  qemu-img	1	9.38s	
    nbdkit	  qemu-img	2	8.69s	
    nbdkit	  qemu-img	4	8.87s	


Null test (./multi-conn.pl null)
================================

qemu-nbd with null-co driver  or  nbdkit-null-plugin + noextents filter
                  |
                  | nbd+unix
                  v
qemu-img convert   or   nbdcopy

This is like the local file test above, but without needing a file.
Instead all zeroes (fully allocated) are downloaded over NBD.

  server          client        multi-conn
  ---------------------------------------------------------------
  qemu-nbd	   nbdcopy	1	14.86s	
  qemu-nbd	   nbdcopy	2	17.08s	(14.90% worse)
  qemu-nbd	   nbdcopy	4	17.89s	(20.37% worse)
  qemu-nbd	  qemu-img	[u/s]	13.29s	
  qemu-nbd	  qemu-img	1	13.31s	
  qemu-nbd	  qemu-img	2	13.00s	
  qemu-nbd	  qemu-img	4	12.62s	
    nbdkit	   nbdcopy	1	15.06s	
    nbdkit	   nbdcopy	2	12.21s	(23.32% better)
    nbdkit	   nbdcopy	4	11.67s	(29.10% better)
    nbdkit	  qemu-img	[u/s]	17.13s	
    nbdkit	  qemu-img	1	17.11s	
    nbdkit	  qemu-img	2	16.82s	
    nbdkit	  qemu-img	4	18.81s
Re: [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only]
Posted by Eric Blake 1 year, 1 month ago
On Thu, Mar 09, 2023 at 11:39:42AM +0000, Richard W.M. Jones wrote:
> [ Patch series also available here, along with this cover letter and the
>   script used to generate test results:
>   https://gitlab.com/rwmjones/qemu/-/commits/2023-nbd-multi-conn-v1 ]
> 
> This patch series adds multi-conn support to the NBD block driver in
> qemu.  It is only meant for discussion and testing because it has a
> number of obvious shortcomings (see "XXX" in commit messages and
> code).  If we decided this was a good idea, we can work on a better
> patch.

Overall, I'm in favor of this.  A longer term project might be to have
qemu's NBD client code call into libnbd instead of reimplementing
things itself, at which point having libnbd manage multi-conn under
the hood would be awesome, but as that's a much bigger effort, a
shorter-term task of having qemu itself handle parallel sockets seems
worthwhile.

> 
>  - It works effectively for qemu client & nbdkit server, especially in
>    cases where the server does large, heavyweight requests.  This is
>    important for us because virt-v2v uses an nbdkit Python plugin and
>    various other heavyweight plugins (eg. plugins that access remote
>    servers for each request).
> 
>  - It seems to make little or no difference with qemu + qemu-nbd
>    server.  I speculate that's because qemu-nbd doesn't support system
>    threads, so networking is bottlenecked through a single core.  Even
>    though there are coroutines handling different sockets, they must
>    all wait in turn to issue send(3) or recv(3) calls on the same
>    core.

Is the current work to teach qemu to do multi-queue (that is, spread
the I/O load for a single block device across multiple cores) going to
help here?  I haven't been following the multi-queue efforts closely
enough to know if the approach used in this series will play nicely,
or need even further overhaul.

> 
>  - qemu-img unfortunately uses a single thread for all coroutines so
>    it suffers from a similar problem to qemu-nbd.  This change would
>    be much more effective if we could distribute coroutines across
>    threads.

qemu-img uses the same client code as qemu-nbd; any multi-queue
improvements that can spread the send()/recv() load of multiple
sockets across multiple cores will benefit both programs
simultaneously.

> 
>  - For tests which are highly bottlenecked on disk I/O (eg. the large
>    local file test and null test) multi-conn doesn't make much
>    difference.

As long as it isn't adding to much penalty, that's okay.  If the
saturation is truly at the point of how fast disk requests can be
served, it doesn't matter if we can queue up more of those requests in
parallel across multiple NBD sockets.

> 
>  - Multi-conn even with only 2 connections can make up for the
>    overhead of range requests, exceeding the performance of wget.

That alone is a rather cool result, and an argument in favor of
further developing this.

> 
>  - In the curlremote test, qemu-nbd is especially slow, for unknown
>    reasons.
> 
> 
> Integrity test (./multi-conn.pl integrity)
> ==========================================
> 
> nbdkit-sparse-random-plugin
>   |                 ^
>   | nbd+unix        | nbd+unix
>   v                 |
>    qemu-img convert
> 
> Reading from and writing the same data back to nbdkit sparse-random
> plugin checks that the data written is the same as the data read.
> This uses two Unix domain sockets, with or without multi-conn.  This
> test is mainly here to check we don't crash or corrupt data with this
> patch.
> 
>   server          client        multi-conn
>   ---------------------------------------------------------------
>     nbdkit	  qemu-img	[u/s]	9.07s	
>     nbdkit	  qemu-img	1	9.05s	
>     nbdkit	  qemu-img	2	9.02s	
>     nbdkit	  qemu-img	4	8.98s	
> 
> [u/s] = upstream qemu 7.2.0

How many of these timing numbers can be repeated with TLS in the mix?

> 
> 
> Curl local server test (./multi-conn.pl curlhttp)
> =================================================
> 
> Localhost Apache serving a file over http
>                   |
>                   | http
>                   v
> nbdkit-curl-plugin   or   qemu-nbd
>                   |
>                   | nbd+unix
>                   v
> qemu-img convert   or   nbdcopy
> 
> We download an image from a local web server through
> nbdkit-curl-plugin or qemu-nbd using the curl block driver, over NBD.
> The image is copied to /dev/null.
> 
>   server          client        multi-conn
>   ---------------------------------------------------------------
>   qemu-nbd	   nbdcopy	1	8.88s	
>   qemu-nbd	   nbdcopy	2	8.64s	
>   qemu-nbd	   nbdcopy	4	8.37s	
>   qemu-nbd	  qemu-img	[u/s]	6.47s

Do we have any good feel for why qemu-img is faster than nbdcopy in
the baseline?  But improving that is orthogonal to this series.

>   qemu-nbd	  qemu-img	1	6.56s	
>   qemu-nbd	  qemu-img	2	6.63s	
>   qemu-nbd	  qemu-img	4	6.50s	
>     nbdkit	   nbdcopy	1	12.15s  

I'm assuming this is nbdkit with your recent in-progress patches to
have the curl plugin serve parallel requests.  But another place where
we can investigate why nbdkit is not as performant as qemu-nbd at
utilizing curl.

>     nbdkit	   nbdcopy	2	7.05s	(72.36% better)
>     nbdkit	   nbdcopy	4	3.54s	(242.90% better)

That one is impressive!

>     nbdkit	  qemu-img	[u/s]	6.90s	
>     nbdkit	  qemu-img	1	7.00s   

Minimal penalty for adding the code but not utilizing it...

>     nbdkit	  qemu-img	2	3.85s	(79.15% better)
>     nbdkit	  qemu-img	4	3.85s	(79.15% better)

...and definitely shows its worth.

> 
> 
> Curl local file test (./multi-conn.pl curlfile)
> ===============================================
> 
> nbdkit-curl-plugin   using file:/// URI
>                   |
>                   | nbd+unix
>                   v
> qemu-img convert   or   nbdcopy
> 
> We download from a file:/// URI.  This test is designed to exercise
> NBD and some curl internal paths without the overhead from an external
> server.  qemu-nbd doesn't support file:/// URIs so we cannot duplicate
> the test for qemu as server.
> 
>   server          client        multi-conn
>   ---------------------------------------------------------------
>     nbdkit	   nbdcopy	1	31.32s	
>     nbdkit	   nbdcopy	2	20.29s	(54.38% better)
>     nbdkit	   nbdcopy	4	13.22s	(136.91% better)
>     nbdkit	  qemu-img	[u/s]	31.55s	

Here, the baseline is already comparable; both nbdcopy and qemu-img
are parsing the image off nbdkit in about the same amount of time.

>     nbdkit	  qemu-img	1	31.70s	

And again, minimal penalty for having the new code in place but not
exploiting it.

>     nbdkit	  qemu-img	2	21.60s	(46.07% better)
>     nbdkit	  qemu-img	4	13.88s	(127.25% better)

Plus an obvious benefit when the parallel sockets matter.

> 
> 
> Curl remote server test (./multi-conn.pl curlremote)
> ====================================================
> 
> nbdkit-curl-plugin   using http://remote/*.qcow2 URI
>          |
>          | nbd+unix
>          v
> qemu-img convert
> 
> We download from a remote qcow2 file to a local raw file, converting
> between formats during copying.
> 
> qemu-nbd   using http://remote/*.qcow2 URI
>     |
>     | nbd+unix
>     v
> qemu-img convert
> 
> Similarly, replacing nbdkit with qemu-nbd (treating the remote file as
> if it is raw, so the conversion is still done by qemu-img).
> 
> Additionally we compare downloading the file with wget (note this
> doesn't include the time for conversion, but that should only be a few
> seconds).
> 
>   server          client        multi-conn
>   ---------------------------------------------------------------
>          -	      wget	1	58.19s	
>     nbdkit	  qemu-img	[u/s]	68.29s	(17.36% worse)
>     nbdkit	  qemu-img	1	67.85s	(16.60% worse)
>     nbdkit	  qemu-img	2	58.17s	

Comparable to wget on paper, but a win in practice (since the wget
step also has to add a post-download qemu-img local conversion step).

>     nbdkit	  qemu-img	4	59.80s	
>     nbdkit	  qemu-img	6	59.15s	
>     nbdkit	  qemu-img	8	59.52s	
> 
>   qemu-nbd	  qemu-img	[u/s]	202.55s
>   qemu-nbd	  qemu-img	1	204.61s	
>   qemu-nbd	  qemu-img	2	196.73s	
>   qemu-nbd	  qemu-img	4	179.53s	(12.83% better)
>   qemu-nbd	  qemu-img	6	181.70s	(11.48% better)
>   qemu-nbd	  qemu-img	8	181.05s	(11.88% better)
>

Less dramatic results here, but still nothing horrible.

> 
> Local file test (./multi-conn.pl file)
> ======================================
> 
> qemu-nbd or nbdkit serving a large local file
>                   |
>                   | nbd+unix
>                   v
> qemu-img convert   or   nbdcopy
> 
> We download a local file over NBD.  The image is copied to /dev/null.
> 
>   server          client        multi-conn
>   ---------------------------------------------------------------
>   qemu-nbd	   nbdcopy	1	15.50s	
>   qemu-nbd	   nbdcopy	2	14.36s	
>   qemu-nbd	   nbdcopy	4	14.32s	
>   qemu-nbd	  qemu-img	[u/s]	10.16s	

Once again, we're seeing qemu-img baseline faster than nbdcopy as
client.  But throwing more sockets at either client does improve
performance, except for...

>   qemu-nbd	  qemu-img	1	11.17s	(10.01% worse)

...this one looks bad.  Is it a case of this series adding more mutex
work (qemu-img is making parallel requests; each request then contends
for the mutex only to learn that it will be using the same NBD
connection)?  And your comments about smarter round-robin schemes mean
there may still be room to avoid this much of a penalty.

>   qemu-nbd	  qemu-img	2	10.35s	
>   qemu-nbd	  qemu-img	4	10.39s	
>     nbdkit	   nbdcopy	1	9.10s	

This one in interesting: nbdkit as server performs better than
qemu-nbd.

>     nbdkit	   nbdcopy	2	8.25s	
>     nbdkit	   nbdcopy	4	8.60s	
>     nbdkit	  qemu-img	[u/s]	8.64s	
>     nbdkit	  qemu-img	1	9.38s	
>     nbdkit	  qemu-img	2	8.69s	
>     nbdkit	  qemu-img	4	8.87s	
> 
> 
> Null test (./multi-conn.pl null)
> ================================
> 
> qemu-nbd with null-co driver  or  nbdkit-null-plugin + noextents filter
>                   |
>                   | nbd+unix
>                   v
> qemu-img convert   or   nbdcopy
> 
> This is like the local file test above, but without needing a file.
> Instead all zeroes (fully allocated) are downloaded over NBD.

And I'm sure that if you allowed block status to show the holes, the
performance would be a lot faster, but that would be testing something
completely differently ;)

> 
>   server          client        multi-conn
>   ---------------------------------------------------------------
>   qemu-nbd	   nbdcopy	1	14.86s	
>   qemu-nbd	   nbdcopy	2	17.08s	(14.90% worse)
>   qemu-nbd	   nbdcopy	4	17.89s	(20.37% worse)

Oh, that's weird.  I wonder if qemu's null-co driver has some poor
mutex behavior when being hit by parallel I/O.  Seems like
investigating that can be separate from this series, though.

>   qemu-nbd	  qemu-img	[u/s]	13.29s	

And another point where qemu-img is faster than nbdcopy as a
single-client baseline.

>   qemu-nbd	  qemu-img	1	13.31s	
>   qemu-nbd	  qemu-img	2	13.00s	
>   qemu-nbd	  qemu-img	4	12.62s	
>     nbdkit	   nbdcopy	1	15.06s	
>     nbdkit	   nbdcopy	2	12.21s	(23.32% better)
>     nbdkit	   nbdcopy	4	11.67s	(29.10% better)
>     nbdkit	  qemu-img	[u/s]	17.13s	
>     nbdkit	  qemu-img	1	17.11s	
>     nbdkit	  qemu-img	2	16.82s	
>     nbdkit	  qemu-img	4	18.81s	

Overall, I'm looking forward to seeing this go in (8.1 material; we're
too close to 8.0)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org
Re: [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only]
Posted by Richard W.M. Jones 1 year, 1 month ago
On Fri, Mar 10, 2023 at 01:04:12PM -0600, Eric Blake wrote:
> How many of these timing numbers can be repeated with TLS in the mix?

While I have been playing with TLS and kTLS recently, it's not
something that is especially important to v2v since all NBD traffic
goes over Unix domain sockets only (ie. it's used as kind of
interprocess communication).

I could certainly provide benchmarks, although as I'm going on holiday
shortly it may be a little while.

> > Curl local server test (./multi-conn.pl curlhttp)
> > =================================================
> > 
> > Localhost Apache serving a file over http
> >                   |
> >                   | http
> >                   v
> > nbdkit-curl-plugin   or   qemu-nbd
> >                   |
> >                   | nbd+unix
> >                   v
> > qemu-img convert   or   nbdcopy
> > 
> > We download an image from a local web server through
> > nbdkit-curl-plugin or qemu-nbd using the curl block driver, over NBD.
> > The image is copied to /dev/null.
> > 
> >   server          client        multi-conn
> >   ---------------------------------------------------------------
> >   qemu-nbd	   nbdcopy	1	8.88s	
> >   qemu-nbd	   nbdcopy	2	8.64s	
> >   qemu-nbd	   nbdcopy	4	8.37s	
> >   qemu-nbd	  qemu-img	[u/s]	6.47s
> 
> Do we have any good feel for why qemu-img is faster than nbdcopy in
> the baseline?  But improving that is orthogonal to this series.

I do not, but we have in the past found that results can be very
sensitive to request size.  By default (and also in all of these
tests) nbdcopy is using a request size of 256K, and qemu-img is using
a request size of 2M.

> >   qemu-nbd	  qemu-img	1	6.56s	
> >   qemu-nbd	  qemu-img	2	6.63s	
> >   qemu-nbd	  qemu-img	4	6.50s	
> >     nbdkit	   nbdcopy	1	12.15s  
> 
> I'm assuming this is nbdkit with your recent in-progress patches to
> have the curl plugin serve parallel requests.  But another place where
> we can investigate why nbdkit is not as performant as qemu-nbd at
> utilizing curl.
> 
> >     nbdkit	   nbdcopy	2	7.05s	(72.36% better)
> >     nbdkit	   nbdcopy	4	3.54s	(242.90% better)
> 
> That one is impressive!
> 
> >     nbdkit	  qemu-img	[u/s]	6.90s	
> >     nbdkit	  qemu-img	1	7.00s   
> 
> Minimal penalty for adding the code but not utilizing it...

[u/s] and qemu-img with multi-conn:1 ought to be identical actually.
After all, the only difference should be the restructuring of the code
to add the intermediate NBDConnState struct In this case it's probably
just measurement error.

> >     nbdkit	  qemu-img	2	3.85s	(79.15% better)
> >     nbdkit	  qemu-img	4	3.85s	(79.15% better)
> 
> ...and definitely shows its worth.
> 
> > 
> > 
> > Curl local file test (./multi-conn.pl curlfile)
> > ===============================================
> > 
> > nbdkit-curl-plugin   using file:/// URI
> >                   |
> >                   | nbd+unix
> >                   v
> > qemu-img convert   or   nbdcopy
> > 
> > We download from a file:/// URI.  This test is designed to exercise
> > NBD and some curl internal paths without the overhead from an external
> > server.  qemu-nbd doesn't support file:/// URIs so we cannot duplicate
> > the test for qemu as server.
> > 
> >   server          client        multi-conn
> >   ---------------------------------------------------------------
> >     nbdkit	   nbdcopy	1	31.32s	
> >     nbdkit	   nbdcopy	2	20.29s	(54.38% better)
> >     nbdkit	   nbdcopy	4	13.22s	(136.91% better)
> >     nbdkit	  qemu-img	[u/s]	31.55s	
> 
> Here, the baseline is already comparable; both nbdcopy and qemu-img
> are parsing the image off nbdkit in about the same amount of time.
> 
> >     nbdkit	  qemu-img	1	31.70s	
> 
> And again, minimal penalty for having the new code in place but not
> exploiting it.
> 
> >     nbdkit	  qemu-img	2	21.60s	(46.07% better)
> >     nbdkit	  qemu-img	4	13.88s	(127.25% better)
> 
> Plus an obvious benefit when the parallel sockets matter.
> 
> > 
> > 
> > Curl remote server test (./multi-conn.pl curlremote)
> > ====================================================
> > 
> > nbdkit-curl-plugin   using http://remote/*.qcow2 URI
> >          |
> >          | nbd+unix
> >          v
> > qemu-img convert
> > 
> > We download from a remote qcow2 file to a local raw file, converting
> > between formats during copying.
> > 
> > qemu-nbd   using http://remote/*.qcow2 URI
> >     |
> >     | nbd+unix
> >     v
> > qemu-img convert
> > 
> > Similarly, replacing nbdkit with qemu-nbd (treating the remote file as
> > if it is raw, so the conversion is still done by qemu-img).
> > 
> > Additionally we compare downloading the file with wget (note this
> > doesn't include the time for conversion, but that should only be a few
> > seconds).
> > 
> >   server          client        multi-conn
> >   ---------------------------------------------------------------
> >          -	      wget	1	58.19s	
> >     nbdkit	  qemu-img	[u/s]	68.29s	(17.36% worse)
> >     nbdkit	  qemu-img	1	67.85s	(16.60% worse)
> >     nbdkit	  qemu-img	2	58.17s	
> 
> Comparable to wget on paper, but a win in practice (since the wget
> step also has to add a post-download qemu-img local conversion step).

Yes, correct.  Best case that would be another ~ 2-3 seconds on this
machine.

> >     nbdkit	  qemu-img	4	59.80s	
> >     nbdkit	  qemu-img	6	59.15s	
> >     nbdkit	  qemu-img	8	59.52s	
> > 
> >   qemu-nbd	  qemu-img	[u/s]	202.55s
> >   qemu-nbd	  qemu-img	1	204.61s	
> >   qemu-nbd	  qemu-img	2	196.73s	
> >   qemu-nbd	  qemu-img	4	179.53s	(12.83% better)
> >   qemu-nbd	  qemu-img	6	181.70s	(11.48% better)
> >   qemu-nbd	  qemu-img	8	181.05s	(11.88% better)
> >
> 
> Less dramatic results here, but still nothing horrible.
> 
> > 
> > Local file test (./multi-conn.pl file)
> > ======================================
> > 
> > qemu-nbd or nbdkit serving a large local file
> >                   |
> >                   | nbd+unix
> >                   v
> > qemu-img convert   or   nbdcopy
> > 
> > We download a local file over NBD.  The image is copied to /dev/null.
> > 
> >   server          client        multi-conn
> >   ---------------------------------------------------------------
> >   qemu-nbd	   nbdcopy	1	15.50s	
> >   qemu-nbd	   nbdcopy	2	14.36s	
> >   qemu-nbd	   nbdcopy	4	14.32s	
> >   qemu-nbd	  qemu-img	[u/s]	10.16s	
> 
> Once again, we're seeing qemu-img baseline faster than nbdcopy as
> client.  But throwing more sockets at either client does improve
> performance, except for...
> 
> >   qemu-nbd	  qemu-img	1	11.17s	(10.01% worse)
> 
> ...this one looks bad.  Is it a case of this series adding more mutex
> work (qemu-img is making parallel requests; each request then contends
> for the mutex only to learn that it will be using the same NBD
> connection)?  And your comments about smarter round-robin schemes mean
> there may still be room to avoid this much of a penalty.

This was reproducible and I don't have a good explanation for it.  As
far as I know just adding the NBDConnState struct should not add any
overhead.  The only locking is the call to choose_connection, and
that's just the access to an atomic variable which I can't imagine
could cause such a difference.

> >   qemu-nbd	  qemu-img	2	10.35s	
> >   qemu-nbd	  qemu-img	4	10.39s	
> >     nbdkit	   nbdcopy	1	9.10s	
> 
> This one in interesting: nbdkit as server performs better than
> qemu-nbd.
> 
> >     nbdkit	   nbdcopy	2	8.25s	
> >     nbdkit	   nbdcopy	4	8.60s	
> >     nbdkit	  qemu-img	[u/s]	8.64s	
> >     nbdkit	  qemu-img	1	9.38s	
> >     nbdkit	  qemu-img	2	8.69s	
> >     nbdkit	  qemu-img	4	8.87s	
> > 
> > 
> > Null test (./multi-conn.pl null)
> > ================================
> > 
> > qemu-nbd with null-co driver  or  nbdkit-null-plugin + noextents filter
> >                   |
> >                   | nbd+unix
> >                   v
> > qemu-img convert   or   nbdcopy
> > 
> > This is like the local file test above, but without needing a file.
> > Instead all zeroes (fully allocated) are downloaded over NBD.
> 
> And I'm sure that if you allowed block status to show the holes, the
> performance would be a lot faster, but that would be testing something
> completely differently ;)
> 
> > 
> >   server          client        multi-conn
> >   ---------------------------------------------------------------
> >   qemu-nbd	   nbdcopy	1	14.86s	
> >   qemu-nbd	   nbdcopy	2	17.08s	(14.90% worse)
> >   qemu-nbd	   nbdcopy	4	17.89s	(20.37% worse)
> 
> Oh, that's weird.  I wonder if qemu's null-co driver has some poor
> mutex behavior when being hit by parallel I/O.  Seems like
> investigating that can be separate from this series, though.

Yes, I noticed in other tests that null-co has some odd behaviour, but
I couldn't understand it from looking at the code which seems very
simple.  It does a memset, maybe that is expensive because it uses
newly allocated buffers every time or something like that?

> >   qemu-nbd	  qemu-img	[u/s]	13.29s	
> 
> And another point where qemu-img is faster than nbdcopy as a
> single-client baseline.
> 
> >   qemu-nbd	  qemu-img	1	13.31s	
> >   qemu-nbd	  qemu-img	2	13.00s	
> >   qemu-nbd	  qemu-img	4	12.62s	
> >     nbdkit	   nbdcopy	1	15.06s	
> >     nbdkit	   nbdcopy	2	12.21s	(23.32% better)
> >     nbdkit	   nbdcopy	4	11.67s	(29.10% better)
> >     nbdkit	  qemu-img	[u/s]	17.13s	
> >     nbdkit	  qemu-img	1	17.11s	
> >     nbdkit	  qemu-img	2	16.82s	
> >     nbdkit	  qemu-img	4	18.81s	
> 
> Overall, I'm looking forward to seeing this go in (8.1 material; we're
> too close to 8.0)

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org
Re: [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only]
Posted by Vladimir Sementsov-Ogievskiy 1 year, 1 month ago
On 09.03.23 14:39, Richard W.M. Jones wrote:
> [ Patch series also available here, along with this cover letter and the
>    script used to generate test results:
>    https://gitlab.com/rwmjones/qemu/-/commits/2023-nbd-multi-conn-v1  ]
> 
> This patch series adds multi-conn support to the NBD block driver in
> qemu.  It is only meant for discussion and testing because it has a
> number of obvious shortcomings (see "XXX" in commit messages and
> code).  If we decided this was a good idea, we can work on a better
> patch.

I looked through the results and the code, and I think that's of course a good idea!

We still need smarter integration with reconnect logic.

At least, we shouldn't make several open_timer instances..


Currently, on open() we have open-timeout. That's just a limit for the whole nbd_open() - we can do several connection attempts during this time.

Seems we should proceed with success, if we succeeded with at least one connection. Postponing additional connections to be established after open() seems good too[*].


Next, we have reconnect-delay. When connection is lost nbd-client tries to reconnect with no limit in attempts, but after reconnect-delay seconds of reconnection all in-flight requests that are waiting for connection are just failed.

When we have several connections, and one is broken, I think we shouldn't wait, but instead retry the requests on other working connections. This way we don't need several reconnect_delay_timer objects: we need only one, when all connections are lost.


Reestablishing additional connections better to do in background, not blocking in-flight requests. And that's the same as postponing additional connections after open() should work ([*]).

-- 
Best regards,
Vladimir