[Qemu-devel] [PATCH v4 00/21] nbd: add qemu-nbd --list

Eric Blake posted 21 patches 5 years, 3 months ago
Test docker-mingw@fedora passed
Test docker-clang@ubuntu passed
Test asan passed
Test checkpatch passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20190117193658.16413-1-eblake@redhat.com
Maintainers: Eric Blake <eblake@redhat.com>, Max Reitz <mreitz@redhat.com>, Kevin Wolf <kwolf@redhat.com>
qemu-nbd.texi              | 116 +++++-
Makefile                   |   2 +
include/block/nbd.h        |  31 +-
block/nbd-client.c         |   9 +-
blockdev-nbd.c             |  10 +-
nbd/client.c               | 775 ++++++++++++++++++++++++++-----------
nbd/server.c               |  24 +-
qemu-nbd.c                 | 224 ++++++++---
nbd/trace-events           |  17 +-
scripts/texi2pod.pl        |   2 +-
tests/qemu-iotests/223     |   2 +
tests/qemu-iotests/223.out |  20 +
tests/qemu-iotests/233     |  30 +-
tests/qemu-iotests/233.out |  19 +-
14 files changed, 942 insertions(+), 339 deletions(-)
[Qemu-devel] [PATCH v4 00/21] nbd: add qemu-nbd --list
Posted by Eric Blake 5 years, 3 months ago
I got tired of debugging whether a server was advertising the
correct things during negotiation by inspecting the trace
logs of qemu-io as client - not to mention that without SOME
sort of client tracing particular commands, we can't easily
regression test the server for correct behavior.  The final
straw was at KVM Forum, when Nir asked me to make sure there
was a way to easily determine if an NBD server is exposing what
we really want (and fixing x-dirty-bitmap to behave saner fell
out as a result of answering that question).

I note that upstream NBD has 'nbd-client -l $host' for querying
just export names (with no quoting, so you have to know that
a blank line means the default export), but it wasn't powerful
enough, so I implemented 'qemu-nbd -L' to document everything.
Upstream NBD has separate 'nbd-client' and 'nbd-server' binaries,
while we only have 'qemu-nbd' (which is normally just a server,
but 'qemu-nbd -c' also operates a second thread as a client).
Our other uses of qemu as NBD client are for consuming a block
device (as in qemu-io, qemu-img, or a drive to qemu) - but those
binaries are less suited to something so specific to the NBD
protocol.

Bonus: As a result of my work on this series, nbdkit now supports
NBD_OPT_INFO (my interoperability testing between server
implementations has been paying off, both at fixing server bugs,
and at making this code more reliable across difference in valid
servers).

Also available at:
https://repo.or.cz/qemu/ericb.git qemu-nbd-list-v4

Currently based on master.

Since v3:
- 1 new patch (1/21)
- split old patch 15/19 into two (16,17/21)
- retitle two patches (git backport-diff doesn't do too well at showing
the diff on a retitled patch; 5,11/21)
- fix review comments from Rich, Vladimir

001/21:[down] 'iotests: Make 233 output more reliable'
002/21:[----] [--] 'maint: Allow for EXAMPLES in texi2pod'
003/21:[----] [--] 'qemu-nbd: Enhance man page'
004/21:[0006] [FC] 'qemu-nbd: Sanity check partition bounds'
005/21:[down] 'nbd/server: Hoist length check to qmp_nbd_server_add'
006/21:[0025] [FC] 'nbd/server: Favor [u]int64_t over off_t'
007/21:[----] [-C] 'qemu-nbd: Avoid strtol open-coding'
008/21:[----] [--] 'nbd/client: Refactor nbd_receive_list()'
009/21:[----] [--] 'nbd/client: Move export name into NBDExportInfo'
010/21:[----] [--] 'nbd/client: Change signature of nbd_negotiate_simple_meta_context()'
011/21:[down] 'nbd/client: Split out nbd_send_meta_query()'
012/21:[0003] [FC] 'nbd/client: Split out nbd_receive_one_meta_context()'
013/21:[----] [--] 'nbd/client: Refactor return of nbd_receive_negotiate()'
014/21:[0003] [FC] 'nbd/client: Split handshake into two functions'
015/21:[----] [-C] 'nbd/client: Pull out oldstyle size determination'
016/21:[down] 'nbd/client: Refactor nbd_opt_go() to support NBD_OPT_INFO'
017/21:[0032] [FC] 'nbd/client: Add nbd_receive_export_list()'
018/21:[0006] [FC] 'nbd/client: Add meta contexts to nbd_receive_export_list()'
019/21:[0014] [FC] 'qemu-nbd: Add --list option'
020/21:[0004] [FC] 'nbd/client: Work around 3.0 bug for listing meta contexts'
021/21:[0002] [FC] 'iotests: Enhance 223, 233 to cover 'qemu-nbd --list''

Eric Blake (21):
  iotests: Make 233 output more reliable
  maint: Allow for EXAMPLES in texi2pod
  qemu-nbd: Enhance man page
  qemu-nbd: Sanity check partition bounds
  nbd/server: Hoist length check to qmp_nbd_server_add
  nbd/server: Favor [u]int64_t over off_t
  qemu-nbd: Avoid strtol open-coding
  nbd/client: Refactor nbd_receive_list()
  nbd/client: Move export name into NBDExportInfo
  nbd/client: Change signature of nbd_negotiate_simple_meta_context()
  nbd/client: Split out nbd_send_meta_query()
  nbd/client: Split out nbd_receive_one_meta_context()
  nbd/client: Refactor return of nbd_receive_negotiate()
  nbd/client: Split handshake into two functions
  nbd/client: Pull out oldstyle size determination
  nbd/client: Refactor nbd_opt_go() to support NBD_OPT_INFO
  nbd/client: Add nbd_receive_export_list()
  nbd/client: Add meta contexts to nbd_receive_export_list()
  qemu-nbd: Add --list option
  nbd/client: Work around 3.0 bug for listing meta contexts
  iotests: Enhance 223, 233 to cover 'qemu-nbd --list'

 qemu-nbd.texi              | 116 +++++-
 Makefile                   |   2 +
 include/block/nbd.h        |  31 +-
 block/nbd-client.c         |   9 +-
 blockdev-nbd.c             |  10 +-
 nbd/client.c               | 775 ++++++++++++++++++++++++++-----------
 nbd/server.c               |  24 +-
 qemu-nbd.c                 | 224 ++++++++---
 nbd/trace-events           |  17 +-
 scripts/texi2pod.pl        |   2 +-
 tests/qemu-iotests/223     |   2 +
 tests/qemu-iotests/223.out |  20 +
 tests/qemu-iotests/233     |  30 +-
 tests/qemu-iotests/233.out |  19 +-
 14 files changed, 942 insertions(+), 339 deletions(-)

-- 
2.20.1


Re: [Qemu-devel] [PATCH v4 00/21] nbd: add qemu-nbd --list
Posted by Vladimir Sementsov-Ogievskiy 5 years, 3 months ago
17.01.2019 22:36, Eric Blake wrote:
> I got tired of debugging whether a server was advertising the
> correct things during negotiation by inspecting the trace
> logs of qemu-io as client - not to mention that without SOME
> sort of client tracing particular commands, we can't easily
> regression test the server for correct behavior.  The final
> straw was at KVM Forum, when Nir asked me to make sure there
> was a way to easily determine if an NBD server is exposing what
> we really want (and fixing x-dirty-bitmap to behave saner fell
> out as a result of answering that question).
> 
> I note that upstream NBD has 'nbd-client -l $host' for querying
> just export names (with no quoting, so you have to know that
> a blank line means the default export), but it wasn't powerful
> enough, so I implemented 'qemu-nbd -L' to document everything.
> Upstream NBD has separate 'nbd-client' and 'nbd-server' binaries,
> while we only have 'qemu-nbd' (which is normally just a server,
> but 'qemu-nbd -c' also operates a second thread as a client).
> Our other uses of qemu as NBD client are for consuming a block
> device (as in qemu-io, qemu-img, or a drive to qemu) - but those
> binaries are less suited to something so specific to the NBD
> protocol.
> 
> Bonus: As a result of my work on this series, nbdkit now supports
> NBD_OPT_INFO (my interoperability testing between server
> implementations has been paying off, both at fixing server bugs,
> and at making this code more reliable across difference in valid
> servers).
> 
> Also available at:
> https://repo.or.cz/qemu/ericb.git qemu-nbd-list-v4
> 
> Currently based on master.
> 
> Since v3:
> - 1 new patch (1/21)
> - split old patch 15/19 into two (16,17/21)
> - retitle two patches (git backport-diff doesn't do too well at showing
> the diff on a retitled patch; 5,11/21)
> - fix review comments from Rich, Vladimir
> 
> 001/21:[down] 'iotests: Make 233 output more reliable'
> 002/21:[----] [--] 'maint: Allow for EXAMPLES in texi2pod'
> 003/21:[----] [--] 'qemu-nbd: Enhance man page'

Interesting, I don't get it again. Searched in outlook online, I found only v2 of it,
checked spam folder and filters. Magic.

anyway, I can look at https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg04328.html


-- 
Best regards,
Vladimir
Re: [Qemu-devel] [PATCH v4 00/21] nbd: add qemu-nbd --list
Posted by Vladimir Sementsov-Ogievskiy 5 years, 3 months ago
18.01.2019 11:15, Vladimir Sementsov-Ogievskiy wrote:
> 17.01.2019 22:36, Eric Blake wrote:
>> I got tired of debugging whether a server was advertising the
>> correct things during negotiation by inspecting the trace
>> logs of qemu-io as client - not to mention that without SOME
>> sort of client tracing particular commands, we can't easily
>> regression test the server for correct behavior.  The final
>> straw was at KVM Forum, when Nir asked me to make sure there
>> was a way to easily determine if an NBD server is exposing what
>> we really want (and fixing x-dirty-bitmap to behave saner fell
>> out as a result of answering that question).
>>
>> I note that upstream NBD has 'nbd-client -l $host' for querying
>> just export names (with no quoting, so you have to know that
>> a blank line means the default export), but it wasn't powerful
>> enough, so I implemented 'qemu-nbd -L' to document everything.
>> Upstream NBD has separate 'nbd-client' and 'nbd-server' binaries,
>> while we only have 'qemu-nbd' (which is normally just a server,
>> but 'qemu-nbd -c' also operates a second thread as a client).
>> Our other uses of qemu as NBD client are for consuming a block
>> device (as in qemu-io, qemu-img, or a drive to qemu) - but those
>> binaries are less suited to something so specific to the NBD
>> protocol.
>>
>> Bonus: As a result of my work on this series, nbdkit now supports
>> NBD_OPT_INFO (my interoperability testing between server
>> implementations has been paying off, both at fixing server bugs,
>> and at making this code more reliable across difference in valid
>> servers).
>>
>> Also available at:
>> https://repo.or.cz/qemu/ericb.git qemu-nbd-list-v4
>>
>> Currently based on master.
>>
>> Since v3:
>> - 1 new patch (1/21)
>> - split old patch 15/19 into two (16,17/21)
>> - retitle two patches (git backport-diff doesn't do too well at showing
>> the diff on a retitled patch; 5,11/21)
>> - fix review comments from Rich, Vladimir
>>
>> 001/21:[down] 'iotests: Make 233 output more reliable'
>> 002/21:[----] [--] 'maint: Allow for EXAMPLES in texi2pod'
>> 003/21:[----] [--] 'qemu-nbd: Enhance man page'
> 
> Interesting, I don't get it again. Searched in outlook online, I found only v2 of it,
> checked spam folder and filters. Magic.
> 
> anyway, I can look at https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg04328.html
> 
> 

just look at it in your qemu-nbd-list-v4 tag:

 > Only expose MBR partition @var{num}.  Understands physical partitions
 > 1-4 and logical partitions 5-8.

I'm afraid, I'm too lazy to sort out these things I don't know, so just believe. It at least
corresponds to limits in code that it should be 1 <= partition <= 8 ;)

 > @item -c, --connect=@var{dev}
 > Connect @var{filename} to NBD device @var{dev} (Linux only).
 > @item -d, --disconnect
 > Disconnect the device @var{dev} (Linux only).

Yes, and we now have clean code which establish this limitation

 > @item -e, --shared=@var{num}
 > Allow up to @var{num} clients to share the device (default
 > @samp{1}). Safe for readers, but for now, consistency is not
 > guaranteed between multiple writers.

Hmm, don't understand, why you decided to move @samp{1}). on the following line, it looks
better as it was. And with a period it's only 69 symbols, the file have a lot longer lines.

 > Start a server listening on port 10809 that exposes only the
 > guest-visible contents of a qcow2 file, with no TLS encryption, and
 > with the default export name (an empty string). The command is
 > one-shot, and will block until the first successful client
 > disconnects:
 >
 > @example
 > qemu-nbd -f qcow2 file.qcow2
 > @end example
 >
 > Start a long-running server listening with encryption on port 10810,
 > and require clients to have a correct X.509 certificate to connect to
 > a 1 megabyte subset of a raw file, using the export name 'subset':
 >
 > @example
 > qemu-nbd \
 >   --object tls-creds-x509,id=tls0,endpoint=server,dir=/path/to/qemutls \
 >   --tls-creds tls0 -t -x subset -p 10810 \
 >   --image-opts driver=raw,offset=1M,length=1M,file.driver=file,file.filename=file.raw
 > @end example

I don't know tls-related, the other options looks fine.

upd (decided to run the command, with dropped tls):
looks fine but not work, as s/length/size.

 >
 > Serve a read-only copy of just the first MBR partition of a guest
 > image over a Unix socket with as many as 5 simultaneous readers, with
 > a persistent process forked as a daemon:
 >
 > @example
 > qemu-nbd --fork -t -e 5 -s /path/to/sock -p 1 -r -f qcow2 file.qcow2
 > @end example

Oops. s/-s/-k/, s/-p/-P/, and idea: may be, use self-descriptive long option names in
examples?

 >
 > Expose the guest-visible contents of a qcow2 file via a block device
 > /dev/nbd0 (and possibly creating /dev/nbd0p1 and friends for
 > partitions found within), then disconnect the device when done.
 > @emph{CAUTION}: Do not use this method to mount filesystems from an
 > untrusted guest image - a malicious guest may have prepared the image
 > to attempt to trigger kernel bugs in partition probing or file system
 > mounting.
 >
 > @example

should we note, that nbd kernel-module should be loaded for this to work?

 > qemu-nbd -c /dev/nbd0 -f qcow2 file.qcow2
 > qemu-nbd -d /dev/nbd0
 > @end example

With fixed wrong option names and s/length/size:
for 03/21:
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>


-- 
Best regards,
Vladimir
Re: [Qemu-devel] [PATCH v4 00/21] nbd: add qemu-nbd --list
Posted by Eric Blake 5 years, 3 months ago
On 1/18/19 7:47 AM, Vladimir Sementsov-Ogievskiy wrote:

>>
>> Interesting, I don't get it again. Searched in outlook online, I found only v2 of it,
>> checked spam folder and filters. Magic.

Weird indeed.  Maybe there's enough @ signs due to .texi content in
there that a spam filter somewhere along the line ate it silently?

>>
>> anyway, I can look at https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg04328.html
>>
>>
> 
> just look at it in your qemu-nbd-list-v4 tag:

Thanks for that extra effort.

> 
>  > Only expose MBR partition @var{num}.  Understands physical partitions
>  > 1-4 and logical partitions 5-8.
> 
> I'm afraid, I'm too lazy to sort out these things I don't know, so just believe. It at least
> corresponds to limits in code that it should be 1 <= partition <= 8 ;)
> 
>  > @item -c, --connect=@var{dev}
>  > Connect @var{filename} to NBD device @var{dev} (Linux only).
>  > @item -d, --disconnect
>  > Disconnect the device @var{dev} (Linux only).
> 
> Yes, and we now have clean code which establish this limitation
> 
>  > @item -e, --shared=@var{num}
>  > Allow up to @var{num} clients to share the device (default
>  > @samp{1}). Safe for readers, but for now, consistency is not
>  > guaranteed between multiple writers.
> 
> Hmm, don't understand, why you decided to move @samp{1}). on the following line, it looks
> better as it was. And with a period it's only 69 symbols, the file have a lot longer lines.

Emacs auto-reflowed the line as I edited the paragraph.  End-of-lines
doesn't really matter in .texi files (it gets reflowed again in the
creation of output documentation from the .texi input), so if my editor
tries to stick to certain column lengths, I don't really fight it.


>  > @example
>  > qemu-nbd \
>  >   --object tls-creds-x509,id=tls0,endpoint=server,dir=/path/to/qemutls \
>  >   --tls-creds tls0 -t -x subset -p 10810 \
>  >   --image-opts driver=raw,offset=1M,length=1M,file.driver=file,file.filename=file.raw
>  > @end example
> 
> I don't know tls-related, the other options looks fine.

I looked through past git logs to see the command lines that Dan used
when initially implementing things; as well as comparison to iotest 233.

> 
> upd (decided to run the command, with dropped tls):
> looks fine but not work, as s/length/size.

Oh, good catch!  Will fix.

> 
>  >
>  > Serve a read-only copy of just the first MBR partition of a guest
>  > image over a Unix socket with as many as 5 simultaneous readers, with
>  > a persistent process forked as a daemon:
>  >
>  > @example
>  > qemu-nbd --fork -t -e 5 -s /path/to/sock -p 1 -r -f qcow2 file.qcow2
>  > @end example
> 
> Oops. s/-s/-k/, s/-p/-P/,

Ouch. Yes, thanks for catching that.

 and idea: may be, use self-descriptive long option names in
> examples?

Hmm. The nice part about short options is that they are less typing and
shorter lines; but verbose options are also useful.  Maybe listing the
same example twice, in both the short and long form?

> 
>  >
>  > Expose the guest-visible contents of a qcow2 file via a block device
>  > /dev/nbd0 (and possibly creating /dev/nbd0p1 and friends for
>  > partitions found within), then disconnect the device when done.
>  > @emph{CAUTION}: Do not use this method to mount filesystems from an
>  > untrusted guest image - a malicious guest may have prepared the image
>  > to attempt to trigger kernel bugs in partition probing or file system
>  > mounting.
>  >
>  > @example
> 
> should we note, that nbd kernel-module should be loaded for this to work?

Yes, that's worth mentioning (some distros include that module to be
running or autoload already, but it's not universal, so such a note is
indeed warranted).

> 
>  > qemu-nbd -c /dev/nbd0 -f qcow2 file.qcow2
>  > qemu-nbd -d /dev/nbd0
>  > @end example
> 
> With fixed wrong option names and s/length/size:
> for 03/21:
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH v4 00/21] nbd: add qemu-nbd --list
Posted by Eric Blake 5 years, 3 months ago
On 1/18/19 7:47 AM, Vladimir Sementsov-Ogievskiy wrote:

> 
>  > Only expose MBR partition @var{num}.  Understands physical partitions
>  > 1-4 and logical partitions 5-8.
> 
> I'm afraid, I'm too lazy to sort out these things I don't know, so just believe. It at least
> corresponds to limits in code that it should be 1 <= partition <= 8 ;)

It matches the code, but I just learned the code is buggy for anything
larger than 5.  According to
http://tldp.org/HOWTO/Large-Disk-HOWTO-13.html, MBR Extended/Logical
partitions form a linked-list, something like:

MBR:              EBR1              EBR2
+----------+      +----------+      +----------+
+ Part 1   +   |->+ Part 5   +   |->+ Part 6   +
+ Part 2   +   |  + Next --------|  + 0        +
+ Part 3   +   |  + 0        +      + 0        +
+ Part 4 -------  + 0        +      + 0        +
+----------+      +----------+      +----------+

In reality, ANY of part 1-4 can point to the first EBR, as long as there
is at most one primary partition pointing to the extended partition -
even crazier is that there are different magic numbers in historical
use, such as type 5 for Part 3 and type 85 for Part 4, where MS-DOS
would treat Partition 3 as the extended partition and ignore Part 4, but
Linux would follow both chains (where the second extended partition thus
allowed Linux access to logical partitions residing in space beyond what
DOS could access).  But once you start the extended partition chain, all
logical partitions within the chain each have their own EBR table with
one entry for the partition and the next entry pointing to the next EBR,
meaning you can have more than 8 logical partitions (provided your disk
is big enough).

But our find_partition() code stupidly assumes that if there is any
extended partition type in the MBR (recognizing only type 5 or type F,
but not Linux' type 85), then all four entries in that EBR are logical
partitions 5-8 (and no more, rather than a linked list chain, as in:

MBR:              EBR
+----------+      +----------+
+ Part 1   +   |->+ Part 5   +
+ Part 2   +   |  + Part 6   +
+ Part 3   +   |  + Part 7   +
+ Part 4 -------  + Part 8   +
+----------+      +----------+

It's highly unlikely that there are any BIOS implementations that would
actually recognize such a partition beyond 5 as being bootable (there
might be OSs which are a bit more tolerant, since MBR doesn't seem to
have any one hard canonical specification).  I'd have to compare what
the Linux kernel MBR code does, to see if we even stand a chance of
being interoperable in any manner.  It doesn't help that nbdkit does not
yet support Logical MBR partitions.

Oh well, another project for another day; this documentation change is
going in as-is because it at least matches the code, even if the code is
buggy.  (I'm tempted to fix nbdkit to fully support MBR logical
partitions, then rip out the partitioning code in qemu as redundant as
you could get it via nbdkit if you really need it - qemu-nbd's -P 8 has
unchanged, and thus buggy, since at least commit 7a5ca8648 in May 2008,
with no user complaining of a bug for 11 years)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH v4 00/21] nbd: add qemu-nbd --list
Posted by Richard W.M. Jones 5 years, 3 months ago
On Fri, Jan 18, 2019 at 04:47:42PM -0600, Eric Blake wrote:
> It matches the code, but I just learned the code is buggy for anything
> larger than 5.  According to
> http://tldp.org/HOWTO/Large-Disk-HOWTO-13.html, MBR Extended/Logical
> partitions form a linked-list, something like:
> 
> MBR:              EBR1              EBR2
> +----------+      +----------+      +----------+
> + Part 1   +   |->+ Part 5   +   |->+ Part 6   +
> + Part 2   +   |  + Next --------|  + 0        +
> + Part 3   +   |  + 0        +      + 0        +
> + Part 4 -------  + 0        +      + 0        +
> +----------+      +----------+      +----------+

Wikipedia's description is a bit better:

https://en.wikipedia.org/wiki/Extended_boot_record

In fact it's not strictly a linked list because the start record of
each entry is relative to the extended partition start.  It's all a
bit weird.

But you're right, nbdkit can neither read (nbdkit-partition-filter)
nor write (nbdkit-partitioning-plugin) MBR logical partitions so we
can't test interop or tell people to use nbdkit for this corner case.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: [Qemu-devel] [PATCH v4 00/21] nbd: add qemu-nbd --list
Posted by Richard W.M. Jones 5 years, 3 months ago
Attached is a NON-working patch to nbdkit-partitioning-plugin which
adds logical partition support.  I don't think I've fully understood
how the EBR fields are supposed to be initialized (or else they don't
work how is described in online documentation).  This actually causes
parted to print an internal error, while fdisk gives a warning and the
size of the logical partition is wrong.

Anyway I've run out of time to work on it this weekend, but I may be
able to have another look next week.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html
From 8d6dd4288064e9b880eccb1a6b1b7a6b03f2ba96 Mon Sep 17 00:00:00 2001
From: "Richard W.M. Jones" <rjones@redhat.com>
Date: Sat, 19 Jan 2019 10:30:28 +0000
Subject: [PATCH] partitioning: Support MBR logical partitions.

XXX NEEDS TESTS
---
 .../nbdkit-partitioning-plugin.pod            |  25 +---
 plugins/partitioning/virtual-disk.h           |  15 +-
 plugins/partitioning/partition-gpt.c          |   2 +-
 plugins/partitioning/partition-mbr.c          | 134 +++++++++++++++---
 plugins/partitioning/partitioning.c           |  31 ++--
 plugins/partitioning/virtual-disk.c           |  42 +++++-
 6 files changed, 184 insertions(+), 65 deletions(-)

diff --git a/plugins/partitioning/nbdkit-partitioning-plugin.pod b/plugins/partitioning/nbdkit-partitioning-plugin.pod
index 57a1133..edb0776 100644
--- a/plugins/partitioning/nbdkit-partitioning-plugin.pod
+++ b/plugins/partitioning/nbdkit-partitioning-plugin.pod
@@ -27,23 +27,12 @@ access use the I<-r> flag.
 
 =head2 Partition table type
 
-You can choose either an MBR partition table, which is limited to 4
-partitions, or a GPT partition table.  In theory GPT supports an
-unlimited number of partitions.
-
-The rule for selecting the partition table type is:
+Using the C<partition-type> parameter you can choose either an MBR or
+a GPT partition table.  If this parameter is I<not> present then:
 
 =over 4
 
-=item C<partition-type=mbr> parameter on the command line
-
-⇒ MBR is selected
-
-=item C<partition-type=gpt> parameter on the command line
-
-⇒ GPT is selected
-
-=item else, number of files E<gt> 4
+=item number of files E<gt> 4
 
 ⇒ GPT
 
@@ -131,7 +120,9 @@ C<./> (absolute paths do not need modification).
 =item B<partition-type=mbr>
 
 Add an MBR (DOS-style) partition table.  The MBR format is maximally
-compatible with all clients, but only supports up to 4 partitions.
+compatible with all clients.  If there are E<gt> 4 partitions then an
+extended partition is created
+(L<https://en.wikipedia.org/wiki/Extended_boot_record>).
 
 =item B<partition-type=gpt>
 
@@ -163,10 +154,6 @@ a Linux filesystem.
 
 =head1 LIMITS
 
-This plugin only supports B<primary> MBR partitions, hence the limit
-of 4 partitions with MBR.  This might be increased in future if we
-implement support for logical/extended MBR partitions.
-
 Although this plugin can create GPT partition tables containing more
 than 128 GPT partitions (in fact, unlimited numbers of partitions),
 some clients will not be able to handle this.
diff --git a/plugins/partitioning/virtual-disk.h b/plugins/partitioning/virtual-disk.h
index 3860f46..f59df70 100644
--- a/plugins/partitioning/virtual-disk.h
+++ b/plugins/partitioning/virtual-disk.h
@@ -34,6 +34,7 @@
 #ifndef NBDKIT_VIRTUAL_DISK_H
 #define NBDKIT_VIRTUAL_DISK_H
 
+#include <stdbool.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 
@@ -103,7 +104,7 @@ extern struct file *files;
 extern size_t nr_files;
 
 extern struct regions regions;
-extern unsigned char *primary, *secondary;
+extern unsigned char *primary, *secondary, **ebr;
 
 /* Main entry point called after files array has been populated. */
 extern int create_virtual_disk_layout (void);
@@ -114,16 +115,16 @@ extern int create_virtual_disk_layout (void);
 extern int parse_guid (const char *str, char *out)
   __attribute__((__nonnull__ (1, 2)));
 
-/* Internal functions for creating MBR and GPT layouts.  These are
- * published here because the GPT code calls into the MBR code, but
- * are not meant to be called from the main plugin.
+/* Internal function for creating a single MBR PTE.  The GPT code
+ * calls this for creating the protective MBR.
  */
-extern void create_mbr_partition_table (unsigned char *out)
-  __attribute__((__nonnull__ (1)));
 extern void create_mbr_partition_table_entry (const struct region *,
-                                              int bootable, int partition_id,
+                                              bool bootable, int partition_id,
                                               unsigned char *)
   __attribute__((__nonnull__ (1, 4)));
+
+/* Create MBR or GPT layout. */
+extern void create_mbr_layout (void);
 extern void create_gpt_layout (void);
 
 #endif /* NBDKIT_VIRTUAL_DISK_H */
diff --git a/plugins/partitioning/partition-gpt.c b/plugins/partitioning/partition-gpt.c
index 5fb7602..be52e72 100644
--- a/plugins/partitioning/partition-gpt.c
+++ b/plugins/partitioning/partition-gpt.c
@@ -210,7 +210,7 @@ create_gpt_protective_mbr (unsigned char *out)
   region.end = end;
   region.len = region.end - region.start + 1;
 
-  create_mbr_partition_table_entry (&region, 0, 0xee, &out[0x1be]);
+  create_mbr_partition_table_entry (&region, false, 0xee, &out[0x1be]);
 
   /* Boot signature. */
   out[0x1fe] = 0x55;
diff --git a/plugins/partitioning/partition-mbr.c b/plugins/partitioning/partition-mbr.c
index d3a0d78..6f76432 100644
--- a/plugins/partitioning/partition-mbr.c
+++ b/plugins/partitioning/partition-mbr.c
@@ -49,27 +49,125 @@
 #include "regions.h"
 #include "virtual-disk.h"
 
-/* Create the partition table. */
+static const struct region *find_file_region (size_t i, size_t *j);
+static const struct region *find_ebr_region (size_t i, size_t *j);
+
+/* Create the MBR and optionally EBRs. */
 void
-create_mbr_partition_table (unsigned char *out)
+create_mbr_layout (void)
 {
-  size_t i, j;
-
-  for (j = 0; j < nr_regions (&regions); ++j) {
-    const struct region *region = get_region (&regions, j);
-
-    if (region->type == region_file) {
-      i = region->u.i;
-      assert (i < 4);
-      create_mbr_partition_table_entry (region, i == 0,
-                                        files[i].mbr_id,
-                                        &out[0x1be + 16*i]);
-    }
-  }
+  size_t i, j = 0;
 
   /* Boot signature. */
-  out[0x1fe] = 0x55;
-  out[0x1ff] = 0xaa;
+  primary[0x1fe] = 0x55;
+  primary[0x1ff] = 0xaa;
+
+  if (nr_files <= 4) {
+    /* Basic MBR with no extended partition. */
+    for (i = 0; i < nr_files; ++i) {
+      const struct region *region = find_file_region (i, &j);
+
+      create_mbr_partition_table_entry (region, i == 0, files[i].mbr_id,
+                                        &primary[0x1be + 16*i]);
+    }
+  }
+  else {
+    struct region region;
+    const struct region *rptr, *eptr0, *eptr;
+
+    /* The first three primary partitions correspond to the first
+     * three files.
+     */
+    for (i = 0; i < 3; ++i) {
+      rptr = find_file_region (i, &j);
+      create_mbr_partition_table_entry (rptr, i == 0, files[i].mbr_id,
+                                        &primary[0x1be + 16*i]);
+    }
+
+    /* The fourth partition is an extended PTE and does not correspond
+     * to any file.  This partition starts with the first EBR, so find
+     * it.  The partition extends to the end of the disk.
+     */
+    eptr0 = find_ebr_region (3, &j);
+    region.start = eptr0->start;
+    region.end = virtual_size (&regions) - 1; /* to end of disk */
+    region.len = region.end - region.start + 1;
+    create_mbr_partition_table_entry (&region, false, 0xf, &primary[0x1ee]);
+
+    /* The remaining files are mapped to logical partitions living in
+     * the fourth extended partition.
+     */
+    for (i = 3; i < nr_files; ++i) {
+      if (i == 3)
+        eptr = eptr0;
+      else
+        eptr = find_ebr_region (i, &j);
+      rptr = find_file_region (i, &j);
+
+      /* Signature. */
+      ebr[i-3][0x1fe] = 0x55;
+      ebr[i-3][0x1ff] = 0xaa;
+
+      /* First entry in EBR contains:
+       * offset from EBR sector to the first sector of the logical partition
+       * total count of sectors in the logical partition
+       */
+      region.start = rptr->start - eptr->start;
+      region.len = rptr->len;
+      create_mbr_partition_table_entry (&region, false, files[i].mbr_id,
+                                        &ebr[i-3][0x1be]);
+
+      if (i < nr_files-1) {
+        size_t j2 = j;
+        const struct region *enext = find_ebr_region (i+1, &j2);
+        const struct region *rnext = find_file_region (i+1, &j2);
+
+        /* Second entry in the EBR contains:
+         * address of next EBR relative to extended partition
+         * total count of sectors in the next logical partition including
+         * next EBR
+         */
+        region.start = enext->start - eptr0->start;
+        region.len = rnext->end - enext->start + 1;
+        create_mbr_partition_table_entry (&region, false, files[i+1].mbr_id,
+                                          &ebr[i-3][0x1ce]);
+      }
+    }
+  }
+}
+
+/* Find the region corresponding to file[i].
+ * j is a scratch register ensuring we only do a linear scan.
+ */
+static const struct region *
+find_file_region (size_t i, size_t *j)
+{
+  const struct region *region;
+
+  for (; *j < nr_regions (&regions); ++(*j)) {
+    region = get_region (&regions, *j);
+    if (region->type == region_file && region->u.i == i)
+      return region;
+  }
+  abort ();
+}
+
+/* Find the region corresponding to EBR of file[i] (i >= 3).
+ * j is a scratch register ensuring we only do a linear scan.
+ */
+static const struct region *
+find_ebr_region (size_t i, size_t *j)
+{
+  const struct region *region;
+
+  assert (i >= 3);
+
+  for (; *j < nr_regions (&regions); ++(*j)) {
+    region = get_region (&regions, *j);
+    if (region->type == region_data && region->u.data == ebr[i-3])
+      return region;
+  }
+  abort ();
 }
 
 static void
@@ -84,7 +182,7 @@ chs_too_large (unsigned char *out)
 
 void
 create_mbr_partition_table_entry (const struct region *region,
-                                  int bootable, int partition_id,
+                                  bool bootable, int partition_id,
                                   unsigned char *out)
 {
   uint64_t start_sector, nr_sectors;
diff --git a/plugins/partitioning/partitioning.c b/plugins/partitioning/partitioning.c
index 205c059..689cb24 100644
--- a/plugins/partitioning/partitioning.c
+++ b/plugins/partitioning/partitioning.c
@@ -35,6 +35,7 @@
 
 #include <stdio.h>
 #include <stdlib.h>
+#include <stdbool.h>
 #include <stdint.h>
 #include <inttypes.h>
 #include <string.h>
@@ -81,8 +82,11 @@ size_t nr_files = 0;
 /* Virtual disk layout. */
 struct regions regions;
 
-/* Primary and secondary partition tables (secondary is only used for GPT). */
-unsigned char *primary = NULL, *secondary = NULL;
+/* Primary and secondary partition tables and extended boot records.
+ * Secondary PT is only used for GPT.  EBR array of sectors is only
+ * used for MBR with > 4 partitions and has length equal to nr_files-3.
+ */
+unsigned char *primary = NULL, *secondary = NULL, **ebr = NULL;
 
 /* Used to generate random unique partition GUIDs for GPT. */
 static struct random_state random_state;
@@ -105,12 +109,17 @@ partitioning_unload (void)
   free (files);
 
   /* We don't need to free regions.regions[].u.data because it points
-   * to either primary or secondary which we free here.
+   * to primary, secondary or ebr which we free here.
    */
   free_regions (&regions);
 
   free (primary);
   free (secondary);
+  if (ebr) {
+    for (i = 0; i < nr_files-3; ++i)
+      free (ebr[i]);
+    free (ebr);
+  }
 }
 
 static int
@@ -225,7 +234,7 @@ partitioning_config_complete (void)
 {
   size_t i;
   uint64_t total_size;
-  int needs_gpt;
+  bool needs_gpt;
 
   /* Not enough / too many files? */
   if (nr_files == 0) {
@@ -236,17 +245,11 @@ partitioning_config_complete (void)
   total_size = 0;
   for (i = 0; i < nr_files; ++i)
     total_size += files[i].statbuf.st_size;
-
-  if (nr_files > 4)
-    needs_gpt = 1;
-  else if (total_size > MAX_MBR_DISK_SIZE)
-    needs_gpt = 1;
-  else
-    needs_gpt = 0;
+  needs_gpt = total_size > MAX_MBR_DISK_SIZE;
 
   /* Choose default parttype if not set. */
   if (parttype == PARTTYPE_UNSET) {
-    if (needs_gpt) {
+    if (needs_gpt || nr_files > 4) {
       parttype = PARTTYPE_GPT;
       nbdkit_debug ("picking partition type GPT");
     }
@@ -256,8 +259,8 @@ partitioning_config_complete (void)
     }
   }
   else if (parttype == PARTTYPE_MBR && needs_gpt) {
-    nbdkit_error ("MBR partition table type supports a maximum of 4 partitions "
-                  "and a maximum virtual disk size of about 2 TB, "
+    nbdkit_error ("MBR partition table type supports "
+                  "a maximum virtual disk size of about 2 TB, "
                   "but you requested %zu partition(s) "
                   "and a total size of %" PRIu64 " bytes (> %" PRIu64 ").  "
                   "Try using: partition-type=gpt",
diff --git a/plugins/partitioning/virtual-disk.c b/plugins/partitioning/virtual-disk.c
index e2d72bc..4fe186e 100644
--- a/plugins/partitioning/virtual-disk.c
+++ b/plugins/partitioning/virtual-disk.c
@@ -69,6 +69,25 @@ create_virtual_disk_layout (void)
       nbdkit_error ("malloc: %m");
       return -1;
     }
+
+    if (nr_files > 4) {
+      /* The first 3 primary partitions will be real partitions, the
+       * 4th will be an extended partition, and so we need to store
+       * EBRs for nr_files-3 logical partitions.
+       */
+      ebr = malloc (sizeof (unsigned char *) * (nr_files-3));
+      if (ebr == NULL) {
+        nbdkit_error ("malloc: %m");
+        return -1;
+      }
+      for (i = 0; i < nr_files-3; ++i) {
+        ebr[i] = calloc (1, SECTOR_SIZE);
+        if (ebr[i] == NULL) {
+          nbdkit_error ("malloc: %m");
+          return -1;
+        }
+      }
+    }
   }
   else /* PARTTYPE_GPT */ {
     /* Protective MBR + PT header + PTA = 2 + GPT_PTA_LBAs */
@@ -117,6 +136,20 @@ create_virtual_disk_layout (void)
      */
     assert (IS_ALIGNED (offset, SECTOR_SIZE));
 
+    /* Logical partitions are preceeded by an EBR. */
+    if (parttype == PARTTYPE_MBR && nr_files > 4 && i >= 3) {
+      region.start = offset;
+      region.len = SECTOR_SIZE;
+      region.end = region.start + region.len - 1;
+      region.type = region_data;
+      region.u.data = ebr[i-3];
+      region.description = "EBR";
+      if (append_region (&regions, region) == -1)
+        return -1;
+
+      offset = virtual_size (&regions);
+    }
+
     /* Make sure each partition is aligned for best performance. */
     if (!IS_ALIGNED (offset, files[i].alignment)) {
       region.start = offset;
@@ -207,13 +240,10 @@ create_partition_table (void)
   if (parttype == PARTTYPE_GPT)
     assert (secondary != NULL);
 
-  if (parttype == PARTTYPE_MBR) {
-    assert (nr_files <= 4);
-    create_mbr_partition_table (primary);
-  }
-  else /* parttype == PARTTYPE_GPT */ {
+  if (parttype == PARTTYPE_MBR)
+    create_mbr_layout ();
+  else /* parttype == PARTTYPE_GPT */
     create_gpt_layout ();
-  }
 
   return 0;
 }
-- 
2.20.1

Re: [Qemu-devel] [PATCH v4 00/21] nbd: add qemu-nbd --list
Posted by Richard W.M. Jones 5 years, 3 months ago
On Sat, Jan 19, 2019 at 11:39:41AM +0000, Richard W.M. Jones wrote:
> 
> Attached is a NON-working patch to nbdkit-partitioning-plugin which
> adds logical partition support.  I don't think I've fully understood
> how the EBR fields are supposed to be initialized (or else they don't
> work how is described in online documentation).  This actually causes
> parted to print an internal error, while fdisk gives a warning and the
> size of the logical partition is wrong.
> 
> Anyway I've run out of time to work on it this weekend, but I may be
> able to have another look next week.

Sheesh.  Apparently the type byte in the link field must
be the extended partition type byte (eg. 0x5 or 0xf), NOT
the type of the next partition.

This fixes it:

$ git diff
diff --git a/plugins/partitioning/partition-mbr.c b/plugins/partitioning/partition-mbr.c
index 6f76432..6b256d1 100644
--- a/plugins/partitioning/partition-mbr.c
+++ b/plugins/partitioning/partition-mbr.c
@@ -129,7 +129,7 @@ create_mbr_layout (void)
          */
         region.start = enext->start - eptr0->start;
         region.len = rnext->end - enext->start + 1;
-        create_mbr_partition_table_entry (&region, false, files[i+1].mbr_id,
+        create_mbr_partition_table_entry (&region, false, 0xf,
                                           &ebr[i-3][0x1ce]);
       }
     }


I'll try to come up with a better patch with tests etc.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [Qemu-devel] [PATCH v4 00/21] nbd: add qemu-nbd --list
Posted by Eric Blake 5 years, 3 months ago
On 1/17/19 1:36 PM, Eric Blake wrote:
> I got tired of debugging whether a server was advertising the
> correct things during negotiation by inspecting the trace
> logs of qemu-io as client - not to mention that without SOME
> sort of client tracing particular commands, we can't easily
> regression test the server for correct behavior.  The final
> straw was at KVM Forum, when Nir asked me to make sure there
> was a way to easily determine if an NBD server is exposing what
> we really want (and fixing x-dirty-bitmap to behave saner fell
> out as a result of answering that question).
> 
> I note that upstream NBD has 'nbd-client -l $host' for querying
> just export names (with no quoting, so you have to know that
> a blank line means the default export), but it wasn't powerful
> enough, so I implemented 'qemu-nbd -L' to document everything.
> Upstream NBD has separate 'nbd-client' and 'nbd-server' binaries,
> while we only have 'qemu-nbd' (which is normally just a server,
> but 'qemu-nbd -c' also operates a second thread as a client).
> Our other uses of qemu as NBD client are for consuming a block
> device (as in qemu-io, qemu-img, or a drive to qemu) - but those
> binaries are less suited to something so specific to the NBD
> protocol.
> 
> Bonus: As a result of my work on this series, nbdkit now supports
> NBD_OPT_INFO (my interoperability testing between server
> implementations has been paying off, both at fixing server bugs,
> and at making this code more reliable across difference in valid
> servers).
> 
> Also available at:
> https://repo.or.cz/qemu/ericb.git qemu-nbd-list-v4

Vladimir spotted a few more tweaks to make, but they all look
sufficiently minor that I'll go ahead and queue this on my NBD tree for
a pull request on Monday, rather than posting a v5, if there aren't any
other major comments in the meantime.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org