block/file-posix.c | 54 ++++++++++++++++++++++++---------------------- 1 file changed, 28 insertions(+), 26 deletions(-)
Linux block devices, even in O_DIRECT mode don't have any user visible
limit on transfer size / number of segments, which underlying kernel block device can have.
The kernel block layer takes care of enforcing these limits by splitting the bios.
By limiting the transfer sizes, we force qemu to do the splitting itself which
introduces various overheads.
It is especially visible in nbd server, where the low max transfer size of the
underlying device forces us to advertise this over NBD, thus increasing the
traffic overhead in case of image conversion which benefits from large blocks.
More information can be found here:
https://bugzilla.redhat.com/show_bug.cgi?id=1647104
Tested this with qemu-img convert over nbd and natively and to my surprise,
even native IO performance improved a bit.
(The device on which it was tested is Intel Optane DC P4800X,
which has 128k max transfer size reported by the kernel)
The benchmark:
Images were created using:
Sparse image: qemu-img create -f qcow2 /dev/nvme0n1p3 1G / 10G / 100G
Allocated image: qemu-img create -f qcow2 /dev/nvme0n1p3 -o preallocation=metadata 1G / 10G / 100G
The test was:
echo "convert native:"
rm -rf /dev/shm/disk.img
time qemu-img convert -p -f qcow2 -O raw -T none $FILE /dev/shm/disk.img > /dev/zero
echo "convert via nbd:"
qemu-nbd -k /tmp/nbd.sock -v -f qcow2 $FILE -x export --cache=none --aio=native --fork
rm -rf /dev/shm/disk.img
time qemu-img convert -p -f raw -O raw nbd:unix:/tmp/nbd.sock:exportname=export /dev/shm/disk.img > /dev/zero
The results:
=========================================
1G sparse image:
native:
before: 0.027s
after: 0.027s
nbd:
before: 0.287s
after: 0.035s
=========================================
100G sparse image:
native:
before: 0.028s
after: 0.028s
nbd:
before: 23.796s
after: 0.109s
=========================================
1G preallocated image:
native:
before: 0.454s
after: 0.427s
nbd:
before: 0.649s
after: 0.546s
The block limits of max transfer size/max segment size are retained
for the SCSI passthrough because in this case the kernel passes the userspace request
directly to the kernel scsi driver, bypassing the block layer, and thus there is no code to split
such requests.
Fam, since you was the original author of the code that added
these limits, could you share your opinion on that?
What was the reason besides SCSI passthrough?
V2:
* Manually tested to not break the scsi passthrough with a nested VM
* As Eric suggested, refactored the area around the fstat.
* Spelling/grammar fixes
Best regards,
Maxim Levitsky
Maxim Levitsky (1):
raw-posix.c - use max transfer length / max segement count only for
SCSI passthrough
block/file-posix.c | 54 ++++++++++++++++++++++++----------------------
1 file changed, 28 insertions(+), 26 deletions(-)
--
2.17.2
Am 04.07.2019 um 14:43 hat Maxim Levitsky geschrieben: > Linux block devices, even in O_DIRECT mode don't have any user visible > limit on transfer size / number of segments, which underlying kernel block device can have. > The kernel block layer takes care of enforcing these limits by splitting the bios. > > By limiting the transfer sizes, we force qemu to do the splitting itself which > introduces various overheads. > It is especially visible in nbd server, where the low max transfer size of the > underlying device forces us to advertise this over NBD, thus increasing the > traffic overhead in case of image conversion which benefits from large blocks. > > More information can be found here: > https://bugzilla.redhat.com/show_bug.cgi?id=1647104 > > Tested this with qemu-img convert over nbd and natively and to my surprise, > even native IO performance improved a bit. Thanks, applied to the block branch. Kevin
> Linux block devices, even in O_DIRECT mode don't have any user visible > limit on transfer size / number of segments, which underlying kernel block > device can have. > The kernel block layer takes care of enforcing these limits by splitting the > bios. > > By limiting the transfer sizes, we force qemu to do the splitting itself > which > introduces various overheads. > It is especially visible in nbd server, where the low max transfer size of > the > underlying device forces us to advertise this over NBD, thus increasing the > traffic overhead in case of image conversion which benefits from large > blocks. > > More information can be found here: > https://bugzilla.redhat.com/show_bug.cgi?id=1647104 > > Tested this with qemu-img convert over nbd and natively and to my surprise,Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> > even native IO performance improved a bit. > > (The device on which it was tested is Intel Optane DC P4800X, > which has 128k max transfer size reported by the kernel) > > The benchmark: > > Images were created using: > > Sparse image: qemu-img create -f qcow2 /dev/nvme0n1p3 1G / 10G / 100G > Allocated image: qemu-img create -f qcow2 /dev/nvme0n1p3 -o > preallocation=metadata 1G / 10G / 100G > > The test was: > > echo "convert native:" > rm -rf /dev/shm/disk.img > time qemu-img convert -p -f qcow2 -O raw -T none $FILE /dev/shm/disk.img > > /dev/zero > > echo "convert via nbd:" > qemu-nbd -k /tmp/nbd.sock -v -f qcow2 $FILE -x export --cache=none > --aio=native --fork > rm -rf /dev/shm/disk.img > time qemu-img convert -p -f raw -O raw > nbd:unix:/tmp/nbd.sock:exportname=export /dev/shm/disk.img > /dev/zero > > The results: > > ========================================= > 1G sparse image: > native: > before: 0.027s > after: 0.027s > nbd: > before: 0.287s > after: 0.035s > > ========================================= > 100G sparse image: > native: > before: 0.028s > after: 0.028s > nbd: > before: 23.796s > after: 0.109s > > ========================================= > 1G preallocated image: > native: > before: 0.454s > after: 0.427s > nbd: > before: 0.649s > after: 0.546s > > The block limits of max transfer size/max segment size are retained > for the SCSI passthrough because in this case the kernel passes the userspace > request > directly to the kernel scsi driver, bypassing the block layer, and thus there > is no code to split > such requests. > > Fam, since you was the original author of the code that added > these limits, could you share your opinion on that? > What was the reason besides SCSI passthrough? > > V2: > > * Manually tested to not break the scsi passthrough with a nested VM > * As Eric suggested, refactored the area around the fstat. > * Spelling/grammar fixes > > Best regards, > Maxim Levitsky > > Maxim Levitsky (1): > raw-posix.c - use max transfer length / max segement count only for > SCSI passthrough > > block/file-posix.c | 54 ++++++++++++++++++++++++---------------------- > 1 file changed, 28 insertions(+), 26 deletions(-) > > -- I am not familiar with SCSI passthrough special case. But overall looks good to me. Feel free to add: Reviewed-by: Pankaj Gupta <pagupta@redhat.com> > 2.17.2 > > >
On Thu, Jul 04, 2019 at 03:43:41PM +0300, Maxim Levitsky wrote: > Linux block devices, even in O_DIRECT mode don't have any user visible > limit on transfer size / number of segments, which underlying kernel block device can have. > The kernel block layer takes care of enforcing these limits by splitting the bios. > > By limiting the transfer sizes, we force qemu to do the splitting itself which > introduces various overheads. > It is especially visible in nbd server, where the low max transfer size of the > underlying device forces us to advertise this over NBD, thus increasing the > traffic overhead in case of image conversion which benefits from large blocks. > > More information can be found here: > https://bugzilla.redhat.com/show_bug.cgi?id=1647104 > > Tested this with qemu-img convert over nbd and natively and to my surprise, > even native IO performance improved a bit. > > (The device on which it was tested is Intel Optane DC P4800X, > which has 128k max transfer size reported by the kernel) > > The benchmark: > > Images were created using: > > Sparse image: qemu-img create -f qcow2 /dev/nvme0n1p3 1G / 10G / 100G > Allocated image: qemu-img create -f qcow2 /dev/nvme0n1p3 -o preallocation=metadata 1G / 10G / 100G > > The test was: > > echo "convert native:" > rm -rf /dev/shm/disk.img > time qemu-img convert -p -f qcow2 -O raw -T none $FILE /dev/shm/disk.img > /dev/zero > > echo "convert via nbd:" > qemu-nbd -k /tmp/nbd.sock -v -f qcow2 $FILE -x export --cache=none --aio=native --fork > rm -rf /dev/shm/disk.img > time qemu-img convert -p -f raw -O raw nbd:unix:/tmp/nbd.sock:exportname=export /dev/shm/disk.img > /dev/zero > > The results: > > ========================================= > 1G sparse image: > native: > before: 0.027s > after: 0.027s > nbd: > before: 0.287s > after: 0.035s > > ========================================= > 100G sparse image: > native: > before: 0.028s > after: 0.028s > nbd: > before: 23.796s > after: 0.109s > > ========================================= > 1G preallocated image: > native: > before: 0.454s > after: 0.427s > nbd: > before: 0.649s > after: 0.546s > > The block limits of max transfer size/max segment size are retained > for the SCSI passthrough because in this case the kernel passes the userspace request > directly to the kernel scsi driver, bypassing the block layer, and thus there is no code to split > such requests. > > Fam, since you was the original author of the code that added > these limits, could you share your opinion on that? > What was the reason besides SCSI passthrough? > > V2: > > * Manually tested to not break the scsi passthrough with a nested VM > * As Eric suggested, refactored the area around the fstat. > * Spelling/grammar fixes > > Best regards, > Maxim Levitsky > > Maxim Levitsky (1): > raw-posix.c - use max transfer length / max segement count only for > SCSI passthrough > > block/file-posix.c | 54 ++++++++++++++++++++++++---------------------- > 1 file changed, 28 insertions(+), 26 deletions(-) > > -- > 2.17.2 > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
© 2016 - 2026 Red Hat, Inc.