block/io: avoid failure caused by misaligned BLKZEROOUT ioctl

[RFC v2 0/6] block/io: avoid failure caused by misaligned BLKZEROOUT ioctl

Posted by Fiona Ebner 1 month ago

Previous discussion here:
https://lore.kernel.org/qemu-devel/20260105143416.737482-1-f.ebner@proxmox.com/

Commit 5634622bcb ("file-posix: allow BLKZEROOUT with -t writeback")
enables the BLKZEROOUT ioctl when using 'writeback' cache, regressing
certain 'qemu-img convert' invocations, because of a pre-existing
issue. Namely, the BLKZEROOUT ioctl might fail with errno EINVAL when
the request is shorter than the block size of the block device.

Stefan suggested prioritizing bl.pwrite_zeroes_alignment in
bdrv_co_do_zero_pwritev(). This RFC explores that approach and the
issues with qcow2 I encountered, where
bl.pwrite_zeroes_alignment = s->subcluster_size;
I would be happy to discuss potential solutions and whether we should
use this approach after all.

For example, in iotest 154 and 271, there are assertion failures,
because the padded request extends beyond the end of the image:
Assertion `offset + bytes <= bs->total_sectors * BDRV_SECTOR_SIZE ||
child->perm & BLK_PERM_RESIZE' failed.
The total image length is not necessarily aligned to the cluster size.
This could be solved by shortening the relevant requests in
bdrv_co_do_zero_pwritev() and submitting them without the
BDRV_REQ_ZERO_WRITE flag and with bl.request_alignment as the
alignment see patch 5/6.

For iotest 179, I would need to avoid clearing BDRV_REQ_ZERO_WRITE for
the head and tail parts as long as the buffer is fully zero.
Otherwise, we end up with more 'data' sectors in the target map. See
patch 6/6. With or without that, iotests 154 and 271 produces
different output (I think it might be expected, but haven't checked in
detail yet).

Another issue is exposed by iotest 177, where the (sub-)cluster size
is 1MiB, but max-transfer is only 64KiB leading to assertion failures,
because max_transfer =
QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_transfer, INT_MAX), align);
evaluates to 0 (because align > bs->bl.max_transfer). This could be
fixed by safeguarding doing the QEMU_ALIGN_DOWN only if the value is
bigger than align, see patch 4/6.

I'm also not sure what to do about iotest 204 and 177 which use
'opt-write-zero=15M' for the blkdebug driver (which assigns that value
to pwrite_zeroes_alignment) making an is_power_of_2(align) assertion
fail.

Yet another issue is the 'detect_zeroes' option. If the option is set,
bdrv_aligned_pwritev() might set the BDRV_REQ_ZERO_WRITE flag even if
the request is not aligned to pwrite_zeroes_alignment and the original
bug could resurface.

Best Regards,
Fiona


Fiona Ebner (6):
  block/io: pass alignment to bdrv_init_padding()
  block/io: add 'bytes' parameter to bdrv_padding_rmw_read()
  block/io: honor pwrite_zeroes_alignment in bdrv_co_do_zero_pwritev()
  block/io: safeguard max transfer calculation in bdrv_aligned_pwritev()
  block/io: handle image length not aligned to write zeroes alignment in
    bdrv_co_do_zero_pwritev()
  block/io: keep zero flag for head/tail parts of misaligned zero write
    when possible

 block/io.c | 78 ++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 55 insertions(+), 23 deletions(-)

-- 
2.47.3

Re: [RFC v2 0/6] block/io: avoid failure caused by misaligned BLKZEROOUT ioctl

Posted by Stefan Hajnoczi 5 days, 21 hours ago

On Fri, Jan 09, 2026 at 01:08:27PM +0100, Fiona Ebner wrote:
> Previous discussion here:
> https://lore.kernel.org/qemu-devel/20260105143416.737482-1-f.ebner@proxmox.com/
> 
> Commit 5634622bcb ("file-posix: allow BLKZEROOUT with -t writeback")
> enables the BLKZEROOUT ioctl when using 'writeback' cache, regressing
> certain 'qemu-img convert' invocations, because of a pre-existing
> issue. Namely, the BLKZEROOUT ioctl might fail with errno EINVAL when
> the request is shorter than the block size of the block device.
> 
> Stefan suggested prioritizing bl.pwrite_zeroes_alignment in
> bdrv_co_do_zero_pwritev(). This RFC explores that approach and the
> issues with qcow2 I encountered, where
> bl.pwrite_zeroes_alignment = s->subcluster_size;
> I would be happy to discuss potential solutions and whether we should
> use this approach after all.

Hi Fiona,
I wanted to continue this discussion. My thoughts are that making
bdrv_co_do_zero_pwritev() use bl.pwrite_zeroes_alignment is the right
long-term solution to keep all the padding logic in one place.

On the other hand, your series shows it involves fixing a bunch of test
failures and that's not fun. The original bug that is being solved here
is my doing, so feel free to hand this over to me if you decide you
don't want to work on it.

Stefan

Re: [RFC v2 0/6] block/io: avoid failure caused by misaligned BLKZEROOUT ioctl

Posted by Fiona Ebner 3 days, 7 hours ago

Hi Stefan,

Am 02.02.26 um 11:15 PM schrieb Stefan Hajnoczi:
> On Fri, Jan 09, 2026 at 01:08:27PM +0100, Fiona Ebner wrote:
>> Previous discussion here:
>> https://lore.kernel.org/qemu-devel/20260105143416.737482-1-f.ebner@proxmox.com/
>>
>> Commit 5634622bcb ("file-posix: allow BLKZEROOUT with -t writeback")
>> enables the BLKZEROOUT ioctl when using 'writeback' cache, regressing
>> certain 'qemu-img convert' invocations, because of a pre-existing
>> issue. Namely, the BLKZEROOUT ioctl might fail with errno EINVAL when
>> the request is shorter than the block size of the block device.
>>
>> Stefan suggested prioritizing bl.pwrite_zeroes_alignment in
>> bdrv_co_do_zero_pwritev(). This RFC explores that approach and the
>> issues with qcow2 I encountered, where
>> bl.pwrite_zeroes_alignment = s->subcluster_size;
>> I would be happy to discuss potential solutions and whether we should
>> use this approach after all.
> 
> Hi Fiona,
> I wanted to continue this discussion. My thoughts are that making
> bdrv_co_do_zero_pwritev() use bl.pwrite_zeroes_alignment is the right
> long-term solution to keep all the padding logic in one place.
> 
> On the other hand, your series shows it involves fixing a bunch of test
> failures and that's not fun. The original bug that is being solved here
> is my doing, so feel free to hand this over to me if you decide you
> don't want to work on it.

in your other mail, you mentioned you'll ask Kevin for his opinion. So
in part, I was waiting for that. But I also was side-tracked by other
things, and it will be 1-2 more weeks until I can really focus on this
again. If that is too long, please go ahead and pick it up.

Best Regards,
Fiona

Re: [RFC v2 0/6] block/io: avoid failure caused by misaligned BLKZEROOUT ioctl

Posted by Kevin Wolf 3 days, 3 hours ago

Am 05.02.2026 um 13:13 hat Fiona Ebner geschrieben:
> Hi Stefan,
> 
> Am 02.02.26 um 11:15 PM schrieb Stefan Hajnoczi:
> > On Fri, Jan 09, 2026 at 01:08:27PM +0100, Fiona Ebner wrote:
> >> Previous discussion here:
> >> https://lore.kernel.org/qemu-devel/20260105143416.737482-1-f.ebner@proxmox.com/
> >>
> >> Commit 5634622bcb ("file-posix: allow BLKZEROOUT with -t writeback")
> >> enables the BLKZEROOUT ioctl when using 'writeback' cache, regressing
> >> certain 'qemu-img convert' invocations, because of a pre-existing
> >> issue. Namely, the BLKZEROOUT ioctl might fail with errno EINVAL when
> >> the request is shorter than the block size of the block device.
> >>
> >> Stefan suggested prioritizing bl.pwrite_zeroes_alignment in
> >> bdrv_co_do_zero_pwritev(). This RFC explores that approach and the
> >> issues with qcow2 I encountered, where
> >> bl.pwrite_zeroes_alignment = s->subcluster_size;
> >> I would be happy to discuss potential solutions and whether we should
> >> use this approach after all.
> > 
> > Hi Fiona,
> > I wanted to continue this discussion. My thoughts are that making
> > bdrv_co_do_zero_pwritev() use bl.pwrite_zeroes_alignment is the right
> > long-term solution to keep all the padding logic in one place.
> > 
> > On the other hand, your series shows it involves fixing a bunch of test
> > failures and that's not fun. The original bug that is being solved here
> > is my doing, so feel free to hand this over to me if you decide you
> > don't want to work on it.
> 
> in your other mail, you mentioned you'll ask Kevin for his opinion. So
> in part, I was waiting for that. But I also was side-tracked by other
> things, and it will be 1-2 more weeks until I can really focus on this
> again. If that is too long, please go ahead and pick it up.

I didn't review this thoroughly yet, but I agree that considering the
alignment from the start is the better solution and also more consistent
with what we're already doing for normal reads and writes.

We just need to make sure that we use the right alignments in the right
places, which can be a bit confusing with the fallbacks to buffered zero
writes here and there.

I assume that there is enough time left to do this before the 11.0
release and there is no need to take something like v1 as an
intermediate solution?

Kevin

Re: [RFC v2 0/6] block/io: avoid failure caused by misaligned BLKZEROOUT ioctl

Posted by Stefan Hajnoczi 3 days, 4 hours ago

On Thu, Feb 05, 2026 at 01:13:57PM +0100, Fiona Ebner wrote:
> Hi Stefan,
> 
> Am 02.02.26 um 11:15 PM schrieb Stefan Hajnoczi:
> > On Fri, Jan 09, 2026 at 01:08:27PM +0100, Fiona Ebner wrote:
> >> Previous discussion here:
> >> https://lore.kernel.org/qemu-devel/20260105143416.737482-1-f.ebner@proxmox.com/
> >>
> >> Commit 5634622bcb ("file-posix: allow BLKZEROOUT with -t writeback")
> >> enables the BLKZEROOUT ioctl when using 'writeback' cache, regressing
> >> certain 'qemu-img convert' invocations, because of a pre-existing
> >> issue. Namely, the BLKZEROOUT ioctl might fail with errno EINVAL when
> >> the request is shorter than the block size of the block device.
> >>
> >> Stefan suggested prioritizing bl.pwrite_zeroes_alignment in
> >> bdrv_co_do_zero_pwritev(). This RFC explores that approach and the
> >> issues with qcow2 I encountered, where
> >> bl.pwrite_zeroes_alignment = s->subcluster_size;
> >> I would be happy to discuss potential solutions and whether we should
> >> use this approach after all.
> > 
> > Hi Fiona,
> > I wanted to continue this discussion. My thoughts are that making
> > bdrv_co_do_zero_pwritev() use bl.pwrite_zeroes_alignment is the right
> > long-term solution to keep all the padding logic in one place.
> > 
> > On the other hand, your series shows it involves fixing a bunch of test
> > failures and that's not fun. The original bug that is being solved here
> > is my doing, so feel free to hand this over to me if you decide you
> > don't want to work on it.
> 
> in your other mail, you mentioned you'll ask Kevin for his opinion. So
> in part, I was waiting for that. But I also was side-tracked by other
> things, and it will be 1-2 more weeks until I can really focus on this
> again. If that is too long, please go ahead and pick it up.

I have pinged him now.

My timeframe is similar. I can look into this as a background task and
if I make progress I'll share it with you.

Stefan

Re: [RFC v2 0/6] block/io: avoid failure caused by misaligned BLKZEROOUT ioctl

Posted by Stefan Hajnoczi 2 weeks, 6 days ago

On Fri, Jan 09, 2026 at 01:08:27PM +0100, Fiona Ebner wrote:
> Previous discussion here:
> https://lore.kernel.org/qemu-devel/20260105143416.737482-1-f.ebner@proxmox.com/
> 
> Commit 5634622bcb ("file-posix: allow BLKZEROOUT with -t writeback")
> enables the BLKZEROOUT ioctl when using 'writeback' cache, regressing
> certain 'qemu-img convert' invocations, because of a pre-existing
> issue. Namely, the BLKZEROOUT ioctl might fail with errno EINVAL when
> the request is shorter than the block size of the block device.
> 
> Stefan suggested prioritizing bl.pwrite_zeroes_alignment in
> bdrv_co_do_zero_pwritev(). This RFC explores that approach and the
> issues with qcow2 I encountered, where
> bl.pwrite_zeroes_alignment = s->subcluster_size;
> I would be happy to discuss potential solutions and whether we should
> use this approach after all.

These issues are a headache, but I think it's important for us to
consider them. They indicate that QEMU does not properly distinguish
between read/write and pwrite_zeroes constraints.

If we can agree on how the block layer should handle pwrite_zeroes
constraints in a consistent way that makes the tests pass, then that
should serve the QEMU block layer well in the future.

I will mention this patch series to Kevin as well so we can get his
opinion.

> 
> For example, in iotest 154 and 271, there are assertion failures,
> because the padded request extends beyond the end of the image:
> Assertion `offset + bytes <= bs->total_sectors * BDRV_SECTOR_SIZE ||
> child->perm & BLK_PERM_RESIZE' failed.
> The total image length is not necessarily aligned to the cluster size.
> This could be solved by shortening the relevant requests in
> bdrv_co_do_zero_pwritev() and submitting them without the
> BDRV_REQ_ZERO_WRITE flag and with bl.request_alignment as the
> alignment see patch 5/6.
> 
> For iotest 179, I would need to avoid clearing BDRV_REQ_ZERO_WRITE for
> the head and tail parts as long as the buffer is fully zero.
> Otherwise, we end up with more 'data' sectors in the target map. See
> patch 6/6. With or without that, iotests 154 and 271 produces
> different output (I think it might be expected, but haven't checked in
> detail yet).
> 
> Another issue is exposed by iotest 177, where the (sub-)cluster size
> is 1MiB, but max-transfer is only 64KiB leading to assertion failures,
> because max_transfer =
> QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_transfer, INT_MAX), align);
> evaluates to 0 (because align > bs->bl.max_transfer). This could be
> fixed by safeguarding doing the QEMU_ALIGN_DOWN only if the value is
> bigger than align, see patch 4/6.
> 
> I'm also not sure what to do about iotest 204 and 177 which use
> 'opt-write-zero=15M' for the blkdebug driver (which assigns that value
> to pwrite_zeroes_alignment) making an is_power_of_2(align) assertion
> fail.
> 
> Yet another issue is the 'detect_zeroes' option. If the option is set,
> bdrv_aligned_pwritev() might set the BDRV_REQ_ZERO_WRITE flag even if
> the request is not aligned to pwrite_zeroes_alignment and the original
> bug could resurface.
> 
> Best Regards,
> Fiona
> 
> 
> Fiona Ebner (6):
>   block/io: pass alignment to bdrv_init_padding()
>   block/io: add 'bytes' parameter to bdrv_padding_rmw_read()
>   block/io: honor pwrite_zeroes_alignment in bdrv_co_do_zero_pwritev()
>   block/io: safeguard max transfer calculation in bdrv_aligned_pwritev()
>   block/io: handle image length not aligned to write zeroes alignment in
>     bdrv_co_do_zero_pwritev()
>   block/io: keep zero flag for head/tail parts of misaligned zero write
>     when possible
> 
>  block/io.c | 78 ++++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 55 insertions(+), 23 deletions(-)
> 
> -- 
> 2.47.3
> 
>