The BDRV_REQ_ZERO_WRITE is currently implemented in a way that first the
image is possibly preallocated and then the zero flag is added to all
clusters. This means that a copy-on-write operation may be needed when
writing to these clusters, despite having used preallocation, negating
one of the major benefits of preallocation.
Instead, try to forward the BDRV_REQ_ZERO_WRITE to the protocol driver,
and if the protocol driver can ensure that the new area reads as zeros,
we can skip setting the zero flag in the qcow2 layer.
Unfortunately, the same approach doesn't work for metadata
preallocation, so we'll still set the zero flag there.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block/qcow2.c | 22 +++++++++++++++++++---
tests/qemu-iotests/274.out | 4 ++--
2 files changed, 21 insertions(+), 5 deletions(-)
diff --git a/block/qcow2.c b/block/qcow2.c
index ad621fe404..b28e588942 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -4170,9 +4170,25 @@ static int coroutine_fn qcow2_co_truncate(BlockDriverState *bs, int64_t offset,
/* Allocate the data area */
new_file_size = allocation_start +
nb_new_data_clusters * s->cluster_size;
- /* Image file grows, so @exact does not matter */
- ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc, 0,
- errp);
+ /*
+ * Image file grows, so @exact does not matter.
+ *
+ * If we need to zero out the new area, try first whether the protocol
+ * driver can already take care of this.
+ */
+ if (flags & BDRV_REQ_ZERO_WRITE) {
+ ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc,
+ BDRV_REQ_ZERO_WRITE, errp);
+ if (ret >= 0) {
+ flags &= ~BDRV_REQ_ZERO_WRITE;
+ }
+ } else {
+ ret = -1;
+ }
+ if (ret < 0) {
+ ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc, 0,
+ errp);
+ }
if (ret < 0) {
error_prepend(errp, "Failed to resize underlying file: ");
qcow2_free_clusters(bs, allocation_start,
diff --git a/tests/qemu-iotests/274.out b/tests/qemu-iotests/274.out
index 179bd7ccaf..c355b52689 100644
--- a/tests/qemu-iotests/274.out
+++ b/tests/qemu-iotests/274.out
@@ -179,7 +179,7 @@ read 65536/65536 bytes at offset 9437184
10 MiB (0xa00000) bytes allocated at offset 5 MiB (0x500000)
[{ "start": 0, "length": 5242880, "depth": 1, "zero": true, "data": false},
-{ "start": 5242880, "length": 10485760, "depth": 0, "zero": true, "data": false, "offset": 327680}]
+{ "start": 5242880, "length": 10485760, "depth": 0, "zero": false, "data": true, "offset": 327680}]
=== preallocation=full ===
Formatting 'TEST_DIR/PID-base', fmt=qcow2 size=16777216 cluster_size=65536 lazy_refcounts=off refcount_bits=16
@@ -198,7 +198,7 @@ read 65536/65536 bytes at offset 11534336
4 MiB (0x400000) bytes allocated at offset 8 MiB (0x800000)
[{ "start": 0, "length": 8388608, "depth": 1, "zero": true, "data": false},
-{ "start": 8388608, "length": 4194304, "depth": 0, "zero": true, "data": false, "offset": 327680}]
+{ "start": 8388608, "length": 4194304, "depth": 0, "zero": false, "data": true, "offset": 327680}]
=== preallocation=off ===
Formatting 'TEST_DIR/PID-base', fmt=qcow2 size=393216 cluster_size=65536 lazy_refcounts=off refcount_bits=16
--
2.25.3
On 23.04.20 17:01, Kevin Wolf wrote: > The BDRV_REQ_ZERO_WRITE is currently implemented in a way that first the > image is possibly preallocated and then the zero flag is added to all > clusters. This means that a copy-on-write operation may be needed when > writing to these clusters, despite having used preallocation, negating > one of the major benefits of preallocation. > > Instead, try to forward the BDRV_REQ_ZERO_WRITE to the protocol driver, > and if the protocol driver can ensure that the new area reads as zeros, > we can skip setting the zero flag in the qcow2 layer. > > Unfortunately, the same approach doesn't work for metadata > preallocation, so we'll still set the zero flag there. > > Signed-off-by: Kevin Wolf <kwolf@redhat.com> > --- > block/qcow2.c | 22 +++++++++++++++++++--- > tests/qemu-iotests/274.out | 4 ++-- > 2 files changed, 21 insertions(+), 5 deletions(-) Oh, nice. I didn’t think you (or anyone else for that matter) would actually do this. :) With the errp thing fixed: Reviewed-by: Max Reitz <mreitz@redhat.com>
On 4/23/20 10:01 AM, Kevin Wolf wrote:
> The BDRV_REQ_ZERO_WRITE is currently implemented in a way that first the
> image is possibly preallocated and then the zero flag is added to all
> clusters. This means that a copy-on-write operation may be needed when
> writing to these clusters, despite having used preallocation, negating
> one of the major benefits of preallocation.
>
> Instead, try to forward the BDRV_REQ_ZERO_WRITE to the protocol driver,
> and if the protocol driver can ensure that the new area reads as zeros,
> we can skip setting the zero flag in the qcow2 layer.
Hmm. When we get block status, it is very easy to tell that something
reads as zero when the qcow2 file has the zero bit set, but when the
qcow2 file does not have the zero bit set, we have to then query the
format layer whether it reads as zeros (which is expensive enough for
some format layers that we no longer report things as reading as zero).
I'm worried that optimizing this case could penalize later block status.
We already can tell the difference between a cluster that has the zero
bit set but is not preallocated, vs. has the zero bit set and is
preallocated. Are we really forcing a copy-on-write to a cluster that
is marked zero but preallocated? Is the problem that we don't have a
way to distinguish between 'reads as zeroes, allocated, but we don't
know state of format layer' and 'reads as zeroes, allocated, and we know
format layer reads as zeroes'?
>
> Unfortunately, the same approach doesn't work for metadata
> preallocation, so we'll still set the zero flag there.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
> block/qcow2.c | 22 +++++++++++++++++++---
> tests/qemu-iotests/274.out | 4 ++--
> 2 files changed, 21 insertions(+), 5 deletions(-)
>
> diff --git a/block/qcow2.c b/block/qcow2.c
> index ad621fe404..b28e588942 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -4170,9 +4170,25 @@ static int coroutine_fn qcow2_co_truncate(BlockDriverState *bs, int64_t offset,
> /* Allocate the data area */
> new_file_size = allocation_start +
> nb_new_data_clusters * s->cluster_size;
> - /* Image file grows, so @exact does not matter */
> - ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc, 0,
> - errp);
> + /*
> + * Image file grows, so @exact does not matter.
> + *
> + * If we need to zero out the new area, try first whether the protocol
> + * driver can already take care of this.
> + */
> + if (flags & BDRV_REQ_ZERO_WRITE) {
> + ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc,
> + BDRV_REQ_ZERO_WRITE, errp);
> + if (ret >= 0) {
> + flags &= ~BDRV_REQ_ZERO_WRITE;
> + }
Hmm - just noticing things: how does this series interplay with the
existing bdrv_has_zero_init_truncate? Should this series automatically
handle BDRV_REQ_ZERO_WRITE on a bdrv_co_truncate(PREALLOC_NONE) request
for all drivers that report true, even if that driver does not advertise
support for the BDRV_REQ_ZERO_WRITE flag?
> + } else {
> + ret = -1;
> + }
Here, ret == -1 does not imply whether errp is set - but annoyingly,
errp CAN be set if bdrv_co_truncate() failed.
> + if (ret < 0) {
> + ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc, 0,
> + errp);
And here, you are passing a possibly-set errp back to
bdrv_co_truncate(). That is a bug that can abort. You need to pass
NULL to the first bdrv_co_truncate() call or else clear errp prior to
trying a fallback to this second bdrv_co_truncate() call.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization: qemu.org | libvirt.org
Am 23.04.2020 um 17:36 hat Eric Blake geschrieben:
> On 4/23/20 10:01 AM, Kevin Wolf wrote:
> > The BDRV_REQ_ZERO_WRITE is currently implemented in a way that first the
> > image is possibly preallocated and then the zero flag is added to all
> > clusters. This means that a copy-on-write operation may be needed when
> > writing to these clusters, despite having used preallocation, negating
> > one of the major benefits of preallocation.
> >
> > Instead, try to forward the BDRV_REQ_ZERO_WRITE to the protocol driver,
> > and if the protocol driver can ensure that the new area reads as zeros,
> > we can skip setting the zero flag in the qcow2 layer.
>
> Hmm. When we get block status, it is very easy to tell that something reads
> as zero when the qcow2 file has the zero bit set, but when the qcow2 file
> does not have the zero bit set, we have to then query the format layer
> whether it reads as zeros (which is expensive enough for some format layers
> that we no longer report things as reading as zero). I'm worried that
> optimizing this case could penalize later block status.
That's just how preallocation works. If you don't want that, you need
preallocation=off.
> We already can tell the difference between a cluster that has the zero bit
> set but is not preallocated, vs. has the zero bit set and is preallocated.
> Are we really forcing a copy-on-write to a cluster that is marked zero but
> preallocated? Is the problem that we don't have a way to distinguish
> between 'reads as zeroes, allocated, but we don't know state of format
> layer' and 'reads as zeroes, allocated, and we know format layer reads as
> zeroes'?
Basically, yes. If we had this, we could have a type of cluster where
writing to it still involves a metadata update (to clear the zero flag),
but never copy-on-write, even for partial writes.
I'm not sure if this would cover a very relevant case, though.
> >
> > Unfortunately, the same approach doesn't work for metadata
> > preallocation, so we'll still set the zero flag there.
> >
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> > block/qcow2.c | 22 +++++++++++++++++++---
> > tests/qemu-iotests/274.out | 4 ++--
> > 2 files changed, 21 insertions(+), 5 deletions(-)
> >
> > diff --git a/block/qcow2.c b/block/qcow2.c
> > index ad621fe404..b28e588942 100644
> > --- a/block/qcow2.c
> > +++ b/block/qcow2.c
> > @@ -4170,9 +4170,25 @@ static int coroutine_fn qcow2_co_truncate(BlockDriverState *bs, int64_t offset,
> > /* Allocate the data area */
> > new_file_size = allocation_start +
> > nb_new_data_clusters * s->cluster_size;
> > - /* Image file grows, so @exact does not matter */
> > - ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc, 0,
> > - errp);
> > + /*
> > + * Image file grows, so @exact does not matter.
> > + *
> > + * If we need to zero out the new area, try first whether the protocol
> > + * driver can already take care of this.
> > + */
> > + if (flags & BDRV_REQ_ZERO_WRITE) {
> > + ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc,
> > + BDRV_REQ_ZERO_WRITE, errp);
> > + if (ret >= 0) {
> > + flags &= ~BDRV_REQ_ZERO_WRITE;
> > + }
>
> Hmm - just noticing things: how does this series interplay with the existing
> bdrv_has_zero_init_truncate? Should this series automatically handle
> BDRV_REQ_ZERO_WRITE on a bdrv_co_truncate(PREALLOC_NONE) request for all
> drivers that report true, even if that driver does not advertise support for
> the BDRV_REQ_ZERO_WRITE flag?
Hmm... It feels risky to me.
> > + } else {
> > + ret = -1;
> > + }
>
> Here, ret == -1 does not imply whether errp is set - but annoyingly, errp
> CAN be set if bdrv_co_truncate() failed.
>
> > + if (ret < 0) {
> > + ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc, 0,
> > + errp);
>
> And here, you are passing a possibly-set errp back to bdrv_co_truncate().
> That is a bug that can abort. You need to pass NULL to the first
> bdrv_co_truncate() call or else clear errp prior to trying a fallback to
> this second bdrv_co_truncate() call.
Yes, you're right. If nothing else comes up, I'll fix this while
applying.
Kevin
On 4/23/20 11:04 AM, Kevin Wolf wrote:
>> Hmm. When we get block status, it is very easy to tell that something reads
>> as zero when the qcow2 file has the zero bit set, but when the qcow2 file
>> does not have the zero bit set, we have to then query the format layer
>> whether it reads as zeros (which is expensive enough for some format layers
>> that we no longer report things as reading as zero). I'm worried that
>> optimizing this case could penalize later block status.
>
> That's just how preallocation works. If you don't want that, you need
> preallocation=off.
Good point. And if I recall, didn't we already have a discussion (or
even patches) to optimize whether querying the format layer during block
status was even worth the effort, depending on heuristics of the size of
the format layer which in turn is based on whether there was
preallocation? So not a show-stopper.
>
>> We already can tell the difference between a cluster that has the zero bit
>> set but is not preallocated, vs. has the zero bit set and is preallocated.
>> Are we really forcing a copy-on-write to a cluster that is marked zero but
>> preallocated? Is the problem that we don't have a way to distinguish
>> between 'reads as zeroes, allocated, but we don't know state of format
>> layer' and 'reads as zeroes, allocated, and we know format layer reads as
>> zeroes'?
>
> Basically, yes. If we had this, we could have a type of cluster where
> writing to it still involves a metadata update (to clear the zero flag),
> but never copy-on-write, even for partial writes.
>
> I'm not sure if this would cover a very relevant case, though.
I also wonder if Berto's subcluster patches might play into this scenario.
>>
>> Hmm - just noticing things: how does this series interplay with the existing
>> bdrv_has_zero_init_truncate? Should this series automatically handle
>> BDRV_REQ_ZERO_WRITE on a bdrv_co_truncate(PREALLOC_NONE) request for all
>> drivers that report true, even if that driver does not advertise support for
>> the BDRV_REQ_ZERO_WRITE flag?
>
> Hmm... It feels risky to me.
Or worded differently, is bdrv_has_zero_init_truncate even still
necessary (when it is documented only to cover the PREALLOC_NONE case),
or should we get rid of it in favor of using BDRV_REQ_ZERO_WRITE
everywhere instead? (Which in turn involves visiting all drivers that
previously advertised bdrv_has_zero_init_truncate... but I already have
work in my tree trying to do that as part of preparing to add an
autoclear bit to qcow2 to make it faster to report when a qcow2 image is
known all-zero content...)
Looks like I'll be rebasing my work on top of this series.
>
>>> + } else {
>>> + ret = -1;
>>> + }
>>
>> Here, ret == -1 does not imply whether errp is set - but annoyingly, errp
>> CAN be set if bdrv_co_truncate() failed.
>>
>>> + if (ret < 0) {
>>> + ret = bdrv_co_truncate(bs->file, new_file_size, false, prealloc, 0,
>>> + errp);
>>
>> And here, you are passing a possibly-set errp back to bdrv_co_truncate().
>> That is a bug that can abort. You need to pass NULL to the first
>> bdrv_co_truncate() call or else clear errp prior to trying a fallback to
>> this second bdrv_co_truncate() call.
>
> Yes, you're right. If nothing else comes up, I'll fix this while
> applying.
>
> Kevin
>
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization: qemu.org | libvirt.org
© 2016 - 2026 Red Hat, Inc.