bdrv_co_block_status_above has several design problems with handling
short backing files:
1. With want_zeros=true, it may return ret with BDRV_BLOCK_ZERO but
without BDRV_BLOCK_ALLOCATED flag, when actually short backing file
which produces these after-EOF zeros is inside requested backing
sequence.
2. With want_zero=false, it may return pnum=0 prior to actual EOF,
because of EOF of short backing file.
Fix these things, making logic about short backing files clearer.
With fixed bdrv_block_status_above we also have to improve is_zero in
qcow2 code, otherwise iotest 154 will fail, because with this patch we
stop to merge zeros of different types (produced by fully unallocated
in the whole backing chain regions vs produced by short backing files).
Note also, that this patch leaves for another day the general problem
around block-status: misuse of BDRV_BLOCK_ALLOCATED as is-fs-allocated
vs go-to-backing.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
block/io.c | 39 ++++++++++++++++++++++++++++++---------
block/qcow2.c | 16 ++++++++++++++--
2 files changed, 44 insertions(+), 11 deletions(-)
diff --git a/block/io.c b/block/io.c
index 83ffc7d390..f2a89d9417 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2373,25 +2373,46 @@ bdrv_co_common_block_status_above(BlockDriverState *bs,
ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
file);
if (ret < 0) {
- break;
+ return ret;
}
- if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
+ if (*pnum == 0) {
+ if (first) {
+ return ret;
+ }
+
/*
- * Reading beyond the end of the file continues to read
- * zeroes, but we can only widen the result to the
- * unallocated length we learned from an earlier
- * iteration.
+ * The top layer deferred to this layer, and because this layer is
+ * short, any zeroes that we synthesize beyond EOF behave as if they
+ * were allocated at this layer
*/
+ assert(ret & BDRV_BLOCK_EOF);
*pnum = bytes;
+ if (file) {
+ *file = p;
+ }
+ return BDRV_BLOCK_ZERO | BDRV_BLOCK_ALLOCATED;
}
- if (ret & (BDRV_BLOCK_ZERO | BDRV_BLOCK_DATA)) {
- break;
+ if (ret & BDRV_BLOCK_ALLOCATED) {
+ /* We've found the node and the status, we must return. */
+
+ if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
+ /*
+ * This level is also responsible for reads after EOF inside
+ * the unallocated region in the previous level.
+ */
+ *pnum = bytes;
+ }
+
+ return ret;
}
+
/* [offset, pnum] unallocated on this layer, which could be only
* the first part of [offset, bytes]. */
- bytes = MIN(bytes, *pnum);
+ assert(*pnum <= bytes);
+ bytes = *pnum;
first = false;
}
+
return ret;
}
diff --git a/block/qcow2.c b/block/qcow2.c
index 0cd2e6757e..ce4cf00770 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3827,8 +3827,20 @@ static bool is_zero(BlockDriverState *bs, int64_t offset, int64_t bytes)
if (!bytes) {
return true;
}
- res = bdrv_block_status_above(bs, NULL, offset, bytes, &nr, NULL, NULL);
- return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == bytes;
+
+ /*
+ * bdrv_block_status_above doesn't merge different types of zeros, for
+ * example, zeros which come from the region which is unallocated in
+ * the whole backing chain, and zeros which comes because of a short
+ * backing file. So, we need a loop.
+ */
+ do {
+ res = bdrv_block_status_above(bs, NULL, offset, bytes, &nr, NULL, NULL);
+ offset += nr;
+ bytes -= nr;
+ } while (res >= 0 && (res & BDRV_BLOCK_ZERO) && nr && bytes);
+
+ return res >= 0 && (res & BDRV_BLOCK_ZERO) && bytes == 0;
}
static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
--
2.21.0
On Wed 10 Jun 2020 02:04:22 PM CEST, Vladimir Sementsov-Ogievskiy wrote:
> + * The top layer deferred to this layer, and because this layer is
> + * short, any zeroes that we synthesize beyond EOF behave as if they
> + * were allocated at this layer
> */
> + assert(ret & BDRV_BLOCK_EOF);
> *pnum = bytes;
> + if (file) {
> + *file = p;
> + }
> + return BDRV_BLOCK_ZERO | BDRV_BLOCK_ALLOCATED;
You don't add BDRV_BLOCK_EOF to the return code here ?
> + res = bdrv_block_status_above(bs, NULL, offset, bytes, &nr, NULL, NULL);
> + offset += nr;
> + bytes -= nr;
> + } while (res >= 0 && (res & BDRV_BLOCK_ZERO) && nr && bytes);
About this last "... && nr && bytes", I think 'nr' already implies
'bytes', maybe you want to use an assertion instead?
Berto
Thanks a lot for reviewing!
18.08.2020 17:15, Alberto Garcia wrote:
> On Wed 10 Jun 2020 02:04:22 PM CEST, Vladimir Sementsov-Ogievskiy wrote:
>> + * The top layer deferred to this layer, and because this layer is
>> + * short, any zeroes that we synthesize beyond EOF behave as if they
>> + * were allocated at this layer
>> */
>> + assert(ret & BDRV_BLOCK_EOF);
>> *pnum = bytes;
>> + if (file) {
>> + *file = p;
>> + }
>> + return BDRV_BLOCK_ZERO | BDRV_BLOCK_ALLOCATED;
>
> You don't add BDRV_BLOCK_EOF to the return code here ?
No we shouldn't, as this is the end of backing file when the top layer is larger.
>
>> + res = bdrv_block_status_above(bs, NULL, offset, bytes, &nr, NULL, NULL);
>> + offset += nr;
>> + bytes -= nr;
>> + } while (res >= 0 && (res & BDRV_BLOCK_ZERO) && nr && bytes);
>
> About this last "... && nr && bytes", I think 'nr' already implies
> 'bytes', maybe you want to use an assertion instead?
No, on the last iteration, bytes would be 0 and nr is a last chunk. So, if we check
only nr, we'll do one extra call of bdrv_block_status_above with bytes=0, I don't
want do it.
>
> Berto
>
--
Best regards,
Vladimir
On Wed 19 Aug 2020 11:48:25 AM CEST, Vladimir Sementsov-Ogievskiy wrote:
>>> + * The top layer deferred to this layer, and because this layer is
>>> + * short, any zeroes that we synthesize beyond EOF behave as if they
>>> + * were allocated at this layer
>>> */
>>> + assert(ret & BDRV_BLOCK_EOF);
>>> *pnum = bytes;
>>> + if (file) {
>>> + *file = p;
>>> + }
>>> + return BDRV_BLOCK_ZERO | BDRV_BLOCK_ALLOCATED;
>>
>> You don't add BDRV_BLOCK_EOF to the return code here ?
>
> No we shouldn't, as this is the end of backing file when the top layer
> is larger.
Ok, but maybe the original request also reaches EOF on the top layer.
An example that does not actually involve short backing files: let's say
that the size of the active image is 1MB and the backing file is 2MB.
We do 'write -z 960k 63k', that zeroes the last cluster minus a 1KB
tail, so qcow2_co_pwrite_zeroes() calls is_zero() to check if that last
1KB already contains zeroes.
bdrv_co_block_status_above() on the top layer returns BDRV_BLOCK_EOF: no
allocation, no zeros, we simply reached EOF. So we go to the backing
image to see what's there. This is also unallocated, there's no backing
file this time and want_zero is set, so this returns BDRV_BLOCK_ZERO.
However the backing file is longer so bdrv_co_block_status() does not
return BDRV_BLOCK_EOF this time.
If the backing file would have been the same size (1MB) we would have
BDRV_BLOCK_ZERO | BDRV_BLOCK_EOF, but if it's longer or (with your
patch) shorter then BDRV_BLOCK_EOF disappears.
Now, I don't see anyone else using BDRV_BLOCK_EOF for anything so this
does not have any practical effect at the moment, but is this worth
fixing?.
>>> + res = bdrv_block_status_above(bs, NULL, offset, bytes, &nr, NULL, NULL);
>>> + offset += nr;
>>> + bytes -= nr;
>>> + } while (res >= 0 && (res & BDRV_BLOCK_ZERO) && nr && bytes);
>>
>> About this last "... && nr && bytes", I think 'nr' already implies
>> 'bytes', maybe you want to use an assertion instead?
>
> No, on the last iteration, bytes would be 0 and nr is a last
> chunk. So, if we check only nr, we'll do one extra call of
> bdrv_block_status_above with bytes=0, I don't want do it.
You're right !
Berto
19.08.2020 14:34, Alberto Garcia wrote:
> On Wed 19 Aug 2020 11:48:25 AM CEST, Vladimir Sementsov-Ogievskiy wrote:
>>>> + * The top layer deferred to this layer, and because this layer is
>>>> + * short, any zeroes that we synthesize beyond EOF behave as if they
>>>> + * were allocated at this layer
>>>> */
>>>> + assert(ret & BDRV_BLOCK_EOF);
>>>> *pnum = bytes;
>>>> + if (file) {
>>>> + *file = p;
>>>> + }
>>>> + return BDRV_BLOCK_ZERO | BDRV_BLOCK_ALLOCATED;
>>>
>>> You don't add BDRV_BLOCK_EOF to the return code here ?
>>
>> No we shouldn't, as this is the end of backing file when the top layer
>> is larger.
>
> Ok, but maybe the original request also reaches EOF on the top layer.
>
> An example that does not actually involve short backing files: let's say
> that the size of the active image is 1MB and the backing file is 2MB.
>
> We do 'write -z 960k 63k', that zeroes the last cluster minus a 1KB
> tail, so qcow2_co_pwrite_zeroes() calls is_zero() to check if that last
> 1KB already contains zeroes.
>
> bdrv_co_block_status_above() on the top layer returns BDRV_BLOCK_EOF: no
> allocation, no zeros, we simply reached EOF. So we go to the backing
> image to see what's there. This is also unallocated, there's no backing
> file this time and want_zero is set, so this returns BDRV_BLOCK_ZERO.
> However the backing file is longer so bdrv_co_block_status() does not
> return BDRV_BLOCK_EOF this time.
>
> If the backing file would have been the same size (1MB) we would have
> BDRV_BLOCK_ZERO | BDRV_BLOCK_EOF, but if it's longer or (with your
> patch) shorter then BDRV_BLOCK_EOF disappears.
>
> Now, I don't see anyone else using BDRV_BLOCK_EOF for anything so this
> does not have any practical effect at the moment, but is this worth
> fixing?.
That's the point of course, I'll fix.
So, if we get EOF on top layer, we should add it to final ret anyway, regardless of backing chain statuses.
>
>>>> + res = bdrv_block_status_above(bs, NULL, offset, bytes, &nr, NULL, NULL);
>>>> + offset += nr;
>>>> + bytes -= nr;
>>>> + } while (res >= 0 && (res & BDRV_BLOCK_ZERO) && nr && bytes);
>>>
>>> About this last "... && nr && bytes", I think 'nr' already implies
>>> 'bytes', maybe you want to use an assertion instead?
>>
>> No, on the last iteration, bytes would be 0 and nr is a last
>> chunk. So, if we check only nr, we'll do one extra call of
>> bdrv_block_status_above with bytes=0, I don't want do it.
>
> You're right !
>
> Berto
>
--
Best regards,
Vladimir
© 2016 - 2026 Red Hat, Inc.