[v1] fix & merge block_status_above and is_allocated_above

[PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Vladimir Sementsov-Ogievskiy 6 years, 2 months ago

Hi all!

I wanted to understand, what is the real difference between bdrv_block_status_above
and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
bdrv_block_status_above..

And I found the problem: bdrv_is_allocated_above considers space after EOF as
UNALLOCATED for intermediate nodes..

UNALLOCATED is not about allocation at fs level, but about should we go to backing or
not.. And it seems incorrect for me, as in case of short backing file, we'll read
zeroes after EOF, instead of going further by backing chain.

This leads to the following effect:

./qemu-img create -f qcow2 base.qcow2 2M
./qemu-io -c "write -P 0x1 0 2M" base.qcow2

./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M

Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
./qemu-io -c "read -P 0 1M 1M" top.qcow2
read 1048576/1048576 bytes at offset 1048576
1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)

But after commit guest visible state is changed, which seems wrong for me:
./qemu-img commit top.qcow2 -b mid.qcow2

./qemu-io -c "read -P 0 1M 1M" mid.qcow2
Pattern verification failed at offset 1048576, 1048576 bytes
read 1048576/1048576 bytes at offset 1048576
1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)

./qemu-io -c "read -P 1 1M 1M" mid.qcow2
read 1048576/1048576 bytes at offset 1048576
1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)


I don't know, is it a real bug, as I don't know, do we support backing file larger than
its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
to other problems.

=====

Hmm, bdrv_block_allocated_above behaves strange too:

with want_zero=true, it may report unallocated zeroes because of short backing files, which
are actually "allocated" in POV of backing chains. But I see this may influence only
qemu-img compare, and I don't see can it trigger some bug..

with want_zero=false, it may do no progress because of short backing file. Moreover it may
report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
onlyt top layer, so it seems OK. 

=====

So, I propose these series, still I'm not sure is there a real bug.

Vladimir Sementsov-Ogievskiy (4):
  block/io: fix bdrv_co_block_status_above
  block/io: bdrv_common_block_status_above: support include_base
  block/io: bdrv_common_block_status_above: support bs == base
  block/io: fix bdrv_is_allocated_above

 block/io.c                 | 104 ++++++++++++++++++-------------------
 tests/qemu-iotests/154.out |   4 +-
 2 files changed, 53 insertions(+), 55 deletions(-)

-- 
2.21.0

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Vladimir Sementsov-Ogievskiy 6 years, 2 months ago

16.11.2019 19:34, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> I wanted to understand, what is the real difference between bdrv_block_status_above
> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
> bdrv_block_status_above..
> 
> And I found the problem: bdrv_is_allocated_above considers space after EOF as
> UNALLOCATED for intermediate nodes..
> 
> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
> not.. And it seems incorrect for me, as in case of short backing file, we'll read
> zeroes after EOF, instead of going further by backing chain.
> 
> This leads to the following effect:
> 
> ./qemu-img create -f qcow2 base.qcow2 2M
> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> 
> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> 
> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> 
> But after commit guest visible state is changed, which seems wrong for me:
> ./qemu-img commit top.qcow2 -b mid.qcow2
> 
> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> Pattern verification failed at offset 1048576, 1048576 bytes
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> 
> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
> 
> 
> I don't know, is it a real bug, as I don't know, do we support backing file larger than
> its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
> to other problems.
> 
> =====
> 
> Hmm, bdrv_block_allocated_above behaves strange too:
> 
> with want_zero=true, it may report unallocated zeroes because of short backing files, which
> are actually "allocated" in POV of backing chains. But I see this may influence only
> qemu-img compare, and I don't see can it trigger some bug..
> 
> with want_zero=false, it may do no progress because of short backing file. Moreover it may
> report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
> onlyt top layer, so it seems OK.
> 
> =====
> 
> So, I propose these series, still I'm not sure is there a real bug.
> 
> Vladimir Sementsov-Ogievskiy (4):
>    block/io: fix bdrv_co_block_status_above
>    block/io: bdrv_common_block_status_above: support include_base
>    block/io: bdrv_common_block_status_above: support bs == base
>    block/io: fix bdrv_is_allocated_above
> 
>   block/io.c                 | 104 ++++++++++++++++++-------------------
>   tests/qemu-iotests/154.out |   4 +-
>   2 files changed, 53 insertions(+), 55 deletions(-)
> 


Interesting that the problem illustrated here is not fixed by the series, it's actually
relates to the fact that mirror does truncation with PREALLOC_MODE_OFF, which leads
to unallocated qcow2 clusters, which I think should be fixed too.

To illustrate the problem fixed by the series, we should commit to base:

# ./qemu-img commit top.qcow2 -b base.qcow2
Image committed.
# ./qemu-io -c "read -P 0 1M 1M" base.qcow2
Pattern verification failed at offset 1048576, 1048576 bytes
read 1048576/1048576 bytes at offset 1048576
1 MiB, 1 ops; 00.00 sec (5.366 GiB/sec and 5494.4149 ops/sec)

=======

Hmm, but how to fix the problem about truncate? I think truncate must not make
underlying backing available for read.. Discard operation doesn't do it.

So, actually on PREALLOC_MODE_OFF we must allocated L2 tables and mark new clusters
as ZERO?

To improve it, we can add "zero bit" to L1 table entry..

-- 
Best regards,
Vladimir

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Kevin Wolf 6 years, 2 months ago

Am 20.11.2019 um 11:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 16.11.2019 19:34, Vladimir Sementsov-Ogievskiy wrote:
> > Hi all!
> > 
> > I wanted to understand, what is the real difference between bdrv_block_status_above
> > and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
> > bdrv_block_status_above..
> > 
> > And I found the problem: bdrv_is_allocated_above considers space after EOF as
> > UNALLOCATED for intermediate nodes..
> > 
> > UNALLOCATED is not about allocation at fs level, but about should we go to backing or
> > not.. And it seems incorrect for me, as in case of short backing file, we'll read
> > zeroes after EOF, instead of going further by backing chain.
> > 
> > This leads to the following effect:
> > 
> > ./qemu-img create -f qcow2 base.qcow2 2M
> > ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> > 
> > ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> > ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> > 
> > Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> > ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> > read 1048576/1048576 bytes at offset 1048576
> > 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> > 
> > But after commit guest visible state is changed, which seems wrong for me:
> > ./qemu-img commit top.qcow2 -b mid.qcow2
> > 
> > ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> > Pattern verification failed at offset 1048576, 1048576 bytes
> > read 1048576/1048576 bytes at offset 1048576
> > 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> > 
> > ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
> > read 1048576/1048576 bytes at offset 1048576
> > 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
> > 
> > 
> > I don't know, is it a real bug, as I don't know, do we support backing file larger than
> > its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
> > to other problems.
> > 
> > =====
> > 
> > Hmm, bdrv_block_allocated_above behaves strange too:
> > 
> > with want_zero=true, it may report unallocated zeroes because of short backing files, which
> > are actually "allocated" in POV of backing chains. But I see this may influence only
> > qemu-img compare, and I don't see can it trigger some bug..
> > 
> > with want_zero=false, it may do no progress because of short backing file. Moreover it may
> > report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
> > onlyt top layer, so it seems OK.
> > 
> > =====
> > 
> > So, I propose these series, still I'm not sure is there a real bug.
> > 
> > Vladimir Sementsov-Ogievskiy (4):
> >    block/io: fix bdrv_co_block_status_above
> >    block/io: bdrv_common_block_status_above: support include_base
> >    block/io: bdrv_common_block_status_above: support bs == base
> >    block/io: fix bdrv_is_allocated_above
> > 
> >   block/io.c                 | 104 ++++++++++++++++++-------------------
> >   tests/qemu-iotests/154.out |   4 +-
> >   2 files changed, 53 insertions(+), 55 deletions(-)
> > 
> 
> 
> Interesting that the problem illustrated here is not fixed by the series, it's actually
> relates to the fact that mirror does truncation with PREALLOC_MODE_OFF, which leads
> to unallocated qcow2 clusters, which I think should be fixed too.

Yes, this is what I posted yesterday. (With a suggested quick fix, but
it turns out it was not quite correct, see below.)

> To illustrate the problem fixed by the series, we should commit to base:
> 
> # ./qemu-img commit top.qcow2 -b base.qcow2
> Image committed.
> # ./qemu-io -c "read -P 0 1M 1M" base.qcow2
> Pattern verification failed at offset 1048576, 1048576 bytes
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (5.366 GiB/sec and 5494.4149 ops/sec)

Ok, I'll try that later.

> Hmm, but how to fix the problem about truncate? I think truncate must
> not make underlying backing available for read.. Discard operation
> doesn't do it.
> 
> So, actually on PREALLOC_MODE_OFF we must allocated L2 tables and mark
> new clusters as ZERO?

Yes, we need to write zeroes to the new area if the backing file covers
it. We need to do this not only in mirror/commit/bdrv_commit(), but in
fact for all truncate operations: Berto mentioned on IRC yesterday that
you can get into the same situation with 'block_resize' monitor
commands.

So I tried to fix this yesterday, and I thought that I had a fix, when I
noticed that bdrv_co_do_zero_pwritev() takes a 32 bit bytes parameter.
So I'll still need to fix this. Other than that, I suppose the following
fix should work (but is probably a bit too invasive for -rc3).

Kevin

diff --git a/block/io.c b/block/io.c
index f75777f5ea..4118bf0118 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3382,6 +3382,32 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
         goto out;
     }

+    /*
+     * If the image has a backing file that is large enough that it would
+     * provide data for the new area, we cannot leave it unallocated because
+     * then the backing file content would become visible. Instead, zero-fill
+     * the area where backing file and new area overlap.
+     */
+    if (new_bytes && bs->backing && prealloc == PREALLOC_MODE_OFF) {
+        int64_t backing_len;
+
+        backing_len = bdrv_getlength(backing_bs(bs));
+        if (backing_len < 0) {
+            ret = backing_len;
+            goto out;
+        }
+
+        if (backing_len > old_size) {
+            /* FIXME bytes parameter is 32 bits */
+            ret = bdrv_co_do_zero_pwritev(child, old_size,
+                                          MIN(new_bytes, backing_len - old_size),
+                                          BDRV_REQ_ZERO_WRITE | BDRV_REQ_MAY_UNMAP, &req);
+            if (ret < 0) {
+                goto out;
+            }
+        }
+    }
+
     ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "Could not refresh total sector count");

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Vladimir Sementsov-Ogievskiy 6 years, 2 months ago

20.11.2019 14:44, Kevin Wolf wrote:
> Am 20.11.2019 um 11:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 16.11.2019 19:34, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi all!
>>>
>>> I wanted to understand, what is the real difference between bdrv_block_status_above
>>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
>>> bdrv_block_status_above..
>>>
>>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
>>> UNALLOCATED for intermediate nodes..
>>>
>>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
>>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
>>> zeroes after EOF, instead of going further by backing chain.
>>>
>>> This leads to the following effect:
>>>
>>> ./qemu-img create -f qcow2 base.qcow2 2M
>>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
>>>
>>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
>>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
>>>
>>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
>>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
>>> read 1048576/1048576 bytes at offset 1048576
>>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
>>>
>>> But after commit guest visible state is changed, which seems wrong for me:
>>> ./qemu-img commit top.qcow2 -b mid.qcow2
>>>
>>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
>>> Pattern verification failed at offset 1048576, 1048576 bytes
>>> read 1048576/1048576 bytes at offset 1048576
>>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
>>>
>>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
>>> read 1048576/1048576 bytes at offset 1048576
>>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
>>>
>>>
>>> I don't know, is it a real bug, as I don't know, do we support backing file larger than
>>> its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
>>> to other problems.
>>>
>>> =====
>>>
>>> Hmm, bdrv_block_allocated_above behaves strange too:
>>>
>>> with want_zero=true, it may report unallocated zeroes because of short backing files, which
>>> are actually "allocated" in POV of backing chains. But I see this may influence only
>>> qemu-img compare, and I don't see can it trigger some bug..
>>>
>>> with want_zero=false, it may do no progress because of short backing file. Moreover it may
>>> report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
>>> onlyt top layer, so it seems OK.
>>>
>>> =====
>>>
>>> So, I propose these series, still I'm not sure is there a real bug.
>>>
>>> Vladimir Sementsov-Ogievskiy (4):
>>>     block/io: fix bdrv_co_block_status_above
>>>     block/io: bdrv_common_block_status_above: support include_base
>>>     block/io: bdrv_common_block_status_above: support bs == base
>>>     block/io: fix bdrv_is_allocated_above
>>>
>>>    block/io.c                 | 104 ++++++++++++++++++-------------------
>>>    tests/qemu-iotests/154.out |   4 +-
>>>    2 files changed, 53 insertions(+), 55 deletions(-)
>>>
>>
>>
>> Interesting that the problem illustrated here is not fixed by the series, it's actually
>> relates to the fact that mirror does truncation with PREALLOC_MODE_OFF, which leads
>> to unallocated qcow2 clusters, which I think should be fixed too.
> 
> Yes, this is what I posted yesterday. (With a suggested quick fix, but
> it turns out it was not quite correct, see below.)
> 
>> To illustrate the problem fixed by the series, we should commit to base:
>>
>> # ./qemu-img commit top.qcow2 -b base.qcow2
>> Image committed.
>> # ./qemu-io -c "read -P 0 1M 1M" base.qcow2
>> Pattern verification failed at offset 1048576, 1048576 bytes
>> read 1048576/1048576 bytes at offset 1048576
>> 1 MiB, 1 ops; 00.00 sec (5.366 GiB/sec and 5494.4149 ops/sec)
> 
> Ok, I'll try that later.
> 
>> Hmm, but how to fix the problem about truncate? I think truncate must
>> not make underlying backing available for read.. Discard operation
>> doesn't do it.
>>
>> So, actually on PREALLOC_MODE_OFF we must allocated L2 tables and mark
>> new clusters as ZERO?
> 
> Yes, we need to write zeroes to the new area if the backing file covers
> it. We need to do this not only in mirror/commit/bdrv_commit(), but in
> fact for all truncate operations: Berto mentioned on IRC yesterday that
> you can get into the same situation with 'block_resize' monitor
> commands.
> 
> So I tried to fix this yesterday, and I thought that I had a fix, when I
> noticed that bdrv_co_do_zero_pwritev() takes a 32 bit bytes parameter.
> So I'll still need to fix this. Other than that, I suppose the following
> fix should work (but is probably a bit too invasive for -rc3).
> 
> Kevin
> 
> diff --git a/block/io.c b/block/io.c
> index f75777f5ea..4118bf0118 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -3382,6 +3382,32 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
>           goto out;
>       }
> 
> +    /*
> +     * If the image has a backing file that is large enough that it would
> +     * provide data for the new area, we cannot leave it unallocated because
> +     * then the backing file content would become visible. Instead, zero-fill
> +     * the area where backing file and new area overlap.
> +     */
> +    if (new_bytes && bs->backing && prealloc == PREALLOC_MODE_OFF) {
> +        int64_t backing_len;
> +
> +        backing_len = bdrv_getlength(backing_bs(bs));
> +        if (backing_len < 0) {
> +            ret = backing_len;
> +            goto out;
> +        }
> +
> +        if (backing_len > old_size) {
> +            /* FIXME bytes parameter is 32 bits */
> +            ret = bdrv_co_do_zero_pwritev(child, old_size,
> +                                          MIN(new_bytes, backing_len - old_size),
> +                                          BDRV_REQ_ZERO_WRITE | BDRV_REQ_MAY_UNMAP, &req);
> +            if (ret < 0) {
> +                goto out;
> +            }
> +        }
> +    }
> +
>       ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
>       if (ret < 0) {
>           error_setg_errno(errp, -ret, "Could not refresh total sector count");
> 

I'm not sure that it is safe enough: we may not have opened backing at the moment, but
still it may exist and managed by user.

PREALLOC_MODE_OFF is documented as
# @off: no preallocation

- not very descriptive, but I think it's nothing about making backing file available
through new clusters.

I think PREALLOC_MODE_OFF should always make new clusters "BDRV_BLOCK_ALLOCATED". If
for some scenarios (are they exist at all?) we need to preallocate cluster in manner
that backing file would be visible through them, we'd better add another preallocation
mode which will directly document this behaviour, like PREALLOC_MODE_BACKING.

So, I'd consider PREALLOC_MODE_OFF as something that must not create UNALLOCATED (in POV
of backing chains) clusters, and should be fixed in all formats.. Or as a quick fix may
we may write zeros from bdrv_co_truncate, but independently of backing file existence
and length.

===

Also I think it's a wrong thing at all that qcow2 new file is transparent by default..
It should be transparent only when we create snapshots and incremental backups. But when
we create new disk for new vm it should be zeroed (and extending L1 table entry spec by
"zero bit" may help)

-- 
Best regards,
Vladimir

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Kevin Wolf 6 years, 2 months ago

Am 20.11.2019 um 13:04 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 20.11.2019 14:44, Kevin Wolf wrote:
> > Am 20.11.2019 um 11:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> 16.11.2019 19:34, Vladimir Sementsov-Ogievskiy wrote:
> >>> Hi all!
> >>>
> >>> I wanted to understand, what is the real difference between bdrv_block_status_above
> >>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
> >>> bdrv_block_status_above..
> >>>
> >>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
> >>> UNALLOCATED for intermediate nodes..
> >>>
> >>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
> >>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
> >>> zeroes after EOF, instead of going further by backing chain.
> >>>
> >>> This leads to the following effect:
> >>>
> >>> ./qemu-img create -f qcow2 base.qcow2 2M
> >>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> >>>
> >>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> >>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> >>>
> >>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> >>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> >>> read 1048576/1048576 bytes at offset 1048576
> >>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> >>>
> >>> But after commit guest visible state is changed, which seems wrong for me:
> >>> ./qemu-img commit top.qcow2 -b mid.qcow2
> >>>
> >>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> >>> Pattern verification failed at offset 1048576, 1048576 bytes
> >>> read 1048576/1048576 bytes at offset 1048576
> >>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> >>>
> >>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
> >>> read 1048576/1048576 bytes at offset 1048576
> >>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
> >>>
> >>>
> >>> I don't know, is it a real bug, as I don't know, do we support backing file larger than
> >>> its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
> >>> to other problems.
> >>>
> >>> =====
> >>>
> >>> Hmm, bdrv_block_allocated_above behaves strange too:
> >>>
> >>> with want_zero=true, it may report unallocated zeroes because of short backing files, which
> >>> are actually "allocated" in POV of backing chains. But I see this may influence only
> >>> qemu-img compare, and I don't see can it trigger some bug..
> >>>
> >>> with want_zero=false, it may do no progress because of short backing file. Moreover it may
> >>> report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
> >>> onlyt top layer, so it seems OK.
> >>>
> >>> =====
> >>>
> >>> So, I propose these series, still I'm not sure is there a real bug.
> >>>
> >>> Vladimir Sementsov-Ogievskiy (4):
> >>>     block/io: fix bdrv_co_block_status_above
> >>>     block/io: bdrv_common_block_status_above: support include_base
> >>>     block/io: bdrv_common_block_status_above: support bs == base
> >>>     block/io: fix bdrv_is_allocated_above
> >>>
> >>>    block/io.c                 | 104 ++++++++++++++++++-------------------
> >>>    tests/qemu-iotests/154.out |   4 +-
> >>>    2 files changed, 53 insertions(+), 55 deletions(-)
> >>>
> >>
> >>
> >> Interesting that the problem illustrated here is not fixed by the series, it's actually
> >> relates to the fact that mirror does truncation with PREALLOC_MODE_OFF, which leads
> >> to unallocated qcow2 clusters, which I think should be fixed too.
> > 
> > Yes, this is what I posted yesterday. (With a suggested quick fix, but
> > it turns out it was not quite correct, see below.)
> > 
> >> To illustrate the problem fixed by the series, we should commit to base:
> >>
> >> # ./qemu-img commit top.qcow2 -b base.qcow2
> >> Image committed.
> >> # ./qemu-io -c "read -P 0 1M 1M" base.qcow2
> >> Pattern verification failed at offset 1048576, 1048576 bytes
> >> read 1048576/1048576 bytes at offset 1048576
> >> 1 MiB, 1 ops; 00.00 sec (5.366 GiB/sec and 5494.4149 ops/sec)
> > 
> > Ok, I'll try that later.
> > 
> >> Hmm, but how to fix the problem about truncate? I think truncate must
> >> not make underlying backing available for read.. Discard operation
> >> doesn't do it.
> >>
> >> So, actually on PREALLOC_MODE_OFF we must allocated L2 tables and mark
> >> new clusters as ZERO?
> > 
> > Yes, we need to write zeroes to the new area if the backing file covers
> > it. We need to do this not only in mirror/commit/bdrv_commit(), but in
> > fact for all truncate operations: Berto mentioned on IRC yesterday that
> > you can get into the same situation with 'block_resize' monitor
> > commands.
> > 
> > So I tried to fix this yesterday, and I thought that I had a fix, when I
> > noticed that bdrv_co_do_zero_pwritev() takes a 32 bit bytes parameter.
> > So I'll still need to fix this. Other than that, I suppose the following
> > fix should work (but is probably a bit too invasive for -rc3).
> > 
> > Kevin
> > 
> > diff --git a/block/io.c b/block/io.c
> > index f75777f5ea..4118bf0118 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -3382,6 +3382,32 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
> >           goto out;
> >       }
> > 
> > +    /*
> > +     * If the image has a backing file that is large enough that it would
> > +     * provide data for the new area, we cannot leave it unallocated because
> > +     * then the backing file content would become visible. Instead, zero-fill
> > +     * the area where backing file and new area overlap.
> > +     */
> > +    if (new_bytes && bs->backing && prealloc == PREALLOC_MODE_OFF) {
> > +        int64_t backing_len;
> > +
> > +        backing_len = bdrv_getlength(backing_bs(bs));
> > +        if (backing_len < 0) {
> > +            ret = backing_len;
> > +            goto out;
> > +        }
> > +
> > +        if (backing_len > old_size) {
> > +            /* FIXME bytes parameter is 32 bits */
> > +            ret = bdrv_co_do_zero_pwritev(child, old_size,
> > +                                          MIN(new_bytes, backing_len - old_size),
> > +                                          BDRV_REQ_ZERO_WRITE | BDRV_REQ_MAY_UNMAP, &req);
> > +            if (ret < 0) {
> > +                goto out;
> > +            }
> > +        }
> > +    }
> > +
> >       ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
> >       if (ret < 0) {
> >           error_setg_errno(errp, -ret, "Could not refresh total sector count");
> > 
> 
> I'm not sure that it is safe enough: we may not have opened backing at
> the moment, but still it may exist and managed by user.

But then I think it's the user's responsibility to make sure that the
backing file still makes sense when they reattach it. You can't tell
QEMU that there is no backing file and then expect QEMU to take care of
your secret backing file.

Of course, we could unconditionally zero out the new area without
looking at the backing file, but I'm not sure if that is wanted.

> PREALLOC_MODE_OFF is documented as
> # @off: no preallocation
> 
> - not very descriptive, but I think it's nothing about making backing file available
> through new clusters.
> 
> I think PREALLOC_MODE_OFF should always make new clusters "BDRV_BLOCK_ALLOCATED". If
> for some scenarios (are they exist at all?) we need to preallocate cluster in manner
> that backing file would be visible through them, we'd better add another preallocation
> mode which will directly document this behaviour, like PREALLOC_MODE_BACKING.
> 
> So, I'd consider PREALLOC_MODE_OFF as something that must not create UNALLOCATED (in POV
> of backing chains) clusters, and should be fixed in all formats.. Or as a quick fix may
> we may write zeros from bdrv_co_truncate, but independently of backing file existence
> and length.

No, that would mean that all images must be preallocated because
BDRV_BLOCK_ALLOCATED isn't supposed to be set for sparse blocks.

Essentially, the mode that you're envisioning for it is the same as
PREALLOC_MODE_METADATA today (except that it would have to be supported
by every driver).

Obviously, that's wrong for PREALLOC_MODE_OFF.

> ===
> 
> Also I think it's a wrong thing at all that qcow2 new file is transparent by default..
> It should be transparent only when we create snapshots and incremental backups. But when
> we create new disk for new vm it should be zeroed (and extending L1 table entry spec by
> "zero bit" may help)

Why would a qcow2 file even have a backing file when you don't want to
ever access the backing file? Create your image without a backing file
and you get the behaviour you want.

Kevin

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Vladimir Sementsov-Ogievskiy 6 years, 2 months ago

20.11.2019 16:30, Kevin Wolf wrote:
> Am 20.11.2019 um 13:04 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 20.11.2019 14:44, Kevin Wolf wrote:
>>> Am 20.11.2019 um 11:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> 16.11.2019 19:34, Vladimir Sementsov-Ogievskiy wrote:
>>>>> Hi all!
>>>>>
>>>>> I wanted to understand, what is the real difference between bdrv_block_status_above
>>>>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
>>>>> bdrv_block_status_above..
>>>>>
>>>>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
>>>>> UNALLOCATED for intermediate nodes..
>>>>>
>>>>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
>>>>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
>>>>> zeroes after EOF, instead of going further by backing chain.
>>>>>
>>>>> This leads to the following effect:
>>>>>
>>>>> ./qemu-img create -f qcow2 base.qcow2 2M
>>>>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
>>>>>
>>>>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
>>>>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
>>>>>
>>>>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
>>>>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
>>>>> read 1048576/1048576 bytes at offset 1048576
>>>>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
>>>>>
>>>>> But after commit guest visible state is changed, which seems wrong for me:
>>>>> ./qemu-img commit top.qcow2 -b mid.qcow2
>>>>>
>>>>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
>>>>> Pattern verification failed at offset 1048576, 1048576 bytes
>>>>> read 1048576/1048576 bytes at offset 1048576
>>>>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
>>>>>
>>>>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
>>>>> read 1048576/1048576 bytes at offset 1048576
>>>>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
>>>>>
>>>>>
>>>>> I don't know, is it a real bug, as I don't know, do we support backing file larger than
>>>>> its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
>>>>> to other problems.
>>>>>
>>>>> =====
>>>>>
>>>>> Hmm, bdrv_block_allocated_above behaves strange too:
>>>>>
>>>>> with want_zero=true, it may report unallocated zeroes because of short backing files, which
>>>>> are actually "allocated" in POV of backing chains. But I see this may influence only
>>>>> qemu-img compare, and I don't see can it trigger some bug..
>>>>>
>>>>> with want_zero=false, it may do no progress because of short backing file. Moreover it may
>>>>> report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
>>>>> onlyt top layer, so it seems OK.
>>>>>
>>>>> =====
>>>>>
>>>>> So, I propose these series, still I'm not sure is there a real bug.
>>>>>
>>>>> Vladimir Sementsov-Ogievskiy (4):
>>>>>      block/io: fix bdrv_co_block_status_above
>>>>>      block/io: bdrv_common_block_status_above: support include_base
>>>>>      block/io: bdrv_common_block_status_above: support bs == base
>>>>>      block/io: fix bdrv_is_allocated_above
>>>>>
>>>>>     block/io.c                 | 104 ++++++++++++++++++-------------------
>>>>>     tests/qemu-iotests/154.out |   4 +-
>>>>>     2 files changed, 53 insertions(+), 55 deletions(-)
>>>>>
>>>>
>>>>
>>>> Interesting that the problem illustrated here is not fixed by the series, it's actually
>>>> relates to the fact that mirror does truncation with PREALLOC_MODE_OFF, which leads
>>>> to unallocated qcow2 clusters, which I think should be fixed too.
>>>
>>> Yes, this is what I posted yesterday. (With a suggested quick fix, but
>>> it turns out it was not quite correct, see below.)
>>>
>>>> To illustrate the problem fixed by the series, we should commit to base:
>>>>
>>>> # ./qemu-img commit top.qcow2 -b base.qcow2
>>>> Image committed.
>>>> # ./qemu-io -c "read -P 0 1M 1M" base.qcow2
>>>> Pattern verification failed at offset 1048576, 1048576 bytes
>>>> read 1048576/1048576 bytes at offset 1048576
>>>> 1 MiB, 1 ops; 00.00 sec (5.366 GiB/sec and 5494.4149 ops/sec)
>>>
>>> Ok, I'll try that later.
>>>
>>>> Hmm, but how to fix the problem about truncate? I think truncate must
>>>> not make underlying backing available for read.. Discard operation
>>>> doesn't do it.
>>>>
>>>> So, actually on PREALLOC_MODE_OFF we must allocated L2 tables and mark
>>>> new clusters as ZERO?
>>>
>>> Yes, we need to write zeroes to the new area if the backing file covers
>>> it. We need to do this not only in mirror/commit/bdrv_commit(), but in
>>> fact for all truncate operations: Berto mentioned on IRC yesterday that
>>> you can get into the same situation with 'block_resize' monitor
>>> commands.
>>>
>>> So I tried to fix this yesterday, and I thought that I had a fix, when I
>>> noticed that bdrv_co_do_zero_pwritev() takes a 32 bit bytes parameter.
>>> So I'll still need to fix this. Other than that, I suppose the following
>>> fix should work (but is probably a bit too invasive for -rc3).
>>>
>>> Kevin
>>>
>>> diff --git a/block/io.c b/block/io.c
>>> index f75777f5ea..4118bf0118 100644
>>> --- a/block/io.c
>>> +++ b/block/io.c
>>> @@ -3382,6 +3382,32 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
>>>            goto out;
>>>        }
>>>
>>> +    /*
>>> +     * If the image has a backing file that is large enough that it would
>>> +     * provide data for the new area, we cannot leave it unallocated because
>>> +     * then the backing file content would become visible. Instead, zero-fill
>>> +     * the area where backing file and new area overlap.
>>> +     */
>>> +    if (new_bytes && bs->backing && prealloc == PREALLOC_MODE_OFF) {
>>> +        int64_t backing_len;
>>> +
>>> +        backing_len = bdrv_getlength(backing_bs(bs));
>>> +        if (backing_len < 0) {
>>> +            ret = backing_len;
>>> +            goto out;
>>> +        }
>>> +
>>> +        if (backing_len > old_size) {
>>> +            /* FIXME bytes parameter is 32 bits */
>>> +            ret = bdrv_co_do_zero_pwritev(child, old_size,
>>> +                                          MIN(new_bytes, backing_len - old_size),
>>> +                                          BDRV_REQ_ZERO_WRITE | BDRV_REQ_MAY_UNMAP, &req);
>>> +            if (ret < 0) {
>>> +                goto out;
>>> +            }
>>> +        }
>>> +    }
>>> +
>>>        ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
>>>        if (ret < 0) {
>>>            error_setg_errno(errp, -ret, "Could not refresh total sector count");
>>>
>>
>> I'm not sure that it is safe enough: we may not have opened backing at
>> the moment, but still it may exist and managed by user.
> 
> But then I think it's the user's responsibility to make sure that the
> backing file still makes sense when they reattach it. You can't tell
> QEMU that there is no backing file and then expect QEMU to take care of
> your secret backing file.
> 
> Of course, we could unconditionally zero out the new area without
> looking at the backing file, but I'm not sure if that is wanted.
> 
>> PREALLOC_MODE_OFF is documented as
>> # @off: no preallocation
>>
>> - not very descriptive, but I think it's nothing about making backing file available
>> through new clusters.
>>
>> I think PREALLOC_MODE_OFF should always make new clusters "BDRV_BLOCK_ALLOCATED". If
>> for some scenarios (are they exist at all?) we need to preallocate cluster in manner
>> that backing file would be visible through them, we'd better add another preallocation
>> mode which will directly document this behaviour, like PREALLOC_MODE_BACKING.
>>
>> So, I'd consider PREALLOC_MODE_OFF as something that must not create UNALLOCATED (in POV
>> of backing chains) clusters, and should be fixed in all formats.. Or as a quick fix may
>> we may write zeros from bdrv_co_truncate, but independently of backing file existence
>> and length.
> 
> No, that would mean that all images must be preallocated because
> BDRV_BLOCK_ALLOCATED isn't supposed to be set for sparse blocks.

Hmm. what is sparse blocks and how it relates to backing chain?

> 
> Essentially, the mode that you're envisioning for it is the same as
> PREALLOC_MODE_METADATA today (except that it would have to be supported
> by every driver).

As I understand, PREALLOC_MODE_METADATA preallocates DATA clusters, not ZERO.

Also note, that PREALLOC_MODE_OFF actually allocates some metadata: L1 table,
to mark new clusters as UNALLOCATED.

> 
> Obviously, that's wrong for PREALLOC_MODE_OFF.
> 
>> ===
>>
>> Also I think it's a wrong thing at all that qcow2 new file is transparent by default..
>> It should be transparent only when we create snapshots and incremental backups. But when
>> we create new disk for new vm it should be zeroed (and extending L1 table entry spec by
>> "zero bit" may help)
> 
> Why would a qcow2 file even have a backing file when you don't want to
> ever access the backing file? Create your image without a backing file
> and you get the behaviour you want.
> 
> Kevin
> 

Anyway, I think we should at least document that "off" mode make backing file visible through new
clusters. And with your patch, we should document that "off" mode correctly handles current
backing file (which may be backing file defined by qcow2 image, or may be set by some other
mechanism), and otherwise new clusters will be transparent.

-- 
Best regards,
Vladimir

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Vladimir Sementsov-Ogievskiy 6 years, 2 months ago

20.11.2019 15:04, Vladimir Sementsov-Ogievskiy wrote:
> 20.11.2019 14:44, Kevin Wolf wrote:
>> Am 20.11.2019 um 11:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>> 16.11.2019 19:34, Vladimir Sementsov-Ogievskiy wrote:
>>>> Hi all!
>>>>
>>>> I wanted to understand, what is the real difference between bdrv_block_status_above
>>>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
>>>> bdrv_block_status_above..
>>>>
>>>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
>>>> UNALLOCATED for intermediate nodes..
>>>>
>>>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
>>>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
>>>> zeroes after EOF, instead of going further by backing chain.
>>>>
>>>> This leads to the following effect:
>>>>
>>>> ./qemu-img create -f qcow2 base.qcow2 2M
>>>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
>>>>
>>>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
>>>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
>>>>
>>>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
>>>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
>>>> read 1048576/1048576 bytes at offset 1048576
>>>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
>>>>
>>>> But after commit guest visible state is changed, which seems wrong for me:
>>>> ./qemu-img commit top.qcow2 -b mid.qcow2
>>>>
>>>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
>>>> Pattern verification failed at offset 1048576, 1048576 bytes
>>>> read 1048576/1048576 bytes at offset 1048576
>>>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
>>>>
>>>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
>>>> read 1048576/1048576 bytes at offset 1048576
>>>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
>>>>
>>>>
>>>> I don't know, is it a real bug, as I don't know, do we support backing file larger than
>>>> its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
>>>> to other problems.
>>>>
>>>> =====
>>>>
>>>> Hmm, bdrv_block_allocated_above behaves strange too:
>>>>
>>>> with want_zero=true, it may report unallocated zeroes because of short backing files, which
>>>> are actually "allocated" in POV of backing chains. But I see this may influence only
>>>> qemu-img compare, and I don't see can it trigger some bug..
>>>>
>>>> with want_zero=false, it may do no progress because of short backing file. Moreover it may
>>>> report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
>>>> onlyt top layer, so it seems OK.
>>>>
>>>> =====
>>>>
>>>> So, I propose these series, still I'm not sure is there a real bug.
>>>>
>>>> Vladimir Sementsov-Ogievskiy (4):
>>>>     block/io: fix bdrv_co_block_status_above
>>>>     block/io: bdrv_common_block_status_above: support include_base
>>>>     block/io: bdrv_common_block_status_above: support bs == base
>>>>     block/io: fix bdrv_is_allocated_above
>>>>
>>>>    block/io.c                 | 104 ++++++++++++++++++-------------------
>>>>    tests/qemu-iotests/154.out |   4 +-
>>>>    2 files changed, 53 insertions(+), 55 deletions(-)
>>>>
>>>
>>>
>>> Interesting that the problem illustrated here is not fixed by the series, it's actually
>>> relates to the fact that mirror does truncation with PREALLOC_MODE_OFF, which leads
>>> to unallocated qcow2 clusters, which I think should be fixed too.
>>
>> Yes, this is what I posted yesterday. (With a suggested quick fix, but
>> it turns out it was not quite correct, see below.)
>>
>>> To illustrate the problem fixed by the series, we should commit to base:
>>>
>>> # ./qemu-img commit top.qcow2 -b base.qcow2
>>> Image committed.
>>> # ./qemu-io -c "read -P 0 1M 1M" base.qcow2
>>> Pattern verification failed at offset 1048576, 1048576 bytes
>>> read 1048576/1048576 bytes at offset 1048576
>>> 1 MiB, 1 ops; 00.00 sec (5.366 GiB/sec and 5494.4149 ops/sec)
>>
>> Ok, I'll try that later.
>>
>>> Hmm, but how to fix the problem about truncate? I think truncate must
>>> not make underlying backing available for read.. Discard operation
>>> doesn't do it.
>>>
>>> So, actually on PREALLOC_MODE_OFF we must allocated L2 tables and mark
>>> new clusters as ZERO?
>>
>> Yes, we need to write zeroes to the new area if the backing file covers
>> it. We need to do this not only in mirror/commit/bdrv_commit(), but in
>> fact for all truncate operations: Berto mentioned on IRC yesterday that
>> you can get into the same situation with 'block_resize' monitor
>> commands.
>>
>> So I tried to fix this yesterday, and I thought that I had a fix, when I
>> noticed that bdrv_co_do_zero_pwritev() takes a 32 bit bytes parameter.
>> So I'll still need to fix this. Other than that, I suppose the following
>> fix should work (but is probably a bit too invasive for -rc3).
>>
>> Kevin
>>
>> diff --git a/block/io.c b/block/io.c
>> index f75777f5ea..4118bf0118 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -3382,6 +3382,32 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
>>           goto out;
>>       }
>>
>> +    /*
>> +     * If the image has a backing file that is large enough that it would
>> +     * provide data for the new area, we cannot leave it unallocated because
>> +     * then the backing file content would become visible. Instead, zero-fill
>> +     * the area where backing file and new area overlap.
>> +     */
>> +    if (new_bytes && bs->backing && prealloc == PREALLOC_MODE_OFF) {
>> +        int64_t backing_len;
>> +
>> +        backing_len = bdrv_getlength(backing_bs(bs));
>> +        if (backing_len < 0) {
>> +            ret = backing_len;
>> +            goto out;
>> +        }
>> +
>> +        if (backing_len > old_size) {
>> +            /* FIXME bytes parameter is 32 bits */
>> +            ret = bdrv_co_do_zero_pwritev(child, old_size,
>> +                                          MIN(new_bytes, backing_len - old_size),
>> +                                          BDRV_REQ_ZERO_WRITE | BDRV_REQ_MAY_UNMAP, &req);
>> +            if (ret < 0) {
>> +                goto out;
>> +            }
>> +        }
>> +    }
>> +
>>       ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
>>       if (ret < 0) {
>>           error_setg_errno(errp, -ret, "Could not refresh total sector count");
>>
> 
> I'm not sure that it is safe enough: we may not have opened backing at the moment, but
> still it may exist and managed by user.
> 
> PREALLOC_MODE_OFF is documented as
> # @off: no preallocation
> 
> - not very descriptive, but I think it's nothing about making backing file available
> through new clusters.
> 
> I think PREALLOC_MODE_OFF should always make new clusters "BDRV_BLOCK_ALLOCATED". If
> for some scenarios (are they exist at all?) we need to preallocate cluster in manner
> that backing file would be visible through them, we'd better add another preallocation
> mode which will directly document this behaviour, like PREALLOC_MODE_BACKING.
> 
> So, I'd consider PREALLOC_MODE_OFF as something that must not create UNALLOCATED (in POV
> of backing chains) clusters, and should be fixed in all formats.. Or as a quick fix may
> we may write zeros from bdrv_co_truncate, but independently of backing file existence
> and length.
> 
> ===

Or visa-versa, if we can't change current qemu-img-create default, which means preallocation="off"
  === UNALLOCATED transaprent image, we may improve preallocation:"off" specification, mentioning
backing files, and add preallocation:"zero" mode, to be used in truncate.

> 
> Also I think it's a wrong thing at all that qcow2 new file is transparent by default..
> It should be transparent only when we create snapshots and incremental backups. But when
> we create new disk for new vm it should be zeroed (and extending L1 table entry spec by
> "zero bit" may help)
> 

-- 
Best regards,
Vladimir

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Kevin Wolf 6 years, 2 months ago

Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Hi all!
> 
> I wanted to understand, what is the real difference between bdrv_block_status_above
> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
> bdrv_block_status_above..
> 
> And I found the problem: bdrv_is_allocated_above considers space after EOF as
> UNALLOCATED for intermediate nodes..
> 
> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
> not.. And it seems incorrect for me, as in case of short backing file, we'll read
> zeroes after EOF, instead of going further by backing chain.
> 
> This leads to the following effect:
> 
> ./qemu-img create -f qcow2 base.qcow2 2M
> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> 
> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> 
> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> 
> But after commit guest visible state is changed, which seems wrong for me:
> ./qemu-img commit top.qcow2 -b mid.qcow2
> 
> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> Pattern verification failed at offset 1048576, 1048576 bytes
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> 
> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
> 
> 
> I don't know, is it a real bug, as I don't know, do we support backing
> file larger than its parent. Still, I'm not sure that this behavior of
> bdrv_is_allocated_above don't lead to other problems.

Actually, this specific problem is completely unrelated to how the block
status functions deal with short backing files because they are only
ever called for backing files of the same length as their overlay.

The problem is that the commit job grows the backing file first without
making sure that the clusters in the new part read as zeros. After this,
the damage is done and bdrv_is_allocated_above() returns correctly that
the blocks are unallocated both in top.qcow2 and in mid.qcow2.

So the simple fix for 4.2 would be the following. Maybe we can find a
way to optimise it later (though probably it's not worth it because
short backing files are an uncommon case anyway).

Kevin


diff --git a/block/commit.c b/block/commit.c
index 23c90b3b91..a0c4f51caf 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -159,6 +159,11 @@ static int coroutine_fn commit_run(Job *job, Error **errp)
         if (ret) {
             goto out;
         }
+        ret = blk_co_pwrite_zeroes(s->base, base_len, len - base_len,
+                                   BDRV_REQ_MAY_UNMAP);
+        if (ret < 0) {
+            goto out;
+        }
     }
 
     buf = blk_blockalign(s->top, COMMIT_BUFFER_SIZE);
diff --git a/block/mirror.c b/block/mirror.c
index f0f2d9dff1..2a34f2fad6 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -883,6 +883,12 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
             if (ret < 0) {
                 goto immediate_exit;
             }
+            ret = blk_co_pwrite_zeroes(s->target, base_length,
+                                       s->bdev_length - base_length,
+                                       BDRV_REQ_MAY_UNMAP);
+            if (ret < 0) {
+                goto immediate_exit;
+            }
         }
     }

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Vladimir Sementsov-Ogievskiy 6 years, 2 months ago

Ping?

Hi! Why so silent? Postpone this to 5.0? This is fixing the same problem with block
commit, like Kevin's series, just commit not to mid but to base..

16.11.2019 19:34, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> I wanted to understand, what is the real difference between bdrv_block_status_above
> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
> bdrv_block_status_above..
> 
> And I found the problem: bdrv_is_allocated_above considers space after EOF as
> UNALLOCATED for intermediate nodes..
> 
> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
> not.. And it seems incorrect for me, as in case of short backing file, we'll read
> zeroes after EOF, instead of going further by backing chain.
> 
> This leads to the following effect:
> 
> ./qemu-img create -f qcow2 base.qcow2 2M
> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> 
> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> 
> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> 
> But after commit guest visible state is changed, which seems wrong for me:
> ./qemu-img commit top.qcow2 -b mid.qcow2
> 
> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> Pattern verification failed at offset 1048576, 1048576 bytes
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> 
> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
> 
> 
> I don't know, is it a real bug, as I don't know, do we support backing file larger than
> its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
> to other problems.
> 
> =====
> 
> Hmm, bdrv_block_allocated_above behaves strange too:
> 
> with want_zero=true, it may report unallocated zeroes because of short backing files, which
> are actually "allocated" in POV of backing chains. But I see this may influence only
> qemu-img compare, and I don't see can it trigger some bug..
> 
> with want_zero=false, it may do no progress because of short backing file. Moreover it may
> report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
> onlyt top layer, so it seems OK.
> 
> =====
> 
> So, I propose these series, still I'm not sure is there a real bug.
> 
> Vladimir Sementsov-Ogievskiy (4):
>    block/io: fix bdrv_co_block_status_above
>    block/io: bdrv_common_block_status_above: support include_base
>    block/io: bdrv_common_block_status_above: support bs == base
>    block/io: fix bdrv_is_allocated_above
> 
>   block/io.c                 | 104 ++++++++++++++++++-------------------
>   tests/qemu-iotests/154.out |   4 +-
>   2 files changed, 53 insertions(+), 55 deletions(-)
> 


-- 
Best regards,
Vladimir

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Kevin Wolf 6 years, 2 months ago

Am 25.11.2019 um 11:08 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Ping?
> 
> Hi! Why so silent? Postpone this to 5.0? This is fixing the same
> problem with block commit, like Kevin's series, just commit not to mid
> but to base..

To be honest, I think by now we've found so many problems around short
backing files, each with a non-trivial fix, that I don't think we can
have a reasonably complete fix in -rc3 without risking breaking
everything. None of the problems are new, in fact I think they have
existed since day one of resize/commit, and nobody has reported problems
before, so they can't be hitting a large number of users.

So, reluctantly, I have to admit that both series and whatever we'll add
on top are probably better kept for 5.0 (and 4.2.1) rather than added
very late into 4.2.

Kevin

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Vladimir Sementsov-Ogievskiy 6 years, 2 months ago

25.11.2019 18:46, Kevin Wolf wrote:
> Am 25.11.2019 um 11:08 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> Ping?
>>
>> Hi! Why so silent? Postpone this to 5.0? This is fixing the same
>> problem with block commit, like Kevin's series, just commit not to mid
>> but to base..
> 
> To be honest, I think by now we've found so many problems around short
> backing files, each with a non-trivial fix, that I don't think we can
> have a reasonably complete fix in -rc3 without risking breaking
> everything. None of the problems are new, in fact I think they have
> existed since day one of resize/commit, and nobody has reported problems
> before, so they can't be hitting a large number of users.
> 
> So, reluctantly, I have to admit that both series and whatever we'll add
> on top are probably better kept for 5.0 (and 4.2.1) rather than added
> very late into 4.2.
> 

OK, I agree.


-- 
Best regards,
Vladimir

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Max Reitz 6 years, 2 months ago

On 16.11.19 17:34, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> I wanted to understand, what is the real difference between bdrv_block_status_above
> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
> bdrv_block_status_above..
> 
> And I found the problem: bdrv_is_allocated_above considers space after EOF as
> UNALLOCATED for intermediate nodes..
> 
> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
> not.. And it seems incorrect for me, as in case of short backing file, we'll read
> zeroes after EOF, instead of going further by backing chain.

Should we, though?  It absolutely makes sense to me to consider post-EOF
space as unallocated because, well, it is as unallocated as it gets.

So from my POV it would make more sense to fall back to the backing file
for post-EOF reads.

OTOH, I don’t know whether changing that behavior would qualify as a
possible security issue now, because maybe someone has sensitive
information in the tail of some disk and then truncated the overlay so
as to hide it?  But honestly, that seems ridiculous and I can’t imagine
people to do that.  (It would work only for the tail, and why not just
write zeroes there, which works everywhere?)  So in practice I don’t
believe that to be a problem.

Max

> This leads to the following effect:
> 
> ./qemu-img create -f qcow2 base.qcow2 2M
> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> 
> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> 
> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> 
> But after commit guest visible state is changed, which seems wrong for me:
> ./qemu-img commit top.qcow2 -b mid.qcow2
> 
> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> Pattern verification failed at offset 1048576, 1048576 bytes
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> 
> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
> 
> 
> I don't know, is it a real bug, as I don't know, do we support backing file larger than
> its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
> to other problems.
> 
> =====
> 
> Hmm, bdrv_block_allocated_above behaves strange too:
> 
> with want_zero=true, it may report unallocated zeroes because of short backing files, which
> are actually "allocated" in POV of backing chains. But I see this may influence only
> qemu-img compare, and I don't see can it trigger some bug..
> 
> with want_zero=false, it may do no progress because of short backing file. Moreover it may
> report EOF in the middle!! But want_zero=false used only in bdrv_is_allocated, which considers
> onlyt top layer, so it seems OK. 
> 
> =====
> 
> So, I propose these series, still I'm not sure is there a real bug.
> 
> Vladimir Sementsov-Ogievskiy (4):
>   block/io: fix bdrv_co_block_status_above
>   block/io: bdrv_common_block_status_above: support include_base
>   block/io: bdrv_common_block_status_above: support bs == base
>   block/io: fix bdrv_is_allocated_above
> 
>  block/io.c                 | 104 ++++++++++++++++++-------------------
>  tests/qemu-iotests/154.out |   4 +-
>  2 files changed, 53 insertions(+), 55 deletions(-)
>

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Denis V. Lunev 6 years, 2 months ago

On 11/19/19 1:22 PM, Max Reitz wrote:
> On 16.11.19 17:34, Vladimir Sementsov-Ogievskiy wrote:
>> Hi all!
>>
>> I wanted to understand, what is the real difference between bdrv_block_status_above
>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
>> bdrv_block_status_above..
>>
>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
>> UNALLOCATED for intermediate nodes..
>>
>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
>> zeroes after EOF, instead of going further by backing chain.
> Should we, though?  It absolutely makes sense to me to consider post-EOF
> space as unallocated because, well, it is as unallocated as it gets.
>
> So from my POV it would make more sense to fall back to the backing file
> for post-EOF reads.
>
> OTOH, I don’t know whether changing that behavior would qualify as a
> possible security issue now, because maybe someone has sensitive
> information in the tail of some disk and then truncated the overlay so
> as to hide it?  But honestly, that seems ridiculous and I can’t imagine
> people to do that.  (It would work only for the tail, and why not just
> write zeroes there, which works everywhere?)  So in practice I don’t
> believe that to be a problem.
>
> Max

That seems to be wrong from my POW. Once we get block device truncated,
it exposed that tail to the guest with all zeroes.

Let us assume that we have virtual disk of length L. We create new top
delta of
length X (less then L) and new top delta after with length Y (more than L),
like the following:

[.........................] Y
[........] X
[...................] L

Once the guest creates FS  on state Y it relies on the fact that data from X
to Y is all zeroes.

Any operations with backing chain must keep guest content to be tha same,
i.e. if we commit from Y to L, virtual disk content should be preserved,
i.e.
read as all zero even if there is some data in L from X to L.

If we commit from X to Y, the range from X to L should remain all zeroes.

This is especially valid for backups, which can not be changed and are
validated by the software from time to time.

Does this makes sense?

Den

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Vladimir Sementsov-Ogievskiy 6 years, 2 months ago

19.11.2019 15:02, Denis Lunev wrote:
> On 11/19/19 1:22 PM, Max Reitz wrote:
>> On 16.11.19 17:34, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi all!
>>>
>>> I wanted to understand, what is the real difference between bdrv_block_status_above
>>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
>>> bdrv_block_status_above..
>>>
>>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
>>> UNALLOCATED for intermediate nodes..
>>>
>>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
>>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
>>> zeroes after EOF, instead of going further by backing chain.
>> Should we, though?  It absolutely makes sense to me to consider post-EOF
>> space as unallocated because, well, it is as unallocated as it gets.
>>
>> So from my POV it would make more sense to fall back to the backing file
>> for post-EOF reads.
>>
>> OTOH, I don’t know whether changing that behavior would qualify as a
>> possible security issue now, because maybe someone has sensitive
>> information in the tail of some disk and then truncated the overlay so
>> as to hide it?  But honestly, that seems ridiculous and I can’t imagine
>> people to do that.  (It would work only for the tail, and why not just
>> write zeroes there, which works everywhere?)  So in practice I don’t
>> believe that to be a problem.
>>
>> Max
> 
> That seems to be wrong from my POW. Once we get block device truncated,
> it exposed that tail to the guest with all zeroes.
> 
> Let us assume that we have virtual disk of length L. We create new top
> delta of
> length X (less then L) and new top delta after with length Y (more than L),
> like the following:
> 
> [.........................] Y
> [........] X
> [...................] L
> 
> Once the guest creates FS  on state Y it relies on the fact that data from X
> to Y is all zeroes.
> 
> Any operations with backing chain must keep guest content to be tha same,
> i.e. if we commit from Y to L, virtual disk content should be preserved,
> i.e.
> read as all zero even if there is some data in L from X to L.
> 
> If we commit from X to Y, the range from X to L should remain all zeroes.
> 
> This is especially valid for backups, which can not be changed and are
> validated by the software from time to time.
> 
> Does this makes sense?
> 
> Den
> 

Yes, if we consider space after EOF as unallocated, incremental backups are broken,
consider the following sequence:

starting with disk of size L
full backup -> state BASE
shrink disk to size X
incremental backup to empty qcow2 of size X, with backing to BASE -> state INC1
expand disk to size Y
incremental backup to empty qcow2 of size Y, with backing to INC1 -> state INC2

Now, if we read from backup INC2, we'll see data from BASE in range [X, L], which
we must not see.


INC2 [.........................] Y
INC1 [........] X
BASE [...................] L


Also, this example shows, that these series fixes real use-case: merge of some
incremental backups (by commit).

-- 
Best regards,
Vladimir

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Max Reitz 6 years, 2 months ago

On 19.11.19 13:02, Denis V. Lunev wrote:
> On 11/19/19 1:22 PM, Max Reitz wrote:
>> On 16.11.19 17:34, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi all!
>>>
>>> I wanted to understand, what is the real difference between bdrv_block_status_above
>>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
>>> bdrv_block_status_above..
>>>
>>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
>>> UNALLOCATED for intermediate nodes..
>>>
>>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
>>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
>>> zeroes after EOF, instead of going further by backing chain.
>> Should we, though?  It absolutely makes sense to me to consider post-EOF
>> space as unallocated because, well, it is as unallocated as it gets.
>>
>> So from my POV it would make more sense to fall back to the backing file
>> for post-EOF reads.
>>
>> OTOH, I don’t know whether changing that behavior would qualify as a
>> possible security issue now, because maybe someone has sensitive
>> information in the tail of some disk and then truncated the overlay so
>> as to hide it?  But honestly, that seems ridiculous and I can’t imagine
>> people to do that.  (It would work only for the tail, and why not just
>> write zeroes there, which works everywhere?)  So in practice I don’t
>> believe that to be a problem.
>>
>> Max
> 
> That seems to be wrong from my POW. Once we get block device truncated,
> it exposed that tail to the guest with all zeroes.
> 
> Let us assume that we have virtual disk of length L. We create new top
> delta of
> length X (less then L) and new top delta after with length Y (more than L),
> like the following:
> 
> [.........................] Y
> [........] X
> [...................] L
> 
> Once the guest creates FS  on state Y it relies on the fact that data from X
> to Y is all zeroes.
> 
> Any operations with backing chain must keep guest content to be tha same,
> i.e. if we commit from Y to L, virtual disk content should be preserved,
> i.e.
> read as all zero even if there is some data in L from X to L.
> 
> If we commit from X to Y, the range from X to L should remain all zeroes.
> 
> This is especially valid for backups, which can not be changed and are
> validated by the software from time to time.
> 
> Does this makes sense?

All right then.  But then there’s the case of commit not shrinking the
backing file, so the guest content won’t be the same if you commit a
short overlay into a longer backing file.

Max

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Vladimir Sementsov-Ogievskiy 6 years, 2 months ago

19.11.2019 15:20, Max Reitz wrote:
> On 19.11.19 13:02, Denis V. Lunev wrote:
>> On 11/19/19 1:22 PM, Max Reitz wrote:
>>> On 16.11.19 17:34, Vladimir Sementsov-Ogievskiy wrote:
>>>> Hi all!
>>>>
>>>> I wanted to understand, what is the real difference between bdrv_block_status_above
>>>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
>>>> bdrv_block_status_above..
>>>>
>>>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
>>>> UNALLOCATED for intermediate nodes..
>>>>
>>>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
>>>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
>>>> zeroes after EOF, instead of going further by backing chain.
>>> Should we, though?  It absolutely makes sense to me to consider post-EOF
>>> space as unallocated because, well, it is as unallocated as it gets.
>>>
>>> So from my POV it would make more sense to fall back to the backing file
>>> for post-EOF reads.
>>>
>>> OTOH, I don’t know whether changing that behavior would qualify as a
>>> possible security issue now, because maybe someone has sensitive
>>> information in the tail of some disk and then truncated the overlay so
>>> as to hide it?  But honestly, that seems ridiculous and I can’t imagine
>>> people to do that.  (It would work only for the tail, and why not just
>>> write zeroes there, which works everywhere?)  So in practice I don’t
>>> believe that to be a problem.
>>>
>>> Max
>>
>> That seems to be wrong from my POW. Once we get block device truncated,
>> it exposed that tail to the guest with all zeroes.
>>
>> Let us assume that we have virtual disk of length L. We create new top
>> delta of
>> length X (less then L) and new top delta after with length Y (more than L),
>> like the following:
>>
>> [.........................] Y
>> [........] X
>> [...................] L
>>
>> Once the guest creates FS  on state Y it relies on the fact that data from X
>> to Y is all zeroes.
>>
>> Any operations with backing chain must keep guest content to be tha same,
>> i.e. if we commit from Y to L, virtual disk content should be preserved,
>> i.e.
>> read as all zero even if there is some data in L from X to L.
>>
>> If we commit from X to Y, the range from X to L should remain all zeroes.
>>
>> This is especially valid for backups, which can not be changed and are
>> validated by the software from time to time.
>>
>> Does this makes sense?
> 
> All right then.  But then there’s the case of commit not shrinking the
> backing file, so the guest content won’t be the same if you commit a
> short overlay into a longer backing file.
> 

Hmm. Isn't commit target truncated to source before operation?


-- 
Best regards,
Vladimir

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Kevin Wolf 6 years, 2 months ago

Am 19.11.2019 um 13:30 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 19.11.2019 15:20, Max Reitz wrote:
> > On 19.11.19 13:02, Denis V. Lunev wrote:
> >> On 11/19/19 1:22 PM, Max Reitz wrote:
> >>> On 16.11.19 17:34, Vladimir Sementsov-Ogievskiy wrote:
> >>>> Hi all!
> >>>>
> >>>> I wanted to understand, what is the real difference between bdrv_block_status_above
> >>>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
> >>>> bdrv_block_status_above..
> >>>>
> >>>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
> >>>> UNALLOCATED for intermediate nodes..
> >>>>
> >>>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
> >>>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
> >>>> zeroes after EOF, instead of going further by backing chain.
> >>> Should we, though?  It absolutely makes sense to me to consider post-EOF
> >>> space as unallocated because, well, it is as unallocated as it gets.
> >>>
> >>> So from my POV it would make more sense to fall back to the backing file
> >>> for post-EOF reads.
> >>>
> >>> OTOH, I don’t know whether changing that behavior would qualify as a
> >>> possible security issue now, because maybe someone has sensitive
> >>> information in the tail of some disk and then truncated the overlay so
> >>> as to hide it?  But honestly, that seems ridiculous and I can’t imagine
> >>> people to do that.  (It would work only for the tail, and why not just
> >>> write zeroes there, which works everywhere?)  So in practice I don’t
> >>> believe that to be a problem.
> >>>
> >>> Max
> >>
> >> That seems to be wrong from my POW. Once we get block device truncated,
> >> it exposed that tail to the guest with all zeroes.
> >>
> >> Let us assume that we have virtual disk of length L. We create new top
> >> delta of
> >> length X (less then L) and new top delta after with length Y (more than L),
> >> like the following:
> >>
> >> [.........................] Y
> >> [........] X
> >> [...................] L
> >>
> >> Once the guest creates FS  on state Y it relies on the fact that data from X
> >> to Y is all zeroes.
> >>
> >> Any operations with backing chain must keep guest content to be tha same,
> >> i.e. if we commit from Y to L, virtual disk content should be preserved,
> >> i.e.
> >> read as all zero even if there is some data in L from X to L.
> >>
> >> If we commit from X to Y, the range from X to L should remain all zeroes.
> >>
> >> This is especially valid for backups, which can not be changed and are
> >> validated by the software from time to time.
> >>
> >> Does this makes sense?
> > 
> > All right then.  But then there’s the case of commit not shrinking the
> > backing file, so the guest content won’t be the same if you commit a
> > short overlay into a longer backing file.
> 
> Hmm. Isn't commit target truncated to source before operation?

Only if the target is smaller than the source.

Maybe we should change that, because I don't think it's expected that a
guest sees a larger disk, where old data reappears, after resizing
(shrinking) the active layer and then commiting it to the backing file.

Kevin

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Kevin Wolf 6 years, 2 months ago

Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Hi all!
> 
> I wanted to understand, what is the real difference between
> bdrv_block_status_above and bdrv_is_allocated_above, IMHO
> bdrv_is_allocated_above should work through bdrv_block_status_above..
> 
> And I found the problem: bdrv_is_allocated_above considers space after
> EOF as UNALLOCATED for intermediate nodes..
> 
> UNALLOCATED is not about allocation at fs level, but about should we
> go to backing or not.. And it seems incorrect for me, as in case of
> short backing file, we'll read zeroes after EOF, instead of going
> further by backing chain.

We actually have documentation what it means:

 * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
 *                       layer rather than any backing, set by block layer

Say we have a short overlay, like this:

base.qcow2:     AAAAAAAA
overlay.qcow2:  BBBB

Then of course the content of block 5 (the one after EOF of
overlay.qcow2) is still determined by overlay.qcow2, which can be easily
verified by reading it from overlay.qcow2 (produces zeros) and from
base.qcow2 (produces As).

So the correct result when querying the block status of block 5 on
overlay.qcow2 is BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO.

Interestingly, we already fixed the opposite case (large overlay over
short backing file) in commit e88ae2264d9 from May 2014 according to the
same logic.

> This leads to the following effect:
> 
> ./qemu-img create -f qcow2 base.qcow2 2M
> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> 
> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> 
> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> 
> But after commit guest visible state is changed, which seems wrong for me:
> ./qemu-img commit top.qcow2 -b mid.qcow2
> 
> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> Pattern verification failed at offset 1048576, 1048576 bytes
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> 
> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
> 
> 
> I don't know, is it a real bug, as I don't know, do we support backing
> file larger than its parent. Still, I'm not sure that this behavior of
> bdrv_is_allocated_above don't lead to other problems.

I agree it's a bug.

Your fix doesn't look right to me, though. You leave the buggy behaviour
of bdrv_co_block_status() as it is and then add four patches to work
around it in some (but not all) callers of it.

All that it should take to fix this is making the bs->backing check
independent from want_zero and let it set ALLOCATED. What I expected
would be something like the below patch.

But it doesn't seem to fully fix the problem (though 'alloc 1M 1M' in
qemu-io shows that the range is now considered allocated), so probably
there is still a separate bug in bdrv_is_allocated_above().

And I think we'll want an iotests case for both cases (short overlay,
short backing file).

Kevin


diff --git a/block/io.c b/block/io.c
index f75777f5ea..5eafcff01a 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2359,16 +2359,17 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
 
     if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
         ret |= BDRV_BLOCK_ALLOCATED;
-    } else if (want_zero) {
-        if (bdrv_unallocated_blocks_are_zero(bs)) {
-            ret |= BDRV_BLOCK_ZERO;
-        } else if (bs->backing) {
-            BlockDriverState *bs2 = bs->backing->bs;
-            int64_t size2 = bdrv_getlength(bs2);
-
-            if (size2 >= 0 && offset >= size2) {
+    } else if (want_zero && bdrv_unallocated_blocks_are_zero(bs)) {
+        ret |= BDRV_BLOCK_ZERO;
+    } else if (bs->backing) {
+        BlockDriverState *bs2 = bs->backing->bs;
+        int64_t size2 = bdrv_getlength(bs2);
+
+        if (size2 >= 0 && offset >= size2) {
+            if (want_zero) {
                 ret |= BDRV_BLOCK_ZERO;
             }
+            ret |= BDRV_BLOCK_ALLOCATED;
         }
     }

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Vladimir Sementsov-Ogievskiy 6 years, 2 months ago

19.11.2019 15:05, Kevin Wolf wrote:
> Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> Hi all!
>>
>> I wanted to understand, what is the real difference between
>> bdrv_block_status_above and bdrv_is_allocated_above, IMHO
>> bdrv_is_allocated_above should work through bdrv_block_status_above..
>>
>> And I found the problem: bdrv_is_allocated_above considers space after
>> EOF as UNALLOCATED for intermediate nodes..
>>
>> UNALLOCATED is not about allocation at fs level, but about should we
>> go to backing or not.. And it seems incorrect for me, as in case of
>> short backing file, we'll read zeroes after EOF, instead of going
>> further by backing chain.
> 
> We actually have documentation what it means:
> 
>   * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
>   *                       layer rather than any backing, set by block layer
> 
> Say we have a short overlay, like this:
> 
> base.qcow2:     AAAAAAAA
> overlay.qcow2:  BBBB
> 
> Then of course the content of block 5 (the one after EOF of
> overlay.qcow2) is still determined by overlay.qcow2, which can be easily
> verified by reading it from overlay.qcow2 (produces zeros) and from
> base.qcow2 (produces As).
> 
> So the correct result when querying the block status of block 5 on
> overlay.qcow2 is BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO.
> 
> Interestingly, we already fixed the opposite case (large overlay over
> short backing file) in commit e88ae2264d9 from May 2014 according to the
> same logic.
> 
>> This leads to the following effect:
>>
>> ./qemu-img create -f qcow2 base.qcow2 2M
>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
>>
>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
>>
>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
>> read 1048576/1048576 bytes at offset 1048576
>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
>>
>> But after commit guest visible state is changed, which seems wrong for me:
>> ./qemu-img commit top.qcow2 -b mid.qcow2
>>
>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
>> Pattern verification failed at offset 1048576, 1048576 bytes
>> read 1048576/1048576 bytes at offset 1048576
>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
>>
>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
>> read 1048576/1048576 bytes at offset 1048576
>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
>>
>>
>> I don't know, is it a real bug, as I don't know, do we support backing
>> file larger than its parent. Still, I'm not sure that this behavior of
>> bdrv_is_allocated_above don't lead to other problems.
> 
> I agree it's a bug.
> 
> Your fix doesn't look right to me, though. You leave the buggy behaviour
> of bdrv_co_block_status() as it is and then add four patches to work
> around it in some (but not all) callers of it.
> 
> All that it should take to fix this is making the bs->backing check
> independent from want_zero and let it set ALLOCATED. What I expected
> would be something like the below patch.
> 
> But it doesn't seem to fully fix the problem (though 'alloc 1M 1M' in
> qemu-io shows that the range is now considered allocated), so probably
> there is still a separate bug in bdrv_is_allocated_above().
> 
> And I think we'll want an iotests case for both cases (short overlay,
> short backing file).
> 
> Kevin
> 
> 
> diff --git a/block/io.c b/block/io.c
> index f75777f5ea..5eafcff01a 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2359,16 +2359,17 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>   
>       if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
>           ret |= BDRV_BLOCK_ALLOCATED;
> -    } else if (want_zero) {
> -        if (bdrv_unallocated_blocks_are_zero(bs)) {
> -            ret |= BDRV_BLOCK_ZERO;
> -        } else if (bs->backing) {
> -            BlockDriverState *bs2 = bs->backing->bs;
> -            int64_t size2 = bdrv_getlength(bs2);
> -
> -            if (size2 >= 0 && offset >= size2) {
> +    } else if (want_zero && bdrv_unallocated_blocks_are_zero(bs)) {
> +        ret |= BDRV_BLOCK_ZERO;
> +    } else if (bs->backing) {
> +        BlockDriverState *bs2 = bs->backing->bs;
> +        int64_t size2 = bdrv_getlength(bs2);
> +
> +        if (size2 >= 0 && offset >= size2) {
> +            if (want_zero) {
>                   ret |= BDRV_BLOCK_ZERO;
>               }
> +            ret |= BDRV_BLOCK_ALLOCATED;

No, I think it's wrong, as it's not allocated at bs, but at bs->backing->bs.


So, bdrv_co_block_status works correct, what we can change about it, is not
to return pnum=0 if requested range after eof, but return pnum=bytes, together
with BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO | BDRV_BLOCK_EOF.

And actual problem is in bdrv_block_status_above and bdrv_is_allocated_above, which
I'm fixing.



-- 
Best regards,
Vladimir

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Kevin Wolf 6 years, 2 months ago

Am 19.11.2019 um 13:17 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 19.11.2019 15:05, Kevin Wolf wrote:
> > Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> Hi all!
> >>
> >> I wanted to understand, what is the real difference between
> >> bdrv_block_status_above and bdrv_is_allocated_above, IMHO
> >> bdrv_is_allocated_above should work through bdrv_block_status_above..
> >>
> >> And I found the problem: bdrv_is_allocated_above considers space after
> >> EOF as UNALLOCATED for intermediate nodes..
> >>
> >> UNALLOCATED is not about allocation at fs level, but about should we
> >> go to backing or not.. And it seems incorrect for me, as in case of
> >> short backing file, we'll read zeroes after EOF, instead of going
> >> further by backing chain.
> > 
> > We actually have documentation what it means:
> > 
> >   * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
> >   *                       layer rather than any backing, set by block layer
> > 
> > Say we have a short overlay, like this:
> > 
> > base.qcow2:     AAAAAAAA
> > overlay.qcow2:  BBBB
> > 
> > Then of course the content of block 5 (the one after EOF of
> > overlay.qcow2) is still determined by overlay.qcow2, which can be easily
> > verified by reading it from overlay.qcow2 (produces zeros) and from
> > base.qcow2 (produces As).
> > 
> > So the correct result when querying the block status of block 5 on
> > overlay.qcow2 is BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO.

Do you agree with this assessment?

> > Interestingly, we already fixed the opposite case (large overlay over
> > short backing file) in commit e88ae2264d9 from May 2014 according to the
> > same logic.
> > 
> >> This leads to the following effect:
> >>
> >> ./qemu-img create -f qcow2 base.qcow2 2M
> >> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> >>
> >> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> >> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> >>
> >> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> >> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> >> read 1048576/1048576 bytes at offset 1048576
> >> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> >>
> >> But after commit guest visible state is changed, which seems wrong for me:
> >> ./qemu-img commit top.qcow2 -b mid.qcow2
> >>
> >> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> >> Pattern verification failed at offset 1048576, 1048576 bytes
> >> read 1048576/1048576 bytes at offset 1048576
> >> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> >>
> >> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
> >> read 1048576/1048576 bytes at offset 1048576
> >> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
> >>
> >>
> >> I don't know, is it a real bug, as I don't know, do we support backing
> >> file larger than its parent. Still, I'm not sure that this behavior of
> >> bdrv_is_allocated_above don't lead to other problems.
> > 
> > I agree it's a bug.
> > 
> > Your fix doesn't look right to me, though. You leave the buggy behaviour
> > of bdrv_co_block_status() as it is and then add four patches to work
> > around it in some (but not all) callers of it.
> > 
> > All that it should take to fix this is making the bs->backing check
> > independent from want_zero and let it set ALLOCATED. What I expected
> > would be something like the below patch.
> > 
> > But it doesn't seem to fully fix the problem (though 'alloc 1M 1M' in
> > qemu-io shows that the range is now considered allocated), so probably
> > there is still a separate bug in bdrv_is_allocated_above().
> > 
> > And I think we'll want an iotests case for both cases (short overlay,
> > short backing file).
> > 
> > Kevin
> > 
> > 
> > diff --git a/block/io.c b/block/io.c
> > index f75777f5ea..5eafcff01a 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -2359,16 +2359,17 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
> >   
> >       if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
> >           ret |= BDRV_BLOCK_ALLOCATED;
> > -    } else if (want_zero) {
> > -        if (bdrv_unallocated_blocks_are_zero(bs)) {
> > -            ret |= BDRV_BLOCK_ZERO;
> > -        } else if (bs->backing) {
> > -            BlockDriverState *bs2 = bs->backing->bs;
> > -            int64_t size2 = bdrv_getlength(bs2);
> > -
> > -            if (size2 >= 0 && offset >= size2) {
> > +    } else if (want_zero && bdrv_unallocated_blocks_are_zero(bs)) {
> > +        ret |= BDRV_BLOCK_ZERO;
> > +    } else if (bs->backing) {
> > +        BlockDriverState *bs2 = bs->backing->bs;
> > +        int64_t size2 = bdrv_getlength(bs2);
> > +
> > +        if (size2 >= 0 && offset >= size2) {
> > +            if (want_zero) {
> >                   ret |= BDRV_BLOCK_ZERO;
> >               }
> > +            ret |= BDRV_BLOCK_ALLOCATED;
> 
> No, I think it's wrong, as it's not allocated at bs, but at
> bs->backing->bs.

This code implements the logic that I explained above. In my example,
the zeros are produced by overlay.qcow2 (because we're reading after its
EOF), not by base.qcow2.

If it were allocated at base.qcow2, you wouldn't read zeros, but the
data from (the longer) base.qcow2.

The code I posted isn't quite correct yet, but I think that the core of
the problem is already present in bdrv_co_block_status() and not only in
its callers.

Kevin

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Vladimir Sementsov-Ogievskiy 6 years, 2 months ago

19.11.2019 15:17, Vladimir Sementsov-Ogievskiy wrote:
> 19.11.2019 15:05, Kevin Wolf wrote:
>> Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>> Hi all!
>>>
>>> I wanted to understand, what is the real difference between
>>> bdrv_block_status_above and bdrv_is_allocated_above, IMHO
>>> bdrv_is_allocated_above should work through bdrv_block_status_above..
>>>
>>> And I found the problem: bdrv_is_allocated_above considers space after
>>> EOF as UNALLOCATED for intermediate nodes..
>>>
>>> UNALLOCATED is not about allocation at fs level, but about should we
>>> go to backing or not.. And it seems incorrect for me, as in case of
>>> short backing file, we'll read zeroes after EOF, instead of going
>>> further by backing chain.
>>
>> We actually have documentation what it means:
>>
>>   * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
>>   *                       layer rather than any backing, set by block layer
>>
>> Say we have a short overlay, like this:
>>
>> base.qcow2:     AAAAAAAA
>> overlay.qcow2:  BBBB
>>
>> Then of course the content of block 5 (the one after EOF of
>> overlay.qcow2) is still determined by overlay.qcow2, which can be easily
>> verified by reading it from overlay.qcow2 (produces zeros) and from
>> base.qcow2 (produces As).
>>
>> So the correct result when querying the block status of block 5 on
>> overlay.qcow2 is BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO.
>>
>> Interestingly, we already fixed the opposite case (large overlay over
>> short backing file) in commit e88ae2264d9 from May 2014 according to the
>> same logic.
>>
>>> This leads to the following effect:
>>>
>>> ./qemu-img create -f qcow2 base.qcow2 2M
>>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
>>>
>>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
>>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
>>>
>>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
>>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
>>> read 1048576/1048576 bytes at offset 1048576
>>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
>>>
>>> But after commit guest visible state is changed, which seems wrong for me:
>>> ./qemu-img commit top.qcow2 -b mid.qcow2
>>>
>>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
>>> Pattern verification failed at offset 1048576, 1048576 bytes
>>> read 1048576/1048576 bytes at offset 1048576
>>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
>>>
>>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
>>> read 1048576/1048576 bytes at offset 1048576
>>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
>>>
>>>
>>> I don't know, is it a real bug, as I don't know, do we support backing
>>> file larger than its parent. Still, I'm not sure that this behavior of
>>> bdrv_is_allocated_above don't lead to other problems.
>>
>> I agree it's a bug.
>>
>> Your fix doesn't look right to me, though. You leave the buggy behaviour
>> of bdrv_co_block_status() as it is and then add four patches to work
>> around it in some (but not all) callers of it.
>>
>> All that it should take to fix this is making the bs->backing check
>> independent from want_zero and let it set ALLOCATED. What I expected
>> would be something like the below patch.
>>
>> But it doesn't seem to fully fix the problem (though 'alloc 1M 1M' in
>> qemu-io shows that the range is now considered allocated), so probably
>> there is still a separate bug in bdrv_is_allocated_above().
>>
>> And I think we'll want an iotests case for both cases (short overlay,
>> short backing file).
>>
>> Kevin
>>
>>
>> diff --git a/block/io.c b/block/io.c
>> index f75777f5ea..5eafcff01a 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -2359,16 +2359,17 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>>       if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
>>           ret |= BDRV_BLOCK_ALLOCATED;
>> -    } else if (want_zero) {
>> -        if (bdrv_unallocated_blocks_are_zero(bs)) {
>> -            ret |= BDRV_BLOCK_ZERO;
>> -        } else if (bs->backing) {
>> -            BlockDriverState *bs2 = bs->backing->bs;
>> -            int64_t size2 = bdrv_getlength(bs2);
>> -
>> -            if (size2 >= 0 && offset >= size2) {
>> +    } else if (want_zero && bdrv_unallocated_blocks_are_zero(bs)) {
>> +        ret |= BDRV_BLOCK_ZERO;
>> +    } else if (bs->backing) {
>> +        BlockDriverState *bs2 = bs->backing->bs;
>> +        int64_t size2 = bdrv_getlength(bs2);
>> +
>> +        if (size2 >= 0 && offset >= size2) {
>> +            if (want_zero) {
>>                   ret |= BDRV_BLOCK_ZERO;
>>               }
>> +            ret |= BDRV_BLOCK_ALLOCATED;
> 
> No, I think it's wrong, as it's not allocated at bs, but at bs->backing->bs.
> 
> 
> So, bdrv_co_block_status works correct, what we can change about it, is not
> to return pnum=0 if requested range after eof, but return pnum=bytes, together
> with BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO | BDRV_BLOCK_EOF.

But this will break users which rely on pnum=0.

> 
> And actual problem is in bdrv_block_status_above and bdrv_is_allocated_above, which
> I'm fixing.
> 
> 
> 


-- 
Best regards,
Vladimir

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Vladimir Sementsov-Ogievskiy 6 years, 2 months ago

19.11.2019 15:32, Vladimir Sementsov-Ogievskiy wrote:
> 19.11.2019 15:17, Vladimir Sementsov-Ogievskiy wrote:
>> 19.11.2019 15:05, Kevin Wolf wrote:
>>> Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> Hi all!
>>>>
>>>> I wanted to understand, what is the real difference between
>>>> bdrv_block_status_above and bdrv_is_allocated_above, IMHO
>>>> bdrv_is_allocated_above should work through bdrv_block_status_above..
>>>>
>>>> And I found the problem: bdrv_is_allocated_above considers space after
>>>> EOF as UNALLOCATED for intermediate nodes..
>>>>
>>>> UNALLOCATED is not about allocation at fs level, but about should we
>>>> go to backing or not.. And it seems incorrect for me, as in case of
>>>> short backing file, we'll read zeroes after EOF, instead of going
>>>> further by backing chain.
>>>
>>> We actually have documentation what it means:
>>>
>>>   * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
>>>   *                       layer rather than any backing, set by block layer
>>>
>>> Say we have a short overlay, like this:
>>>
>>> base.qcow2:     AAAAAAAA
>>> overlay.qcow2:  BBBB
>>>
>>> Then of course the content of block 5 (the one after EOF of
>>> overlay.qcow2) is still determined by overlay.qcow2, which can be easily
>>> verified by reading it from overlay.qcow2 (produces zeros) and from
>>> base.qcow2 (produces As).
>>>
>>> So the correct result when querying the block status of block 5 on
>>> overlay.qcow2 is BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO.
>>>
>>> Interestingly, we already fixed the opposite case (large overlay over
>>> short backing file) in commit e88ae2264d9 from May 2014 according to the
>>> same logic.
>>>
>>>> This leads to the following effect:
>>>>
>>>> ./qemu-img create -f qcow2 base.qcow2 2M
>>>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
>>>>
>>>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
>>>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
>>>>
>>>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
>>>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
>>>> read 1048576/1048576 bytes at offset 1048576
>>>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
>>>>
>>>> But after commit guest visible state is changed, which seems wrong for me:
>>>> ./qemu-img commit top.qcow2 -b mid.qcow2
>>>>
>>>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
>>>> Pattern verification failed at offset 1048576, 1048576 bytes
>>>> read 1048576/1048576 bytes at offset 1048576
>>>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
>>>>
>>>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
>>>> read 1048576/1048576 bytes at offset 1048576
>>>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
>>>>
>>>>
>>>> I don't know, is it a real bug, as I don't know, do we support backing
>>>> file larger than its parent. Still, I'm not sure that this behavior of
>>>> bdrv_is_allocated_above don't lead to other problems.
>>>
>>> I agree it's a bug.
>>>
>>> Your fix doesn't look right to me, though. You leave the buggy behaviour
>>> of bdrv_co_block_status() as it is and then add four patches to work
>>> around it in some (but not all) callers of it.
>>>
>>> All that it should take to fix this is making the bs->backing check
>>> independent from want_zero and let it set ALLOCATED. What I expected
>>> would be something like the below patch.
>>>
>>> But it doesn't seem to fully fix the problem (though 'alloc 1M 1M' in
>>> qemu-io shows that the range is now considered allocated), so probably
>>> there is still a separate bug in bdrv_is_allocated_above().
>>>
>>> And I think we'll want an iotests case for both cases (short overlay,
>>> short backing file).
>>>
>>> Kevin
>>>
>>>
>>> diff --git a/block/io.c b/block/io.c
>>> index f75777f5ea..5eafcff01a 100644
>>> --- a/block/io.c
>>> +++ b/block/io.c
>>> @@ -2359,16 +2359,17 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>>>       if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
>>>           ret |= BDRV_BLOCK_ALLOCATED;
>>> -    } else if (want_zero) {
>>> -        if (bdrv_unallocated_blocks_are_zero(bs)) {
>>> -            ret |= BDRV_BLOCK_ZERO;
>>> -        } else if (bs->backing) {
>>> -            BlockDriverState *bs2 = bs->backing->bs;
>>> -            int64_t size2 = bdrv_getlength(bs2);
>>> -
>>> -            if (size2 >= 0 && offset >= size2) {
>>> +    } else if (want_zero && bdrv_unallocated_blocks_are_zero(bs)) {
>>> +        ret |= BDRV_BLOCK_ZERO;
>>> +    } else if (bs->backing) {
>>> +        BlockDriverState *bs2 = bs->backing->bs;
>>> +        int64_t size2 = bdrv_getlength(bs2);
>>> +
>>> +        if (size2 >= 0 && offset >= size2) {
>>> +            if (want_zero) {
>>>                   ret |= BDRV_BLOCK_ZERO;
>>>               }
>>> +            ret |= BDRV_BLOCK_ALLOCATED;
>>
>> No, I think it's wrong, as it's not allocated at bs, but at bs->backing->bs.
>>
>>
>> So, bdrv_co_block_status works correct, what we can change about it, is not
>> to return pnum=0 if requested range after eof, but return pnum=bytes, together
>> with BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO | BDRV_BLOCK_EOF.
> 
> But this will break users which rely on pnum=0.

Possibly, we may add flag to bdrv_co_block_status, to behave in such manner, and
set it to true when call from bdrv_block_status_above/bdrv_is_allocated_above.

> 
>>
>> And actual problem is in bdrv_block_status_above and bdrv_is_allocated_above, which
>> I'm fixing.
>>
>>
>>
> 
> 


-- 
Best regards,
Vladimir

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Vladimir Sementsov-Ogievskiy 6 years, 2 months ago

19.11.2019 15:34, Vladimir Sementsov-Ogievskiy wrote:
> 19.11.2019 15:32, Vladimir Sementsov-Ogievskiy wrote:
>> 19.11.2019 15:17, Vladimir Sementsov-Ogievskiy wrote:
>>> 19.11.2019 15:05, Kevin Wolf wrote:
>>>> Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>> Hi all!
>>>>>
>>>>> I wanted to understand, what is the real difference between
>>>>> bdrv_block_status_above and bdrv_is_allocated_above, IMHO
>>>>> bdrv_is_allocated_above should work through bdrv_block_status_above..
>>>>>
>>>>> And I found the problem: bdrv_is_allocated_above considers space after
>>>>> EOF as UNALLOCATED for intermediate nodes..
>>>>>
>>>>> UNALLOCATED is not about allocation at fs level, but about should we
>>>>> go to backing or not.. And it seems incorrect for me, as in case of
>>>>> short backing file, we'll read zeroes after EOF, instead of going
>>>>> further by backing chain.
>>>>
>>>> We actually have documentation what it means:
>>>>
>>>>   * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
>>>>   *                       layer rather than any backing, set by block layer
>>>>
>>>> Say we have a short overlay, like this:
>>>>
>>>> base.qcow2:     AAAAAAAA
>>>> overlay.qcow2:  BBBB
>>>>
>>>> Then of course the content of block 5 (the one after EOF of
>>>> overlay.qcow2) is still determined by overlay.qcow2, which can be easily
>>>> verified by reading it from overlay.qcow2 (produces zeros) and from
>>>> base.qcow2 (produces As).
>>>>
>>>> So the correct result when querying the block status of block 5 on
>>>> overlay.qcow2 is BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO.
>>>>
>>>> Interestingly, we already fixed the opposite case (large overlay over
>>>> short backing file) in commit e88ae2264d9 from May 2014 according to the
>>>> same logic.
>>>>
>>>>> This leads to the following effect:
>>>>>
>>>>> ./qemu-img create -f qcow2 base.qcow2 2M
>>>>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
>>>>>
>>>>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
>>>>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
>>>>>
>>>>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
>>>>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
>>>>> read 1048576/1048576 bytes at offset 1048576
>>>>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
>>>>>
>>>>> But after commit guest visible state is changed, which seems wrong for me:
>>>>> ./qemu-img commit top.qcow2 -b mid.qcow2
>>>>>
>>>>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
>>>>> Pattern verification failed at offset 1048576, 1048576 bytes
>>>>> read 1048576/1048576 bytes at offset 1048576
>>>>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
>>>>>
>>>>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
>>>>> read 1048576/1048576 bytes at offset 1048576
>>>>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
>>>>>
>>>>>
>>>>> I don't know, is it a real bug, as I don't know, do we support backing
>>>>> file larger than its parent. Still, I'm not sure that this behavior of
>>>>> bdrv_is_allocated_above don't lead to other problems.
>>>>
>>>> I agree it's a bug.
>>>>
>>>> Your fix doesn't look right to me, though. You leave the buggy behaviour
>>>> of bdrv_co_block_status() as it is and then add four patches to work
>>>> around it in some (but not all) callers of it.

The only caller is bdrv_co_block_status_above.

>>>>
>>>> All that it should take to fix this is making the bs->backing check
>>>> independent from want_zero and let it set ALLOCATED. What I expected
>>>> would be something like the below patch.
>>>>
>>>> But it doesn't seem to fully fix the problem (though 'alloc 1M 1M' in
>>>> qemu-io shows that the range is now considered allocated), so probably
>>>> there is still a separate bug in bdrv_is_allocated_above().
>>>>
>>>> And I think we'll want an iotests case for both cases (short overlay,
>>>> short backing file).
>>>>
>>>> Kevin
>>>>
>>>>
>>>> diff --git a/block/io.c b/block/io.c
>>>> index f75777f5ea..5eafcff01a 100644
>>>> --- a/block/io.c
>>>> +++ b/block/io.c
>>>> @@ -2359,16 +2359,17 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>>>>       if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
>>>>           ret |= BDRV_BLOCK_ALLOCATED;
>>>> -    } else if (want_zero) {
>>>> -        if (bdrv_unallocated_blocks_are_zero(bs)) {
>>>> -            ret |= BDRV_BLOCK_ZERO;
>>>> -        } else if (bs->backing) {
>>>> -            BlockDriverState *bs2 = bs->backing->bs;
>>>> -            int64_t size2 = bdrv_getlength(bs2);
>>>> -
>>>> -            if (size2 >= 0 && offset >= size2) {
>>>> +    } else if (want_zero && bdrv_unallocated_blocks_are_zero(bs)) {
>>>> +        ret |= BDRV_BLOCK_ZERO;
>>>> +    } else if (bs->backing) {
>>>> +        BlockDriverState *bs2 = bs->backing->bs;
>>>> +        int64_t size2 = bdrv_getlength(bs2);
>>>> +
>>>> +        if (size2 >= 0 && offset >= size2) {
>>>> +            if (want_zero) {
>>>>                   ret |= BDRV_BLOCK_ZERO;
>>>>               }
>>>> +            ret |= BDRV_BLOCK_ALLOCATED;
>>>
>>> No, I think it's wrong, as it's not allocated at bs, but at bs->backing->bs.
>>>
>>>
>>> So, bdrv_co_block_status works correct, what we can change about it, is not
>>> to return pnum=0 if requested range after eof, but return pnum=bytes, together
>>> with BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO | BDRV_BLOCK_EOF.
>>
>> But this will break users which rely on pnum=0.
> 
> Possibly, we may add flag to bdrv_co_block_status, to behave in such manner, and
> set it to true when call from bdrv_block_status_above/bdrv_is_allocated_above.
> 
>>
>>>
>>> And actual problem is in bdrv_block_status_above and bdrv_is_allocated_above, which
>>> I'm fixing.
>>>
>>>
>>>
>>
>>
> 
> 


-- 
Best regards,
Vladimir

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Stefan Hajnoczi 6 years, 2 months ago

On Sat, Nov 16, 2019 at 07:34:06PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> I wanted to understand, what is the real difference between bdrv_block_status_above
> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
> bdrv_block_status_above..
> 
> And I found the problem: bdrv_is_allocated_above considers space after EOF as
> UNALLOCATED for intermediate nodes..
> 
> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
> not.. And it seems incorrect for me, as in case of short backing file, we'll read
> zeroes after EOF, instead of going further by backing chain.
> 
> This leads to the following effect:
> 
> ./qemu-img create -f qcow2 base.qcow2 2M
> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
> 
> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
> 
> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
> 
> But after commit guest visible state is changed, which seems wrong for me:
> ./qemu-img commit top.qcow2 -b mid.qcow2
> 
> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
> Pattern verification failed at offset 1048576, 1048576 bytes
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
> 
> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
> read 1048576/1048576 bytes at offset 1048576
> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
> 
> 
> I don't know, is it a real bug, as I don't know, do we support backing file larger than
> its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
> to other problems.

It seems correct that a backing file limits the virtual disk size of its
backing chain.

The "qemu-img commit" behavior seems counter-intuitive at first, but the
problem there is that the top-most image file is larger than its backing
file - not that the backing chain has differing sizes.  Committing
should not lose data, mid.qcow2 will be grown and then you get the
result you've observed.

I consider this a little weird but not a bug.  Does anyone else have
opinions?

Stefan

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Posted by Vladimir Sementsov-Ogievskiy 6 years, 2 months ago

19.11.2019 19:58, Stefan Hajnoczi wrote:
> On Sat, Nov 16, 2019 at 07:34:06PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> Hi all!
>>
>> I wanted to understand, what is the real difference between bdrv_block_status_above
>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work through
>> bdrv_block_status_above..
>>
>> And I found the problem: bdrv_is_allocated_above considers space after EOF as
>> UNALLOCATED for intermediate nodes..
>>
>> UNALLOCATED is not about allocation at fs level, but about should we go to backing or
>> not.. And it seems incorrect for me, as in case of short backing file, we'll read
>> zeroes after EOF, instead of going further by backing chain.
>>
>> This leads to the following effect:
>>
>> ./qemu-img create -f qcow2 base.qcow2 2M
>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
>>
>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
>>
>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
>> read 1048576/1048576 bytes at offset 1048576
>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
>>
>> But after commit guest visible state is changed, which seems wrong for me:
>> ./qemu-img commit top.qcow2 -b mid.qcow2
>>
>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
>> Pattern verification failed at offset 1048576, 1048576 bytes
>> read 1048576/1048576 bytes at offset 1048576
>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
>>
>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
>> read 1048576/1048576 bytes at offset 1048576
>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
>>
>>
>> I don't know, is it a real bug, as I don't know, do we support backing file larger than
>> its parent. Still, I'm not sure that this behavior of bdrv_is_allocated_above don't lead
>> to other problems.
> 
> It seems correct that a backing file limits the virtual disk size of its
> backing chain.
> 
> The "qemu-img commit" behavior seems counter-intuitive at first, but the
> problem there is that the top-most image file is larger than its backing
> file - not that the backing chain has differing sizes.  Committing
> should not lose data, mid.qcow2 will be grown and then you get the
> result you've observed.
> 
> I consider this a little weird but not a bug.  Does anyone else have
> opinions?
> 

I thing, that the fact that read and block_status treats space after EOF
in different ways is a bug anyway.

Also, block_status behavior don't correspond to our definition of
BDRV_BLOCK_ALLOCATED (see Kevin's response)

Also, as I explained in parallel branch, such situation is possible for example
when we interleave incremental backup creation with vm disk resize.

So, then, we have incremental backup chain with different sizes like in
example. Assume we want to merge some sequential incremental backups into
one (aka, remove incremental backup), it may be done with help of commit.
And we get final image, where we see data from base backup instead of
zeros. This is illegal, vm should not get access to this data.

Possibly, similar thing is possible with snapshots.

-- 
Best regards,
Vladimir