[Qemu-devel] [PATCH 0/2] deal with BDRV_BLOCK_RAW

Vladimir Sementsov-Ogievskiy posted 2 patches 1 week ago
Test FreeBSD passed
Test docker-mingw@fedora passed
Test asan passed
Test docker-clang@ubuntu passed
Test checkpatch passed
Test s390x failed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20190812181146.26121-1-vsementsov@virtuozzo.com
Maintainers: Max Reitz <mreitz@redhat.com>, Kevin Wolf <kwolf@redhat.com>
block/raw-format.c         |  2 +-
tests/qemu-iotests/263     | 46 ++++++++++++++++++++++++++++++++++++++
tests/qemu-iotests/263.out | 12 ++++++++++
tests/qemu-iotests/group   |  1 +
4 files changed, 60 insertions(+), 1 deletion(-)
create mode 100755 tests/qemu-iotests/263
create mode 100644 tests/qemu-iotests/263.out

[Qemu-devel] [PATCH 0/2] deal with BDRV_BLOCK_RAW

Posted by Vladimir Sementsov-Ogievskiy 1 week ago
Hi all!

I'm not sure, is it a bug or a feature, but using qcow2 under raw is
broken. It should be either fixed like I propose (by Max's suggestion)
or somehow forbidden (just forbid backing-file supporting node to be
file child of raw-format node).

Vladimir Sementsov-Ogievskiy (2):
  block/raw-format: switch to BDRV_BLOCK_DATA with BDRV_BLOCK_RECURSE
  iotests: test mirroring qcow2 under raw format

 block/raw-format.c         |  2 +-
 tests/qemu-iotests/263     | 46 ++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/263.out | 12 ++++++++++
 tests/qemu-iotests/group   |  1 +
 4 files changed, 60 insertions(+), 1 deletion(-)
 create mode 100755 tests/qemu-iotests/263
 create mode 100644 tests/qemu-iotests/263.out

-- 
2.18.0


Re: [Qemu-devel] [PATCH 0/2] deal with BDRV_BLOCK_RAW

Posted by Max Reitz 1 week ago
On 12.08.19 20:11, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> I'm not sure, is it a bug or a feature, but using qcow2 under raw is
> broken. It should be either fixed like I propose (by Max's suggestion)
> or somehow forbidden (just forbid backing-file supporting node to be
> file child of raw-format node).

I agree, I think only filters should return BDRV_BLOCK_RAW.

(And not even them, they should just be handled transparently by
bdrv_co_block_status().  But that’s something for later.)

> Vladimir Sementsov-Ogievskiy (2):
>   block/raw-format: switch to BDRV_BLOCK_DATA with BDRV_BLOCK_RECURSE
>   iotests: test mirroring qcow2 under raw format
> 
>  block/raw-format.c         |  2 +-
>  tests/qemu-iotests/263     | 46 ++++++++++++++++++++++++++++++++++++++
>  tests/qemu-iotests/263.out | 12 ++++++++++
>  tests/qemu-iotests/group   |  1 +
>  4 files changed, 60 insertions(+), 1 deletion(-)
>  create mode 100755 tests/qemu-iotests/263
>  create mode 100644 tests/qemu-iotests/263.out

Thanks, applied to my block-next branch:

https://git.xanclic.moe/XanClic/qemu/commits/branch/block-next

Max

Re: [Qemu-devel] [PATCH 0/2] deal with BDRV_BLOCK_RAW

Posted by Kevin Wolf 1 week ago
Am 12.08.2019 um 21:46 hat Max Reitz geschrieben:
> On 12.08.19 20:11, Vladimir Sementsov-Ogievskiy wrote:
> > Hi all!
> > 
> > I'm not sure, is it a bug or a feature, but using qcow2 under raw is
> > broken. It should be either fixed like I propose (by Max's suggestion)
> > or somehow forbidden (just forbid backing-file supporting node to be
> > file child of raw-format node).
> 
> I agree, I think only filters should return BDRV_BLOCK_RAW.

If BDRV_BLOCK_RAW isn't suitable for raw, something went wrong. :-)

But anyway, raw is essentially a filter, at least if you don't use
the offset option. And BDRV_BLOCK_RAW shouldn't even depend on an
unchanged offset because the .bdrv_co_block_status implementation
returns the right offset.

And indeed, you can replace raw with blkdebug and it still fails (the
testcase from patch 2 fails, too, but it's obvious enough this way):

    $ ./qemu-img map --output=json --image-opts driver=qcow2,file.driver=file,file.filename=/tmp/test.qcow2 
    [{ "start": 0, "length": 1048576, "depth": 1, "zero": true, "data": false},
    { "start": 1048576, "length": 1048576, "depth": 1, "zero": false, "data": true, "offset": 327680},
    { "start": 2097152, "length": 65011712, "depth": 1, "zero": true, "data": false}]

    $ ./qemu-img map --output=json --image-opts driver=raw,file.driver=qcow2,file.file.driver=file,file.file.filename=/tmp/test.qcow2 
    [{ "start": 0, "length": 67108864, "depth": 0, "zero": false, "data": false}]

    $ ./qemu-img map --output=json --image-opts driver=blkdebug,image.driver=qcow2,image.file.driver=file,image.file.filename=/tmp/test.qcow2 
    [{ "start": 0, "length": 67108864, "depth": 0, "zero": false, "data": false}]

After applying your "deal with filters" series, blkdebug actually prints
the expected result and passes the iotests case, but raw still doesn't.
So you must have fixed something for filters that declare themselves
filters, even though that semantics should probably be coupled to
BDRV_BLOCK_RAW instead so that the raw format can benefit from it, too.

Kevin

Re: [Qemu-devel] [PATCH 0/2] deal with BDRV_BLOCK_RAW

Posted by Max Reitz 1 week ago
On 13.08.19 11:34, Kevin Wolf wrote:
> Am 12.08.2019 um 21:46 hat Max Reitz geschrieben:
>> On 12.08.19 20:11, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi all!
>>>
>>> I'm not sure, is it a bug or a feature, but using qcow2 under raw is
>>> broken. It should be either fixed like I propose (by Max's suggestion)
>>> or somehow forbidden (just forbid backing-file supporting node to be
>>> file child of raw-format node).
>>
>> I agree, I think only filters should return BDRV_BLOCK_RAW.
> 
> If BDRV_BLOCK_RAW isn't suitable for raw, something went wrong. :-)

Yes, its introduction.

> But anyway, raw is essentially a filter, at least if you don't use
> the offset option.

Which is precisely why it isn’t a filter.

Another reason is that it generally appears as a replacement for a
format driver.  You never insert it into some graph because you want it
as a filter.  Which is why behaving like a format driver in block_status
makes sense, in my opinion.

>                    And BDRV_BLOCK_RAW shouldn't even depend on an
> unchanged offset because the .bdrv_co_block_status implementation
> returns the right offset.
> 
> And indeed, you can replace raw with blkdebug and it still fails (the
> testcase from patch 2 fails, too, but it's obvious enough this way):
> 
>     $ ./qemu-img map --output=json --image-opts driver=qcow2,file.driver=file,file.filename=/tmp/test.qcow2 
>     [{ "start": 0, "length": 1048576, "depth": 1, "zero": true, "data": false},
>     { "start": 1048576, "length": 1048576, "depth": 1, "zero": false, "data": true, "offset": 327680},
>     { "start": 2097152, "length": 65011712, "depth": 1, "zero": true, "data": false}]
> 
>     $ ./qemu-img map --output=json --image-opts driver=raw,file.driver=qcow2,file.file.driver=file,file.file.filename=/tmp/test.qcow2 
>     [{ "start": 0, "length": 67108864, "depth": 0, "zero": false, "data": false}]
> 
>     $ ./qemu-img map --output=json --image-opts driver=blkdebug,image.driver=qcow2,image.file.driver=file,image.file.filename=/tmp/test.qcow2 
>     [{ "start": 0, "length": 67108864, "depth": 0, "zero": false, "data": false}]
> 
> After applying your "deal with filters" series, blkdebug actually prints
> the expected result and passes the iotests case, but raw still doesn't.
> So you must have fixed something for filters that declare themselves
> filters,

I hope to have fixed many things for filters that declare themselves
filters. ;-)

I think the problem is that filters just break backing chains right now.
 My series fixes that.  (So even if you have a blkdebug node on top of
your qcow2 node, you should realize that the node you really care about
is the qcow2 node.  If you look at the filter, you won’t see the backing
file and will thus interpret unallocated areas the wrong way.)

(The most important problem is that bdrv_is_allocated_above() currently
only goes to the backing_bs().  That’s fixed by patch 13, “block: Use
CAFs in block status functions”.)

>          even though that semantics should probably be coupled to
> BDRV_BLOCK_RAW instead so that the raw format can benefit from it, too.

I think BDRV_BLOCK_RAW should just be dropped.  I don’t see its purpose
for non-filters now that we have BDRV_BLOCK_RECURSE; and filters should
just be handled generically in bdrv_co_block_status().

Alternatively, we can decide that calling block_status on a filter node
is a bad idea, because the caller may not be prepared for the fact that
block_status will not return information for this node.  But that would
mean dropping BDRV_BLOCK_RAW, too.

Max

Re: [Qemu-devel] [PATCH 0/2] deal with BDRV_BLOCK_RAW

Posted by Max Reitz 1 week ago
On 12.08.19 21:46, Max Reitz wrote:
> On 12.08.19 20:11, Vladimir Sementsov-Ogievskiy wrote:
>> Hi all!
>>
>> I'm not sure, is it a bug or a feature, but using qcow2 under raw is
>> broken. It should be either fixed like I propose (by Max's suggestion)
>> or somehow forbidden (just forbid backing-file supporting node to be
>> file child of raw-format node).
> 
> I agree, I think only filters should return BDRV_BLOCK_RAW.
> 
> (And not even them, they should just be handled transparently by
> bdrv_co_block_status().  But that’s something for later.)
> 
>> Vladimir Sementsov-Ogievskiy (2):
>>   block/raw-format: switch to BDRV_BLOCK_DATA with BDRV_BLOCK_RECURSE
>>   iotests: test mirroring qcow2 under raw format
>>
>>  block/raw-format.c         |  2 +-
>>  tests/qemu-iotests/263     | 46 ++++++++++++++++++++++++++++++++++++++
>>  tests/qemu-iotests/263.out | 12 ++++++++++
>>  tests/qemu-iotests/group   |  1 +
>>  4 files changed, 60 insertions(+), 1 deletion(-)
>>  create mode 100755 tests/qemu-iotests/263
>>  create mode 100644 tests/qemu-iotests/263.out
> 
> Thanks, applied to my block-next branch:
> 
> https://git.xanclic.moe/XanClic/qemu/commits/branch/block-next

Oops, maybe not.  221 needs to be adjusted.

Max

Re: [Qemu-devel] [PATCH 0/2] deal with BDRV_BLOCK_RAW

Posted by Vladimir Sementsov-Ogievskiy 1 week ago
12.08.2019 22:50, Max Reitz wrote:
> On 12.08.19 21:46, Max Reitz wrote:
>> On 12.08.19 20:11, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi all!
>>>
>>> I'm not sure, is it a bug or a feature, but using qcow2 under raw is
>>> broken. It should be either fixed like I propose (by Max's suggestion)
>>> or somehow forbidden (just forbid backing-file supporting node to be
>>> file child of raw-format node).
>>
>> I agree, I think only filters should return BDRV_BLOCK_RAW.
>>
>> (And not even them, they should just be handled transparently by
>> bdrv_co_block_status().  But that’s something for later.)
>>
>>> Vladimir Sementsov-Ogievskiy (2):
>>>    block/raw-format: switch to BDRV_BLOCK_DATA with BDRV_BLOCK_RECURSE
>>>    iotests: test mirroring qcow2 under raw format
>>>
>>>   block/raw-format.c         |  2 +-
>>>   tests/qemu-iotests/263     | 46 ++++++++++++++++++++++++++++++++++++++
>>>   tests/qemu-iotests/263.out | 12 ++++++++++
>>>   tests/qemu-iotests/group   |  1 +
>>>   4 files changed, 60 insertions(+), 1 deletion(-)
>>>   create mode 100755 tests/qemu-iotests/263
>>>   create mode 100644 tests/qemu-iotests/263.out
>>
>> Thanks, applied to my block-next branch:
>>
>> https://git.xanclic.moe/XanClic/qemu/commits/branch/block-next
> 
> Oops, maybe not.  221 needs to be adjusted.
> 


Hmm yes, I forget to run tests.. Areas which were zero becomes data|zero, it
don't look good.

So, it's not quite right to report DATA | RECURSE, we actually should report
DATA_OR_ZERO | RECURSE, which is actually ALLOCATED | RECURSE, as otherwise
DATA will be set in final result (generic layer must not drop it, obviously).

ALLOCATED never returned by drivers but seems it should be. I'll think a bit and
resend something new.


-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH 0/2] deal with BDRV_BLOCK_RAW

Posted by Vladimir Sementsov-Ogievskiy 1 week ago
13.08.2019 11:39, Vladimir Sementsov-Ogievskiy wrote:
> 12.08.2019 22:50, Max Reitz wrote:
>> On 12.08.19 21:46, Max Reitz wrote:
>>> On 12.08.19 20:11, Vladimir Sementsov-Ogievskiy wrote:
>>>> Hi all!
>>>>
>>>> I'm not sure, is it a bug or a feature, but using qcow2 under raw is
>>>> broken. It should be either fixed like I propose (by Max's suggestion)
>>>> or somehow forbidden (just forbid backing-file supporting node to be
>>>> file child of raw-format node).
>>>
>>> I agree, I think only filters should return BDRV_BLOCK_RAW.
>>>
>>> (And not even them, they should just be handled transparently by
>>> bdrv_co_block_status().  But that’s something for later.)
>>>
>>>> Vladimir Sementsov-Ogievskiy (2):
>>>>    block/raw-format: switch to BDRV_BLOCK_DATA with BDRV_BLOCK_RECURSE
>>>>    iotests: test mirroring qcow2 under raw format
>>>>
>>>>   block/raw-format.c         |  2 +-
>>>>   tests/qemu-iotests/263     | 46 ++++++++++++++++++++++++++++++++++++++
>>>>   tests/qemu-iotests/263.out | 12 ++++++++++
>>>>   tests/qemu-iotests/group   |  1 +
>>>>   4 files changed, 60 insertions(+), 1 deletion(-)
>>>>   create mode 100755 tests/qemu-iotests/263
>>>>   create mode 100644 tests/qemu-iotests/263.out
>>>
>>> Thanks, applied to my block-next branch:
>>>
>>> https://git.xanclic.moe/XanClic/qemu/commits/branch/block-next
>>
>> Oops, maybe not.  221 needs to be adjusted.
>>
> 
> 
> Hmm yes, I forget to run tests.. Areas which were zero becomes data|zero, it
> don't look good.
> 
> So, it's not quite right to report DATA | RECURSE, we actually should report
> DATA_OR_ZERO | RECURSE, which is actually ALLOCATED | RECURSE, as otherwise
> DATA will be set in final result (generic layer must not drop it, obviously).
> 
> ALLOCATED never returned by drivers but seems it should be. I'll think a bit and
> resend something new.
> 
> 


Hmmm.. So, we have raw node, and assume backing chain under it. who should loop through it,
generic code or raw driver?

Now it all looks like generic code is responsible for looping through filtered chain (backing files
and filters) and driver is responsible for all it's children except for filtered child.

Or, driver may return something that says to generic child to handle the whole backing chain of returned
file at once, as it's another backing chain. And seems even RECURSE don't work correctly as it doesn't handle
the backing chain in this recursion. Why it works better than RAW - just because we return it together
with DATA flags and this DATA flag is kept anyway, independently of finding zeros or not.


-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH 0/2] deal with BDRV_BLOCK_RAW

Posted by Vladimir Sementsov-Ogievskiy 1 week ago
13.08.2019 12:01, Vladimir Sementsov-Ogievskiy wrote:
> 13.08.2019 11:39, Vladimir Sementsov-Ogievskiy wrote:
>> 12.08.2019 22:50, Max Reitz wrote:
>>> On 12.08.19 21:46, Max Reitz wrote:
>>>> On 12.08.19 20:11, Vladimir Sementsov-Ogievskiy wrote:
>>>>> Hi all!
>>>>>
>>>>> I'm not sure, is it a bug or a feature, but using qcow2 under raw is
>>>>> broken. It should be either fixed like I propose (by Max's suggestion)
>>>>> or somehow forbidden (just forbid backing-file supporting node to be
>>>>> file child of raw-format node).
>>>>
>>>> I agree, I think only filters should return BDRV_BLOCK_RAW.
>>>>
>>>> (And not even them, they should just be handled transparently by
>>>> bdrv_co_block_status().  But that’s something for later.)
>>>>
>>>>> Vladimir Sementsov-Ogievskiy (2):
>>>>>    block/raw-format: switch to BDRV_BLOCK_DATA with BDRV_BLOCK_RECURSE
>>>>>    iotests: test mirroring qcow2 under raw format
>>>>>
>>>>>   block/raw-format.c         |  2 +-
>>>>>   tests/qemu-iotests/263     | 46 ++++++++++++++++++++++++++++++++++++++
>>>>>   tests/qemu-iotests/263.out | 12 ++++++++++
>>>>>   tests/qemu-iotests/group   |  1 +
>>>>>   4 files changed, 60 insertions(+), 1 deletion(-)
>>>>>   create mode 100755 tests/qemu-iotests/263
>>>>>   create mode 100644 tests/qemu-iotests/263.out
>>>>
>>>> Thanks, applied to my block-next branch:
>>>>
>>>> https://git.xanclic.moe/XanClic/qemu/commits/branch/block-next
>>>
>>> Oops, maybe not.  221 needs to be adjusted.
>>>
>>
>>
>> Hmm yes, I forget to run tests.. Areas which were zero becomes data|zero, it
>> don't look good.
>>
>> So, it's not quite right to report DATA | RECURSE, we actually should report
>> DATA_OR_ZERO | RECURSE, which is actually ALLOCATED | RECURSE, as otherwise
>> DATA will be set in final result (generic layer must not drop it, obviously).
>>
>> ALLOCATED never returned by drivers but seems it should be. I'll think a bit and
>> resend something new.
>>
>>
> 
> 
> Hmmm.. So, we have raw node, and assume backing chain under it. who should loop through it,
> generic code or raw driver?
> 
> Now it all looks like generic code is responsible for looping through filtered chain (backing files
> and filters) and driver is responsible for all it's children except for filtered child.
> 
> Or, driver may return something that says to generic child to handle the whole backing chain of returned
> file at once, as it's another backing chain. And seems even RECURSE don't work correctly as it doesn't handle
> the backing chain in this recursion. Why it works better than RAW - just because we return it together
> with DATA flags and this DATA flag is kept anyway, independently of finding zeros or not.
> 
> 


Hmm, so, is it correct that we return DATA | RECURSE, if we are not really sure that it is data?

If we see at

  * BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer

seems like we should report DATA only if there is allocation..

  * DATA ZERO OFFSET_VALID
  *  t    t        t       sectors read as zero, returned file is zero at offset
  *  t    f        t       sectors read as valid from file at offset
  *  f    t        t       sectors preallocated, read as zero, returned file not

so, ZERO alone doesn't guarantee that we may safely read?

So, for qcow2 metadata-preallocated images, what about zero-init? We report DATA, and probably get ZERO from
file and have finally DAYA | ZERO which guarantees that read will return zeros, but is it true?

Finally, what "DATA" mean? That space is allocated and occupies disk space? Or it only  means only ALLOCATED i.e.
"read from this layer, not from backing" otherwise, and adds additional meaning to ZERO when used together, that
read will return zeros for sure?

-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH 0/2] deal with BDRV_BLOCK_RAW

Posted by Vladimir Sementsov-Ogievskiy 1 week ago
13.08.2019 12:33, Vladimir Sementsov-Ogievskiy wrote:
> 13.08.2019 12:01, Vladimir Sementsov-Ogievskiy wrote:
>> 13.08.2019 11:39, Vladimir Sementsov-Ogievskiy wrote:
>>> 12.08.2019 22:50, Max Reitz wrote:
>>>> On 12.08.19 21:46, Max Reitz wrote:
>>>>> On 12.08.19 20:11, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> Hi all!
>>>>>>
>>>>>> I'm not sure, is it a bug or a feature, but using qcow2 under raw is
>>>>>> broken. It should be either fixed like I propose (by Max's suggestion)
>>>>>> or somehow forbidden (just forbid backing-file supporting node to be
>>>>>> file child of raw-format node).
>>>>>
>>>>> I agree, I think only filters should return BDRV_BLOCK_RAW.
>>>>>
>>>>> (And not even them, they should just be handled transparently by
>>>>> bdrv_co_block_status().  But that’s something for later.)
>>>>>
>>>>>> Vladimir Sementsov-Ogievskiy (2):
>>>>>>    block/raw-format: switch to BDRV_BLOCK_DATA with BDRV_BLOCK_RECURSE
>>>>>>    iotests: test mirroring qcow2 under raw format
>>>>>>
>>>>>>   block/raw-format.c         |  2 +-
>>>>>>   tests/qemu-iotests/263     | 46 ++++++++++++++++++++++++++++++++++++++
>>>>>>   tests/qemu-iotests/263.out | 12 ++++++++++
>>>>>>   tests/qemu-iotests/group   |  1 +
>>>>>>   4 files changed, 60 insertions(+), 1 deletion(-)
>>>>>>   create mode 100755 tests/qemu-iotests/263
>>>>>>   create mode 100644 tests/qemu-iotests/263.out
>>>>>
>>>>> Thanks, applied to my block-next branch:
>>>>>
>>>>> https://git.xanclic.moe/XanClic/qemu/commits/branch/block-next
>>>>
>>>> Oops, maybe not.  221 needs to be adjusted.
>>>>
>>>
>>>
>>> Hmm yes, I forget to run tests.. Areas which were zero becomes data|zero, it
>>> don't look good.
>>>
>>> So, it's not quite right to report DATA | RECURSE, we actually should report
>>> DATA_OR_ZERO | RECURSE, which is actually ALLOCATED | RECURSE, as otherwise
>>> DATA will be set in final result (generic layer must not drop it, obviously).
>>>
>>> ALLOCATED never returned by drivers but seems it should be. I'll think a bit and
>>> resend something new.
>>>
>>>
>>
>>
>> Hmmm.. So, we have raw node, and assume backing chain under it. who should loop through it,
>> generic code or raw driver?
>>
>> Now it all looks like generic code is responsible for looping through filtered chain (backing files
>> and filters) and driver is responsible for all it's children except for filtered child.
>>
>> Or, driver may return something that says to generic child to handle the whole backing chain of returned
>> file at once, as it's another backing chain. And seems even RECURSE don't work correctly as it doesn't handle
>> the backing chain in this recursion. Why it works better than RAW - just because we return it together
>> with DATA flags and this DATA flag is kept anyway, independently of finding zeros or not.
>>
>>
> 
> 
> Hmm, so, is it correct that we return DATA | RECURSE, if we are not really sure that it is data?
> 
> If we see at
> 
>   * BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
> 
> seems like we should report DATA only if there is allocation..
> 
>   * DATA ZERO OFFSET_VALID
>   *  t    t        t       sectors read as zero, returned file is zero at offset
>   *  t    f        t       sectors read as valid from file at offset
>   *  f    t        t       sectors preallocated, read as zero, returned file not
> 
> so, ZERO alone doesn't guarantee that we may safely read?
> 
> So, for qcow2 metadata-preallocated images, what about zero-init? We report DATA, and probably get ZERO from
> file and have finally DAYA | ZERO which guarantees that read will return zeros, but is it true?
> 
> Finally, what "DATA" mean? That space is allocated and occupies disk space? Or it only  means only ALLOCATED i.e.
> "read from this layer, not from backing" otherwise, and adds additional meaning to ZERO when used together, that
> read will return zeros for sure?
> 


Continue self-discussion.

Consider closer the following case:
 >   * DATA ZERO OFFSET_VALID
 >   *  f    t        t       sectors preallocated, read as zero, returned file not

It actually means that we must not read, as read will return wrong data, when clusters are actually zero for guest.

It's OK, when for ex. qcow2 returns this combination and link to its file child: it means that if you read from qcow2
node, you'll see correct zeros, as qcow2 has special metadata which shows that these clusters are zero. But if you read
from file directly at returned offset you'll see garbage, so don't do it.

But what if some node returns this combination with file == itself? It actually means that you must not read, but you
should call block-status to understand that there are zeros. So, if some format can return this combination with
file == itself it means that you must not read it directly, but only after checking block status.

And file-posix is example of such driver. But file-posix holes are guaranteed to read as zero, so we can report DATA | ZERO.
But this will break user expirience which assumes that DATA means occupation of real disk space.

...
me go and re-read what we've documented in NBD protocol about block steus...

"DATA" turns into NBD_STATE_HOLE, which formally means nothing, and just notes that probably there is no disk space occupation
for this region.. So it's about disk space allocation and nothing about correctness of read.
and NBD_STATE_ZERO guarantees that region read as all zeroes.

Look at code in nbd/server.c.. Aha, it calls block_status_above and turns !ALLOCATED into HOLE. Which means that it will never
return HOLE for file-posix..





-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH 0/2] deal with BDRV_BLOCK_RAW

Posted by Kevin Wolf 1 week ago
Am 13.08.2019 um 13:14 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 13.08.2019 12:33, Vladimir Sementsov-Ogievskiy wrote:
> > 13.08.2019 12:01, Vladimir Sementsov-Ogievskiy wrote:
> >> 13.08.2019 11:39, Vladimir Sementsov-Ogievskiy wrote:
> >>> 12.08.2019 22:50, Max Reitz wrote:
> >>>> On 12.08.19 21:46, Max Reitz wrote:
> >>>>> On 12.08.19 20:11, Vladimir Sementsov-Ogievskiy wrote:
> >>>>>> Hi all!
> >>>>>>
> >>>>>> I'm not sure, is it a bug or a feature, but using qcow2 under raw is
> >>>>>> broken. It should be either fixed like I propose (by Max's suggestion)
> >>>>>> or somehow forbidden (just forbid backing-file supporting node to be
> >>>>>> file child of raw-format node).
> >>>>>
> >>>>> I agree, I think only filters should return BDRV_BLOCK_RAW.
> >>>>>
> >>>>> (And not even them, they should just be handled transparently by
> >>>>> bdrv_co_block_status().  But that’s something for later.)
> >>>>>
> >>>>>> Vladimir Sementsov-Ogievskiy (2):
> >>>>>>    block/raw-format: switch to BDRV_BLOCK_DATA with BDRV_BLOCK_RECURSE
> >>>>>>    iotests: test mirroring qcow2 under raw format
> >>>>>>
> >>>>>>   block/raw-format.c         |  2 +-
> >>>>>>   tests/qemu-iotests/263     | 46 ++++++++++++++++++++++++++++++++++++++
> >>>>>>   tests/qemu-iotests/263.out | 12 ++++++++++
> >>>>>>   tests/qemu-iotests/group   |  1 +
> >>>>>>   4 files changed, 60 insertions(+), 1 deletion(-)
> >>>>>>   create mode 100755 tests/qemu-iotests/263
> >>>>>>   create mode 100644 tests/qemu-iotests/263.out
> >>>>>
> >>>>> Thanks, applied to my block-next branch:
> >>>>>
> >>>>> https://git.xanclic.moe/XanClic/qemu/commits/branch/block-next
> >>>>
> >>>> Oops, maybe not.  221 needs to be adjusted.
> >>>>
> >>>
> >>>
> >>> Hmm yes, I forget to run tests.. Areas which were zero becomes data|zero, it
> >>> don't look good.
> >>>
> >>> So, it's not quite right to report DATA | RECURSE, we actually should report
> >>> DATA_OR_ZERO | RECURSE, which is actually ALLOCATED | RECURSE, as otherwise
> >>> DATA will be set in final result (generic layer must not drop it, obviously).
> >>>
> >>> ALLOCATED never returned by drivers but seems it should be. I'll think a bit and
> >>> resend something new.
> >>>
> >>>
> >>
> >>
> >> Hmmm.. So, we have raw node, and assume backing chain under it. who should loop through it,
> >> generic code or raw driver?
> >>
> >> Now it all looks like generic code is responsible for looping through filtered chain (backing files
> >> and filters) and driver is responsible for all it's children except for filtered child.
> >>
> >> Or, driver may return something that says to generic child to handle the whole backing chain of returned
> >> file at once, as it's another backing chain. And seems even RECURSE don't work correctly as it doesn't handle
> >> the backing chain in this recursion. Why it works better than RAW - just because we return it together
> >> with DATA flags and this DATA flag is kept anyway, independently of finding zeros or not.
> >>
> >>
> > 
> > 
> > Hmm, so, is it correct that we return DATA | RECURSE, if we are not really sure that it is data?
> > 
> > If we see at
> > 
> >   * BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
> > 
> > seems like we should report DATA only if there is allocation..
> > 
> >   * DATA ZERO OFFSET_VALID
> >   *  t    t        t       sectors read as zero, returned file is zero at offset
> >   *  t    f        t       sectors read as valid from file at offset
> >   *  f    t        t       sectors preallocated, read as zero, returned file not
> > 
> > so, ZERO alone doesn't guarantee that we may safely read?
> > 
> > So, for qcow2 metadata-preallocated images, what about zero-init? We report DATA, and probably get ZERO from
> > file and have finally DAYA | ZERO which guarantees that read will return zeros, but is it true?
> > 
> > Finally, what "DATA" mean? That space is allocated and occupies disk space? Or it only  means only ALLOCATED i.e.
> > "read from this layer, not from backing" otherwise, and adds additional meaning to ZERO when used together, that
> > read will return zeros for sure?

I think DATA means that the data for this block is provided by *file. I
wouldn't necessarily understand it to mean that the data actually takes
up physical disk space there.

> Continue self-discussion.
> 
> Consider closer the following case:
>  >   * DATA ZERO OFFSET_VALID
>  >   *  f    t        t       sectors preallocated, read as zero, returned file not
> 
> It actually means that we must not read, as read will return wrong
> data, when clusters are actually zero for guest.

It means that you need to read from bs itself to get the correct data
(which will be zero). Even though OFFSET_VALID is set, reading from
*file (typically bs->file->bs) at the returned offset might not give the
right result.

> It's OK, when for ex. qcow2 returns this combination and link to its
> file child: it means that if you read from qcow2 node, you'll see
> correct zeros, as qcow2 has special metadata which shows that these
> clusters are zero. But if you read from file directly at returned
> offset you'll see garbage, so don't do it.

Correct.

> But what if some node returns this combination with file == itself? It
> actually means that you must not read, but you should call
> block-status to understand that there are zeros. So, if some format
> can return this combination with file == itself it means that you must
> not read it directly, but only after checking block status.

This doesn't make sense to me. Reading from a node is always correct.

But you're right that DATA seems to mean something slightly different at
the protocol level because *file cannot have a meaningful value for the
lower layer there. In this case, DATA still seems to mean that the data
is fetched from the lower layer (i.e. the block device on which the file
system resides). For holes, this is not the case.

> And file-posix is example of such driver. But file-posix holes are guaranteed to read as zero, so we can report DATA | ZERO.
> But this will break user expirience which assumes that DATA means occupation of real disk space.

With the above explanation, DATA shouldn't be set for holes.

But it's still kind of inconsistent because OFFSET_VALID and the offset
refer to bs itself and not to the lower layer.

> ...
> me go and re-read what we've documented in NBD protocol about block steus...
> 
> "DATA" turns into NBD_STATE_HOLE, which formally means nothing, and just notes that probably there is no disk space occupation
> for this region.. So it's about disk space allocation and nothing about correctness of read.
> and NBD_STATE_ZERO guarantees that region read as all zeroes.
> 
> Look at code in nbd/server.c.. Aha, it calls block_status_above and turns !ALLOCATED into HOLE. Which means that it will never
> return HOLE for file-posix..

Hm... This is a mess. :-)

Kevin

Re: [Qemu-devel] [PATCH 0/2] deal with BDRV_BLOCK_RAW

Posted by Max Reitz 1 week ago
On 13.08.19 13:51, Kevin Wolf wrote:

[...]

> Hm... This is a mess. :-)

Just out of curiosity: Why?

Aren’t there only two things we really need from the block_status
infrastructure?

(1) Whether something is allocated in the given layer of the backing chain,

(2) Whether we know that a given range reads as zeroes.

Do we really need anything else?

Max

Re: [Qemu-devel] [PATCH 0/2] deal with BDRV_BLOCK_RAW

Posted by Eric Blake 1 week ago
On 8/13/19 9:31 AM, Max Reitz wrote:
> On 13.08.19 13:51, Kevin Wolf wrote:
> 
> [...]
> 
>> Hm... This is a mess. :-)
> 
> Just out of curiosity: Why?
> 
> Aren’t there only two things we really need from the block_status
> infrastructure?
> 
> (1) Whether something is allocated in the given layer of the backing chain,

(1)(a) - is it allocated in this layer
(1)(b) - is it allocated in any backing layer

> 
> (2) Whether we know that a given range reads as zeroes.
> 
> Do we really need anything else?

qemu-img map needs:

(3) What host-relative offset, if any, corresponds to a given
guest-visible offset.

I also need to find time to revisit my proposed patches on block_status
alignment - there are some cases where we want to ensure that one layer
with large granularity does not pick up mid-granularity changes in
status read from a backing layer with smaller granularity.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH 0/2] deal with BDRV_BLOCK_RAW

Posted by Vladimir Sementsov-Ogievskiy 1 week ago
13.08.2019 17:31, Max Reitz wrote:
> On 13.08.19 13:51, Kevin Wolf wrote:
> 
> [...]
> 
>> Hm... This is a mess. :-)
> 
> Just out of curiosity: Why?
> 
> Aren’t there only two things we really need from the block_status
> infrastructure?
> 
> (1) Whether something is allocated in the given layer of the backing chain,
> 
> (2) Whether we know that a given range reads as zeroes.
> 
> Do we really need anything else?
> 

qemu-img map?


1. We need to fix the bug somehow
2. We need to fix comment about different block-status flags, as it really
lacks information of what actually "DATA" means (together with *file).
And what finally means "allocated", can you define it precisely?
3. Fix nbd-server to be closer to NBD spec about block-status

I made several tries to imagine [1] and [2] but never succeeded..


-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH 0/2] deal with BDRV_BLOCK_RAW

Posted by Max Reitz 1 week ago
On 13.08.19 16:46, Vladimir Sementsov-Ogievskiy wrote:
> 13.08.2019 17:31, Max Reitz wrote:
>> On 13.08.19 13:51, Kevin Wolf wrote:
>>
>> [...]
>>
>>> Hm... This is a mess. :-)
>>
>> Just out of curiosity: Why?
>>
>> Aren’t there only two things we really need from the block_status
>> infrastructure?
>>
>> (1) Whether something is allocated in the given layer of the backing chain,
>>
>> (2) Whether we know that a given range reads as zeroes.
>>
>> Do we really need anything else?
>>
> 
> qemu-img map?

Which is a debugging tool.  So it doesn’t fall under “really” in my
book.  If removing everything but allocation+zero information would make
the code a lot simpler, I think that would be worth it.

> 1. We need to fix the bug somehow
> 2. We need to fix comment about different block-status flags, as it really
> lacks information of what actually "DATA" means (together with *file).
> And what finally means "allocated", can you define it precisely?

As I wrote in my other mails, I think the problem is that it’s just
unexpected that block_status automatically skips through for filters.
It shouldn’t, that’s just black magic that the caller should not rely on.

(We see precisely here that it’s wrong, because the callers are not
prepared for the allocation information returned to be associated with a
different node than what they passed.)

So my definition is just “If the node has a COW backing file and
block_status returns ‘not allocated’, the data will be there.
Otherwise, the data is in the current node.”  Yes, that means that
filters should appear as fully allocated.

Max

> 3. Fix nbd-server to be closer to NBD spec about block-status
> 
> I made several tries to imagine [1] and [2] but never succeeded..
> 
> 


Re: [Qemu-devel] [PATCH 0/2] deal with BDRV_BLOCK_RAW

Posted by Kevin Wolf 1 week ago
Am 13.08.2019 um 16:53 hat Max Reitz geschrieben:
> On 13.08.19 16:46, Vladimir Sementsov-Ogievskiy wrote:
> > 13.08.2019 17:31, Max Reitz wrote:
> >> On 13.08.19 13:51, Kevin Wolf wrote:
> >>
> >> [...]
> >>
> >>> Hm... This is a mess. :-)
> >>
> >> Just out of curiosity: Why?
> >>
> >> Aren’t there only two things we really need from the block_status
> >> infrastructure?
> >>
> >> (1) Whether something is allocated in the given layer of the backing chain,
> >>
> >> (2) Whether we know that a given range reads as zeroes.
> >>
> >> Do we really need anything else?
> >>
> > 
> > qemu-img map?
> 
> Which is a debugging tool.  So it doesn’t fall under “really” in my
> book.  If removing everything but allocation+zero information would make
> the code a lot simpler, I think that would be worth it.
> 
> > 1. We need to fix the bug somehow
> > 2. We need to fix comment about different block-status flags, as it really
> > lacks information of what actually "DATA" means (together with *file).
> > And what finally means "allocated", can you define it precisely?
> 
> As I wrote in my other mails, I think the problem is that it’s just
> unexpected that block_status automatically skips through for filters.
> It shouldn’t, that’s just black magic that the caller should not rely on.
> 
> (We see precisely here that it’s wrong, because the callers are not
> prepared for the allocation information returned to be associated with a
> different node than what they passed.)
> 
> So my definition is just “If the node has a COW backing file and
> block_status returns ‘not allocated’, the data will be there.
> Otherwise, the data is in the current node.”  Yes, that means that
> filters should appear as fully allocated.

You can do that, but then the callers need to learn to do the recursion
instead. After all, just copying everything if a filter is in the
subtree isn't the desired behaviour.

Kevin

Re: [Qemu-devel] [PATCH 0/2] deal with BDRV_BLOCK_RAW

Posted by Max Reitz 1 week ago
On 13.08.19 17:03, Kevin Wolf wrote:
> Am 13.08.2019 um 16:53 hat Max Reitz geschrieben:
>> On 13.08.19 16:46, Vladimir Sementsov-Ogievskiy wrote:
>>> 13.08.2019 17:31, Max Reitz wrote:
>>>> On 13.08.19 13:51, Kevin Wolf wrote:
>>>>
>>>> [...]
>>>>
>>>>> Hm... This is a mess. :-)
>>>>
>>>> Just out of curiosity: Why?
>>>>
>>>> Aren’t there only two things we really need from the block_status
>>>> infrastructure?
>>>>
>>>> (1) Whether something is allocated in the given layer of the backing chain,
>>>>
>>>> (2) Whether we know that a given range reads as zeroes.
>>>>
>>>> Do we really need anything else?
>>>>
>>>
>>> qemu-img map?
>>
>> Which is a debugging tool.  So it doesn’t fall under “really” in my
>> book.  If removing everything but allocation+zero information would make
>> the code a lot simpler, I think that would be worth it.
>>
>>> 1. We need to fix the bug somehow
>>> 2. We need to fix comment about different block-status flags, as it really
>>> lacks information of what actually "DATA" means (together with *file).
>>> And what finally means "allocated", can you define it precisely?
>>
>> As I wrote in my other mails, I think the problem is that it’s just
>> unexpected that block_status automatically skips through for filters.
>> It shouldn’t, that’s just black magic that the caller should not rely on.
>>
>> (We see precisely here that it’s wrong, because the callers are not
>> prepared for the allocation information returned to be associated with a
>> different node than what they passed.)
>>
>> So my definition is just “If the node has a COW backing file and
>> block_status returns ‘not allocated’, the data will be there.
>> Otherwise, the data is in the current node.”  Yes, that means that
>> filters should appear as fully allocated.
> 
> You can do that, but then the callers need to learn to do the recursion
> instead. After all, just copying everything if a filter is in the
> subtree isn't the desired behaviour.

Yes, hence the “deal with filters” series.

Max

Re: [Qemu-devel] [PATCH 0/2] deal with BDRV_BLOCK_RAW

Posted by Vladimir Sementsov-Ogievskiy 1 week ago
13.08.2019 14:51, Kevin Wolf wrote:
> Am 13.08.2019 um 13:14 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 13.08.2019 12:33, Vladimir Sementsov-Ogievskiy wrote:
>>> 13.08.2019 12:01, Vladimir Sementsov-Ogievskiy wrote:
>>>> 13.08.2019 11:39, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 12.08.2019 22:50, Max Reitz wrote:
>>>>>> On 12.08.19 21:46, Max Reitz wrote:
>>>>>>> On 12.08.19 20:11, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>>> Hi all!
>>>>>>>>
>>>>>>>> I'm not sure, is it a bug or a feature, but using qcow2 under raw is
>>>>>>>> broken. It should be either fixed like I propose (by Max's suggestion)
>>>>>>>> or somehow forbidden (just forbid backing-file supporting node to be
>>>>>>>> file child of raw-format node).
>>>>>>>
>>>>>>> I agree, I think only filters should return BDRV_BLOCK_RAW.
>>>>>>>
>>>>>>> (And not even them, they should just be handled transparently by
>>>>>>> bdrv_co_block_status().  But that’s something for later.)
>>>>>>>
>>>>>>>> Vladimir Sementsov-Ogievskiy (2):
>>>>>>>>     block/raw-format: switch to BDRV_BLOCK_DATA with BDRV_BLOCK_RECURSE
>>>>>>>>     iotests: test mirroring qcow2 under raw format
>>>>>>>>
>>>>>>>>    block/raw-format.c         |  2 +-
>>>>>>>>    tests/qemu-iotests/263     | 46 ++++++++++++++++++++++++++++++++++++++
>>>>>>>>    tests/qemu-iotests/263.out | 12 ++++++++++
>>>>>>>>    tests/qemu-iotests/group   |  1 +
>>>>>>>>    4 files changed, 60 insertions(+), 1 deletion(-)
>>>>>>>>    create mode 100755 tests/qemu-iotests/263
>>>>>>>>    create mode 100644 tests/qemu-iotests/263.out
>>>>>>>
>>>>>>> Thanks, applied to my block-next branch:
>>>>>>>
>>>>>>> https://git.xanclic.moe/XanClic/qemu/commits/branch/block-next
>>>>>>
>>>>>> Oops, maybe not.  221 needs to be adjusted.
>>>>>>
>>>>>
>>>>>
>>>>> Hmm yes, I forget to run tests.. Areas which were zero becomes data|zero, it
>>>>> don't look good.
>>>>>
>>>>> So, it's not quite right to report DATA | RECURSE, we actually should report
>>>>> DATA_OR_ZERO | RECURSE, which is actually ALLOCATED | RECURSE, as otherwise
>>>>> DATA will be set in final result (generic layer must not drop it, obviously).
>>>>>
>>>>> ALLOCATED never returned by drivers but seems it should be. I'll think a bit and
>>>>> resend something new.
>>>>>
>>>>>
>>>>
>>>>
>>>> Hmmm.. So, we have raw node, and assume backing chain under it. who should loop through it,
>>>> generic code or raw driver?
>>>>
>>>> Now it all looks like generic code is responsible for looping through filtered chain (backing files
>>>> and filters) and driver is responsible for all it's children except for filtered child.
>>>>
>>>> Or, driver may return something that says to generic child to handle the whole backing chain of returned
>>>> file at once, as it's another backing chain. And seems even RECURSE don't work correctly as it doesn't handle
>>>> the backing chain in this recursion. Why it works better than RAW - just because we return it together
>>>> with DATA flags and this DATA flag is kept anyway, independently of finding zeros or not.
>>>>
>>>>
>>>
>>>
>>> Hmm, so, is it correct that we return DATA | RECURSE, if we are not really sure that it is data?
>>>
>>> If we see at
>>>
>>>    * BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
>>>
>>> seems like we should report DATA only if there is allocation..
>>>
>>>    * DATA ZERO OFFSET_VALID
>>>    *  t    t        t       sectors read as zero, returned file is zero at offset
>>>    *  t    f        t       sectors read as valid from file at offset
>>>    *  f    t        t       sectors preallocated, read as zero, returned file not
>>>
>>> so, ZERO alone doesn't guarantee that we may safely read?
>>>
>>> So, for qcow2 metadata-preallocated images, what about zero-init? We report DATA, and probably get ZERO from
>>> file and have finally DAYA | ZERO which guarantees that read will return zeros, but is it true?
>>>
>>> Finally, what "DATA" mean? That space is allocated and occupies disk space? Or it only  means only ALLOCATED i.e.
>>> "read from this layer, not from backing" otherwise, and adds additional meaning to ZERO when used together, that
>>> read will return zeros for sure?
> 
> I think DATA means that the data for this block is provided by *file. I
> wouldn't necessarily understand it to mean that the data actually takes
> up physical disk space there.
> 
>> Continue self-discussion.
>>
>> Consider closer the following case:
>>   >   * DATA ZERO OFFSET_VALID
>>   >   *  f    t        t       sectors preallocated, read as zero, returned file not
>>
>> It actually means that we must not read, as read will return wrong
>> data, when clusters are actually zero for guest.
> 
> It means that you need to read from bs itself to get the correct data
> (which will be zero). Even though OFFSET_VALID is set, reading from
> *file (typically bs->file->bs) at the returned offset might not give the
> right result.
> 
>> It's OK, when for ex. qcow2 returns this combination and link to its
>> file child: it means that if you read from qcow2 node, you'll see
>> correct zeros, as qcow2 has special metadata which shows that these
>> clusters are zero. But if you read from file directly at returned
>> offset you'll see garbage, so don't do it.
> 
> Correct.
> 
>> But what if some node returns this combination with file == itself? It
>> actually means that you must not read, but you should call
>> block-status to understand that there are zeros. So, if some format
>> can return this combination with file == itself it means that you must
>> not read it directly, but only after checking block status.
> 
> This doesn't make sense to me. Reading from a node is always correct.
> 
> But you're right that DATA seems to mean something slightly different at
> the protocol level because *file cannot have a meaningful value for the
> lower layer there. In this case, DATA still seems to mean that the data
> is fetched from the lower layer (i.e. the block device on which the file
> system resides). For holes, this is not the case.
> 
>> And file-posix is example of such driver. But file-posix holes are guaranteed to read as zero, so we can report DATA | ZERO.
>> But this will break user expirience which assumes that DATA means occupation of real disk space.
> 
> With the above explanation, DATA shouldn't be set for holes.
> 
> But it's still kind of inconsistent because OFFSET_VALID and the offset
> refer to bs itself and not to the lower layer.
> 
>> ...
>> me go and re-read what we've documented in NBD protocol about block steus...
>>
>> "DATA" turns into NBD_STATE_HOLE, which formally means nothing, and just notes that probably there is no disk space occupation
>> for this region.. So it's about disk space allocation and nothing about correctness of read.
>> and NBD_STATE_ZERO guarantees that region read as all zeroes.
>>
>> Look at code in nbd/server.c.. Aha, it calls block_status_above and turns !ALLOCATED into HOLE. Which means that it will never
>> return HOLE for file-posix..
> 
> Hm... This is a mess. :-)
> 


I'm afraid the these all are consequences of really different usages of BLOCK_STATUS:

1. For backing-chain aware block-jobs, to work with overlays. So backing children are different from the others,
as they are presenting overlays, which may be used in separate, and other children are not overlays and "more owned" by their parents

2. To understand where are zeroes and improve performance of copying loops (not copying zeroes, not sending zeroes through the wire,
  use effective write_zeroes)

3. To show file-mapping (qemu-img map)

4. Mirror use it to do DISCARD if "UNALLOCATED", but seems wrong for me now.. For which driver bdrv_block_status_above(source, NULL,...)
will return UNALLOCATED? Seems neither file-posix nor qcow2.

anything else?

May be we should split these use cases instead of trying to combine them using a lot of returned flags and parameters?


-- 
Best regards,
Vladimir