file-posix: add rough-block-status parameter

[Qemu-devel] [PATCH] file-posix: add rough-block-status parameter

Posted by Vladimir Sementsov-Ogievskiy 7 years, 1 month ago

bdrv_co_block_status digs bs->file for additional, more accurate search
for hole inside region, reported as DATA by bs since 5daa74a6ebc.

This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
knows, where are holes and where is data. But every block_status
request calls lseek additionally. Assume a big disk, full of
data, in any iterative copying block job (or img convert) we'll call
lseek(HOLE) on every iteration, and each of these lseeks will have to
iterate through all metadata up to the end of file. It's obviously
ineffective behavior. And for many scenarios we don't need this lseek
at all.

So, add an option to omit calling lseek.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---

Hi all!

This is a continuation of "do not lseek in qcow2 block_status" thread
(https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg05749.html)

 qapi/block-core.json |  6 +++++-
 block/file-posix.c   | 13 ++++++++++++-
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 762000f31f..ec5549e5c2 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2645,6 +2645,9 @@
 #                         migration.  May cause noticeable delays if the image
 #                         file is large, do not use in production.
 #                         (default: off) (since: 3.0)
+# @rough-block-status: If set, on block-status querying assume that the whole
+#                      file is data and do not try to find holes through system
+#                      calls (default: off) (since 4.0)
 #
 # Since: 2.9
 ##
@@ -2653,7 +2656,8 @@
             '*pr-manager': 'str',
             '*locking': 'OnOffAuto',
             '*aio': 'BlockdevAioOptions',
-            '*x-check-cache-dropped': 'bool' } }
+            '*x-check-cache-dropped': 'bool',
+            '*rough-block-status': 'bool' } }
 
 ##
 # @BlockdevOptionsNull:
diff --git a/block/file-posix.c b/block/file-posix.c
index d8f0b93752..3f6d76a5dc 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -167,6 +167,7 @@ typedef struct BDRVRawState {
     bool has_fallocate;
     bool needs_alignment;
     bool check_cache_dropped;
+    bool rough_block_status;
 
     PRManager *pr_mgr;
 } BDRVRawState;
@@ -439,6 +440,13 @@ static QemuOptsList raw_runtime_opts = {
             .type = QEMU_OPT_BOOL,
             .help = "check that page cache was dropped on live migration (default: off)"
         },
+        {
+            .name = "rough-block-status",
+            .type = QEMU_OPT_BOOL,
+            .help = "If set, on block-status querying assume that the whole "
+                    "file is data and do not try to find holes through system "
+                    "calls (default: off)"
+        },
         { /* end of list */ }
     },
 };
@@ -525,6 +533,8 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
 
     s->check_cache_dropped = qemu_opt_get_bool(opts, "x-check-cache-dropped",
                                                false);
+    s->rough_block_status = qemu_opt_get_bool(opts, "rough-block-status",
+                                              false);
 
     s->open_flags = open_flags;
     raw_parse_flags(bdrv_flags, &s->open_flags);
@@ -2424,6 +2434,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
                                             int64_t *map,
                                             BlockDriverState **file)
 {
+    BDRVRawState *s = bs->opaque;
     off_t data = 0, hole = 0;
     int ret;
 
@@ -2432,7 +2443,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
         return ret;
     }
 
-    if (!want_zero) {
+    if (!want_zero || s->rough_block_status) {
         *pnum = bytes;
         *map = offset;
         *file = bs;
-- 
2.18.0

Re: [Qemu-devel] [PATCH] file-posix: add rough-block-status parameter

Posted by Eric Blake 7 years, 1 month ago

On 1/8/19 1:45 PM, Vladimir Sementsov-Ogievskiy wrote:
> bdrv_co_block_status digs bs->file for additional, more accurate search
> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
> 
> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
> knows, where are holes and where is data.

Not when it is has been preallocated (at which point the qcow2 metadata
says the entire file is data, even though much of it may still be zeroes).

> But every block_status
> request calls lseek additionally. Assume a big disk, full of
> data, in any iterative copying block job (or img convert) we'll call
> lseek(HOLE) on every iteration, and each of these lseeks will have to
> iterate through all metadata up to the end of file. It's obviously
> ineffective behavior. And for many scenarios we don't need this lseek
> at all.

This part is true, and we have had several complaints from various
places about the time spent doing lseek()s where the block layer fails
to cache what was learned on the previous iteration, coupled with
inefficient kernel/filesystem implementations where lseek is not as
cheap as it should be.

> 
> So, add an option to omit calling lseek.

The commit message would do well with an example showing the usage of
the parameter via a typical qemu-img command line that is currently
negatively impacted.  Or even better, where is the iotests change that
serves as a unit test of the new knob?

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
> 
> Hi all!
> 
> This is a continuation of "do not lseek in qcow2 block_status" thread
> (https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg05749.html)
> 
>  qapi/block-core.json |  6 +++++-
>  block/file-posix.c   | 13 ++++++++++++-
>  2 files changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 762000f31f..ec5549e5c2 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -2645,6 +2645,9 @@
>  #                         migration.  May cause noticeable delays if the image
>  #                         file is large, do not use in production.
>  #                         (default: off) (since: 3.0)

Pre-existing: "off" isn't a valid QAPI bool; better would be fixing all
instances of this to "default: false".

> +# @rough-block-status: If set, on block-status querying assume that the whole
> +#                      file is data and do not try to find holes through system
> +#                      calls (default: off) (since 4.0)

You merely copied and pasted bad wording for the default section, but
you might as well fix it, whether or not we also fix the other instances
in a separate patch.

Possible better wording:
@rough-block-status: True to skip system calls that probe for holes when
querying for block status (default: false)

Bike-shedding: Since the point of this knob is to skip zero probing,
could we instead name it '@probe-zeroes: True to use lseek to probe for
holes (default: true)'?

>  #
>  # Since: 2.9
>  ##
> @@ -2653,7 +2656,8 @@
>              '*pr-manager': 'str',
>              '*locking': 'OnOffAuto',
>              '*aio': 'BlockdevAioOptions',
> -            '*x-check-cache-dropped': 'bool' } }
> +            '*x-check-cache-dropped': 'bool',
> +            '*rough-block-status': 'bool' } }
>  
>  ##
>  # @BlockdevOptionsNull:
> diff --git a/block/file-posix.c b/block/file-posix.c
> index d8f0b93752..3f6d76a5dc 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -167,6 +167,7 @@ typedef struct BDRVRawState {
>      bool has_fallocate;
>      bool needs_alignment;
>      bool check_cache_dropped;
> +    bool rough_block_status;
>  
>      PRManager *pr_mgr;
>  } BDRVRawState;
> @@ -439,6 +440,13 @@ static QemuOptsList raw_runtime_opts = {
>              .type = QEMU_OPT_BOOL,
>              .help = "check that page cache was dropped on live migration (default: off)"
>          },
> +        {
> +            .name = "rough-block-status",
> +            .type = QEMU_OPT_BOOL,
> +            .help = "If set, on block-status querying assume that the whole "
> +                    "file is data and do not try to find holes through system "
> +                    "calls (default: off)"

Whatever wording changes we make to QAPI should be copied here.

The idea makes sense, but without unit test coverage, I'm reluctant to
leave R-b yet.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH] file-posix: add rough-block-status parameter

Posted by Kevin Wolf 7 years ago

Am 08.01.2019 um 20:45 hat Vladimir Sementsov-Ogievskiy geschrieben:
> bdrv_co_block_status digs bs->file for additional, more accurate search
> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
> 
> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
> knows, where are holes and where is data. But every block_status
> request calls lseek additionally. Assume a big disk, full of
> data, in any iterative copying block job (or img convert) we'll call
> lseek(HOLE) on every iteration, and each of these lseeks will have to
> iterate through all metadata up to the end of file. It's obviously
> ineffective behavior. And for many scenarios we don't need this lseek
> at all.
> 
> So, add an option to omit calling lseek.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

I don't think adding an option is a good idea, people will not know this
option or forget to add it. Also, it's an option that you want to
specify almost always, so if we do add an option, maybe it should be the
default and an option should be added to enable recursive mode?

I haven't checked the full list of bdrv_co_block_status() callers yet.
So far we're talking about 'qemu-img convert' or the mirror job. Maybe
we need to check other callers, too, but for now I'll stick to these.

The interesting case seems to be preallocated qcow2 images, where the
clusters are allocated on the qcow2 level, but not in the image file.
If we don't do the lseek there, we'll have to read in the data and will
only then detect that it is zero (qemu-img only, mirror doesn't do this
yet - should it have an option to enable a zero-detecting mode? Can
detect-zeroes on the target node achieve the same thing?). So at least
in qemu-img, we don't get a different result, it's just a bit slower.

Also note that this is only metadata preallocation; full preallocation
will still return allocated for the protocol layer and so it will always
be slow.

Now, if this isn't about correctness, but about performance, maybe we
can do some heuristics instead of adding a user option? A very simple
approach could be, look at the lower layers only if we got a large
allocated area. If the image was preallocated, everything is allocated,
so if bdrv_co_block_status() returns a short range, it's a hint that
qcow2 knows _something_ at least. (That something could just be that
someone discarded something from a preallocated image, so we can't jump
to the conclusion that it's not preallocated - but it's certainly more
likely). And for small buffer, just reading it in and checking whether
it is zero can't be that much slower than finding a hole first.

So how about we look at the protocol layer only if we got an allocated
range of at least a megabyte or something? Or maybe make it depend on
the cluster size because with 2 MB clusters that will be always the
case, even if the image isn't preallocated.

Kevin

Re: [Qemu-devel] [PATCH] file-posix: add rough-block-status parameter

Posted by Paolo Bonzini 7 years ago

On 09/01/19 12:23, Kevin Wolf wrote:
> Also note that this is only metadata preallocation; full preallocation
> will still return allocated for the protocol layer and so it will always
> be slow.

Full preallocation these days can create images with preallocated but
known-zero blocks, I think?

Paolo

Re: [Qemu-devel] [PATCH] file-posix: add rough-block-status parameter

Posted by Kevin Wolf 7 years ago

Am 09.01.2019 um 17:42 hat Paolo Bonzini geschrieben:
> On 09/01/19 12:23, Kevin Wolf wrote:
> > Also note that this is only metadata preallocation; full preallocation
> > will still return allocated for the protocol layer and so it will always
> > be slow.
> 
> Full preallocation these days can create images with preallocated but
> known-zero blocks, I think?

That would defeat one of the main purposes of preallocation because it
would still require COW and metadata updates on the first write.

If there is demand, we could add something like preallocation=data where
data clusters are preallocated but COW/metadata updates still happen at
runtime, but so far nobody has asked for it. Not sure when you would use
it either.

Kevin

Re: [Qemu-devel] [PATCH] file-posix: add rough-block-status parameter

Posted by Paolo Bonzini 7 years ago

On 09/01/19 17:51, Kevin Wolf wrote:
> Am 09.01.2019 um 17:42 hat Paolo Bonzini geschrieben:
>> On 09/01/19 12:23, Kevin Wolf wrote:
>>> Also note that this is only metadata preallocation; full preallocation
>>> will still return allocated for the protocol layer and so it will always
>>> be slow.
>>
>> Full preallocation these days can create images with preallocated but
>> known-zero blocks, I think?
> 
> That would defeat one of the main purposes of preallocation because it
> would still require COW and metadata updates on the first write.

Sorry I mean at the protocol level, like FALLOC_FL_ZERO_RANGE.  It would
still require metadata updates on the filesystem level, unlike "real"
full preallocation, but no qcow2 metadata updates.
> If there is demand, we could add something like preallocation=data where
> data clusters are preallocated but COW/metadata updates still happen at
> runtime, but so far nobody has asked for it. Not sure when you would use
> it either.
> 
> Kevin
>

Re: [Qemu-devel] [PATCH] file-posix: add rough-block-status parameter

Posted by Kevin Wolf 7 years ago

Am 09.01.2019 um 17:55 hat Paolo Bonzini geschrieben:
> On 09/01/19 17:51, Kevin Wolf wrote:
> > Am 09.01.2019 um 17:42 hat Paolo Bonzini geschrieben:
> >> On 09/01/19 12:23, Kevin Wolf wrote:
> >>> Also note that this is only metadata preallocation; full preallocation
> >>> will still return allocated for the protocol layer and so it will always
> >>> be slow.
> >>
> >> Full preallocation these days can create images with preallocated but
> >> known-zero blocks, I think?
> > 
> > That would defeat one of the main purposes of preallocation because it
> > would still require COW and metadata updates on the first write.
> 
> Sorry I mean at the protocol level, like FALLOC_FL_ZERO_RANGE.  It would
> still require metadata updates on the filesystem level, unlike "real"
> full preallocation, but no qcow2 metadata updates.

preallocation=full doesn't do that. preallocation=falloc is more like
it, though that is just posix_fallocate(), not FALLOC_FL_ZERO_RANGE. But
when called on a new file, it might result in the same thing? Not sure.

Kevin

Re: [Qemu-devel] [PATCH] file-posix: add rough-block-status parameter

Posted by Denis V.Lunev 7 years ago

On 1/9/19 8:09 PM, Kevin Wolf wrote:
> Am 09.01.2019 um 17:55 hat Paolo Bonzini geschrieben:
>> On 09/01/19 17:51, Kevin Wolf wrote:
>>> Am 09.01.2019 um 17:42 hat Paolo Bonzini geschrieben:
>>>> On 09/01/19 12:23, Kevin Wolf wrote:
>>>>> Also note that this is only metadata preallocation; full preallocation
>>>>> will still return allocated for the protocol layer and so it will always
>>>>> be slow.
>>>> Full preallocation these days can create images with preallocated but
>>>> known-zero blocks, I think?
>>> That would defeat one of the main purposes of preallocation because it
>>> would still require COW and metadata updates on the first write.
>> Sorry I mean at the protocol level, like FALLOC_FL_ZERO_RANGE.  It would
>> still require metadata updates on the filesystem level, unlike "real"
>> full preallocation, but no qcow2 metadata updates.
> preallocation=full doesn't do that. preallocation=falloc is more like
> it, though that is just posix_fallocate(), not FALLOC_FL_ZERO_RANGE. But
> when called on a new file, it might result in the same thing? Not sure.
>
> Kevin
From the point of the file structure "fallocated" space should be considered
as data as the space on the disc is really allocated and the same constraint
should be kept in on target.

Den

Re: [Qemu-devel] [PATCH] file-posix: add rough-block-status parameter

Posted by Eric Blake 7 years ago

On 1/9/19 10:51 AM, Kevin Wolf wrote:
> Am 09.01.2019 um 17:42 hat Paolo Bonzini geschrieben:
>> On 09/01/19 12:23, Kevin Wolf wrote:
>>> Also note that this is only metadata preallocation; full preallocation
>>> will still return allocated for the protocol layer and so it will always
>>> be slow.
>>
>> Full preallocation these days can create images with preallocated but
>> known-zero blocks, I think?
> 
> That would defeat one of the main purposes of preallocation because it
> would still require COW and metadata updates on the first write.
> 
> If there is demand, we could add something like preallocation=data where
> data clusters are preallocated but COW/metadata updates still happen at
> runtime, but so far nobody has asked for it. Not sure when you would use
> it either.

Except there HAS been talk about it before - such a mode makes it so
that you can guarantee a non-fragmented image, or even guarantee that
the entire guest-visible portion is contiguous and all qcow2 metadata is
at the front or back of the overall qcow2 file, making it very easy to
convert between qcow2 -> raw (strip the qcow2 metadata) or go in reverse
(wrap a raw into a qcow2) without having to shuffle the guest-visible
portions of the file.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [Qemu-block] [PATCH] file-posix: add rough-block-status parameter

Posted by Stefan Hajnoczi 7 years ago

On Tue, Jan 08, 2019 at 10:45:52PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> bdrv_co_block_status digs bs->file for additional, more accurate search
> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
> 
> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
> knows, where are holes and where is data. But every block_status
> request calls lseek additionally. Assume a big disk, full of
> data, in any iterative copying block job (or img convert) we'll call
> lseek(HOLE) on every iteration, and each of these lseeks will have to
> iterate through all metadata up to the end of file. It's obviously
> ineffective behavior. And for many scenarios we don't need this lseek
> at all.
> 
> So, add an option to omit calling lseek.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
> 
> Hi all!
> 
> This is a continuation of "do not lseek in qcow2 block_status" thread
> (https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg05749.html)

I don't have time to participate in the discussion so my (uneducated?)
opinion doesn't count for much, but I wish it were possible to perform
this optimization automatically without introducing a new user-visible
option.

That way existing users benefit and new users don't have to learn about
a new option.

Stefan