bdrv_co_block_status digs bs->file for additional, more accurate search
for hole inside region, reported as DATA by bs since 5daa74a6ebc.
This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
knows, where are holes and where is data. But every block_status
request calls lseek additionally. Assume a big disk, full of
data, in any iterative copying block job (or img convert) we'll call
lseek(HOLE) on every iteration, and each of these lseeks will have to
iterate through all metadata up to the end of file. It's obviously
ineffective behavior. And for many scenarios we don't need this lseek
at all.
So, add an option to omit calling lseek.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
Hi all!
This is a continuation of "do not lseek in qcow2 block_status" thread
(https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg05749.html)
qapi/block-core.json | 6 +++++-
block/file-posix.c | 13 ++++++++++++-
2 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 762000f31f..ec5549e5c2 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2645,6 +2645,9 @@
# migration. May cause noticeable delays if the image
# file is large, do not use in production.
# (default: off) (since: 3.0)
+# @rough-block-status: If set, on block-status querying assume that the whole
+# file is data and do not try to find holes through system
+# calls (default: off) (since 4.0)
#
# Since: 2.9
##
@@ -2653,7 +2656,8 @@
'*pr-manager': 'str',
'*locking': 'OnOffAuto',
'*aio': 'BlockdevAioOptions',
- '*x-check-cache-dropped': 'bool' } }
+ '*x-check-cache-dropped': 'bool',
+ '*rough-block-status': 'bool' } }
##
# @BlockdevOptionsNull:
diff --git a/block/file-posix.c b/block/file-posix.c
index d8f0b93752..3f6d76a5dc 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -167,6 +167,7 @@ typedef struct BDRVRawState {
bool has_fallocate;
bool needs_alignment;
bool check_cache_dropped;
+ bool rough_block_status;
PRManager *pr_mgr;
} BDRVRawState;
@@ -439,6 +440,13 @@ static QemuOptsList raw_runtime_opts = {
.type = QEMU_OPT_BOOL,
.help = "check that page cache was dropped on live migration (default: off)"
},
+ {
+ .name = "rough-block-status",
+ .type = QEMU_OPT_BOOL,
+ .help = "If set, on block-status querying assume that the whole "
+ "file is data and do not try to find holes through system "
+ "calls (default: off)"
+ },
{ /* end of list */ }
},
};
@@ -525,6 +533,8 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
s->check_cache_dropped = qemu_opt_get_bool(opts, "x-check-cache-dropped",
false);
+ s->rough_block_status = qemu_opt_get_bool(opts, "rough-block-status",
+ false);
s->open_flags = open_flags;
raw_parse_flags(bdrv_flags, &s->open_flags);
@@ -2424,6 +2434,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
int64_t *map,
BlockDriverState **file)
{
+ BDRVRawState *s = bs->opaque;
off_t data = 0, hole = 0;
int ret;
@@ -2432,7 +2443,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
return ret;
}
- if (!want_zero) {
+ if (!want_zero || s->rough_block_status) {
*pnum = bytes;
*map = offset;
*file = bs;
--
2.18.0
On 1/8/19 1:45 PM, Vladimir Sementsov-Ogievskiy wrote:
> bdrv_co_block_status digs bs->file for additional, more accurate search
> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
>
> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
> knows, where are holes and where is data.
Not when it is has been preallocated (at which point the qcow2 metadata
says the entire file is data, even though much of it may still be zeroes).
> But every block_status
> request calls lseek additionally. Assume a big disk, full of
> data, in any iterative copying block job (or img convert) we'll call
> lseek(HOLE) on every iteration, and each of these lseeks will have to
> iterate through all metadata up to the end of file. It's obviously
> ineffective behavior. And for many scenarios we don't need this lseek
> at all.
This part is true, and we have had several complaints from various
places about the time spent doing lseek()s where the block layer fails
to cache what was learned on the previous iteration, coupled with
inefficient kernel/filesystem implementations where lseek is not as
cheap as it should be.
>
> So, add an option to omit calling lseek.
The commit message would do well with an example showing the usage of
the parameter via a typical qemu-img command line that is currently
negatively impacted. Or even better, where is the iotests change that
serves as a unit test of the new knob?
>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>
> Hi all!
>
> This is a continuation of "do not lseek in qcow2 block_status" thread
> (https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg05749.html)
>
> qapi/block-core.json | 6 +++++-
> block/file-posix.c | 13 ++++++++++++-
> 2 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 762000f31f..ec5549e5c2 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -2645,6 +2645,9 @@
> # migration. May cause noticeable delays if the image
> # file is large, do not use in production.
> # (default: off) (since: 3.0)
Pre-existing: "off" isn't a valid QAPI bool; better would be fixing all
instances of this to "default: false".
> +# @rough-block-status: If set, on block-status querying assume that the whole
> +# file is data and do not try to find holes through system
> +# calls (default: off) (since 4.0)
You merely copied and pasted bad wording for the default section, but
you might as well fix it, whether or not we also fix the other instances
in a separate patch.
Possible better wording:
@rough-block-status: True to skip system calls that probe for holes when
querying for block status (default: false)
Bike-shedding: Since the point of this knob is to skip zero probing,
could we instead name it '@probe-zeroes: True to use lseek to probe for
holes (default: true)'?
> #
> # Since: 2.9
> ##
> @@ -2653,7 +2656,8 @@
> '*pr-manager': 'str',
> '*locking': 'OnOffAuto',
> '*aio': 'BlockdevAioOptions',
> - '*x-check-cache-dropped': 'bool' } }
> + '*x-check-cache-dropped': 'bool',
> + '*rough-block-status': 'bool' } }
>
> ##
> # @BlockdevOptionsNull:
> diff --git a/block/file-posix.c b/block/file-posix.c
> index d8f0b93752..3f6d76a5dc 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -167,6 +167,7 @@ typedef struct BDRVRawState {
> bool has_fallocate;
> bool needs_alignment;
> bool check_cache_dropped;
> + bool rough_block_status;
>
> PRManager *pr_mgr;
> } BDRVRawState;
> @@ -439,6 +440,13 @@ static QemuOptsList raw_runtime_opts = {
> .type = QEMU_OPT_BOOL,
> .help = "check that page cache was dropped on live migration (default: off)"
> },
> + {
> + .name = "rough-block-status",
> + .type = QEMU_OPT_BOOL,
> + .help = "If set, on block-status querying assume that the whole "
> + "file is data and do not try to find holes through system "
> + "calls (default: off)"
Whatever wording changes we make to QAPI should be copied here.
The idea makes sense, but without unit test coverage, I'm reluctant to
leave R-b yet.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization: qemu.org | libvirt.org
Am 08.01.2019 um 20:45 hat Vladimir Sementsov-Ogievskiy geschrieben: > bdrv_co_block_status digs bs->file for additional, more accurate search > for hole inside region, reported as DATA by bs since 5daa74a6ebc. > > This accuracy is not free: assume we have qcow2 disk. Actually, qcow2 > knows, where are holes and where is data. But every block_status > request calls lseek additionally. Assume a big disk, full of > data, in any iterative copying block job (or img convert) we'll call > lseek(HOLE) on every iteration, and each of these lseeks will have to > iterate through all metadata up to the end of file. It's obviously > ineffective behavior. And for many scenarios we don't need this lseek > at all. > > So, add an option to omit calling lseek. > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> I don't think adding an option is a good idea, people will not know this option or forget to add it. Also, it's an option that you want to specify almost always, so if we do add an option, maybe it should be the default and an option should be added to enable recursive mode? I haven't checked the full list of bdrv_co_block_status() callers yet. So far we're talking about 'qemu-img convert' or the mirror job. Maybe we need to check other callers, too, but for now I'll stick to these. The interesting case seems to be preallocated qcow2 images, where the clusters are allocated on the qcow2 level, but not in the image file. If we don't do the lseek there, we'll have to read in the data and will only then detect that it is zero (qemu-img only, mirror doesn't do this yet - should it have an option to enable a zero-detecting mode? Can detect-zeroes on the target node achieve the same thing?). So at least in qemu-img, we don't get a different result, it's just a bit slower. Also note that this is only metadata preallocation; full preallocation will still return allocated for the protocol layer and so it will always be slow. Now, if this isn't about correctness, but about performance, maybe we can do some heuristics instead of adding a user option? A very simple approach could be, look at the lower layers only if we got a large allocated area. If the image was preallocated, everything is allocated, so if bdrv_co_block_status() returns a short range, it's a hint that qcow2 knows _something_ at least. (That something could just be that someone discarded something from a preallocated image, so we can't jump to the conclusion that it's not preallocated - but it's certainly more likely). And for small buffer, just reading it in and checking whether it is zero can't be that much slower than finding a hole first. So how about we look at the protocol layer only if we got an allocated range of at least a megabyte or something? Or maybe make it depend on the cluster size because with 2 MB clusters that will be always the case, even if the image isn't preallocated. Kevin
On 09/01/19 12:23, Kevin Wolf wrote: > Also note that this is only metadata preallocation; full preallocation > will still return allocated for the protocol layer and so it will always > be slow. Full preallocation these days can create images with preallocated but known-zero blocks, I think? Paolo
Am 09.01.2019 um 17:42 hat Paolo Bonzini geschrieben: > On 09/01/19 12:23, Kevin Wolf wrote: > > Also note that this is only metadata preallocation; full preallocation > > will still return allocated for the protocol layer and so it will always > > be slow. > > Full preallocation these days can create images with preallocated but > known-zero blocks, I think? That would defeat one of the main purposes of preallocation because it would still require COW and metadata updates on the first write. If there is demand, we could add something like preallocation=data where data clusters are preallocated but COW/metadata updates still happen at runtime, but so far nobody has asked for it. Not sure when you would use it either. Kevin
On 09/01/19 17:51, Kevin Wolf wrote: > Am 09.01.2019 um 17:42 hat Paolo Bonzini geschrieben: >> On 09/01/19 12:23, Kevin Wolf wrote: >>> Also note that this is only metadata preallocation; full preallocation >>> will still return allocated for the protocol layer and so it will always >>> be slow. >> >> Full preallocation these days can create images with preallocated but >> known-zero blocks, I think? > > That would defeat one of the main purposes of preallocation because it > would still require COW and metadata updates on the first write. Sorry I mean at the protocol level, like FALLOC_FL_ZERO_RANGE. It would still require metadata updates on the filesystem level, unlike "real" full preallocation, but no qcow2 metadata updates. > If there is demand, we could add something like preallocation=data where > data clusters are preallocated but COW/metadata updates still happen at > runtime, but so far nobody has asked for it. Not sure when you would use > it either. > > Kevin >
Am 09.01.2019 um 17:55 hat Paolo Bonzini geschrieben: > On 09/01/19 17:51, Kevin Wolf wrote: > > Am 09.01.2019 um 17:42 hat Paolo Bonzini geschrieben: > >> On 09/01/19 12:23, Kevin Wolf wrote: > >>> Also note that this is only metadata preallocation; full preallocation > >>> will still return allocated for the protocol layer and so it will always > >>> be slow. > >> > >> Full preallocation these days can create images with preallocated but > >> known-zero blocks, I think? > > > > That would defeat one of the main purposes of preallocation because it > > would still require COW and metadata updates on the first write. > > Sorry I mean at the protocol level, like FALLOC_FL_ZERO_RANGE. It would > still require metadata updates on the filesystem level, unlike "real" > full preallocation, but no qcow2 metadata updates. preallocation=full doesn't do that. preallocation=falloc is more like it, though that is just posix_fallocate(), not FALLOC_FL_ZERO_RANGE. But when called on a new file, it might result in the same thing? Not sure. Kevin
On 1/9/19 8:09 PM, Kevin Wolf wrote: > Am 09.01.2019 um 17:55 hat Paolo Bonzini geschrieben: >> On 09/01/19 17:51, Kevin Wolf wrote: >>> Am 09.01.2019 um 17:42 hat Paolo Bonzini geschrieben: >>>> On 09/01/19 12:23, Kevin Wolf wrote: >>>>> Also note that this is only metadata preallocation; full preallocation >>>>> will still return allocated for the protocol layer and so it will always >>>>> be slow. >>>> Full preallocation these days can create images with preallocated but >>>> known-zero blocks, I think? >>> That would defeat one of the main purposes of preallocation because it >>> would still require COW and metadata updates on the first write. >> Sorry I mean at the protocol level, like FALLOC_FL_ZERO_RANGE. It would >> still require metadata updates on the filesystem level, unlike "real" >> full preallocation, but no qcow2 metadata updates. > preallocation=full doesn't do that. preallocation=falloc is more like > it, though that is just posix_fallocate(), not FALLOC_FL_ZERO_RANGE. But > when called on a new file, it might result in the same thing? Not sure. > > Kevin From the point of the file structure "fallocated" space should be considered as data as the space on the disc is really allocated and the same constraint should be kept in on target. Den
On 1/9/19 10:51 AM, Kevin Wolf wrote: > Am 09.01.2019 um 17:42 hat Paolo Bonzini geschrieben: >> On 09/01/19 12:23, Kevin Wolf wrote: >>> Also note that this is only metadata preallocation; full preallocation >>> will still return allocated for the protocol layer and so it will always >>> be slow. >> >> Full preallocation these days can create images with preallocated but >> known-zero blocks, I think? > > That would defeat one of the main purposes of preallocation because it > would still require COW and metadata updates on the first write. > > If there is demand, we could add something like preallocation=data where > data clusters are preallocated but COW/metadata updates still happen at > runtime, but so far nobody has asked for it. Not sure when you would use > it either. Except there HAS been talk about it before - such a mode makes it so that you can guarantee a non-fragmented image, or even guarantee that the entire guest-visible portion is contiguous and all qcow2 metadata is at the front or back of the overall qcow2 file, making it very easy to convert between qcow2 -> raw (strip the qcow2 metadata) or go in reverse (wrap a raw into a qcow2) without having to shuffle the guest-visible portions of the file. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
On Tue, Jan 08, 2019 at 10:45:52PM +0300, Vladimir Sementsov-Ogievskiy wrote: > bdrv_co_block_status digs bs->file for additional, more accurate search > for hole inside region, reported as DATA by bs since 5daa74a6ebc. > > This accuracy is not free: assume we have qcow2 disk. Actually, qcow2 > knows, where are holes and where is data. But every block_status > request calls lseek additionally. Assume a big disk, full of > data, in any iterative copying block job (or img convert) we'll call > lseek(HOLE) on every iteration, and each of these lseeks will have to > iterate through all metadata up to the end of file. It's obviously > ineffective behavior. And for many scenarios we don't need this lseek > at all. > > So, add an option to omit calling lseek. > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> > --- > > Hi all! > > This is a continuation of "do not lseek in qcow2 block_status" thread > (https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg05749.html) I don't have time to participate in the discussion so my (uneducated?) opinion doesn't count for much, but I wish it were possible to perform this optimization automatically without introducing a new user-visible option. That way existing users benefit and new users don't have to learn about a new option. Stefan
© 2016 - 2025 Red Hat, Inc.