[Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Vladimir Sementsov-Ogievskiy posted 1 patch 10 weeks ago
Failed in applying to current master (apply log)
qapi/block-core.json       |  5 +++-
include/block/block.h      |  1 +
include/block/block_int.h  |  1 +
block.c                    |  9 ++++++
block/io.c                 |  2 +-
qemu-options.hx            |  4 +++
tests/qemu-iotests/237     | 61 ++++++++++++++++++++++++++++++++++++++
tests/qemu-iotests/237.out | 12 ++++++++
tests/qemu-iotests/group   |  1 +
9 files changed, 94 insertions(+), 2 deletions(-)
create mode 100755 tests/qemu-iotests/237
create mode 100644 tests/qemu-iotests/237.out

[Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 10 weeks ago
drv_co_block_status digs bs->file for additional, more accurate search
for hole inside region, reported as DATA by bs since 5daa74a6ebc.

This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
knows, where are holes and where is data. But every block_status
request calls lseek additionally. Assume a big disk, full of
data, in any iterative copying block job (or img convert) we'll call
lseek(HOLE) on every iteration, and each of these lseeks will have to
iterate through all metadata up to the end of file. It's obviously
ineffective behavior. And for many scenarios we don't need this lseek
at all.

So, let's "5daa74a6ebc" by default, leaving an option to return
previous behavior, which is needed for scenarios with preallocated
images.

Add iotest illustrating new option semantics.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---

Hi all!

I don't call it v2, as it do completely another thing, but here is a
link to my previous try with discussion:
[PATCH] file-posix: add rough-block-status parameter
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg01322.html)

And the previous discussion on the topic is
"do not lseek in qcow2 block_status" thread
(https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg05749.html)

 qapi/block-core.json       |  5 +++-
 include/block/block.h      |  1 +
 include/block/block_int.h  |  1 +
 block.c                    |  9 ++++++
 block/io.c                 |  2 +-
 qemu-options.hx            |  4 +++
 tests/qemu-iotests/237     | 61 ++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/237.out | 12 ++++++++
 tests/qemu-iotests/group   |  1 +
 9 files changed, 94 insertions(+), 2 deletions(-)
 create mode 100755 tests/qemu-iotests/237
 create mode 100644 tests/qemu-iotests/237.out

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 762000f31f..9eed36f3f4 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3673,6 +3673,8 @@
 #                 (default: off)
 # @force-share:   force share all permission on added nodes.
 #                 Requires read-only=true. (Since 2.10)
+# @probe-zeroes:  Probe zeroes on protocol layer if format reports data
+#                 (default: false) (since 4.0)
 #
 # Remaining options are determined by the block driver.
 #
@@ -3686,7 +3688,8 @@
             '*read-only': 'bool',
             '*auto-read-only': 'bool',
             '*force-share': 'bool',
-            '*detect-zeroes': 'BlockdevDetectZeroesOptions' },
+            '*detect-zeroes': 'BlockdevDetectZeroesOptions',
+            '*probe-zeroes': 'bool' },
   'discriminator': 'driver',
   'data': {
       'blkdebug':   'BlockdevOptionsBlkdebug',
diff --git a/include/block/block.h b/include/block/block.h
index f70a843b72..1d7f4a6296 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -129,6 +129,7 @@ typedef struct HDGeometry {
 #define BDRV_OPT_AUTO_READ_ONLY "auto-read-only"
 #define BDRV_OPT_DISCARD        "discard"
 #define BDRV_OPT_FORCE_SHARE    "force-share"
+#define BDRV_OPT_PROBE_ZEROES   "probe-zeroes"
 
 
 #define BDRV_SECTOR_BITS   9
diff --git a/include/block/block_int.h b/include/block/block_int.h
index f605622216..68bddf06b8 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -753,6 +753,7 @@ struct BlockDriverState {
     QDict *options;
     QDict *explicit_options;
     BlockdevDetectZeroesOptions detect_zeroes;
+    bool probe_zeroes;
 
     /* The error object in use for blocking operations on backing_hd */
     Error *backing_blocker;
diff --git a/block.c b/block.c
index 4f5ff2cc12..23a17435c9 100644
--- a/block.c
+++ b/block.c
@@ -939,6 +939,7 @@ static void bdrv_inherited_options(int *child_flags, QDict *child_options,
     qdict_copy_default(child_options, parent_options, BDRV_OPT_CACHE_DIRECT);
     qdict_copy_default(child_options, parent_options, BDRV_OPT_CACHE_NO_FLUSH);
     qdict_copy_default(child_options, parent_options, BDRV_OPT_FORCE_SHARE);
+    qdict_copy_default(child_options, parent_options, BDRV_OPT_PROBE_ZEROES);
 
     /* Inherit the read-only option from the parent if it's not set */
     qdict_copy_default(child_options, parent_options, BDRV_OPT_READ_ONLY);
@@ -1063,6 +1064,7 @@ static void bdrv_backing_options(int *child_flags, QDict *child_options,
     qdict_copy_default(child_options, parent_options, BDRV_OPT_CACHE_DIRECT);
     qdict_copy_default(child_options, parent_options, BDRV_OPT_CACHE_NO_FLUSH);
     qdict_copy_default(child_options, parent_options, BDRV_OPT_FORCE_SHARE);
+    qdict_copy_default(child_options, parent_options, BDRV_OPT_PROBE_ZEROES);
 
     /* backing files always opened read-only */
     qdict_set_default_str(child_options, BDRV_OPT_READ_ONLY, "on");
@@ -1364,6 +1366,12 @@ QemuOptsList bdrv_runtime_opts = {
             .type = QEMU_OPT_BOOL,
             .help = "always accept other writers (default: off)",
         },
+        {
+            .name = BDRV_OPT_PROBE_ZEROES,
+            .type = QEMU_OPT_BOOL,
+            .help = "probe zeroes on protocol layer if format reports data "
+                    "(default: false)",
+        },
         { /* end of list */ }
     },
 };
@@ -1403,6 +1411,7 @@ static int bdrv_open_common(BlockDriverState *bs, BlockBackend *file,
     assert(drv != NULL);
 
     bs->force_share = qemu_opt_get_bool(opts, BDRV_OPT_FORCE_SHARE, false);
+    bs->probe_zeroes = qemu_opt_get_bool(opts, BDRV_OPT_PROBE_ZEROES, false);
 
     if (bs->force_share && (bs->open_flags & BDRV_O_RDWR)) {
         error_setg(errp,
diff --git a/block/io.c b/block/io.c
index bd9d688f8b..0c0cb3a17f 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2186,7 +2186,7 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
         }
     }
 
-    if (want_zero && local_file && local_file != bs &&
+    if (want_zero && bs->probe_zeroes && local_file && local_file != bs &&
         (ret & BDRV_BLOCK_DATA) && !(ret & BDRV_BLOCK_ZERO) &&
         (ret & BDRV_BLOCK_OFFSET_VALID)) {
         int64_t file_pnum;
diff --git a/qemu-options.hx b/qemu-options.hx
index d4f3564b78..7b8dfcfaf6 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -615,6 +615,7 @@ DEF("blockdev", HAS_ARG, QEMU_OPTION_blockdev,
     "-blockdev [driver=]driver[,node-name=N][,discard=ignore|unmap]\n"
     "          [,cache.direct=on|off][,cache.no-flush=on|off]\n"
     "          [,read-only=on|off][,detect-zeroes=on|off|unmap]\n"
+    "          [,probe-zeroes=on|off]\n"
     "          [,driver specific parameters...]\n"
     "                configure a block backend\n", QEMU_ARCH_ALL)
 STEXI
@@ -670,6 +671,8 @@ discard requests.
 conversion of plain zero writes by the OS to driver specific optimized
 zero write commands. You may even choose "unmap" if @var{discard} is set
 to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
+@item probe-zeroes=@var{probe-zeroes}
+Probe zeroes on protocol layer if format reports data. Default is "off".
 @end table
 
 @item Driver-specific options for @code{file}
@@ -793,6 +796,7 @@ DEF("drive", HAS_ARG, QEMU_OPTION_drive,
     "       [,werror=ignore|stop|report|enospc][,id=name][,aio=threads|native]\n"
     "       [,readonly=on|off][,copy-on-read=on|off]\n"
     "       [,discard=ignore|unmap][,detect-zeroes=on|off|unmap]\n"
+    "       [,probe-zeroes=on|off]\n"
     "       [[,bps=b]|[[,bps_rd=r][,bps_wr=w]]]\n"
     "       [[,iops=i]|[[,iops_rd=r][,iops_wr=w]]]\n"
     "       [[,bps_max=bm]|[[,bps_rd_max=rm][,bps_wr_max=wm]]]\n"
diff --git a/tests/qemu-iotests/237 b/tests/qemu-iotests/237
new file mode 100755
index 0000000000..9e66230c13
--- /dev/null
+++ b/tests/qemu-iotests/237
@@ -0,0 +1,61 @@
+#!/bin/bash
+#
+# Test probe-zeroes drive option
+#
+# Copyright (c) 2019 Virtuozzo International GmbH. All rights reserved.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=vsementsov@virtuozzo.com
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+
+status=1	# failure is the default!
+
+_cleanup()
+{
+	_cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux
+
+IMGOPTS='preallocation=metadata' _make_test_img 1M
+
+echo "== default map =="
+$QEMU_IMG map --output=json "$TEST_IMG"
+
+echo
+echo "== map with probe-zeroes=off =="
+$QEMU_IMG map --output=json --image-opts \
+        driver="$IMGFMT",file.filename="$TEST_IMG",probe-zeroes=off
+
+echo
+echo "== map with probe-zeroes=on =="
+$QEMU_IMG map --output=json --image-opts \
+        driver="$IMGFMT",file.filename="$TEST_IMG",probe-zeroes=on
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/237.out b/tests/qemu-iotests/237.out
new file mode 100644
index 0000000000..8747757da7
--- /dev/null
+++ b/tests/qemu-iotests/237.out
@@ -0,0 +1,12 @@
+QA output created by 237
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 preallocation=metadata
+== default map ==
+[{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 327680}]
+
+== map with probe-zeroes=off ==
+[{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 327680}]
+
+== map with probe-zeroes=on ==
+[{ "start": 0, "length": 1044480, "depth": 0, "zero": true, "data": true, "offset": 327680},
+{ "start": 1044480, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": 1372160}]
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index 61a6d98ebd..3064a04911 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -233,3 +233,4 @@
 233 auto quick
 234 auto quick migration
 235 auto quick
+237 auto quick
-- 
2.18.0


Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Kevin Wolf 10 weeks ago
Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
> drv_co_block_status digs bs->file for additional, more accurate search
> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
> 
> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
> knows, where are holes and where is data. But every block_status
> request calls lseek additionally. Assume a big disk, full of
> data, in any iterative copying block job (or img convert) we'll call
> lseek(HOLE) on every iteration, and each of these lseeks will have to
> iterate through all metadata up to the end of file. It's obviously
> ineffective behavior. And for many scenarios we don't need this lseek
> at all.
> 
> So, let's "5daa74a6ebc" by default, leaving an option to return
> previous behavior, which is needed for scenarios with preallocated
> images.
> 
> Add iotest illustrating new option semantics.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

I still think that an option isn't a good solution and we should try use
some heuristics instead.

Kevin

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 10 weeks ago
11.01.2019 13:41, Kevin Wolf wrote:
> Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> drv_co_block_status digs bs->file for additional, more accurate search
>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
>>
>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
>> knows, where are holes and where is data. But every block_status
>> request calls lseek additionally. Assume a big disk, full of
>> data, in any iterative copying block job (or img convert) we'll call
>> lseek(HOLE) on every iteration, and each of these lseeks will have to
>> iterate through all metadata up to the end of file. It's obviously
>> ineffective behavior. And for many scenarios we don't need this lseek
>> at all.
>>
>> So, let's "5daa74a6ebc" by default, leaving an option to return
>> previous behavior, which is needed for scenarios with preallocated
>> images.
>>
>> Add iotest illustrating new option semantics.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> 
> I still think that an option isn't a good solution and we should try use
> some heuristics instead.

Do you think that heuristics would be better than fair cache for lseek results?

I don't see good heuristics, neither want to implement lseek optimizations.
In our cases we don't need lseek under qcow2 at all, and it's obviously better
just don't lseek in these cases.

Moreover, as I understand, the cases when we need lseek are those when qcow2 thinks
that the cluster is data, but actually it's a hole on fs... Isn't it better to
support this thing in qcow2? A kind of allocated-zeroes cluster? We have enough
reserved bits in cluster descriptor to add this feature. And than we don't need
this kind of cheating when user "better knows" where are holes than format layer
which owns the data..

-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Kevin Wolf 8 weeks ago
Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 11.01.2019 13:41, Kevin Wolf wrote:
> > Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> drv_co_block_status digs bs->file for additional, more accurate search
> >> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
> >>
> >> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
> >> knows, where are holes and where is data. But every block_status
> >> request calls lseek additionally. Assume a big disk, full of
> >> data, in any iterative copying block job (or img convert) we'll call
> >> lseek(HOLE) on every iteration, and each of these lseeks will have to
> >> iterate through all metadata up to the end of file. It's obviously
> >> ineffective behavior. And for many scenarios we don't need this lseek
> >> at all.
> >>
> >> So, let's "5daa74a6ebc" by default, leaving an option to return
> >> previous behavior, which is needed for scenarios with preallocated
> >> images.
> >>
> >> Add iotest illustrating new option semantics.
> >>
> >> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > 
> > I still think that an option isn't a good solution and we should try use
> > some heuristics instead.
> 
> Do you think that heuristics would be better than fair cache for lseek results?

I just played a bit with this (qemu-img convert only), and how much
caching lseek() results helps depends completely on the image. As it
happened, my test image was the worst case where caching didn't buy us
much. Obviously, I can just as easily construct an image where it makes
a huge difference. I think that most real-world images should be able to
take good advantage of it, though, and it doesn't hurt, so maybe that's
a first thing that we can do in any case. It might not be the complete
solution, though.

Let me explain my test images: The case where all of this actually
matters for qemu-img convert is fragmented qcow2 images. If your image
isn't fragmented, we don't do lseek() a lot anyway because a single
bdrv_block_status() call already gives you the information for the whole
image. So I constructed a fragmented image, by writing to it backwards:

./qemu-img create -f qcow2 /tmp/test.qcow2 1G
for i in $(seq 16384 -1 0); do
    echo "write $((i * 65536)) 64k"
done | ./qemu-io /tmp/test.qcow2

It's not really surprising that caching the lseek() results doesn't help
much there as we're moving backwards and lseek() only returns results
about the things after the current position, not before the current
position. So this is probably the worst case.

So I constructed a second image, which is fragmented, too, but starts at
the beginning of the image file:

./qemu-img create -f qcow2 /tmp/test_forward.qcow2 1G
for i in $(seq 0 2 16384); do
    echo "write $((i * 65536)) 64k"
done | ./qemu-io /tmp/test_forward.qcow2
for i in $(seq 1 2 16384); do
    echo "write $((i * 65536)) 64k"
done | ./qemu-io /tmp/test_forward.qcow2

Here caching makes a huge difference:

    time ./qemu-img convert -p -n $IMG null-co://

                        uncached        cached
    test.qcow2             ~145s         ~70s
    test_forward.qcow2     ~110s        ~0.2s

Not completely sure why there is such a big difference even in the
uncached case, but it seems to be reproducible. I haven't looked into
that more closely.

> I don't see good heuristics, neither want to implement lseek optimizations.
> In our cases we don't need lseek under qcow2 at all, and it's obviously better
> just don't lseek in these cases.

I also did the same thing with an image where I allocated 2 MB chunks
instead of 64k (backwards), and that brings it down to ~3.5s without
caching and ~2s with caching.

So if we implemented the heuristics and lseek caching, maybe we're good?

Kevin


diff --git a/block/file-posix.c b/block/file-posix.c
index 8aee7a3fb8..7272c7c99d 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -168,6 +168,12 @@ typedef struct BDRVRawState {
     bool needs_alignment;
     bool check_cache_dropped;

+    struct seek_data_cache {
+        bool        valid;
+        uint64_t    start;
+        uint64_t    end;
+    } seek_data_cache;
+
     PRManager *pr_mgr;
 } BDRVRawState;

@@ -1555,8 +1561,17 @@ static int handle_aiocb_write_zeroes_unmap(void *opaque)
 {
     RawPosixAIOData *aiocb = opaque;
     BDRVRawState *s G_GNUC_UNUSED = aiocb->bs->opaque;
+    struct seek_data_cache *sdc;
     int ret;

+    /* Invalidate seek_data_cache if it overlaps */
+    sdc = &s->seek_data_cache;
+    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
+                        sdc->start > aiocb->aio_offset + aiocb->aio_nbytes))
+    {
+        sdc->valid = false;
+    }
+
     /* First try to write zeros and unmap at the same time */

 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
@@ -1634,11 +1649,20 @@ static int handle_aiocb_discard(void *opaque)
     RawPosixAIOData *aiocb = opaque;
     int ret = -EOPNOTSUPP;
     BDRVRawState *s = aiocb->bs->opaque;
+    struct seek_data_cache *sdc;
 
     if (!s->has_discard) {
         return -ENOTSUP;
     }
 
+    /* Invalidate seek_data_cache if it overlaps */
+    sdc = &s->seek_data_cache;
+    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
+                        sdc->start > aiocb->aio_offset + aiocb->aio_nbytes))
+    {
+        sdc->valid = false;
+    }
+
     if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
 #ifdef BLKDISCARD
         do {
@@ -2424,6 +2448,8 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
                                             int64_t *map,
                                             BlockDriverState **file)
 {
+    BDRVRawState *s = bs->opaque;
+    struct seek_data_cache *sdc;
     off_t data = 0, hole = 0;
     int ret;
 
@@ -2439,6 +2465,14 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
         return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
     }
 
+    sdc = &s->seek_data_cache;
+    if (sdc->valid && sdc->start <= offset && sdc->end > offset) {
+        *pnum = MIN(bytes, sdc->end - offset);
+        *map = offset;
+        *file = bs;
+        return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+    }
+
     ret = find_allocation(bs, offset, &data, &hole);
     if (ret == -ENXIO) {
         /* Trailing hole */
@@ -2451,14 +2485,27 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
     } else if (data == offset) {
         /* On a data extent, compute bytes to the end of the extent,
          * possibly including a partial sector at EOF. */
-        *pnum = MIN(bytes, hole - offset);
+        *pnum = hole - offset;
         ret = BDRV_BLOCK_DATA;
     } else {
         /* On a hole, compute bytes to the beginning of the next extent.  */
         assert(hole == offset);
-        *pnum = MIN(bytes, data - offset);
+        *pnum = data - offset;
         ret = BDRV_BLOCK_ZERO;
     }
+
+    /* Caching allocated ranges is okay even if another process writes to the
+     * same file because we allow declaring things allocated even if there is a
+     * hole. However, we cannot cache holes without risking corruption. */
+    if (ret == BDRV_BLOCK_DATA) {
+        *sdc = (struct seek_data_cache) {
+            .valid  = true,
+            .start  = offset,
+            .end    = offset + *pnum,
+        };
+    }
+
+    *pnum = MIN(*pnum, bytes);
     *map = offset;
     *file = bs;
     return ret | BDRV_BLOCK_OFFSET_VALID;

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 8 weeks ago
22.01.2019 21:57, Kevin Wolf wrote:
> Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 11.01.2019 13:41, Kevin Wolf wrote:
>>> Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> drv_co_block_status digs bs->file for additional, more accurate search
>>>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
>>>>
>>>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
>>>> knows, where are holes and where is data. But every block_status
>>>> request calls lseek additionally. Assume a big disk, full of
>>>> data, in any iterative copying block job (or img convert) we'll call
>>>> lseek(HOLE) on every iteration, and each of these lseeks will have to
>>>> iterate through all metadata up to the end of file. It's obviously
>>>> ineffective behavior. And for many scenarios we don't need this lseek
>>>> at all.
>>>>
>>>> So, let's "5daa74a6ebc" by default, leaving an option to return
>>>> previous behavior, which is needed for scenarios with preallocated
>>>> images.
>>>>
>>>> Add iotest illustrating new option semantics.
>>>>
>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>
>>> I still think that an option isn't a good solution and we should try use
>>> some heuristics instead.
>>
>> Do you think that heuristics would be better than fair cache for lseek results?
> 
> I just played a bit with this (qemu-img convert only), and how much
> caching lseek() results helps depends completely on the image. As it
> happened, my test image was the worst case where caching didn't buy us
> much. Obviously, I can just as easily construct an image where it makes
> a huge difference. I think that most real-world images should be able to
> take good advantage of it, though, and it doesn't hurt, so maybe that's
> a first thing that we can do in any case. It might not be the complete
> solution, though.

Hmm, and one more idea from Den:

We can detect preallocated image, comparing allocated size of real file with
number of non-zero qcow2 refcounts. So, real allocation is much less than
allocation in qcow2 point of view, we'll enable lseeks, otherwise - not.

-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 8 weeks ago
23.01.2019 15:04, Vladimir Sementsov-Ogievskiy wrote:
> 22.01.2019 21:57, Kevin Wolf wrote:
>> Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>> 11.01.2019 13:41, Kevin Wolf wrote:
>>>> Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>> drv_co_block_status digs bs->file for additional, more accurate search
>>>>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
>>>>>
>>>>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
>>>>> knows, where are holes and where is data. But every block_status
>>>>> request calls lseek additionally. Assume a big disk, full of
>>>>> data, in any iterative copying block job (or img convert) we'll call
>>>>> lseek(HOLE) on every iteration, and each of these lseeks will have to
>>>>> iterate through all metadata up to the end of file. It's obviously
>>>>> ineffective behavior. And for many scenarios we don't need this lseek
>>>>> at all.
>>>>>
>>>>> So, let's "5daa74a6ebc" by default, leaving an option to return
>>>>> previous behavior, which is needed for scenarios with preallocated
>>>>> images.
>>>>>
>>>>> Add iotest illustrating new option semantics.
>>>>>
>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>
>>>> I still think that an option isn't a good solution and we should try use
>>>> some heuristics instead.
>>>
>>> Do you think that heuristics would be better than fair cache for lseek results?
>>
>> I just played a bit with this (qemu-img convert only), and how much
>> caching lseek() results helps depends completely on the image. As it
>> happened, my test image was the worst case where caching didn't buy us
>> much. Obviously, I can just as easily construct an image where it makes
>> a huge difference. I think that most real-world images should be able to
>> take good advantage of it, though, and it doesn't hurt, so maybe that's
>> a first thing that we can do in any case. It might not be the complete
>> solution, though.
> 
> Hmm, and one more idea from Den:
> 
> We can detect preallocated image, comparing allocated size of real file with
> number of non-zero qcow2 refcounts. So, real allocation is much less than
> allocation in qcow2 point of view, we'll enable lseeks, otherwise - not.
> 

Kevin, what do you think?

-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Kevin Wolf 8 weeks ago
Am 24.01.2019 um 15:37 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 23.01.2019 15:04, Vladimir Sementsov-Ogievskiy wrote:
> > 22.01.2019 21:57, Kevin Wolf wrote:
> >> Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >>> 11.01.2019 13:41, Kevin Wolf wrote:
> >>>> Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >>>>> drv_co_block_status digs bs->file for additional, more accurate search
> >>>>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
> >>>>>
> >>>>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
> >>>>> knows, where are holes and where is data. But every block_status
> >>>>> request calls lseek additionally. Assume a big disk, full of
> >>>>> data, in any iterative copying block job (or img convert) we'll call
> >>>>> lseek(HOLE) on every iteration, and each of these lseeks will have to
> >>>>> iterate through all metadata up to the end of file. It's obviously
> >>>>> ineffective behavior. And for many scenarios we don't need this lseek
> >>>>> at all.
> >>>>>
> >>>>> So, let's "5daa74a6ebc" by default, leaving an option to return
> >>>>> previous behavior, which is needed for scenarios with preallocated
> >>>>> images.
> >>>>>
> >>>>> Add iotest illustrating new option semantics.
> >>>>>
> >>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> >>>>
> >>>> I still think that an option isn't a good solution and we should try use
> >>>> some heuristics instead.
> >>>
> >>> Do you think that heuristics would be better than fair cache for lseek results?
> >>
> >> I just played a bit with this (qemu-img convert only), and how much
> >> caching lseek() results helps depends completely on the image. As it
> >> happened, my test image was the worst case where caching didn't buy us
> >> much. Obviously, I can just as easily construct an image where it makes
> >> a huge difference. I think that most real-world images should be able to
> >> take good advantage of it, though, and it doesn't hurt, so maybe that's
> >> a first thing that we can do in any case. It might not be the complete
> >> solution, though.
> > 
> > Hmm, and one more idea from Den:
> > 
> > We can detect preallocated image, comparing allocated size of real file with
> > number of non-zero qcow2 refcounts. So, real allocation is much less than
> > allocation in qcow2 point of view, we'll enable lseeks, otherwise - not.
> > 
> 
> Kevin, what do you think?

I'm unsure. I think it requires scanning all refcount blocks in
qcow2_open(), right? This could be slow on huge images. On the other
hand, the first cluster allocation will probably do this anyway, so it
might be reasonable enough.

How would you communicate this? Another block_status return flag that
says "don't bother to ask the protocol layer" and which we would only
set in qcow2 if the probing came to the conclusion that it's not
preallocated?

Kevin

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Eric Blake 8 weeks ago
On 1/24/19 9:39 AM, Kevin Wolf wrote:

>>>
>>> Hmm, and one more idea from Den:
>>>
>>> We can detect preallocated image, comparing allocated size of real file with
>>> number of non-zero qcow2 refcounts. So, real allocation is much less than
>>> allocation in qcow2 point of view, we'll enable lseeks, otherwise - not.
>>>
>>
>> Kevin, what do you think?
> 
> I'm unsure. I think it requires scanning all refcount blocks in
> qcow2_open(), right? This could be slow on huge images. On the other
> hand, the first cluster allocation will probably do this anyway, so it
> might be reasonable enough.
> 

We could add a qcow2 header to cache the last known values, to make
things faster on open (will only benefit users that know to set/read the
cache; should probably be an autoclear-type feature to detect when an
older user modified the image but not the cache).

We can on first allocation for images being modified - but the use case
in question is 'qemu-img convert' which only reads the image, not writes
it, so yes, it adds a startup delay.  But if the delay is no worse than
the current code, or if it can end early, it may help.  We're using it
as some sort of heuristic; we need to compare the size of the underlying
file to the count of known-allocated clusters in the qcow2 layer (where
internal snapshots may throw things off, and where having a backing file
or not changes whether a sparse file is worth further checking for extra
precision for encountering holes on allocated clusters).

> How would you communicate this? Another block_status return flag that
> says "don't bother to ask the protocol layer" and which we would only
> set in qcow2 if the probing came to the conclusion that it's not
> preallocated?

That could work.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 8 weeks ago
24.01.2019 18:39, Kevin Wolf wrote:
> Am 24.01.2019 um 15:37 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 23.01.2019 15:04, Vladimir Sementsov-Ogievskiy wrote:
>>> 22.01.2019 21:57, Kevin Wolf wrote:
>>>> Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>> 11.01.2019 13:41, Kevin Wolf wrote:
>>>>>> Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>>> drv_co_block_status digs bs->file for additional, more accurate search
>>>>>>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
>>>>>>>
>>>>>>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
>>>>>>> knows, where are holes and where is data. But every block_status
>>>>>>> request calls lseek additionally. Assume a big disk, full of
>>>>>>> data, in any iterative copying block job (or img convert) we'll call
>>>>>>> lseek(HOLE) on every iteration, and each of these lseeks will have to
>>>>>>> iterate through all metadata up to the end of file. It's obviously
>>>>>>> ineffective behavior. And for many scenarios we don't need this lseek
>>>>>>> at all.
>>>>>>>
>>>>>>> So, let's "5daa74a6ebc" by default, leaving an option to return
>>>>>>> previous behavior, which is needed for scenarios with preallocated
>>>>>>> images.
>>>>>>>
>>>>>>> Add iotest illustrating new option semantics.
>>>>>>>
>>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>>
>>>>>> I still think that an option isn't a good solution and we should try use
>>>>>> some heuristics instead.
>>>>>
>>>>> Do you think that heuristics would be better than fair cache for lseek results?
>>>>
>>>> I just played a bit with this (qemu-img convert only), and how much
>>>> caching lseek() results helps depends completely on the image. As it
>>>> happened, my test image was the worst case where caching didn't buy us
>>>> much. Obviously, I can just as easily construct an image where it makes
>>>> a huge difference. I think that most real-world images should be able to
>>>> take good advantage of it, though, and it doesn't hurt, so maybe that's
>>>> a first thing that we can do in any case. It might not be the complete
>>>> solution, though.
>>>
>>> Hmm, and one more idea from Den:
>>>
>>> We can detect preallocated image, comparing allocated size of real file with
>>> number of non-zero qcow2 refcounts. So, real allocation is much less than
>>> allocation in qcow2 point of view, we'll enable lseeks, otherwise - not.
>>>
>>
>> Kevin, what do you think?
> 
> I'm unsure. I think it requires scanning all refcount blocks in
> qcow2_open(), right? This could be slow on huge images. On the other
> hand, the first cluster allocation will probably do this anyway, so it
> might be reasonable enough.

Seems like better is doing it on first block_status, not on open.

> 
> How would you communicate this? Another block_status return flag that
> says "don't bother to ask the protocol layer" and which we would only
> set in qcow2 if the probing came to the conclusion that it's not
> preallocated?

Or bool flag on BlockDriverState.

> 
> Kevin
> 


-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 8 weeks ago
22.01.2019 21:57, Kevin Wolf wrote:
> Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 11.01.2019 13:41, Kevin Wolf wrote:
>>> Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> drv_co_block_status digs bs->file for additional, more accurate search
>>>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
>>>>
>>>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
>>>> knows, where are holes and where is data. But every block_status
>>>> request calls lseek additionally. Assume a big disk, full of
>>>> data, in any iterative copying block job (or img convert) we'll call
>>>> lseek(HOLE) on every iteration, and each of these lseeks will have to
>>>> iterate through all metadata up to the end of file. It's obviously
>>>> ineffective behavior. And for many scenarios we don't need this lseek
>>>> at all.
>>>>
>>>> So, let's "5daa74a6ebc" by default, leaving an option to return
>>>> previous behavior, which is needed for scenarios with preallocated
>>>> images.
>>>>
>>>> Add iotest illustrating new option semantics.
>>>>
>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>
>>> I still think that an option isn't a good solution and we should try use
>>> some heuristics instead.
>>
>> Do you think that heuristics would be better than fair cache for lseek results?
> 
> I just played a bit with this (qemu-img convert only), and how much
> caching lseek() results helps depends completely on the image. As it
> happened, my test image was the worst case where caching didn't buy us
> much. Obviously, I can just as easily construct an image where it makes
> a huge difference. I think that most real-world images should be able to
> take good advantage of it, though, and it doesn't hurt, so maybe that's
> a first thing that we can do in any case. It might not be the complete
> solution, though.
> 
> Let me explain my test images: The case where all of this actually
> matters for qemu-img convert is fragmented qcow2 images. If your image
> isn't fragmented, we don't do lseek() a lot anyway because a single
> bdrv_block_status() call already gives you the information for the whole
> image. So I constructed a fragmented image, by writing to it backwards:
> 
> ./qemu-img create -f qcow2 /tmp/test.qcow2 1G
> for i in $(seq 16384 -1 0); do
>      echo "write $((i * 65536)) 64k"
> done | ./qemu-io /tmp/test.qcow2
> 
> It's not really surprising that caching the lseek() results doesn't help
> much there as we're moving backwards and lseek() only returns results
> about the things after the current position, not before the current
> position. So this is probably the worst case.
> 
> So I constructed a second image, which is fragmented, too, but starts at
> the beginning of the image file:
> 
> ./qemu-img create -f qcow2 /tmp/test_forward.qcow2 1G
> for i in $(seq 0 2 16384); do
>      echo "write $((i * 65536)) 64k"
> done | ./qemu-io /tmp/test_forward.qcow2
> for i in $(seq 1 2 16384); do
>      echo "write $((i * 65536)) 64k"
> done | ./qemu-io /tmp/test_forward.qcow2
> 
> Here caching makes a huge difference:
> 
>      time ./qemu-img convert -p -n $IMG null-co://
> 
>                          uncached        cached
>      test.qcow2             ~145s         ~70s
>      test_forward.qcow2     ~110s        ~0.2s

Unsure about your results, at least 0.2s means, that we benefit from cached read, not lseek.

= my results =

in short:

uncached read:
+--------------+--------+------+------+--------+
|              | master | you  |  me  | master |
+--------------+--------+------+------+--------+
| test         |   30.4 | 32.6 | 33.9 |   32.4 |
| test_forward |   28.3 | 33.5 | 32.9 |   32.8 |
+--------------+--------+------+------+--------+

('you' is your patch, 'me' is my simple patch, see below)

(I retested master, as first test run seemed noticeable faster than patched)
So, No significant difference may be noticed (if ignore first run:). Or I need a lot more test runs,
to calculate the average.
However, I don't expect any difference here, I'm afraid that lseek() time becomes noticeable in
comparison with read() for a lot larger disks.
Also, problems of lseek are mostly related to lseek bugs, now seems that my kernel is not buggy..
What kernel and fs do you use to get such a significant difference between cached/uncached lseek?


On the other hand, results with cached read are more interesting:

+--------------+--------+------+------+--------+
|              | master | you  |  me  | master |
+--------------+--------+------+------+--------+
| test         |   0.24 | 0.20 | 0.16 |   0.24 |
| test_forward |   0.24 | 0.16 | 0.16 |   0.24 |
+--------------+--------+------+------+--------+

And they show, that my patch wins. So no lseeks = no problems.
Moreover, keeping in mind, that we in Virtuozzo don't have scenarios,
where we'll benefit from lseeks, and after two slow-lseek bugs, it's
definitely safer for me to just keep one out-of-tree patch, than
rely on lseek-cache + lseek, which both are not needed in our case.
Of-course, lseek-cache is a good thing, and your patch seems reasonable,
but I'll go with my patch anyway, and I proposed an option to bring
such behavior to upstream, if someone wants it.

= more detailed tests run description =

1. master branch

]# sync; echo 3 > /proc/sys/vm/drop_caches
]# time ./qemu-img convert -p -n /tmp/test.qcow2 null-co://
     (100.00/100%)

real    0m30.402s
user    0m0.361s
sys     0m0.859s
]# time ./qemu-img convert -p -n /tmp/test.qcow2 null-co://
     (100.00/100%)

real    0m0.240s
user    0m0.173s
sys     0m0.300s
]# time ./qemu-img convert -p -n /tmp/test.qcow2 null-co://
     (100.00/100%)

real    0m0.245s
user    0m0.166s
sys     0m0.286s
]# sync; echo 3 > /proc/sys/vm/drop_caches
]# time ./qemu-img convert -p -n /tmp/test_forward.qcow2 null-co://
     (100.00/100%)

real    0m28.291s
user    0m0.343s
sys     0m0.943s
]# time ./qemu-img convert -p -n /tmp/test_forward.qcow2 null-co://
     (100.00/100%)

real    0m0.231s
user    0m0.154s
sys     0m0.308s
]# time ./qemu-img convert -p -n /tmp/test_forward.qcow2 null-co://
     (100.00/100%)

real    0m0.241s
user    0m0.158s
sys     0m0.284s

2. your patch

]# sync; echo 3 > /proc/sys/vm/drop_caches
]# time ./qemu-img convert -p -n /tmp/test.qcow2 null-co://
     (100.00/100%)

real    0m32.634s
user    0m0.328s
sys     0m0.884s
]# time ./qemu-img convert -p -n /tmp/test.qcow2 null-co://
     (100.00/100%)

real    0m0.208s
user    0m0.169s
sys     0m0.304s
]# time ./qemu-img convert -p -n /tmp/test.qcow2 null-co://
     (100.00/100%)

real    0m0.189s
user    0m0.155s
sys     0m0.263s
]# time ./qemu-img convert -p -n /tmp/test.qcow2 null-co://
     (100.00/100%)

real    0m0.194s
user    0m0.154s
sys     0m0.273s
]# sync; echo 3 > /proc/sys/vm/drop_caches
]# time ./qemu-img convert -p -n /tmp/test_forward.qcow2 null-co://
     (100.00/100%)

real    0m33.486s
user    0m0.374s
sys     0m0.959s
]# time ./qemu-img convert -p -n /tmp/test_forward.qcow2 null-co://
     (100.00/100%)

real    0m0.160s
user    0m0.173s
sys     0m0.253s
]# time ./qemu-img convert -p -n /tmp/test_forward.qcow2 null-co://
     (100.00/100%)

real    0m0.165s
user    0m0.186s
sys     0m0.216s
]# time ./qemu-img convert -p -n /tmp/test_forward.qcow2 null-co://
     (100.00/100%)

real    0m0.168s
user    0m0.160s
sys     0m0.253s

3. my patch

]# sync; echo 3 > /proc/sys/vm/drop_caches
]# time ./qemu-img convert -p -n /tmp/test.qcow2 null-co://
     (100.00/100%)

real    0m33.915s
user    0m0.299s
sys     0m0.848s
]# time ./qemu-img convert -p -n /tmp/test.qcow2 null-co://
     (100.00/100%)

real    0m0.158s
user    0m0.158s
sys     0m0.267s
]# time ./qemu-img convert -p -n /tmp/test.qcow2 null-co://
     (100.00/100%)

real    0m0.167s
user    0m0.159s
sys     0m0.246s
]# time ./qemu-img convert -p -n /tmp/test.qcow2 null-co://
     (100.00/100%)

real    0m0.169s
user    0m0.175s
sys     0m0.236s
]# sync; echo 3 > /proc/sys/vm/drop_caches
]# time ./qemu-img convert -p -n /tmp/test_forward.qcow2 null-co://
     (100.00/100%)

real    0m32.980s
user    0m0.377s
sys     0m0.924s
]# time ./qemu-img convert -p -n /tmp/test_forward.qcow2 null-co://
     (100.00/100%)

real    0m0.152s
user    0m0.149s
sys     0m0.265s
]# time ./qemu-img convert -p -n /tmp/test_forward.qcow2 null-co://
     (100.00/100%)

real    0m0.162s
user    0m0.174s
sys     0m0.214s
]# time ./qemu-img convert -p -n /tmp/test_forward.qcow2 null-co://
     (100.00/100%)

real    0m0.164s
user    0m0.166s
sys     0m0.239s


4. master retest

]# sync; echo 3 > /proc/sys/vm/drop_caches
]# time ./qemu-img convert -p -n /tmp/test.qcow2 null-co://
     (100.00/100%)

real    0m32.418s
user    0m0.347s
sys     0m0.856s
]# time ./qemu-img convert -p -n /tmp/test.qcow2 null-co://
     (100.00/100%)

real    0m0.236s
user    0m0.153s
sys     0m0.316s
]# time ./qemu-img convert -p -n /tmp/test.qcow2 null-co://
     (100.00/100%)

real    0m0.232s
user    0m0.173s
sys     0m0.263s
]# time ./qemu-img convert -p -n /tmp/test.qcow2 null-co://
     (100.00/100%)

real    0m0.244s
user    0m0.160s
sys     0m0.287s
]# sync; echo 3 > /proc/sys/vm/drop_caches
]# time ./qemu-img convert -p -n /tmp/test_forward.qcow2 null-co://
     (100.00/100%)

real    0m32.846s
user    0m0.385s
sys     0m0.987s
]# time ./qemu-img convert -p -n /tmp/test_forward.qcow2 null-co://
     (100.00/100%)

real    0m0.243s
user    0m0.173s
sys     0m0.299s
]# time ./qemu-img convert -p -n /tmp/test_forward.qcow2 null-co://
     (100.00/100%)

real    0m0.243s
user    0m0.164s
sys     0m0.274s
]# time ./qemu-img convert -p -n /tmp/test_forward.qcow2 null-co://
     (100.00/100%)

real    0m0.243s
user    0m0.159s
sys     0m0.282s


= My patch =

diff --git a/block/io.c b/block/io.c
index bd9d688f8b..45e4a52ded 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2186,34 +2186,6 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
          }
      }

-    if (want_zero && local_file && local_file != bs &&
-        (ret & BDRV_BLOCK_DATA) && !(ret & BDRV_BLOCK_ZERO) &&
-        (ret & BDRV_BLOCK_OFFSET_VALID)) {
-        int64_t file_pnum;
-        int ret2;
-
-        ret2 = bdrv_co_block_status(local_file, want_zero, local_map,
-                                    *pnum, &file_pnum, NULL, NULL);
-        if (ret2 >= 0) {
-            /* Ignore errors.  This is just providing extra information, it
-             * is useful but not necessary.
-             */
-            if (ret2 & BDRV_BLOCK_EOF &&
-                (!file_pnum || ret2 & BDRV_BLOCK_ZERO)) {
-                /*
-                 * It is valid for the format block driver to read
-                 * beyond the end of the underlying file's current
-                 * size; such areas read as zero.
-                 */
-                ret |= BDRV_BLOCK_ZERO;
-            } else {
-                /* Limit request to the range reported by the protocol driver */
-                *pnum = file_pnum;
-                ret |= (ret2 & BDRV_BLOCK_ZERO);
-            }
-        }
-    }
-
  out:
      bdrv_dec_in_flight(bs);
      if (ret >= 0 && offset + *pnum == total_size) {



--
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Kevin Wolf 8 weeks ago
Am 23.01.2019 um 12:53 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 22.01.2019 21:57, Kevin Wolf wrote:
> > Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> 11.01.2019 13:41, Kevin Wolf wrote:
> >>> Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >>>> drv_co_block_status digs bs->file for additional, more accurate search
> >>>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
> >>>>
> >>>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
> >>>> knows, where are holes and where is data. But every block_status
> >>>> request calls lseek additionally. Assume a big disk, full of
> >>>> data, in any iterative copying block job (or img convert) we'll call
> >>>> lseek(HOLE) on every iteration, and each of these lseeks will have to
> >>>> iterate through all metadata up to the end of file. It's obviously
> >>>> ineffective behavior. And for many scenarios we don't need this lseek
> >>>> at all.
> >>>>
> >>>> So, let's "5daa74a6ebc" by default, leaving an option to return
> >>>> previous behavior, which is needed for scenarios with preallocated
> >>>> images.
> >>>>
> >>>> Add iotest illustrating new option semantics.
> >>>>
> >>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> >>>
> >>> I still think that an option isn't a good solution and we should try use
> >>> some heuristics instead.
> >>
> >> Do you think that heuristics would be better than fair cache for lseek results?
> > 
> > I just played a bit with this (qemu-img convert only), and how much
> > caching lseek() results helps depends completely on the image. As it
> > happened, my test image was the worst case where caching didn't buy us
> > much. Obviously, I can just as easily construct an image where it makes
> > a huge difference. I think that most real-world images should be able to
> > take good advantage of it, though, and it doesn't hurt, so maybe that's
> > a first thing that we can do in any case. It might not be the complete
> > solution, though.
> > 
> > Let me explain my test images: The case where all of this actually
> > matters for qemu-img convert is fragmented qcow2 images. If your image
> > isn't fragmented, we don't do lseek() a lot anyway because a single
> > bdrv_block_status() call already gives you the information for the whole
> > image. So I constructed a fragmented image, by writing to it backwards:
> > 
> > ./qemu-img create -f qcow2 /tmp/test.qcow2 1G
> > for i in $(seq 16384 -1 0); do
> >      echo "write $((i * 65536)) 64k"
> > done | ./qemu-io /tmp/test.qcow2
> > 
> > It's not really surprising that caching the lseek() results doesn't help
> > much there as we're moving backwards and lseek() only returns results
> > about the things after the current position, not before the current
> > position. So this is probably the worst case.
> > 
> > So I constructed a second image, which is fragmented, too, but starts at
> > the beginning of the image file:
> > 
> > ./qemu-img create -f qcow2 /tmp/test_forward.qcow2 1G
> > for i in $(seq 0 2 16384); do
> >      echo "write $((i * 65536)) 64k"
> > done | ./qemu-io /tmp/test_forward.qcow2
> > for i in $(seq 1 2 16384); do
> >      echo "write $((i * 65536)) 64k"
> > done | ./qemu-io /tmp/test_forward.qcow2
> > 
> > Here caching makes a huge difference:
> > 
> >      time ./qemu-img convert -p -n $IMG null-co://
> > 
> >                          uncached        cached
> >      test.qcow2             ~145s         ~70s
> >      test_forward.qcow2     ~110s        ~0.2s
> 
> Unsure about your results, at least 0.2s means, that we benefit from
> cached read, not lseek.

Yes, all reads are from the kernel page cache, this is on tmpfs.

I chose tmpfs for two reasons: I wanted to get expensive I/O out of the
way so that the lseek() performance is even visible; and tmpfs was
reported to perform especially bad for SEEK_DATA/HOLE (which my results
confirm). So yes, this setup really makes the lseek() calls stand out
much more than in the common case (which makes sense when you want to
fix the overhead introduced by them).

> = my results =
> 
> in short:
> 
> uncached read:
> +--------------+--------+------+------+--------+
> |              | master | you  |  me  | master |
> +--------------+--------+------+------+--------+
> | test         |   30.4 | 32.6 | 33.9 |   32.4 |
> | test_forward |   28.3 | 33.5 | 32.9 |   32.8 |
> +--------------+--------+------+------+--------+
> 
> ('you' is your patch, 'me' is my simple patch, see below)
> 
> (I retested master, as first test run seemed noticeable faster than patched)
> So, No significant difference may be noticed (if ignore first run:). Or I need a lot more test runs,
> to calculate the average.
> However, I don't expect any difference here, I'm afraid that lseek() time becomes noticeable in
> comparison with read() for a lot larger disks.
> Also, problems of lseek are mostly related to lseek bugs, now seems that my kernel is not buggy..
> What kernel and fs do you use to get such a significant difference between cached/uncached lseek?

$ uname -r
4.20.3-200.fc29.x86_64

> On the other hand, results with cached read are more interesting:
> 
> +--------------+--------+------+------+--------+
> |              | master | you  |  me  | master |
> +--------------+--------+------+------+--------+
> | test         |   0.24 | 0.20 | 0.16 |   0.24 |
> | test_forward |   0.24 | 0.16 | 0.16 |   0.24 |
> +--------------+--------+------+------+--------+
> 
> And they show, that my patch wins. So no lseeks = no problems.

This is not surprising. You included the worst case for my patch, but
not the worst case for your patch. If I include that, too, I get:

+-----------------+------------+------------+--------------+
|                 |   master   |  no lseek  | cached lseek |
+-----------------+------------+------------+--------------+
| on tmpfs:       |            |            |              |
|   test          |        115 |       0.16 |           58 |
|   test_forward  |        115 |       0.16 |         0.16 |
|   test_prealloc |       0.07 |       0.65 |         0.07 |
+-----------------+------------+------------+--------------+
| on xfs          |            |            |              |
|   test          |       0.20 |       0.16 |         0.18 |
|   test_forward  |       0.20 |       0.16 |         0.16 |
|   test_prealloc |       0.07 |       0.73 |         0.07 |
+-----------------+------------+------------+--------------+
| on xfs, -T none |            |            |              |
|   test          |       1.33 |       1.31 |         1.33 |
|   test_forward  |       1.00 |       1.00 |         1.00 |
|   test_prealloc |       0.08 |       0.65 |         0.08 |
+-----------------+------------+------------+--------------+

test_prealloc is an empty image with metadata preallocation:

$ ./qemu-img create -f qcow2 -o preallocation=metadata ~/tmp/test_prealloc.qcow2 1G

So your patch is a clear winner for test on tmpfs, but also a clear
loser for test_prealloc on all backends. Otherwise, I don't see much of
a significant difference.

> Moreover, keeping in mind, that we in Virtuozzo don't have scenarios,
> where we'll benefit from lseeks, and after two slow-lseek bugs, it's
> definitely safer for me to just keep one out-of-tree patch, than
> rely on lseek-cache + lseek, which both are not needed in our case.

With which filesystems did you have those slow lseek bugs?

I mean, I can understand your point that you're not using preallocation
anyway, but for upstream, that's obviously not going to work. My
priority is getting something that improves the situation for everyone.
If we then still need a user-visible option to squeeze out the last
millisecond, we can talk about it. But the option can't be the primary
solution for everyone.

> Of-course, lseek-cache is a good thing, and your patch seems reasonable,
> but I'll go with my patch anyway, and I proposed an option to bring
> such behavior to upstream, if someone wants it.

Okay, if you think it makes sense, I can post it as a proper patch.

Kevin

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 8 weeks ago
23.01.2019 19:33, Kevin Wolf wrote:
> Am 23.01.2019 um 12:53 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 22.01.2019 21:57, Kevin Wolf wrote:
>>> Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> 11.01.2019 13:41, Kevin Wolf wrote:
>>>>> Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>> drv_co_block_status digs bs->file for additional, more accurate search
>>>>>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
>>>>>>
>>>>>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
>>>>>> knows, where are holes and where is data. But every block_status
>>>>>> request calls lseek additionally. Assume a big disk, full of
>>>>>> data, in any iterative copying block job (or img convert) we'll call
>>>>>> lseek(HOLE) on every iteration, and each of these lseeks will have to
>>>>>> iterate through all metadata up to the end of file. It's obviously
>>>>>> ineffective behavior. And for many scenarios we don't need this lseek
>>>>>> at all.
>>>>>>
>>>>>> So, let's "5daa74a6ebc" by default, leaving an option to return
>>>>>> previous behavior, which is needed for scenarios with preallocated
>>>>>> images.
>>>>>>
>>>>>> Add iotest illustrating new option semantics.
>>>>>>
>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>
>>>>> I still think that an option isn't a good solution and we should try use
>>>>> some heuristics instead.
>>>>
>>>> Do you think that heuristics would be better than fair cache for lseek results?
>>>
>>> I just played a bit with this (qemu-img convert only), and how much
>>> caching lseek() results helps depends completely on the image. As it
>>> happened, my test image was the worst case where caching didn't buy us
>>> much. Obviously, I can just as easily construct an image where it makes
>>> a huge difference. I think that most real-world images should be able to
>>> take good advantage of it, though, and it doesn't hurt, so maybe that's
>>> a first thing that we can do in any case. It might not be the complete
>>> solution, though.
>>>
>>> Let me explain my test images: The case where all of this actually
>>> matters for qemu-img convert is fragmented qcow2 images. If your image
>>> isn't fragmented, we don't do lseek() a lot anyway because a single
>>> bdrv_block_status() call already gives you the information for the whole
>>> image. So I constructed a fragmented image, by writing to it backwards:
>>>
>>> ./qemu-img create -f qcow2 /tmp/test.qcow2 1G
>>> for i in $(seq 16384 -1 0); do
>>>       echo "write $((i * 65536)) 64k"
>>> done | ./qemu-io /tmp/test.qcow2
>>>
>>> It's not really surprising that caching the lseek() results doesn't help
>>> much there as we're moving backwards and lseek() only returns results
>>> about the things after the current position, not before the current
>>> position. So this is probably the worst case.
>>>
>>> So I constructed a second image, which is fragmented, too, but starts at
>>> the beginning of the image file:
>>>
>>> ./qemu-img create -f qcow2 /tmp/test_forward.qcow2 1G
>>> for i in $(seq 0 2 16384); do
>>>       echo "write $((i * 65536)) 64k"
>>> done | ./qemu-io /tmp/test_forward.qcow2
>>> for i in $(seq 1 2 16384); do
>>>       echo "write $((i * 65536)) 64k"
>>> done | ./qemu-io /tmp/test_forward.qcow2
>>>
>>> Here caching makes a huge difference:
>>>
>>>       time ./qemu-img convert -p -n $IMG null-co://
>>>
>>>                           uncached        cached
>>>       test.qcow2             ~145s         ~70s
>>>       test_forward.qcow2     ~110s        ~0.2s
>>
>> Unsure about your results, at least 0.2s means, that we benefit from
>> cached read, not lseek.
> 
> Yes, all reads are from the kernel page cache, this is on tmpfs.
> 
> I chose tmpfs for two reasons: I wanted to get expensive I/O out of the
> way so that the lseek() performance is even visible; and tmpfs was
> reported to perform especially bad for SEEK_DATA/HOLE (which my results
> confirm). So yes, this setup really makes the lseek() calls stand out
> much more than in the common case (which makes sense when you want to
> fix the overhead introduced by them).

Ok, missed this. On the other hand tmpfs is not a real production case..

> 
>> = my results =
>>
>> in short:
>>
>> uncached read:
>> +--------------+--------+------+------+--------+
>> |              | master | you  |  me  | master |
>> +--------------+--------+------+------+--------+
>> | test         |   30.4 | 32.6 | 33.9 |   32.4 |
>> | test_forward |   28.3 | 33.5 | 32.9 |   32.8 |
>> +--------------+--------+------+------+--------+
>>
>> ('you' is your patch, 'me' is my simple patch, see below)
>>
>> (I retested master, as first test run seemed noticeable faster than patched)
>> So, No significant difference may be noticed (if ignore first run:). Or I need a lot more test runs,
>> to calculate the average.
>> However, I don't expect any difference here, I'm afraid that lseek() time becomes noticeable in
>> comparison with read() for a lot larger disks.
>> Also, problems of lseek are mostly related to lseek bugs, now seems that my kernel is not buggy..
>> What kernel and fs do you use to get such a significant difference between cached/uncached lseek?
> 
> $ uname -r
> 4.20.3-200.fc29.x86_64

I just didn't guess or missed that you run tests on tmps.

> 
>> On the other hand, results with cached read are more interesting:
>>
>> +--------------+--------+------+------+--------+
>> |              | master | you  |  me  | master |
>> +--------------+--------+------+------+--------+
>> | test         |   0.24 | 0.20 | 0.16 |   0.24 |
>> | test_forward |   0.24 | 0.16 | 0.16 |   0.24 |
>> +--------------+--------+------+------+--------+
>>
>> And they show, that my patch wins. So no lseeks = no problems.
> 
> This is not surprising. You included the worst case for my patch, but
> not the worst case for your patch. If I include that, too, I get:
> 
> +-----------------+------------+------------+--------------+
> |                 |   master   |  no lseek  | cached lseek |
> +-----------------+------------+------------+--------------+
> | on tmpfs:       |            |            |              |
> |   test          |        115 |       0.16 |           58 |
> |   test_forward  |        115 |       0.16 |         0.16 |
> |   test_prealloc |       0.07 |       0.65 |         0.07 |
> +-----------------+------------+------------+--------------+
> | on xfs          |            |            |              |
> |   test          |       0.20 |       0.16 |         0.18 |
> |   test_forward  |       0.20 |       0.16 |         0.16 |
> |   test_prealloc |       0.07 |       0.73 |         0.07 |
> +-----------------+------------+------------+--------------+
> | on xfs, -T none |            |            |              |
> |   test          |       1.33 |       1.31 |         1.33 |
> |   test_forward  |       1.00 |       1.00 |         1.00 |
> |   test_prealloc |       0.08 |       0.65 |         0.08 |
> +-----------------+------------+------------+--------------+
> 
> test_prealloc is an empty image with metadata preallocation:
> 
> $ ./qemu-img create -f qcow2 -o preallocation=metadata ~/tmp/test_prealloc.qcow2 1G
> 
> So your patch is a clear winner for test on tmpfs, but also a clear
> loser for test_prealloc on all backends. Otherwise, I don't see much of
> a significant difference.

Yes, of course.

> 
>> Moreover, keeping in mind, that we in Virtuozzo don't have scenarios,
>> where we'll benefit from lseeks, and after two slow-lseek bugs, it's
>> definitely safer for me to just keep one out-of-tree patch, than
>> rely on lseek-cache + lseek, which both are not needed in our case.
> 
> With which filesystems did you have those slow lseek bugs?

ext4. For example, the first bug was fixed by this:
https://www.spinics.net/lists/linux-ext4/msg51346.html

> 
> I mean, I can understand your point that you're not using preallocation
> anyway, but for upstream, that's obviously not going to work. My
> priority is getting something that improves the situation for everyone.
> If we then still need a user-visible option to squeeze out the last
> millisecond, we can talk about it. But the option can't be the primary
> solution for everyone.
> 
>> Of-course, lseek-cache is a good thing, and your patch seems reasonable,
>> but I'll go with my patch anyway, and I proposed an option to bring
>> such behavior to upstream, if someone wants it.
> 
> Okay, if you think it makes sense, I can post it as a proper patch.
> 
> Kevin
> 


-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Kevin Wolf 8 weeks ago
Am 24.01.2019 um 15:36 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 23.01.2019 19:33, Kevin Wolf wrote:
> > Am 23.01.2019 um 12:53 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> 22.01.2019 21:57, Kevin Wolf wrote:
> >>> Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >>>> 11.01.2019 13:41, Kevin Wolf wrote:
> >>>>> Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >>>>>> drv_co_block_status digs bs->file for additional, more accurate search
> >>>>>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
> >>>>>>
> >>>>>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
> >>>>>> knows, where are holes and where is data. But every block_status
> >>>>>> request calls lseek additionally. Assume a big disk, full of
> >>>>>> data, in any iterative copying block job (or img convert) we'll call
> >>>>>> lseek(HOLE) on every iteration, and each of these lseeks will have to
> >>>>>> iterate through all metadata up to the end of file. It's obviously
> >>>>>> ineffective behavior. And for many scenarios we don't need this lseek
> >>>>>> at all.
> >>>>>>
> >>>>>> So, let's "5daa74a6ebc" by default, leaving an option to return
> >>>>>> previous behavior, which is needed for scenarios with preallocated
> >>>>>> images.
> >>>>>>
> >>>>>> Add iotest illustrating new option semantics.
> >>>>>>
> >>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> >>>>>
> >>>>> I still think that an option isn't a good solution and we should try use
> >>>>> some heuristics instead.
> >>>>
> >>>> Do you think that heuristics would be better than fair cache for lseek results?
> >>>
> >>> I just played a bit with this (qemu-img convert only), and how much
> >>> caching lseek() results helps depends completely on the image. As it
> >>> happened, my test image was the worst case where caching didn't buy us
> >>> much. Obviously, I can just as easily construct an image where it makes
> >>> a huge difference. I think that most real-world images should be able to
> >>> take good advantage of it, though, and it doesn't hurt, so maybe that's
> >>> a first thing that we can do in any case. It might not be the complete
> >>> solution, though.
> >>>
> >>> Let me explain my test images: The case where all of this actually
> >>> matters for qemu-img convert is fragmented qcow2 images. If your image
> >>> isn't fragmented, we don't do lseek() a lot anyway because a single
> >>> bdrv_block_status() call already gives you the information for the whole
> >>> image. So I constructed a fragmented image, by writing to it backwards:
> >>>
> >>> ./qemu-img create -f qcow2 /tmp/test.qcow2 1G
> >>> for i in $(seq 16384 -1 0); do
> >>>       echo "write $((i * 65536)) 64k"
> >>> done | ./qemu-io /tmp/test.qcow2
> >>>
> >>> It's not really surprising that caching the lseek() results doesn't help
> >>> much there as we're moving backwards and lseek() only returns results
> >>> about the things after the current position, not before the current
> >>> position. So this is probably the worst case.
> >>>
> >>> So I constructed a second image, which is fragmented, too, but starts at
> >>> the beginning of the image file:
> >>>
> >>> ./qemu-img create -f qcow2 /tmp/test_forward.qcow2 1G
> >>> for i in $(seq 0 2 16384); do
> >>>       echo "write $((i * 65536)) 64k"
> >>> done | ./qemu-io /tmp/test_forward.qcow2
> >>> for i in $(seq 1 2 16384); do
> >>>       echo "write $((i * 65536)) 64k"
> >>> done | ./qemu-io /tmp/test_forward.qcow2
> >>>
> >>> Here caching makes a huge difference:
> >>>
> >>>       time ./qemu-img convert -p -n $IMG null-co://
> >>>
> >>>                           uncached        cached
> >>>       test.qcow2             ~145s         ~70s
> >>>       test_forward.qcow2     ~110s        ~0.2s
> >>
> >> Unsure about your results, at least 0.2s means, that we benefit from
> >> cached read, not lseek.
> > 
> > Yes, all reads are from the kernel page cache, this is on tmpfs.
> > 
> > I chose tmpfs for two reasons: I wanted to get expensive I/O out of the
> > way so that the lseek() performance is even visible; and tmpfs was
> > reported to perform especially bad for SEEK_DATA/HOLE (which my results
> > confirm). So yes, this setup really makes the lseek() calls stand out
> > much more than in the common case (which makes sense when you want to
> > fix the overhead introduced by them).
> 
> Ok, missed this. On the other hand tmpfs is not a real production case..

Yes, I fully agree. But it was a simple case where I knew there is a
problem.

I also have a bug report on XFS with an image that is very fragmented on
the file system level. But I don't know how to produce such a file to
run benchmarks on it.

Kevin

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 8 weeks ago
24.01.2019 18:31, Kevin Wolf wrote:
> Am 24.01.2019 um 15:36 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 23.01.2019 19:33, Kevin Wolf wrote:
>>> Am 23.01.2019 um 12:53 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> 22.01.2019 21:57, Kevin Wolf wrote:
>>>>> Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>> 11.01.2019 13:41, Kevin Wolf wrote:
>>>>>>> Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>>>> drv_co_block_status digs bs->file for additional, more accurate search
>>>>>>>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
>>>>>>>>
>>>>>>>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
>>>>>>>> knows, where are holes and where is data. But every block_status
>>>>>>>> request calls lseek additionally. Assume a big disk, full of
>>>>>>>> data, in any iterative copying block job (or img convert) we'll call
>>>>>>>> lseek(HOLE) on every iteration, and each of these lseeks will have to
>>>>>>>> iterate through all metadata up to the end of file. It's obviously
>>>>>>>> ineffective behavior. And for many scenarios we don't need this lseek
>>>>>>>> at all.
>>>>>>>>
>>>>>>>> So, let's "5daa74a6ebc" by default, leaving an option to return
>>>>>>>> previous behavior, which is needed for scenarios with preallocated
>>>>>>>> images.
>>>>>>>>
>>>>>>>> Add iotest illustrating new option semantics.
>>>>>>>>
>>>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>>>
>>>>>>> I still think that an option isn't a good solution and we should try use
>>>>>>> some heuristics instead.
>>>>>>
>>>>>> Do you think that heuristics would be better than fair cache for lseek results?
>>>>>
>>>>> I just played a bit with this (qemu-img convert only), and how much
>>>>> caching lseek() results helps depends completely on the image. As it
>>>>> happened, my test image was the worst case where caching didn't buy us
>>>>> much. Obviously, I can just as easily construct an image where it makes
>>>>> a huge difference. I think that most real-world images should be able to
>>>>> take good advantage of it, though, and it doesn't hurt, so maybe that's
>>>>> a first thing that we can do in any case. It might not be the complete
>>>>> solution, though.
>>>>>
>>>>> Let me explain my test images: The case where all of this actually
>>>>> matters for qemu-img convert is fragmented qcow2 images. If your image
>>>>> isn't fragmented, we don't do lseek() a lot anyway because a single
>>>>> bdrv_block_status() call already gives you the information for the whole
>>>>> image. So I constructed a fragmented image, by writing to it backwards:
>>>>>
>>>>> ./qemu-img create -f qcow2 /tmp/test.qcow2 1G
>>>>> for i in $(seq 16384 -1 0); do
>>>>>        echo "write $((i * 65536)) 64k"
>>>>> done | ./qemu-io /tmp/test.qcow2
>>>>>
>>>>> It's not really surprising that caching the lseek() results doesn't help
>>>>> much there as we're moving backwards and lseek() only returns results
>>>>> about the things after the current position, not before the current
>>>>> position. So this is probably the worst case.
>>>>>
>>>>> So I constructed a second image, which is fragmented, too, but starts at
>>>>> the beginning of the image file:
>>>>>
>>>>> ./qemu-img create -f qcow2 /tmp/test_forward.qcow2 1G
>>>>> for i in $(seq 0 2 16384); do
>>>>>        echo "write $((i * 65536)) 64k"
>>>>> done | ./qemu-io /tmp/test_forward.qcow2
>>>>> for i in $(seq 1 2 16384); do
>>>>>        echo "write $((i * 65536)) 64k"
>>>>> done | ./qemu-io /tmp/test_forward.qcow2
>>>>>
>>>>> Here caching makes a huge difference:
>>>>>
>>>>>        time ./qemu-img convert -p -n $IMG null-co://
>>>>>
>>>>>                            uncached        cached
>>>>>        test.qcow2             ~145s         ~70s
>>>>>        test_forward.qcow2     ~110s        ~0.2s
>>>>
>>>> Unsure about your results, at least 0.2s means, that we benefit from
>>>> cached read, not lseek.
>>>
>>> Yes, all reads are from the kernel page cache, this is on tmpfs.
>>>
>>> I chose tmpfs for two reasons: I wanted to get expensive I/O out of the
>>> way so that the lseek() performance is even visible; and tmpfs was
>>> reported to perform especially bad for SEEK_DATA/HOLE (which my results
>>> confirm). So yes, this setup really makes the lseek() calls stand out
>>> much more than in the common case (which makes sense when you want to
>>> fix the overhead introduced by them).
>>
>> Ok, missed this. On the other hand tmpfs is not a real production case..
> 
> Yes, I fully agree. But it was a simple case where I knew there is a
> problem.
> 
> I also have a bug report on XFS with an image that is very fragmented on
> the file system level. But I don't know how to produce such a file to
> run benchmarks on it.
> 

I've experimented around very fragmented images, but didn't find lseek problems,
may be because I don't have enough big hdd to test. Here is a program I used to
produce fragmented file. The idea is fallocate all space, and then reallocate it.
Usage is as simple as

./frag /data/test 500G

Attached code may be ugly, I didn't prepare it for publishing(

-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Kevin Wolf 10 weeks ago
Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 11.01.2019 13:41, Kevin Wolf wrote:
> > Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> drv_co_block_status digs bs->file for additional, more accurate search
> >> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
> >>
> >> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
> >> knows, where are holes and where is data. But every block_status
> >> request calls lseek additionally. Assume a big disk, full of
> >> data, in any iterative copying block job (or img convert) we'll call
> >> lseek(HOLE) on every iteration, and each of these lseeks will have to
> >> iterate through all metadata up to the end of file. It's obviously
> >> ineffective behavior. And for many scenarios we don't need this lseek
> >> at all.
> >>
> >> So, let's "5daa74a6ebc" by default, leaving an option to return
> >> previous behavior, which is needed for scenarios with preallocated
> >> images.
> >>
> >> Add iotest illustrating new option semantics.
> >>
> >> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > 
> > I still think that an option isn't a good solution and we should try use
> > some heuristics instead.
> 
> Do you think that heuristics would be better than fair cache for lseek
> results?

I don't think lseek() results are cachable, at least for images that can
share BLK_PERM_WRITE between processes.

> I don't see good heuristics, neither want to implement lseek optimizations.
> In our cases we don't need lseek under qcow2 at all, and it's obviously better
> just don't lseek in these cases.

I suggested one: Pass large contiguous allocated ranges to the protocol
driver, but just assume that the allocation status is correct in the
format layer if they are small.

> Moreover, as I understand, the cases when we need lseek are those when qcow2 thinks
> that the cluster is data, but actually it's a hole on fs... Isn't it better to
> support this thing in qcow2? A kind of allocated-zeroes cluster? We have enough
> reserved bits in cluster descriptor to add this feature. And than we don't need
> this kind of cheating when user "better knows" where are holes than format layer
> which owns the data..

The point of metadata preallocation is that you don't have to update the
metadata when you write to the image. Having to clear a bit in the L2
entry would negate that.

Maybe another option would be to use a compatible feature flag in qcow2
that is set during creation with preallocation=metadata, and we would
only go to the lower layer if the flag is set. The downside side is that
this would only be effective for new images, so copying old preallocated
images would be slowed down considerably.

Kevin

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 10 weeks ago
11.01.2019 15:21, Kevin Wolf wrote:
> Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 11.01.2019 13:41, Kevin Wolf wrote:
>>> Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> drv_co_block_status digs bs->file for additional, more accurate search
>>>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
>>>>
>>>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
>>>> knows, where are holes and where is data. But every block_status
>>>> request calls lseek additionally. Assume a big disk, full of
>>>> data, in any iterative copying block job (or img convert) we'll call
>>>> lseek(HOLE) on every iteration, and each of these lseeks will have to
>>>> iterate through all metadata up to the end of file. It's obviously
>>>> ineffective behavior. And for many scenarios we don't need this lseek
>>>> at all.
>>>>
>>>> So, let's "5daa74a6ebc" by default, leaving an option to return
>>>> previous behavior, which is needed for scenarios with preallocated
>>>> images.
>>>>
>>>> Add iotest illustrating new option semantics.
>>>>
>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>
>>> I still think that an option isn't a good solution and we should try use
>>> some heuristics instead.
>>
>> Do you think that heuristics would be better than fair cache for lseek
>> results?
> 
> I don't think lseek() results are cachable, at least for images that can
> share BLK_PERM_WRITE between processes.
> 
>> I don't see good heuristics, neither want to implement lseek optimizations.
>> In our cases we don't need lseek under qcow2 at all, and it's obviously better
>> just don't lseek in these cases.
> 
> I suggested one: Pass large contiguous allocated ranges to the protocol
> driver, but just assume that the allocation status is correct in the
> format layer if they are small.

So, for fully allocated image, we will call lseek always?

> 
>> Moreover, as I understand, the cases when we need lseek are those when qcow2 thinks
>> that the cluster is data, but actually it's a hole on fs... Isn't it better to
>> support this thing in qcow2? A kind of allocated-zeroes cluster? We have enough
>> reserved bits in cluster descriptor to add this feature. And than we don't need
>> this kind of cheating when user "better knows" where are holes than format layer
>> which owns the data..
> 
> The point of metadata preallocation is that you don't have to update the
> metadata when you write to the image. Having to clear a bit in the L2
> entry would negate that.
> 
> Maybe another option would be to use a compatible feature flag in qcow2
> that is set during creation with preallocation=metadata, and we would
> only go to the lower layer if the flag is set. The downside side is that
> this would only be effective for new images, so copying old preallocated
> images would be slowed down considerably.
> 
> Kevin
> 


-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Kevin Wolf 10 weeks ago
Am 11.01.2019 um 13:59 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 11.01.2019 15:21, Kevin Wolf wrote:
> > Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> 11.01.2019 13:41, Kevin Wolf wrote:
> >>> Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >>>> drv_co_block_status digs bs->file for additional, more accurate search
> >>>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
> >>>>
> >>>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
> >>>> knows, where are holes and where is data. But every block_status
> >>>> request calls lseek additionally. Assume a big disk, full of
> >>>> data, in any iterative copying block job (or img convert) we'll call
> >>>> lseek(HOLE) on every iteration, and each of these lseeks will have to
> >>>> iterate through all metadata up to the end of file. It's obviously
> >>>> ineffective behavior. And for many scenarios we don't need this lseek
> >>>> at all.
> >>>>
> >>>> So, let's "5daa74a6ebc" by default, leaving an option to return
> >>>> previous behavior, which is needed for scenarios with preallocated
> >>>> images.
> >>>>
> >>>> Add iotest illustrating new option semantics.
> >>>>
> >>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> >>>
> >>> I still think that an option isn't a good solution and we should try use
> >>> some heuristics instead.
> >>
> >> Do you think that heuristics would be better than fair cache for lseek
> >> results?
> > 
> > I don't think lseek() results are cachable, at least for images that can
> > share BLK_PERM_WRITE between processes.
> > 
> >> I don't see good heuristics, neither want to implement lseek optimizations.
> >> In our cases we don't need lseek under qcow2 at all, and it's obviously better
> >> just don't lseek in these cases.
> > 
> > I suggested one: Pass large contiguous allocated ranges to the protocol
> > driver, but just assume that the allocation status is correct in the
> > format layer if they are small.
> 
> So, for fully allocated image, we will call lseek always?

If they are fully contiguous, yes. But that's a single lseek() call per
image then instead of an lseek() for every 64k, so not a big problem.

In the more realistic case, you will still call lseek() occasionally
because you will have some fragmentation, but the fragments can be
rather large. But it should still significantly reduce them compared to
now because you skip it for those parts with small contiguous
allocations where lseek() would be called a lot today.

Kevin

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 10 weeks ago
11.01.2019 16:15, Kevin Wolf wrote:
> Am 11.01.2019 um 13:59 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 11.01.2019 15:21, Kevin Wolf wrote:
>>> Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> 11.01.2019 13:41, Kevin Wolf wrote:
>>>>> Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>> drv_co_block_status digs bs->file for additional, more accurate search
>>>>>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
>>>>>>
>>>>>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
>>>>>> knows, where are holes and where is data. But every block_status
>>>>>> request calls lseek additionally. Assume a big disk, full of
>>>>>> data, in any iterative copying block job (or img convert) we'll call
>>>>>> lseek(HOLE) on every iteration, and each of these lseeks will have to
>>>>>> iterate through all metadata up to the end of file. It's obviously
>>>>>> ineffective behavior. And for many scenarios we don't need this lseek
>>>>>> at all.
>>>>>>
>>>>>> So, let's "5daa74a6ebc" by default, leaving an option to return
>>>>>> previous behavior, which is needed for scenarios with preallocated
>>>>>> images.
>>>>>>
>>>>>> Add iotest illustrating new option semantics.
>>>>>>
>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>
>>>>> I still think that an option isn't a good solution and we should try use
>>>>> some heuristics instead.
>>>>
>>>> Do you think that heuristics would be better than fair cache for lseek
>>>> results?
>>>
>>> I don't think lseek() results are cachable, at least for images that can
>>> share BLK_PERM_WRITE between processes.
>>>
>>>> I don't see good heuristics, neither want to implement lseek optimizations.
>>>> In our cases we don't need lseek under qcow2 at all, and it's obviously better
>>>> just don't lseek in these cases.
>>>
>>> I suggested one: Pass large contiguous allocated ranges to the protocol
>>> driver, but just assume that the allocation status is correct in the
>>> format layer if they are small.
>>
>> So, for fully allocated image, we will call lseek always?
> 
> If they are fully contiguous, yes. But that's a single lseek() call per
> image then instead of an lseek() for every 64k, so not a big problem.

lseek is called on each mirror iteration, why one per image?

> 
> In the more realistic case, you will still call lseek() occasionally
> because you will have some fragmentation, but the fragments can be
> rather large. But it should still significantly reduce them compared to
> now because you skip it for those parts with small contiguous
> allocations where lseek() would be called a lot today.
> 
> Kevin
> 

Ok, you propose not to call lseek for small enough data regions reported by
format layer. And for images which are less fragmented, this helps less or don't
help.

Why do you think it is better?

For not preallocated images it is worse, as it covers less cases. So, for our
scenarios it is worse.

The only case, when heuristic works better, is when user have preallocated image,
but don't know about new option, which returns old behavior. We are not interested
in this case and can't go this way, as it doesn't guarantee, that some customer will
not come again with lseek-related problems.

Don't you like what Eric propose, about binding behavior switch to existing
detect-zeroes option?

Or, we can add an opposite option, to enable new behavior, keeping the old one by
default. So, all stays as is, and who need uses new option. Heuristic may be
implemented then too.

-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Eric Blake 10 weeks ago
On 1/11/19 10:09 AM, Vladimir Sementsov-Ogievskiy wrote:

>>>> I suggested one: Pass large contiguous allocated ranges to the protocol
>>>> driver, but just assume that the allocation status is correct in the
>>>> format layer if they are small.
>>>
>>> So, for fully allocated image, we will call lseek always?
>>
>> If they are fully contiguous, yes. But that's a single lseek() call per
>> image then instead of an lseek() for every 64k, so not a big problem.
> 
> lseek is called on each mirror iteration, why one per image?

If the image has no holes, then lseek(0, SEEK_HOLE) will return EOF, and
then you know that there are no holes, and you don't need to make any
further lseek() calls.  Hence, once per image.  A fully-allocated file
that has areas that read as known zeroes can be determined by fiemap
(but not by lseek, which can only detect unallocated holes) - but we
already know that while fiemap has more information, it also has more
problems (you cannot use it safely without sync, but sync makes it too
slow to use), so that is a non-starter.

>>
>> In the more realistic case, you will still call lseek() occasionally
>> because you will have some fragmentation, but the fragments can be
>> rather large. But it should still significantly reduce them compared to
>> now because you skip it for those parts with small contiguous
>> allocations where lseek() would be called a lot today.
>>
>> Kevin
>>
> 
> Ok, you propose not to call lseek for small enough data regions reported by
> format layer. And for images which are less fragmented, this helps less or don't
> help.

Indeed - pick some threshold (maybe 16 clusters); if block status of the
format layer returns something smaller than the threshold, don't bother
refining the answer further by doing block status of the protocol layer
(if the caller is iterating over an image 1 cluster at a time, then the
threshold will never be tripped and thus you'll never do an lseek); but
where the block status of the format layer is large, we are reading the
file in larger chunks so we end up with fewer lseeks in the long run
anyways.

> 
> Why do you think it is better?
> 
> For not preallocated images it is worse, as it covers less cases. So, for our
> scenarios it is worse.

Anywhere that you skip calling lseek(), you end up missing out on
opportunities to optimize had you instead been able to learn from lseek
that you were on a hole after all.  So it becomes a balancing question:
how much time is spent on probing for whether an optimization is even
possible, vs. how much time is spent if the probe succeeded and you can
then optimize.  For a fully-allocated image, all of the time spent
probing is wasted (you never find a hole, so every probe was wasted).
So it is indeed a tradeoff when picking your heuristics, of trying to
balance how likely the probe will justify the time spent on the probe.

> The only case, when heuristic works better, is when user have preallocated image,
> but don't know about new option, which returns old behavior. We are not interested
> in this case and can't go this way, as it doesn't guarantee, that some customer will
> not come again with lseek-related problems.
> 
> Don't you like what Eric propose, about binding behavior switch to existing
> detect-zeroes option?
> 
> Or, we can add an opposite option, to enable new behavior, keeping the old one by
> default. So, all stays as is, and who need uses new option. Heuristic may be
> implemented then too.
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 10 weeks ago

On 11.01.2019 17:04, Eric Blake wrote:
> On 1/11/19 10:09 AM, Vladimir Sementsov-Ogievskiy wrote:
> 
>>>>> I suggested one: Pass large contiguous allocated ranges to the protocol
>>>>> driver, but just assume that the allocation status is correct in the
>>>>> format layer if they are small.
>>>>
>>>> So, for fully allocated image, we will call lseek always?
>>>
>>> If they are fully contiguous, yes. But that's a single lseek() call per
>>> image then instead of an lseek() for every 64k, so not a big problem.
>>
>> lseek is called on each mirror iteration, why one per image?
> 
> If the image has no holes, then lseek(0, SEEK_HOLE) will return EOF, and
> then you know that there are no holes, and you don't need to make any
> further lseek() calls.  Hence, once per image.  A fully-allocated file
> that has areas that read as known zeroes can be determined by fiemap
> (but not by lseek, which can only detect unallocated holes) - but we
> already know that while fiemap has more information, it also has more
> problems (you cannot use it safely without sync, but sync makes it too
> slow to use), so that is a non-starter.
> 
>>>
>>> In the more realistic case, you will still call lseek() occasionally
>>> because you will have some fragmentation, but the fragments can be
>>> rather large. But it should still significantly reduce them compared to
>>> now because you skip it for those parts with small contiguous
>>> allocations where lseek() would be called a lot today.
>>>
>>> Kevin
>>>
>>
>> Ok, you propose not to call lseek for small enough data regions reported by
>> format layer. And for images which are less fragmented, this helps less or don't
>> help.
> 
> Indeed - pick some threshold (maybe 16 clusters); if block status of the
> format layer returns something smaller than the threshold, don't bother
> refining the answer further by doing block status of the protocol layer
> (if the caller is iterating over an image 1 cluster at a time, then the
> threshold will never be tripped and thus you'll never do an lseek); but
> where the block status of the format layer is large, we are reading the
> file in larger chunks so we end up with fewer lseeks in the long run
> anyways.
> 
>>
>> Why do you think it is better?
>>
>> For not preallocated images it is worse, as it covers less cases. So, for our
>> scenarios it is worse.
> 
> Anywhere that you skip calling lseek(), you end up missing out on
> opportunities to optimize had you instead been able to learn from lseek
> that you were on a hole after all.

What are the cases when we'll benefit from lseek except preallocated images?

   So it becomes a balancing question:
> how much time is spent on probing for whether an optimization is even
> possible, vs. how much time is spent if the probe succeeded and you can
> then optimize.  For a fully-allocated image, all of the time spent
> probing is wasted (you never find a hole, so every probe was wasted).
> So it is indeed a tradeoff when picking your heuristics, of trying to
> balance how likely the probe will justify the time spent on the probe.
> 
>> The only case, when heuristic works better, is when user have preallocated image,
>> but don't know about new option, which returns old behavior. We are not interested
>> in this case and can't go this way, as it doesn't guarantee, that some customer will
>> not come again with lseek-related problems.
>>
>> Don't you like what Eric propose, about binding behavior switch to existing
>> detect-zeroes option?
>>
>> Or, we can add an opposite option, to enable new behavior, keeping the old one by
>> default. So, all stays as is, and who need uses new option. Heuristic may be
>> implemented then too.
>>
> 

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Eric Blake 10 weeks ago
On 1/10/19 7:20 AM, Vladimir Sementsov-Ogievskiy wrote:
> drv_co_block_status digs bs->file for additional, more accurate search
> for hole inside region, reported as DATA by bs since 5daa74a6ebc.

s/region, reported/regions reported/

> 
> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
> knows, where are holes and where is data. But every block_status

Not quite true. qcow2 knows where some holes are, but does not know if
there are even more holes hiding in the sections marked data (such as
because of how the disk was pre-allocated).

> request calls lseek additionally. Assume a big disk, full of
> data, in any iterative copying block job (or img convert) we'll call
> lseek(HOLE) on every iteration, and each of these lseeks will have to
> iterate through all metadata up to the end of file. It's obviously
> ineffective behavior. And for many scenarios we don't need this lseek
> at all.

How much performance can we buy back without any knobs at all, if we
just taught posix-file.c to cache lseek() results?  That is, when
visiting a file sequentially, if lseek(fd, 0, SEEK_HOLE) returns EOF on
our first block status query, then all subsequent block status queries
fall within what we know to be data, and we can skip the lseek() calls.

> 
> So, let's "5daa74a6ebc" by default, leaving an option to return

s/let's/let's undo/

> previous behavior, which is needed for scenarios with preallocated
> images.
> 
> Add iotest illustrating new option semantics.

And none of the existing iotests fail due to changed output?  Does that
mean our testsuite does not have good coverage of pre-allocation modes
where the zero probe mattered?

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
> 

> +++ b/qapi/block-core.json
> @@ -3673,6 +3673,8 @@
>  #                 (default: off)
>  # @force-share:   force share all permission on added nodes.
>  #                 Requires read-only=true. (Since 2.10)
> +# @probe-zeroes:  Probe zeroes on protocol layer if format reports data
> +#                 (default: false) (since 4.0)

Why do we need a new bool?  Can we instead...

>  #
>  # Remaining options are determined by the block driver.
>  #
> @@ -3686,7 +3688,8 @@
>              '*read-only': 'bool',
>              '*auto-read-only': 'bool',
>              '*force-share': 'bool',
> -            '*detect-zeroes': 'BlockdevDetectZeroesOptions' },
> +            '*detect-zeroes': 'BlockdevDetectZeroesOptions',
> +            '*probe-zeroes': 'bool' },

...add another enum value to 'detect-zeroes', since after all, what we
are really doing is fine-tuning what level of zero probing we are
willing to do?


> +++ b/qemu-options.hx
> @@ -615,6 +615,7 @@ DEF("blockdev", HAS_ARG, QEMU_OPTION_blockdev,
>      "-blockdev [driver=]driver[,node-name=N][,discard=ignore|unmap]\n"
>      "          [,cache.direct=on|off][,cache.no-flush=on|off]\n"
>      "          [,read-only=on|off][,detect-zeroes=on|off|unmap]\n"
> +    "          [,probe-zeroes=on|off]\n"
>      "          [,driver specific parameters...]\n"
>      "                configure a block backend\n", QEMU_ARCH_ALL)
>  STEXI
> @@ -670,6 +671,8 @@ discard requests.
>  conversion of plain zero writes by the OS to driver specific optimized
>  zero write commands. You may even choose "unmap" if @var{discard} is set
>  to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
> +@item probe-zeroes=@var{probe-zeroes}
> +Probe zeroes on protocol layer if format reports data. Default is "off".

Again, fitting this into detect-zeroes seems better than inventing a new
knob.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 10 weeks ago
10.01.2019 23:51, Eric Blake wrote:
> On 1/10/19 7:20 AM, Vladimir Sementsov-Ogievskiy wrote:
>> drv_co_block_status digs bs->file for additional, more accurate search
>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
> 
> s/region, reported/regions reported/
> 
>>
>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
>> knows, where are holes and where is data. But every block_status
> 
> Not quite true. qcow2 knows where some holes are, but does not know if
> there are even more holes hiding in the sections marked data (such as
> because of how the disk was pre-allocated).
> 
>> request calls lseek additionally. Assume a big disk, full of
>> data, in any iterative copying block job (or img convert) we'll call
>> lseek(HOLE) on every iteration, and each of these lseeks will have to
>> iterate through all metadata up to the end of file. It's obviously
>> ineffective behavior. And for many scenarios we don't need this lseek
>> at all.
> 
> How much performance can we buy back without any knobs at all, if we
> just taught posix-file.c to cache lseek() results?  That is, when
> visiting a file sequentially, if lseek(fd, 0, SEEK_HOLE) returns EOF on
> our first block status query, then all subsequent block status queries
> fall within what we know to be data, and we can skip the lseek() calls.

EOF is bad mark I think. We may have a small hole not far from EOF, which
will lead to the same performance, but not EOF returned.

I think, proper implementation of lseek cache should work, but it is more
difficult than just disable lseek. And in scenarios without preallocation
we don't need lseek, so again better is disable it.

And I don't sure that qemu is a proper place for lseek optimizations.

And another way may be to select cases when fiemap is safe and use it.

Again, for scenarios when we don't need nor lseek, nor fiemap, neither
cache for them the best option is not use them.

> 
>>
>> So, let's "5daa74a6ebc" by default, leaving an option to return
> 
> s/let's/let's undo/
> 
>> previous behavior, which is needed for scenarios with preallocated
>> images.
>>
>> Add iotest illustrating new option semantics.
> 
> And none of the existing iotests fail due to changed output?  Does that
> mean our testsuite does not have good coverage of pre-allocation modes
> where the zero probe mattered?

To be honest, I didn't run all the tests, only several.

Do it now, and found that, yes, at least 102 broken. Will fix it with the following
version. Strange, do patchew run tests on patches in list now?

> 
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>
> 
>> +++ b/qapi/block-core.json
>> @@ -3673,6 +3673,8 @@
>>   #                 (default: off)
>>   # @force-share:   force share all permission on added nodes.
>>   #                 Requires read-only=true. (Since 2.10)
>> +# @probe-zeroes:  Probe zeroes on protocol layer if format reports data
>> +#                 (default: false) (since 4.0)
> 
> Why do we need a new bool?  Can we instead...
> 
>>   #
>>   # Remaining options are determined by the block driver.
>>   #
>> @@ -3686,7 +3688,8 @@
>>               '*read-only': 'bool',
>>               '*auto-read-only': 'bool',
>>               '*force-share': 'bool',
>> -            '*detect-zeroes': 'BlockdevDetectZeroesOptions' },
>> +            '*detect-zeroes': 'BlockdevDetectZeroesOptions',
>> +            '*probe-zeroes': 'bool' },
> 
> ...add another enum value to 'detect-zeroes', since after all, what we
> are really doing is fine-tuning what level of zero probing we are
> willing to do?

Should we? Or I just bind old behavior to detect-zeroes=on? And then, if needed,
we'll add intermediate modes.

> 
> 
>> +++ b/qemu-options.hx
>> @@ -615,6 +615,7 @@ DEF("blockdev", HAS_ARG, QEMU_OPTION_blockdev,
>>       "-blockdev [driver=]driver[,node-name=N][,discard=ignore|unmap]\n"
>>       "          [,cache.direct=on|off][,cache.no-flush=on|off]\n"
>>       "          [,read-only=on|off][,detect-zeroes=on|off|unmap]\n"
>> +    "          [,probe-zeroes=on|off]\n"
>>       "          [,driver specific parameters...]\n"
>>       "                configure a block backend\n", QEMU_ARCH_ALL)
>>   STEXI
>> @@ -670,6 +671,8 @@ discard requests.
>>   conversion of plain zero writes by the OS to driver specific optimized
>>   zero write commands. You may even choose "unmap" if @var{discard} is set
>>   to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
>> +@item probe-zeroes=@var{probe-zeroes}
>> +Probe zeroes on protocol layer if format reports data. Default is "off".
> 
> Again, fitting this into detect-zeroes seems better than inventing a new
> knob.
> 

No objections. But description of detect-zeroes are more about writes, should
we change them somehow?

# Describes the operation mode for the automatic conversion of plain
# zero writes by the OS to driver specific optimized zero write commands.

...

# @detect-zeroes: detect and optimize zero writes (Since 2.1)



-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Eric Blake 10 weeks ago
On 1/11/19 1:54 AM, Vladimir Sementsov-Ogievskiy wrote:

>>
>> How much performance can we buy back without any knobs at all, if we
>> just taught posix-file.c to cache lseek() results?  That is, when
>> visiting a file sequentially, if lseek(fd, 0, SEEK_HOLE) returns EOF on
>> our first block status query, then all subsequent block status queries
>> fall within what we know to be data, and we can skip the lseek() calls.
> 
> EOF is bad mark I think. We may have a small hole not far from EOF, which
> will lead to the same performance, but not EOF returned.

EOF was just an example for a file that has no holes. But even for a
file with holes, caching should help.  That is, if I have a raw file with:

1M data | 1M hole | 1M data | EOF

but a qcow2 file that was created in an out-of-order fashion, so that
all clusters are discontiguous, then our current code may do something
like the following sequence:

block_status 0 - maps to 64k of host file at offset 0x20000
 - lseek(0x20000) detects that we are in a data portion, and the next
hole begins at 0x100000
 - but because the next cluster is not at 0x30000, we throw away the
information for 0x30000 to 0x100000
block_status 64k - maps to 64k of host file at offset 0x40000
 - lseek(0x40000) detects that we are in a data portion, and the next
hole begins at 0x100000
 - but because the next cluster is not at 0x50000, we throw away the
information for 0x50000 to 0x100000
block status 128k - maps to 64k of host file at offset 0x30000
 - lseek(0x30000) detects that we are in a data portion, and the next
hole begins at 0x100000
...

Even a dumb most-recent use cache will speed this up: both the second
and third queries above can be avoided because we know that both 0x40000
and 0x30000 the second query at 0x40000 can be skipped (0x40000 is
between our most recent lseek at 0x20000 and hole at 0x10000)

Make the cache slightly larger, or use a bitmap with 2 bits per cluster
(tracking unknown, known-data, known-hole), with proper flushing of the
cache as we write to the image, or whatever, and we should automatically
get some performance improvements by using fewer lseek() anywhere that
we remember what previous lseek() already told us, with no knobs needed.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 10 weeks ago
11.01.2019 19:02, Eric Blake wrote:
> On 1/11/19 1:54 AM, Vladimir Sementsov-Ogievskiy wrote:
> 
>>>
>>> How much performance can we buy back without any knobs at all, if we
>>> just taught posix-file.c to cache lseek() results?  That is, when
>>> visiting a file sequentially, if lseek(fd, 0, SEEK_HOLE) returns EOF on
>>> our first block status query, then all subsequent block status queries
>>> fall within what we know to be data, and we can skip the lseek() calls.
>>
>> EOF is bad mark I think. We may have a small hole not far from EOF, which
>> will lead to the same performance, but not EOF returned.
> 
> EOF was just an example for a file that has no holes. But even for a
> file with holes, caching should help.  That is, if I have a raw file with:
> 
> 1M data | 1M hole | 1M data | EOF
> 
> but a qcow2 file that was created in an out-of-order fashion, so that
> all clusters are discontiguous, then our current code may do something
> like the following sequence:
> 
> block_status 0 - maps to 64k of host file at offset 0x20000
>   - lseek(0x20000) detects that we are in a data portion, and the next
> hole begins at 0x100000
>   - but because the next cluster is not at 0x30000, we throw away the
> information for 0x30000 to 0x100000
> block_status 64k - maps to 64k of host file at offset 0x40000
>   - lseek(0x40000) detects that we are in a data portion, and the next
> hole begins at 0x100000
>   - but because the next cluster is not at 0x50000, we throw away the
> information for 0x50000 to 0x100000
> block status 128k - maps to 64k of host file at offset 0x30000
>   - lseek(0x30000) detects that we are in a data portion, and the next
> hole begins at 0x100000
> ...
> 
> Even a dumb most-recent use cache will speed this up: both the second
> and third queries above can be avoided because we know that both 0x40000
> and 0x30000 the second query at 0x40000 can be skipped (0x40000 is
> between our most recent lseek at 0x20000 and hole at 0x10000)

Is it correct just use results from previous iterations? In mirror source
is active and may change.

> 
> Make the cache slightly larger, or use a bitmap with 2 bits per cluster
> (tracking unknown, known-data, known-hole), with proper flushing of the
> cache as we write to the image, or whatever, and we should automatically
> get some performance improvements by using fewer lseek() anywhere that
> we remember what previous lseek() already told us, with no knobs needed.
> 

So the cache should consider all writes and discards. And it is obviously
more difficult to implement it, than just don't call this lseek. And I
don't understand, why cache + lseek is better for the case when we don't
need nor the lseek neither the cache. Is this all to not add an option?
Also Kevin objects to caching lseek in parallel sub-thread.

-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Eric Blake 10 weeks ago
On 1/11/19 10:22 AM, Vladimir Sementsov-Ogievskiy wrote:

>> Even a dumb most-recent use cache will speed this up: both the second
>> and third queries above can be avoided because we know that both 0x40000
>> and 0x30000 the second query at 0x40000 can be skipped (0x40000 is
>> between our most recent lseek at 0x20000 and hole at 0x10000)
> 
> Is it correct just use results from previous iterations? In mirror source
> is active and may change.

If you keep a cache, you have to keep the cache up-to-date. Any writes
to an area that is covered by the known-hole cache have to flush the
cache, so that the next block status no longer sees a known-hole and
ends up doing another lseek.  Or, if the cache has enough state to track
unknown/known-hole/known-data, then writes update the cache to be
known-data, and future block status can skip the lseek by using the
results of the cache.

> 
>>
>> Make the cache slightly larger, or use a bitmap with 2 bits per cluster
>> (tracking unknown, known-data, known-hole), with proper flushing of the
>> cache as we write to the image, or whatever, and we should automatically
>> get some performance improvements by using fewer lseek() anywhere that
>> we remember what previous lseek() already told us, with no knobs needed.
>>
> 
> So the cache should consider all writes and discards. And it is obviously
> more difficult to implement it, than just don't call this lseek. And I
> don't understand, why cache + lseek is better for the case when we don't
> need nor the lseek neither the cache. Is this all to not add an option?
> Also Kevin objects to caching lseek in parallel sub-thread.

Keven objected to caching anything if the image has multiple writers,
where an outside process could change the file allocation in between our
reads. But multiple writers is rare - in fact, our image locking for
qcow2 formats tries to prevent multiple writers.  Having multiple
threads within one process writing is fine, as long as they properly
coordinate writes to the lseek cache so that readers never see a stale
claim of a hole - although a stale claim of data is safe.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Eric Blake 10 weeks ago
On 1/11/19 10:02 AM, Eric Blake wrote:

> Even a dumb most-recent use cache will speed this up: both the second
> and third queries above can be avoided because we know that both 0x40000
> and 0x30000 the second query at 0x40000 can be skipped (0x40000 is
> between our most recent lseek at 0x20000 and hole at 0x10000)
> 
> Make the cache slightly larger, or use a bitmap with 2 bits per cluster
> (tracking unknown, known-data, known-hole), with proper flushing of the
> cache as we write to the image, or whatever, and we should automatically
> get some performance improvements by using fewer lseek() anywhere that
> we remember what previous lseek() already told us, with no knobs needed.

In fact, block/iscsi.c has such a cache; look at iscsi_allocmap_update()
and friends; maybe we should generalize that code in order to reuse the
same concepts in file-posix.c.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 10 weeks ago
11.01.2019 10:54, Vladimir Sementsov-Ogievskiy wrote:
> 10.01.2019 23:51, Eric Blake wrote:
>> On 1/10/19 7:20 AM, Vladimir Sementsov-Ogievskiy wrote:
>>> drv_co_block_status digs bs->file for additional, more accurate search
>>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
>>
>> s/region, reported/regions reported/
>>
>>>
>>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
>>> knows, where are holes and where is data. But every block_status
>>
>> Not quite true. qcow2 knows where some holes are, but does not know if
>> there are even more holes hiding in the sections marked data (such as
>> because of how the disk was pre-allocated).
>>
>>> request calls lseek additionally. Assume a big disk, full of
>>> data, in any iterative copying block job (or img convert) we'll call
>>> lseek(HOLE) on every iteration, and each of these lseeks will have to
>>> iterate through all metadata up to the end of file. It's obviously
>>> ineffective behavior. And for many scenarios we don't need this lseek
>>> at all.
>>
>> How much performance can we buy back without any knobs at all, if we
>> just taught posix-file.c to cache lseek() results?  That is, when
>> visiting a file sequentially, if lseek(fd, 0, SEEK_HOLE) returns EOF on
>> our first block status query, then all subsequent block status queries
>> fall within what we know to be data, and we can skip the lseek() calls.
> 
> EOF is bad mark I think. We may have a small hole not far from EOF, which
> will lead to the same performance, but not EOF returned.
> 
> I think, proper implementation of lseek cache should work, but it is more
> difficult than just disable lseek. And in scenarios without preallocation
> we don't need lseek, so again better is disable it.
> 
> And I don't sure that qemu is a proper place for lseek optimizations.
> 
> And another way may be to select cases when fiemap is safe and use it.
> 
> Again, for scenarios when we don't need nor lseek, nor fiemap, neither
> cache for them the best option is not use them.
> 
>>
>>>
>>> So, let's "5daa74a6ebc" by default, leaving an option to return
>>
>> s/let's/let's undo/
>>
>>> previous behavior, which is needed for scenarios with preallocated
>>> images.
>>>
>>> Add iotest illustrating new option semantics.
>>
>> And none of the existing iotests fail due to changed output?  Does that
>> mean our testsuite does not have good coverage of pre-allocation modes
>> where the zero probe mattered?
> 
> To be honest, I didn't run all the tests, only several.
> 
> Do it now, and found that, yes, at least 102 broken. Will fix it with the following
> version. Strange, do patchew run tests on patches in list now?
> 
>>
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>
>>
>>> +++ b/qapi/block-core.json
>>> @@ -3673,6 +3673,8 @@
>>>   #                 (default: off)
>>>   # @force-share:   force share all permission on added nodes.
>>>   #                 Requires read-only=true. (Since 2.10)
>>> +# @probe-zeroes:  Probe zeroes on protocol layer if format reports data
>>> +#                 (default: false) (since 4.0)
>>
>> Why do we need a new bool?  Can we instead...
>>
>>>   #
>>>   # Remaining options are determined by the block driver.
>>>   #
>>> @@ -3686,7 +3688,8 @@
>>>               '*read-only': 'bool',
>>>               '*auto-read-only': 'bool',
>>>               '*force-share': 'bool',
>>> -            '*detect-zeroes': 'BlockdevDetectZeroesOptions' },
>>> +            '*detect-zeroes': 'BlockdevDetectZeroesOptions',
>>> +            '*probe-zeroes': 'bool' },
>>
>> ...add another enum value to 'detect-zeroes', since after all, what we
>> are really doing is fine-tuning what level of zero probing we are
>> willing to do?
> 
> Should we? Or I just bind old behavior to detect-zeroes=on? And then, if needed,
> we'll add intermediate modes.
> 
>>
>>
>>> +++ b/qemu-options.hx
>>> @@ -615,6 +615,7 @@ DEF("blockdev", HAS_ARG, QEMU_OPTION_blockdev,
>>>       "-blockdev [driver=]driver[,node-name=N][,discard=ignore|unmap]\n"
>>>       "          [,cache.direct=on|off][,cache.no-flush=on|off]\n"
>>>       "          [,read-only=on|off][,detect-zeroes=on|off|unmap]\n"
>>> +    "          [,probe-zeroes=on|off]\n"
>>>       "          [,driver specific parameters...]\n"
>>>       "                configure a block backend\n", QEMU_ARCH_ALL)
>>>   STEXI
>>> @@ -670,6 +671,8 @@ discard requests.
>>>   conversion of plain zero writes by the OS to driver specific optimized
>>>   zero write commands. You may even choose "unmap" if @var{discard} is set
>>>   to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
>>> +@item probe-zeroes=@var{probe-zeroes}
>>> +Probe zeroes on protocol layer if format reports data. Default is "off".
>>
>> Again, fitting this into detect-zeroes seems better than inventing a new
>> knob.
>>
> 
> No objections. But description of detect-zeroes are more about writes, should
> we change them somehow?
> 
> # Describes the operation mode for the automatic conversion of plain
> # zero writes by the OS to driver specific optimized zero write commands.
> 
> ...
> 
> # @detect-zeroes: detect and optimize zero writes (Since 2.1)
> 
> 
> 

Hm, just note, that detect-zeroes was about writes and should be set on target, and
the new thing is about block-status and should be set on source, and this thing should
be described in spec as well.


-- 
Best regards,
Vladimir