[Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Vladimir Sementsov-Ogievskiy posted 1 patch 1 week ago
Failed in applying to current master (apply log)
qapi/block-core.json       |  5 +++-
include/block/block.h      |  1 +
include/block/block_int.h  |  1 +
block.c                    |  9 ++++++
block/io.c                 |  2 +-
qemu-options.hx            |  4 +++
tests/qemu-iotests/237     | 61 ++++++++++++++++++++++++++++++++++++++
tests/qemu-iotests/237.out | 12 ++++++++
tests/qemu-iotests/group   |  1 +
9 files changed, 94 insertions(+), 2 deletions(-)
create mode 100755 tests/qemu-iotests/237
create mode 100644 tests/qemu-iotests/237.out

[Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 1 week ago
drv_co_block_status digs bs->file for additional, more accurate search
for hole inside region, reported as DATA by bs since 5daa74a6ebc.

This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
knows, where are holes and where is data. But every block_status
request calls lseek additionally. Assume a big disk, full of
data, in any iterative copying block job (or img convert) we'll call
lseek(HOLE) on every iteration, and each of these lseeks will have to
iterate through all metadata up to the end of file. It's obviously
ineffective behavior. And for many scenarios we don't need this lseek
at all.

So, let's "5daa74a6ebc" by default, leaving an option to return
previous behavior, which is needed for scenarios with preallocated
images.

Add iotest illustrating new option semantics.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---

Hi all!

I don't call it v2, as it do completely another thing, but here is a
link to my previous try with discussion:
[PATCH] file-posix: add rough-block-status parameter
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg01322.html)

And the previous discussion on the topic is
"do not lseek in qcow2 block_status" thread
(https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg05749.html)

 qapi/block-core.json       |  5 +++-
 include/block/block.h      |  1 +
 include/block/block_int.h  |  1 +
 block.c                    |  9 ++++++
 block/io.c                 |  2 +-
 qemu-options.hx            |  4 +++
 tests/qemu-iotests/237     | 61 ++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/237.out | 12 ++++++++
 tests/qemu-iotests/group   |  1 +
 9 files changed, 94 insertions(+), 2 deletions(-)
 create mode 100755 tests/qemu-iotests/237
 create mode 100644 tests/qemu-iotests/237.out

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 762000f31f..9eed36f3f4 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3673,6 +3673,8 @@
 #                 (default: off)
 # @force-share:   force share all permission on added nodes.
 #                 Requires read-only=true. (Since 2.10)
+# @probe-zeroes:  Probe zeroes on protocol layer if format reports data
+#                 (default: false) (since 4.0)
 #
 # Remaining options are determined by the block driver.
 #
@@ -3686,7 +3688,8 @@
             '*read-only': 'bool',
             '*auto-read-only': 'bool',
             '*force-share': 'bool',
-            '*detect-zeroes': 'BlockdevDetectZeroesOptions' },
+            '*detect-zeroes': 'BlockdevDetectZeroesOptions',
+            '*probe-zeroes': 'bool' },
   'discriminator': 'driver',
   'data': {
       'blkdebug':   'BlockdevOptionsBlkdebug',
diff --git a/include/block/block.h b/include/block/block.h
index f70a843b72..1d7f4a6296 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -129,6 +129,7 @@ typedef struct HDGeometry {
 #define BDRV_OPT_AUTO_READ_ONLY "auto-read-only"
 #define BDRV_OPT_DISCARD        "discard"
 #define BDRV_OPT_FORCE_SHARE    "force-share"
+#define BDRV_OPT_PROBE_ZEROES   "probe-zeroes"
 
 
 #define BDRV_SECTOR_BITS   9
diff --git a/include/block/block_int.h b/include/block/block_int.h
index f605622216..68bddf06b8 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -753,6 +753,7 @@ struct BlockDriverState {
     QDict *options;
     QDict *explicit_options;
     BlockdevDetectZeroesOptions detect_zeroes;
+    bool probe_zeroes;
 
     /* The error object in use for blocking operations on backing_hd */
     Error *backing_blocker;
diff --git a/block.c b/block.c
index 4f5ff2cc12..23a17435c9 100644
--- a/block.c
+++ b/block.c
@@ -939,6 +939,7 @@ static void bdrv_inherited_options(int *child_flags, QDict *child_options,
     qdict_copy_default(child_options, parent_options, BDRV_OPT_CACHE_DIRECT);
     qdict_copy_default(child_options, parent_options, BDRV_OPT_CACHE_NO_FLUSH);
     qdict_copy_default(child_options, parent_options, BDRV_OPT_FORCE_SHARE);
+    qdict_copy_default(child_options, parent_options, BDRV_OPT_PROBE_ZEROES);
 
     /* Inherit the read-only option from the parent if it's not set */
     qdict_copy_default(child_options, parent_options, BDRV_OPT_READ_ONLY);
@@ -1063,6 +1064,7 @@ static void bdrv_backing_options(int *child_flags, QDict *child_options,
     qdict_copy_default(child_options, parent_options, BDRV_OPT_CACHE_DIRECT);
     qdict_copy_default(child_options, parent_options, BDRV_OPT_CACHE_NO_FLUSH);
     qdict_copy_default(child_options, parent_options, BDRV_OPT_FORCE_SHARE);
+    qdict_copy_default(child_options, parent_options, BDRV_OPT_PROBE_ZEROES);
 
     /* backing files always opened read-only */
     qdict_set_default_str(child_options, BDRV_OPT_READ_ONLY, "on");
@@ -1364,6 +1366,12 @@ QemuOptsList bdrv_runtime_opts = {
             .type = QEMU_OPT_BOOL,
             .help = "always accept other writers (default: off)",
         },
+        {
+            .name = BDRV_OPT_PROBE_ZEROES,
+            .type = QEMU_OPT_BOOL,
+            .help = "probe zeroes on protocol layer if format reports data "
+                    "(default: false)",
+        },
         { /* end of list */ }
     },
 };
@@ -1403,6 +1411,7 @@ static int bdrv_open_common(BlockDriverState *bs, BlockBackend *file,
     assert(drv != NULL);
 
     bs->force_share = qemu_opt_get_bool(opts, BDRV_OPT_FORCE_SHARE, false);
+    bs->probe_zeroes = qemu_opt_get_bool(opts, BDRV_OPT_PROBE_ZEROES, false);
 
     if (bs->force_share && (bs->open_flags & BDRV_O_RDWR)) {
         error_setg(errp,
diff --git a/block/io.c b/block/io.c
index bd9d688f8b..0c0cb3a17f 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2186,7 +2186,7 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
         }
     }
 
-    if (want_zero && local_file && local_file != bs &&
+    if (want_zero && bs->probe_zeroes && local_file && local_file != bs &&
         (ret & BDRV_BLOCK_DATA) && !(ret & BDRV_BLOCK_ZERO) &&
         (ret & BDRV_BLOCK_OFFSET_VALID)) {
         int64_t file_pnum;
diff --git a/qemu-options.hx b/qemu-options.hx
index d4f3564b78..7b8dfcfaf6 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -615,6 +615,7 @@ DEF("blockdev", HAS_ARG, QEMU_OPTION_blockdev,
     "-blockdev [driver=]driver[,node-name=N][,discard=ignore|unmap]\n"
     "          [,cache.direct=on|off][,cache.no-flush=on|off]\n"
     "          [,read-only=on|off][,detect-zeroes=on|off|unmap]\n"
+    "          [,probe-zeroes=on|off]\n"
     "          [,driver specific parameters...]\n"
     "                configure a block backend\n", QEMU_ARCH_ALL)
 STEXI
@@ -670,6 +671,8 @@ discard requests.
 conversion of plain zero writes by the OS to driver specific optimized
 zero write commands. You may even choose "unmap" if @var{discard} is set
 to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
+@item probe-zeroes=@var{probe-zeroes}
+Probe zeroes on protocol layer if format reports data. Default is "off".
 @end table
 
 @item Driver-specific options for @code{file}
@@ -793,6 +796,7 @@ DEF("drive", HAS_ARG, QEMU_OPTION_drive,
     "       [,werror=ignore|stop|report|enospc][,id=name][,aio=threads|native]\n"
     "       [,readonly=on|off][,copy-on-read=on|off]\n"
     "       [,discard=ignore|unmap][,detect-zeroes=on|off|unmap]\n"
+    "       [,probe-zeroes=on|off]\n"
     "       [[,bps=b]|[[,bps_rd=r][,bps_wr=w]]]\n"
     "       [[,iops=i]|[[,iops_rd=r][,iops_wr=w]]]\n"
     "       [[,bps_max=bm]|[[,bps_rd_max=rm][,bps_wr_max=wm]]]\n"
diff --git a/tests/qemu-iotests/237 b/tests/qemu-iotests/237
new file mode 100755
index 0000000000..9e66230c13
--- /dev/null
+++ b/tests/qemu-iotests/237
@@ -0,0 +1,61 @@
+#!/bin/bash
+#
+# Test probe-zeroes drive option
+#
+# Copyright (c) 2019 Virtuozzo International GmbH. All rights reserved.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=vsementsov@virtuozzo.com
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+
+status=1	# failure is the default!
+
+_cleanup()
+{
+	_cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux
+
+IMGOPTS='preallocation=metadata' _make_test_img 1M
+
+echo "== default map =="
+$QEMU_IMG map --output=json "$TEST_IMG"
+
+echo
+echo "== map with probe-zeroes=off =="
+$QEMU_IMG map --output=json --image-opts \
+        driver="$IMGFMT",file.filename="$TEST_IMG",probe-zeroes=off
+
+echo
+echo "== map with probe-zeroes=on =="
+$QEMU_IMG map --output=json --image-opts \
+        driver="$IMGFMT",file.filename="$TEST_IMG",probe-zeroes=on
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/237.out b/tests/qemu-iotests/237.out
new file mode 100644
index 0000000000..8747757da7
--- /dev/null
+++ b/tests/qemu-iotests/237.out
@@ -0,0 +1,12 @@
+QA output created by 237
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 preallocation=metadata
+== default map ==
+[{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 327680}]
+
+== map with probe-zeroes=off ==
+[{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": true, "offset": 327680}]
+
+== map with probe-zeroes=on ==
+[{ "start": 0, "length": 1044480, "depth": 0, "zero": true, "data": true, "offset": 327680},
+{ "start": 1044480, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": 1372160}]
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index 61a6d98ebd..3064a04911 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -233,3 +233,4 @@
 233 auto quick
 234 auto quick migration
 235 auto quick
+237 auto quick
-- 
2.18.0


Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Kevin Wolf 6 days ago
Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
> drv_co_block_status digs bs->file for additional, more accurate search
> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
> 
> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
> knows, where are holes and where is data. But every block_status
> request calls lseek additionally. Assume a big disk, full of
> data, in any iterative copying block job (or img convert) we'll call
> lseek(HOLE) on every iteration, and each of these lseeks will have to
> iterate through all metadata up to the end of file. It's obviously
> ineffective behavior. And for many scenarios we don't need this lseek
> at all.
> 
> So, let's "5daa74a6ebc" by default, leaving an option to return
> previous behavior, which is needed for scenarios with preallocated
> images.
> 
> Add iotest illustrating new option semantics.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

I still think that an option isn't a good solution and we should try use
some heuristics instead.

Kevin

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 6 days ago
11.01.2019 13:41, Kevin Wolf wrote:
> Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> drv_co_block_status digs bs->file for additional, more accurate search
>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
>>
>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
>> knows, where are holes and where is data. But every block_status
>> request calls lseek additionally. Assume a big disk, full of
>> data, in any iterative copying block job (or img convert) we'll call
>> lseek(HOLE) on every iteration, and each of these lseeks will have to
>> iterate through all metadata up to the end of file. It's obviously
>> ineffective behavior. And for many scenarios we don't need this lseek
>> at all.
>>
>> So, let's "5daa74a6ebc" by default, leaving an option to return
>> previous behavior, which is needed for scenarios with preallocated
>> images.
>>
>> Add iotest illustrating new option semantics.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> 
> I still think that an option isn't a good solution and we should try use
> some heuristics instead.

Do you think that heuristics would be better than fair cache for lseek results?

I don't see good heuristics, neither want to implement lseek optimizations.
In our cases we don't need lseek under qcow2 at all, and it's obviously better
just don't lseek in these cases.

Moreover, as I understand, the cases when we need lseek are those when qcow2 thinks
that the cluster is data, but actually it's a hole on fs... Isn't it better to
support this thing in qcow2? A kind of allocated-zeroes cluster? We have enough
reserved bits in cluster descriptor to add this feature. And than we don't need
this kind of cheating when user "better knows" where are holes than format layer
which owns the data..

-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Kevin Wolf 6 days ago
Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 11.01.2019 13:41, Kevin Wolf wrote:
> > Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> drv_co_block_status digs bs->file for additional, more accurate search
> >> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
> >>
> >> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
> >> knows, where are holes and where is data. But every block_status
> >> request calls lseek additionally. Assume a big disk, full of
> >> data, in any iterative copying block job (or img convert) we'll call
> >> lseek(HOLE) on every iteration, and each of these lseeks will have to
> >> iterate through all metadata up to the end of file. It's obviously
> >> ineffective behavior. And for many scenarios we don't need this lseek
> >> at all.
> >>
> >> So, let's "5daa74a6ebc" by default, leaving an option to return
> >> previous behavior, which is needed for scenarios with preallocated
> >> images.
> >>
> >> Add iotest illustrating new option semantics.
> >>
> >> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > 
> > I still think that an option isn't a good solution and we should try use
> > some heuristics instead.
> 
> Do you think that heuristics would be better than fair cache for lseek
> results?

I don't think lseek() results are cachable, at least for images that can
share BLK_PERM_WRITE between processes.

> I don't see good heuristics, neither want to implement lseek optimizations.
> In our cases we don't need lseek under qcow2 at all, and it's obviously better
> just don't lseek in these cases.

I suggested one: Pass large contiguous allocated ranges to the protocol
driver, but just assume that the allocation status is correct in the
format layer if they are small.

> Moreover, as I understand, the cases when we need lseek are those when qcow2 thinks
> that the cluster is data, but actually it's a hole on fs... Isn't it better to
> support this thing in qcow2? A kind of allocated-zeroes cluster? We have enough
> reserved bits in cluster descriptor to add this feature. And than we don't need
> this kind of cheating when user "better knows" where are holes than format layer
> which owns the data..

The point of metadata preallocation is that you don't have to update the
metadata when you write to the image. Having to clear a bit in the L2
entry would negate that.

Maybe another option would be to use a compatible feature flag in qcow2
that is set during creation with preallocation=metadata, and we would
only go to the lower layer if the flag is set. The downside side is that
this would only be effective for new images, so copying old preallocated
images would be slowed down considerably.

Kevin

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 6 days ago
11.01.2019 15:21, Kevin Wolf wrote:
> Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 11.01.2019 13:41, Kevin Wolf wrote:
>>> Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> drv_co_block_status digs bs->file for additional, more accurate search
>>>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
>>>>
>>>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
>>>> knows, where are holes and where is data. But every block_status
>>>> request calls lseek additionally. Assume a big disk, full of
>>>> data, in any iterative copying block job (or img convert) we'll call
>>>> lseek(HOLE) on every iteration, and each of these lseeks will have to
>>>> iterate through all metadata up to the end of file. It's obviously
>>>> ineffective behavior. And for many scenarios we don't need this lseek
>>>> at all.
>>>>
>>>> So, let's "5daa74a6ebc" by default, leaving an option to return
>>>> previous behavior, which is needed for scenarios with preallocated
>>>> images.
>>>>
>>>> Add iotest illustrating new option semantics.
>>>>
>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>
>>> I still think that an option isn't a good solution and we should try use
>>> some heuristics instead.
>>
>> Do you think that heuristics would be better than fair cache for lseek
>> results?
> 
> I don't think lseek() results are cachable, at least for images that can
> share BLK_PERM_WRITE between processes.
> 
>> I don't see good heuristics, neither want to implement lseek optimizations.
>> In our cases we don't need lseek under qcow2 at all, and it's obviously better
>> just don't lseek in these cases.
> 
> I suggested one: Pass large contiguous allocated ranges to the protocol
> driver, but just assume that the allocation status is correct in the
> format layer if they are small.

So, for fully allocated image, we will call lseek always?

> 
>> Moreover, as I understand, the cases when we need lseek are those when qcow2 thinks
>> that the cluster is data, but actually it's a hole on fs... Isn't it better to
>> support this thing in qcow2? A kind of allocated-zeroes cluster? We have enough
>> reserved bits in cluster descriptor to add this feature. And than we don't need
>> this kind of cheating when user "better knows" where are holes than format layer
>> which owns the data..
> 
> The point of metadata preallocation is that you don't have to update the
> metadata when you write to the image. Having to clear a bit in the L2
> entry would negate that.
> 
> Maybe another option would be to use a compatible feature flag in qcow2
> that is set during creation with preallocation=metadata, and we would
> only go to the lower layer if the flag is set. The downside side is that
> this would only be effective for new images, so copying old preallocated
> images would be slowed down considerably.
> 
> Kevin
> 


-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Kevin Wolf 6 days ago
Am 11.01.2019 um 13:59 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 11.01.2019 15:21, Kevin Wolf wrote:
> > Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> 11.01.2019 13:41, Kevin Wolf wrote:
> >>> Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >>>> drv_co_block_status digs bs->file for additional, more accurate search
> >>>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
> >>>>
> >>>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
> >>>> knows, where are holes and where is data. But every block_status
> >>>> request calls lseek additionally. Assume a big disk, full of
> >>>> data, in any iterative copying block job (or img convert) we'll call
> >>>> lseek(HOLE) on every iteration, and each of these lseeks will have to
> >>>> iterate through all metadata up to the end of file. It's obviously
> >>>> ineffective behavior. And for many scenarios we don't need this lseek
> >>>> at all.
> >>>>
> >>>> So, let's "5daa74a6ebc" by default, leaving an option to return
> >>>> previous behavior, which is needed for scenarios with preallocated
> >>>> images.
> >>>>
> >>>> Add iotest illustrating new option semantics.
> >>>>
> >>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> >>>
> >>> I still think that an option isn't a good solution and we should try use
> >>> some heuristics instead.
> >>
> >> Do you think that heuristics would be better than fair cache for lseek
> >> results?
> > 
> > I don't think lseek() results are cachable, at least for images that can
> > share BLK_PERM_WRITE between processes.
> > 
> >> I don't see good heuristics, neither want to implement lseek optimizations.
> >> In our cases we don't need lseek under qcow2 at all, and it's obviously better
> >> just don't lseek in these cases.
> > 
> > I suggested one: Pass large contiguous allocated ranges to the protocol
> > driver, but just assume that the allocation status is correct in the
> > format layer if they are small.
> 
> So, for fully allocated image, we will call lseek always?

If they are fully contiguous, yes. But that's a single lseek() call per
image then instead of an lseek() for every 64k, so not a big problem.

In the more realistic case, you will still call lseek() occasionally
because you will have some fragmentation, but the fragments can be
rather large. But it should still significantly reduce them compared to
now because you skip it for those parts with small contiguous
allocations where lseek() would be called a lot today.

Kevin

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 6 days ago
11.01.2019 16:15, Kevin Wolf wrote:
> Am 11.01.2019 um 13:59 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 11.01.2019 15:21, Kevin Wolf wrote:
>>> Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> 11.01.2019 13:41, Kevin Wolf wrote:
>>>>> Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>> drv_co_block_status digs bs->file for additional, more accurate search
>>>>>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
>>>>>>
>>>>>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
>>>>>> knows, where are holes and where is data. But every block_status
>>>>>> request calls lseek additionally. Assume a big disk, full of
>>>>>> data, in any iterative copying block job (or img convert) we'll call
>>>>>> lseek(HOLE) on every iteration, and each of these lseeks will have to
>>>>>> iterate through all metadata up to the end of file. It's obviously
>>>>>> ineffective behavior. And for many scenarios we don't need this lseek
>>>>>> at all.
>>>>>>
>>>>>> So, let's "5daa74a6ebc" by default, leaving an option to return
>>>>>> previous behavior, which is needed for scenarios with preallocated
>>>>>> images.
>>>>>>
>>>>>> Add iotest illustrating new option semantics.
>>>>>>
>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>
>>>>> I still think that an option isn't a good solution and we should try use
>>>>> some heuristics instead.
>>>>
>>>> Do you think that heuristics would be better than fair cache for lseek
>>>> results?
>>>
>>> I don't think lseek() results are cachable, at least for images that can
>>> share BLK_PERM_WRITE between processes.
>>>
>>>> I don't see good heuristics, neither want to implement lseek optimizations.
>>>> In our cases we don't need lseek under qcow2 at all, and it's obviously better
>>>> just don't lseek in these cases.
>>>
>>> I suggested one: Pass large contiguous allocated ranges to the protocol
>>> driver, but just assume that the allocation status is correct in the
>>> format layer if they are small.
>>
>> So, for fully allocated image, we will call lseek always?
> 
> If they are fully contiguous, yes. But that's a single lseek() call per
> image then instead of an lseek() for every 64k, so not a big problem.

lseek is called on each mirror iteration, why one per image?

> 
> In the more realistic case, you will still call lseek() occasionally
> because you will have some fragmentation, but the fragments can be
> rather large. But it should still significantly reduce them compared to
> now because you skip it for those parts with small contiguous
> allocations where lseek() would be called a lot today.
> 
> Kevin
> 

Ok, you propose not to call lseek for small enough data regions reported by
format layer. And for images which are less fragmented, this helps less or don't
help.

Why do you think it is better?

For not preallocated images it is worse, as it covers less cases. So, for our
scenarios it is worse.

The only case, when heuristic works better, is when user have preallocated image,
but don't know about new option, which returns old behavior. We are not interested
in this case and can't go this way, as it doesn't guarantee, that some customer will
not come again with lseek-related problems.

Don't you like what Eric propose, about binding behavior switch to existing
detect-zeroes option?

Or, we can add an opposite option, to enable new behavior, keeping the old one by
default. So, all stays as is, and who need uses new option. Heuristic may be
implemented then too.

-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Eric Blake 6 days ago
On 1/11/19 10:09 AM, Vladimir Sementsov-Ogievskiy wrote:

>>>> I suggested one: Pass large contiguous allocated ranges to the protocol
>>>> driver, but just assume that the allocation status is correct in the
>>>> format layer if they are small.
>>>
>>> So, for fully allocated image, we will call lseek always?
>>
>> If they are fully contiguous, yes. But that's a single lseek() call per
>> image then instead of an lseek() for every 64k, so not a big problem.
> 
> lseek is called on each mirror iteration, why one per image?

If the image has no holes, then lseek(0, SEEK_HOLE) will return EOF, and
then you know that there are no holes, and you don't need to make any
further lseek() calls.  Hence, once per image.  A fully-allocated file
that has areas that read as known zeroes can be determined by fiemap
(but not by lseek, which can only detect unallocated holes) - but we
already know that while fiemap has more information, it also has more
problems (you cannot use it safely without sync, but sync makes it too
slow to use), so that is a non-starter.

>>
>> In the more realistic case, you will still call lseek() occasionally
>> because you will have some fragmentation, but the fragments can be
>> rather large. But it should still significantly reduce them compared to
>> now because you skip it for those parts with small contiguous
>> allocations where lseek() would be called a lot today.
>>
>> Kevin
>>
> 
> Ok, you propose not to call lseek for small enough data regions reported by
> format layer. And for images which are less fragmented, this helps less or don't
> help.

Indeed - pick some threshold (maybe 16 clusters); if block status of the
format layer returns something smaller than the threshold, don't bother
refining the answer further by doing block status of the protocol layer
(if the caller is iterating over an image 1 cluster at a time, then the
threshold will never be tripped and thus you'll never do an lseek); but
where the block status of the format layer is large, we are reading the
file in larger chunks so we end up with fewer lseeks in the long run
anyways.

> 
> Why do you think it is better?
> 
> For not preallocated images it is worse, as it covers less cases. So, for our
> scenarios it is worse.

Anywhere that you skip calling lseek(), you end up missing out on
opportunities to optimize had you instead been able to learn from lseek
that you were on a hole after all.  So it becomes a balancing question:
how much time is spent on probing for whether an optimization is even
possible, vs. how much time is spent if the probe succeeded and you can
then optimize.  For a fully-allocated image, all of the time spent
probing is wasted (you never find a hole, so every probe was wasted).
So it is indeed a tradeoff when picking your heuristics, of trying to
balance how likely the probe will justify the time spent on the probe.

> The only case, when heuristic works better, is when user have preallocated image,
> but don't know about new option, which returns old behavior. We are not interested
> in this case and can't go this way, as it doesn't guarantee, that some customer will
> not come again with lseek-related problems.
> 
> Don't you like what Eric propose, about binding behavior switch to existing
> detect-zeroes option?
> 
> Or, we can add an opposite option, to enable new behavior, keeping the old one by
> default. So, all stays as is, and who need uses new option. Heuristic may be
> implemented then too.
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 6 days ago

On 11.01.2019 17:04, Eric Blake wrote:
> On 1/11/19 10:09 AM, Vladimir Sementsov-Ogievskiy wrote:
> 
>>>>> I suggested one: Pass large contiguous allocated ranges to the protocol
>>>>> driver, but just assume that the allocation status is correct in the
>>>>> format layer if they are small.
>>>>
>>>> So, for fully allocated image, we will call lseek always?
>>>
>>> If they are fully contiguous, yes. But that's a single lseek() call per
>>> image then instead of an lseek() for every 64k, so not a big problem.
>>
>> lseek is called on each mirror iteration, why one per image?
> 
> If the image has no holes, then lseek(0, SEEK_HOLE) will return EOF, and
> then you know that there are no holes, and you don't need to make any
> further lseek() calls.  Hence, once per image.  A fully-allocated file
> that has areas that read as known zeroes can be determined by fiemap
> (but not by lseek, which can only detect unallocated holes) - but we
> already know that while fiemap has more information, it also has more
> problems (you cannot use it safely without sync, but sync makes it too
> slow to use), so that is a non-starter.
> 
>>>
>>> In the more realistic case, you will still call lseek() occasionally
>>> because you will have some fragmentation, but the fragments can be
>>> rather large. But it should still significantly reduce them compared to
>>> now because you skip it for those parts with small contiguous
>>> allocations where lseek() would be called a lot today.
>>>
>>> Kevin
>>>
>>
>> Ok, you propose not to call lseek for small enough data regions reported by
>> format layer. And for images which are less fragmented, this helps less or don't
>> help.
> 
> Indeed - pick some threshold (maybe 16 clusters); if block status of the
> format layer returns something smaller than the threshold, don't bother
> refining the answer further by doing block status of the protocol layer
> (if the caller is iterating over an image 1 cluster at a time, then the
> threshold will never be tripped and thus you'll never do an lseek); but
> where the block status of the format layer is large, we are reading the
> file in larger chunks so we end up with fewer lseeks in the long run
> anyways.
> 
>>
>> Why do you think it is better?
>>
>> For not preallocated images it is worse, as it covers less cases. So, for our
>> scenarios it is worse.
> 
> Anywhere that you skip calling lseek(), you end up missing out on
> opportunities to optimize had you instead been able to learn from lseek
> that you were on a hole after all.

What are the cases when we'll benefit from lseek except preallocated images?

   So it becomes a balancing question:
> how much time is spent on probing for whether an optimization is even
> possible, vs. how much time is spent if the probe succeeded and you can
> then optimize.  For a fully-allocated image, all of the time spent
> probing is wasted (you never find a hole, so every probe was wasted).
> So it is indeed a tradeoff when picking your heuristics, of trying to
> balance how likely the probe will justify the time spent on the probe.
> 
>> The only case, when heuristic works better, is when user have preallocated image,
>> but don't know about new option, which returns old behavior. We are not interested
>> in this case and can't go this way, as it doesn't guarantee, that some customer will
>> not come again with lseek-related problems.
>>
>> Don't you like what Eric propose, about binding behavior switch to existing
>> detect-zeroes option?
>>
>> Or, we can add an opposite option, to enable new behavior, keeping the old one by
>> default. So, all stays as is, and who need uses new option. Heuristic may be
>> implemented then too.
>>
> 

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Eric Blake 6 days ago
On 1/10/19 7:20 AM, Vladimir Sementsov-Ogievskiy wrote:
> drv_co_block_status digs bs->file for additional, more accurate search
> for hole inside region, reported as DATA by bs since 5daa74a6ebc.

s/region, reported/regions reported/

> 
> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
> knows, where are holes and where is data. But every block_status

Not quite true. qcow2 knows where some holes are, but does not know if
there are even more holes hiding in the sections marked data (such as
because of how the disk was pre-allocated).

> request calls lseek additionally. Assume a big disk, full of
> data, in any iterative copying block job (or img convert) we'll call
> lseek(HOLE) on every iteration, and each of these lseeks will have to
> iterate through all metadata up to the end of file. It's obviously
> ineffective behavior. And for many scenarios we don't need this lseek
> at all.

How much performance can we buy back without any knobs at all, if we
just taught posix-file.c to cache lseek() results?  That is, when
visiting a file sequentially, if lseek(fd, 0, SEEK_HOLE) returns EOF on
our first block status query, then all subsequent block status queries
fall within what we know to be data, and we can skip the lseek() calls.

> 
> So, let's "5daa74a6ebc" by default, leaving an option to return

s/let's/let's undo/

> previous behavior, which is needed for scenarios with preallocated
> images.
> 
> Add iotest illustrating new option semantics.

And none of the existing iotests fail due to changed output?  Does that
mean our testsuite does not have good coverage of pre-allocation modes
where the zero probe mattered?

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
> 

> +++ b/qapi/block-core.json
> @@ -3673,6 +3673,8 @@
>  #                 (default: off)
>  # @force-share:   force share all permission on added nodes.
>  #                 Requires read-only=true. (Since 2.10)
> +# @probe-zeroes:  Probe zeroes on protocol layer if format reports data
> +#                 (default: false) (since 4.0)

Why do we need a new bool?  Can we instead...

>  #
>  # Remaining options are determined by the block driver.
>  #
> @@ -3686,7 +3688,8 @@
>              '*read-only': 'bool',
>              '*auto-read-only': 'bool',
>              '*force-share': 'bool',
> -            '*detect-zeroes': 'BlockdevDetectZeroesOptions' },
> +            '*detect-zeroes': 'BlockdevDetectZeroesOptions',
> +            '*probe-zeroes': 'bool' },

...add another enum value to 'detect-zeroes', since after all, what we
are really doing is fine-tuning what level of zero probing we are
willing to do?


> +++ b/qemu-options.hx
> @@ -615,6 +615,7 @@ DEF("blockdev", HAS_ARG, QEMU_OPTION_blockdev,
>      "-blockdev [driver=]driver[,node-name=N][,discard=ignore|unmap]\n"
>      "          [,cache.direct=on|off][,cache.no-flush=on|off]\n"
>      "          [,read-only=on|off][,detect-zeroes=on|off|unmap]\n"
> +    "          [,probe-zeroes=on|off]\n"
>      "          [,driver specific parameters...]\n"
>      "                configure a block backend\n", QEMU_ARCH_ALL)
>  STEXI
> @@ -670,6 +671,8 @@ discard requests.
>  conversion of plain zero writes by the OS to driver specific optimized
>  zero write commands. You may even choose "unmap" if @var{discard} is set
>  to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
> +@item probe-zeroes=@var{probe-zeroes}
> +Probe zeroes on protocol layer if format reports data. Default is "off".

Again, fitting this into detect-zeroes seems better than inventing a new
knob.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 6 days ago
10.01.2019 23:51, Eric Blake wrote:
> On 1/10/19 7:20 AM, Vladimir Sementsov-Ogievskiy wrote:
>> drv_co_block_status digs bs->file for additional, more accurate search
>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
> 
> s/region, reported/regions reported/
> 
>>
>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
>> knows, where are holes and where is data. But every block_status
> 
> Not quite true. qcow2 knows where some holes are, but does not know if
> there are even more holes hiding in the sections marked data (such as
> because of how the disk was pre-allocated).
> 
>> request calls lseek additionally. Assume a big disk, full of
>> data, in any iterative copying block job (or img convert) we'll call
>> lseek(HOLE) on every iteration, and each of these lseeks will have to
>> iterate through all metadata up to the end of file. It's obviously
>> ineffective behavior. And for many scenarios we don't need this lseek
>> at all.
> 
> How much performance can we buy back without any knobs at all, if we
> just taught posix-file.c to cache lseek() results?  That is, when
> visiting a file sequentially, if lseek(fd, 0, SEEK_HOLE) returns EOF on
> our first block status query, then all subsequent block status queries
> fall within what we know to be data, and we can skip the lseek() calls.

EOF is bad mark I think. We may have a small hole not far from EOF, which
will lead to the same performance, but not EOF returned.

I think, proper implementation of lseek cache should work, but it is more
difficult than just disable lseek. And in scenarios without preallocation
we don't need lseek, so again better is disable it.

And I don't sure that qemu is a proper place for lseek optimizations.

And another way may be to select cases when fiemap is safe and use it.

Again, for scenarios when we don't need nor lseek, nor fiemap, neither
cache for them the best option is not use them.

> 
>>
>> So, let's "5daa74a6ebc" by default, leaving an option to return
> 
> s/let's/let's undo/
> 
>> previous behavior, which is needed for scenarios with preallocated
>> images.
>>
>> Add iotest illustrating new option semantics.
> 
> And none of the existing iotests fail due to changed output?  Does that
> mean our testsuite does not have good coverage of pre-allocation modes
> where the zero probe mattered?

To be honest, I didn't run all the tests, only several.

Do it now, and found that, yes, at least 102 broken. Will fix it with the following
version. Strange, do patchew run tests on patches in list now?

> 
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>
> 
>> +++ b/qapi/block-core.json
>> @@ -3673,6 +3673,8 @@
>>   #                 (default: off)
>>   # @force-share:   force share all permission on added nodes.
>>   #                 Requires read-only=true. (Since 2.10)
>> +# @probe-zeroes:  Probe zeroes on protocol layer if format reports data
>> +#                 (default: false) (since 4.0)
> 
> Why do we need a new bool?  Can we instead...
> 
>>   #
>>   # Remaining options are determined by the block driver.
>>   #
>> @@ -3686,7 +3688,8 @@
>>               '*read-only': 'bool',
>>               '*auto-read-only': 'bool',
>>               '*force-share': 'bool',
>> -            '*detect-zeroes': 'BlockdevDetectZeroesOptions' },
>> +            '*detect-zeroes': 'BlockdevDetectZeroesOptions',
>> +            '*probe-zeroes': 'bool' },
> 
> ...add another enum value to 'detect-zeroes', since after all, what we
> are really doing is fine-tuning what level of zero probing we are
> willing to do?

Should we? Or I just bind old behavior to detect-zeroes=on? And then, if needed,
we'll add intermediate modes.

> 
> 
>> +++ b/qemu-options.hx
>> @@ -615,6 +615,7 @@ DEF("blockdev", HAS_ARG, QEMU_OPTION_blockdev,
>>       "-blockdev [driver=]driver[,node-name=N][,discard=ignore|unmap]\n"
>>       "          [,cache.direct=on|off][,cache.no-flush=on|off]\n"
>>       "          [,read-only=on|off][,detect-zeroes=on|off|unmap]\n"
>> +    "          [,probe-zeroes=on|off]\n"
>>       "          [,driver specific parameters...]\n"
>>       "                configure a block backend\n", QEMU_ARCH_ALL)
>>   STEXI
>> @@ -670,6 +671,8 @@ discard requests.
>>   conversion of plain zero writes by the OS to driver specific optimized
>>   zero write commands. You may even choose "unmap" if @var{discard} is set
>>   to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
>> +@item probe-zeroes=@var{probe-zeroes}
>> +Probe zeroes on protocol layer if format reports data. Default is "off".
> 
> Again, fitting this into detect-zeroes seems better than inventing a new
> knob.
> 

No objections. But description of detect-zeroes are more about writes, should
we change them somehow?

# Describes the operation mode for the automatic conversion of plain
# zero writes by the OS to driver specific optimized zero write commands.

...

# @detect-zeroes: detect and optimize zero writes (Since 2.1)



-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Eric Blake 6 days ago
On 1/11/19 1:54 AM, Vladimir Sementsov-Ogievskiy wrote:

>>
>> How much performance can we buy back without any knobs at all, if we
>> just taught posix-file.c to cache lseek() results?  That is, when
>> visiting a file sequentially, if lseek(fd, 0, SEEK_HOLE) returns EOF on
>> our first block status query, then all subsequent block status queries
>> fall within what we know to be data, and we can skip the lseek() calls.
> 
> EOF is bad mark I think. We may have a small hole not far from EOF, which
> will lead to the same performance, but not EOF returned.

EOF was just an example for a file that has no holes. But even for a
file with holes, caching should help.  That is, if I have a raw file with:

1M data | 1M hole | 1M data | EOF

but a qcow2 file that was created in an out-of-order fashion, so that
all clusters are discontiguous, then our current code may do something
like the following sequence:

block_status 0 - maps to 64k of host file at offset 0x20000
 - lseek(0x20000) detects that we are in a data portion, and the next
hole begins at 0x100000
 - but because the next cluster is not at 0x30000, we throw away the
information for 0x30000 to 0x100000
block_status 64k - maps to 64k of host file at offset 0x40000
 - lseek(0x40000) detects that we are in a data portion, and the next
hole begins at 0x100000
 - but because the next cluster is not at 0x50000, we throw away the
information for 0x50000 to 0x100000
block status 128k - maps to 64k of host file at offset 0x30000
 - lseek(0x30000) detects that we are in a data portion, and the next
hole begins at 0x100000
...

Even a dumb most-recent use cache will speed this up: both the second
and third queries above can be avoided because we know that both 0x40000
and 0x30000 the second query at 0x40000 can be skipped (0x40000 is
between our most recent lseek at 0x20000 and hole at 0x10000)

Make the cache slightly larger, or use a bitmap with 2 bits per cluster
(tracking unknown, known-data, known-hole), with proper flushing of the
cache as we write to the image, or whatever, and we should automatically
get some performance improvements by using fewer lseek() anywhere that
we remember what previous lseek() already told us, with no knobs needed.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 6 days ago
11.01.2019 19:02, Eric Blake wrote:
> On 1/11/19 1:54 AM, Vladimir Sementsov-Ogievskiy wrote:
> 
>>>
>>> How much performance can we buy back without any knobs at all, if we
>>> just taught posix-file.c to cache lseek() results?  That is, when
>>> visiting a file sequentially, if lseek(fd, 0, SEEK_HOLE) returns EOF on
>>> our first block status query, then all subsequent block status queries
>>> fall within what we know to be data, and we can skip the lseek() calls.
>>
>> EOF is bad mark I think. We may have a small hole not far from EOF, which
>> will lead to the same performance, but not EOF returned.
> 
> EOF was just an example for a file that has no holes. But even for a
> file with holes, caching should help.  That is, if I have a raw file with:
> 
> 1M data | 1M hole | 1M data | EOF
> 
> but a qcow2 file that was created in an out-of-order fashion, so that
> all clusters are discontiguous, then our current code may do something
> like the following sequence:
> 
> block_status 0 - maps to 64k of host file at offset 0x20000
>   - lseek(0x20000) detects that we are in a data portion, and the next
> hole begins at 0x100000
>   - but because the next cluster is not at 0x30000, we throw away the
> information for 0x30000 to 0x100000
> block_status 64k - maps to 64k of host file at offset 0x40000
>   - lseek(0x40000) detects that we are in a data portion, and the next
> hole begins at 0x100000
>   - but because the next cluster is not at 0x50000, we throw away the
> information for 0x50000 to 0x100000
> block status 128k - maps to 64k of host file at offset 0x30000
>   - lseek(0x30000) detects that we are in a data portion, and the next
> hole begins at 0x100000
> ...
> 
> Even a dumb most-recent use cache will speed this up: both the second
> and third queries above can be avoided because we know that both 0x40000
> and 0x30000 the second query at 0x40000 can be skipped (0x40000 is
> between our most recent lseek at 0x20000 and hole at 0x10000)

Is it correct just use results from previous iterations? In mirror source
is active and may change.

> 
> Make the cache slightly larger, or use a bitmap with 2 bits per cluster
> (tracking unknown, known-data, known-hole), with proper flushing of the
> cache as we write to the image, or whatever, and we should automatically
> get some performance improvements by using fewer lseek() anywhere that
> we remember what previous lseek() already told us, with no knobs needed.
> 

So the cache should consider all writes and discards. And it is obviously
more difficult to implement it, than just don't call this lseek. And I
don't understand, why cache + lseek is better for the case when we don't
need nor the lseek neither the cache. Is this all to not add an option?
Also Kevin objects to caching lseek in parallel sub-thread.

-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Eric Blake 6 days ago
On 1/11/19 10:22 AM, Vladimir Sementsov-Ogievskiy wrote:

>> Even a dumb most-recent use cache will speed this up: both the second
>> and third queries above can be avoided because we know that both 0x40000
>> and 0x30000 the second query at 0x40000 can be skipped (0x40000 is
>> between our most recent lseek at 0x20000 and hole at 0x10000)
> 
> Is it correct just use results from previous iterations? In mirror source
> is active and may change.

If you keep a cache, you have to keep the cache up-to-date. Any writes
to an area that is covered by the known-hole cache have to flush the
cache, so that the next block status no longer sees a known-hole and
ends up doing another lseek.  Or, if the cache has enough state to track
unknown/known-hole/known-data, then writes update the cache to be
known-data, and future block status can skip the lseek by using the
results of the cache.

> 
>>
>> Make the cache slightly larger, or use a bitmap with 2 bits per cluster
>> (tracking unknown, known-data, known-hole), with proper flushing of the
>> cache as we write to the image, or whatever, and we should automatically
>> get some performance improvements by using fewer lseek() anywhere that
>> we remember what previous lseek() already told us, with no knobs needed.
>>
> 
> So the cache should consider all writes and discards. And it is obviously
> more difficult to implement it, than just don't call this lseek. And I
> don't understand, why cache + lseek is better for the case when we don't
> need nor the lseek neither the cache. Is this all to not add an option?
> Also Kevin objects to caching lseek in parallel sub-thread.

Keven objected to caching anything if the image has multiple writers,
where an outside process could change the file allocation in between our
reads. But multiple writers is rare - in fact, our image locking for
qcow2 formats tries to prevent multiple writers.  Having multiple
threads within one process writing is fine, as long as they properly
coordinate writes to the lseek cache so that readers never see a stale
claim of a hole - although a stale claim of data is safe.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Eric Blake 6 days ago
On 1/11/19 10:02 AM, Eric Blake wrote:

> Even a dumb most-recent use cache will speed this up: both the second
> and third queries above can be avoided because we know that both 0x40000
> and 0x30000 the second query at 0x40000 can be skipped (0x40000 is
> between our most recent lseek at 0x20000 and hole at 0x10000)
> 
> Make the cache slightly larger, or use a bitmap with 2 bits per cluster
> (tracking unknown, known-data, known-hole), with proper flushing of the
> cache as we write to the image, or whatever, and we should automatically
> get some performance improvements by using fewer lseek() anywhere that
> we remember what previous lseek() already told us, with no knobs needed.

In fact, block/iscsi.c has such a cache; look at iscsi_allocmap_update()
and friends; maybe we should generalize that code in order to reuse the
same concepts in file-posix.c.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH] block: don't probe zeroes in bs->file by default on block_status

Posted by Vladimir Sementsov-Ogievskiy 6 days ago
11.01.2019 10:54, Vladimir Sementsov-Ogievskiy wrote:
> 10.01.2019 23:51, Eric Blake wrote:
>> On 1/10/19 7:20 AM, Vladimir Sementsov-Ogievskiy wrote:
>>> drv_co_block_status digs bs->file for additional, more accurate search
>>> for hole inside region, reported as DATA by bs since 5daa74a6ebc.
>>
>> s/region, reported/regions reported/
>>
>>>
>>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2
>>> knows, where are holes and where is data. But every block_status
>>
>> Not quite true. qcow2 knows where some holes are, but does not know if
>> there are even more holes hiding in the sections marked data (such as
>> because of how the disk was pre-allocated).
>>
>>> request calls lseek additionally. Assume a big disk, full of
>>> data, in any iterative copying block job (or img convert) we'll call
>>> lseek(HOLE) on every iteration, and each of these lseeks will have to
>>> iterate through all metadata up to the end of file. It's obviously
>>> ineffective behavior. And for many scenarios we don't need this lseek
>>> at all.
>>
>> How much performance can we buy back without any knobs at all, if we
>> just taught posix-file.c to cache lseek() results?  That is, when
>> visiting a file sequentially, if lseek(fd, 0, SEEK_HOLE) returns EOF on
>> our first block status query, then all subsequent block status queries
>> fall within what we know to be data, and we can skip the lseek() calls.
> 
> EOF is bad mark I think. We may have a small hole not far from EOF, which
> will lead to the same performance, but not EOF returned.
> 
> I think, proper implementation of lseek cache should work, but it is more
> difficult than just disable lseek. And in scenarios without preallocation
> we don't need lseek, so again better is disable it.
> 
> And I don't sure that qemu is a proper place for lseek optimizations.
> 
> And another way may be to select cases when fiemap is safe and use it.
> 
> Again, for scenarios when we don't need nor lseek, nor fiemap, neither
> cache for them the best option is not use them.
> 
>>
>>>
>>> So, let's "5daa74a6ebc" by default, leaving an option to return
>>
>> s/let's/let's undo/
>>
>>> previous behavior, which is needed for scenarios with preallocated
>>> images.
>>>
>>> Add iotest illustrating new option semantics.
>>
>> And none of the existing iotests fail due to changed output?  Does that
>> mean our testsuite does not have good coverage of pre-allocation modes
>> where the zero probe mattered?
> 
> To be honest, I didn't run all the tests, only several.
> 
> Do it now, and found that, yes, at least 102 broken. Will fix it with the following
> version. Strange, do patchew run tests on patches in list now?
> 
>>
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>
>>
>>> +++ b/qapi/block-core.json
>>> @@ -3673,6 +3673,8 @@
>>>   #                 (default: off)
>>>   # @force-share:   force share all permission on added nodes.
>>>   #                 Requires read-only=true. (Since 2.10)
>>> +# @probe-zeroes:  Probe zeroes on protocol layer if format reports data
>>> +#                 (default: false) (since 4.0)
>>
>> Why do we need a new bool?  Can we instead...
>>
>>>   #
>>>   # Remaining options are determined by the block driver.
>>>   #
>>> @@ -3686,7 +3688,8 @@
>>>               '*read-only': 'bool',
>>>               '*auto-read-only': 'bool',
>>>               '*force-share': 'bool',
>>> -            '*detect-zeroes': 'BlockdevDetectZeroesOptions' },
>>> +            '*detect-zeroes': 'BlockdevDetectZeroesOptions',
>>> +            '*probe-zeroes': 'bool' },
>>
>> ...add another enum value to 'detect-zeroes', since after all, what we
>> are really doing is fine-tuning what level of zero probing we are
>> willing to do?
> 
> Should we? Or I just bind old behavior to detect-zeroes=on? And then, if needed,
> we'll add intermediate modes.
> 
>>
>>
>>> +++ b/qemu-options.hx
>>> @@ -615,6 +615,7 @@ DEF("blockdev", HAS_ARG, QEMU_OPTION_blockdev,
>>>       "-blockdev [driver=]driver[,node-name=N][,discard=ignore|unmap]\n"
>>>       "          [,cache.direct=on|off][,cache.no-flush=on|off]\n"
>>>       "          [,read-only=on|off][,detect-zeroes=on|off|unmap]\n"
>>> +    "          [,probe-zeroes=on|off]\n"
>>>       "          [,driver specific parameters...]\n"
>>>       "                configure a block backend\n", QEMU_ARCH_ALL)
>>>   STEXI
>>> @@ -670,6 +671,8 @@ discard requests.
>>>   conversion of plain zero writes by the OS to driver specific optimized
>>>   zero write commands. You may even choose "unmap" if @var{discard} is set
>>>   to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
>>> +@item probe-zeroes=@var{probe-zeroes}
>>> +Probe zeroes on protocol layer if format reports data. Default is "off".
>>
>> Again, fitting this into detect-zeroes seems better than inventing a new
>> knob.
>>
> 
> No objections. But description of detect-zeroes are more about writes, should
> we change them somehow?
> 
> # Describes the operation mode for the automatic conversion of plain
> # zero writes by the OS to driver specific optimized zero write commands.
> 
> ...
> 
> # @detect-zeroes: detect and optimize zero writes (Since 2.1)
> 
> 
> 

Hm, just note, that detect-zeroes was about writes and should be set on target, and
the new thing is about block-status and should be set on source, and this thing should
be described in spec as well.


-- 
Best regards,
Vladimir