From nobody Tue Feb 10 09:25:03 2026 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=fail(p=none dis=none) header.from=redhat.com Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1523466162991476.97807443027534; Wed, 11 Apr 2018 10:02:42 -0700 (PDT) Received: from localhost ([::1]:40801 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f6J8s-0006dV-6J for importer@patchew.org; Wed, 11 Apr 2018 13:02:42 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41475) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f6InT-0002ya-AR for qemu-devel@nongnu.org; Wed, 11 Apr 2018 12:40:37 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1f6InQ-0004ge-5n for qemu-devel@nongnu.org; Wed, 11 Apr 2018 12:40:35 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:46478 helo=mx1.redhat.com) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1f6InL-0004dn-Og; Wed, 11 Apr 2018 12:40:27 -0400 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 551BD8DC2E; Wed, 11 Apr 2018 16:40:27 +0000 (UTC) Received: from localhost.localdomain.com (ovpn-117-181.ams2.redhat.com [10.36.117.181]) by smtp.corp.redhat.com (Postfix) with ESMTP id 23666AB3FD; Wed, 11 Apr 2018 16:40:26 +0000 (UTC) From: Kevin Wolf To: qemu-block@nongnu.org Date: Wed, 11 Apr 2018 18:39:39 +0200 Message-Id: <20180411163940.2523-19-kwolf@redhat.com> In-Reply-To: <20180411163940.2523-1-kwolf@redhat.com> References: <20180411163940.2523-1-kwolf@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.11.54.5 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.2]); Wed, 11 Apr 2018 16:40:27 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.2]); Wed, 11 Apr 2018 16:40:27 +0000 (UTC) for IP:'10.11.54.5' DOMAIN:'int-mx05.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'kwolf@redhat.com' RCPT:'' X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 66.187.233.73 Subject: [Qemu-devel] [PATCH 18/19] block: Allow graph changes in bdrv_drain_all_begin/end sections X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: kwolf@redhat.com, famz@redhat.com, qemu-devel@nongnu.org, mreitz@redhat.com, stefanha@redhat.com, pbonzini@redhat.com Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" bdrv_drain_all_*() used bdrv_next() to iterate over all root nodes and did a subtree drain for each of them. This works fine as long as the graph is static, but sadly, reality looks different. If the graph changes so that root nodes are added or removed, we would have to compensate for this. bdrv_next() returns each root node only once even if it's the root node for multiple BlockBackends or for a monitor-owned block driver tree, which would only complicate things. The much easier and more obviously correct way is to fundamentally change the way the functions work: Iterate over all BlockDriverStates, no matter who owns them, and drain them individually. Compensation is only necessary when a new BDS is created inside a drain_all section. Removal of a BDS doesn't require any action because it's gone afterwards anyway. This change means that some nodes get a higher bs->quiesce_count now because each node propagates its individual drain to all of its parents. In the old subtree drain, propagation back to the parent that made the recursive drain request was avoided. While this isn't perfectly beautiful, accepting the higher counts seems preferable to adding drain code to multiple other places that modify the graph. The test case is changed to account for the higher counts where necessary. Signed-off-by: Kevin Wolf --- include/block/block.h | 1 + include/block/block_int.h | 1 + block.c | 20 +++++++++++++++- block/io.c | 58 ++++++++++++++++++++++++++++++++++++-------= ---- tests/test-bdrv-drain.c | 37 +++++++++++++++++------------- 5 files changed, 87 insertions(+), 30 deletions(-) diff --git a/include/block/block.h b/include/block/block.h index de2cba2c74..0b59d519c5 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -412,6 +412,7 @@ BlockDriverState *bdrv_lookup_bs(const char *device, Error **errp); bool bdrv_chain_contains(BlockDriverState *top, BlockDriverState *base); BlockDriverState *bdrv_next_node(BlockDriverState *bs); +BlockDriverState *bdrv_next_all_states(BlockDriverState *bs); =20 typedef struct BdrvNextIterator { enum { diff --git a/include/block/block_int.h b/include/block/block_int.h index dc6985e3ae..2c86a7b53f 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -804,6 +804,7 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child, int64_t offset, unsigned int bytes, QEMUIOVector *qiov, BdrvRequestFlags flags); =20 +extern unsigned int bdrv_drain_all_count; void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_pare= nt); void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_pa= rent); =20 diff --git a/block.c b/block.c index 330238de19..c1b339c4a4 100644 --- a/block.c +++ b/block.c @@ -332,6 +332,10 @@ BlockDriverState *bdrv_new(void) =20 qemu_co_queue_init(&bs->flush_queue); =20 + for (i =3D 0; i < bdrv_drain_all_count; i++) { + bdrv_drained_begin(bs); + } + QTAILQ_INSERT_TAIL(&all_bdrv_states, bs, bs_list); =20 return bs; @@ -1160,7 +1164,7 @@ static int bdrv_open_driver(BlockDriverState *bs, Blo= ckDriver *drv, int open_flags, Error **errp) { Error *local_err =3D NULL; - int ret; + int i, ret; =20 bdrv_assign_node_name(bs, node_name, &local_err); if (local_err) { @@ -1208,6 +1212,12 @@ static int bdrv_open_driver(BlockDriverState *bs, Bl= ockDriver *drv, assert(bdrv_min_mem_align(bs) !=3D 0); assert(is_power_of_2(bs->bl.request_alignment)); =20 + for (i =3D 0; i < bs->quiesce_counter; i++) { + if (drv->bdrv_co_drain_begin) { + drv->bdrv_co_drain_begin(bs); + } + } + return 0; open_failed: bs->drv =3D NULL; @@ -4034,6 +4044,14 @@ BlockDriverState *bdrv_next_node(BlockDriverState *b= s) return QTAILQ_NEXT(bs, node_list); } =20 +BlockDriverState *bdrv_next_all_states(BlockDriverState *bs) +{ + if (!bs) { + return QTAILQ_FIRST(&all_bdrv_states); + } + return QTAILQ_NEXT(bs, bs_list); +} + const char *bdrv_get_node_name(const BlockDriverState *bs) { return bs->node_name; diff --git a/block/io.c b/block/io.c index db6810b35f..af65c3ec2f 100644 --- a/block/io.c +++ b/block/io.c @@ -38,6 +38,8 @@ /* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */ #define MAX_BOUNCE_BUFFER (32768 << BDRV_SECTOR_BITS) =20 +static AioWait drain_all_aio_wait; + static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int bytes, BdrvRequestFlags flags); =20 @@ -445,6 +447,29 @@ static void bdrv_drain_assert_idle(BlockDriverState *b= s) } } =20 +unsigned int bdrv_drain_all_count =3D 0; + +static bool bdrv_drain_all_poll(void) +{ + BlockDriverState *bs =3D NULL; + bool result =3D false; + + /* Execute pending BHs first (may modify the graph) and check everythi= ng + * else only after the BHs have executed. */ + while (aio_poll(qemu_get_aio_context(), false)); + + /* bdrv_drain_poll() with top_level =3D false can't make changes to the + * graph, so iterating bdrv_next_all_states() is safe. */ + while ((bs =3D bdrv_next_all_states(bs))) { + AioContext *aio_context =3D bdrv_get_aio_context(bs); + aio_context_acquire(aio_context); + result |=3D bdrv_drain_poll(bs, false, false); + aio_context_release(aio_context); + } + + return result; +} + /* * Wait for pending requests to complete across all BlockDriverStates * @@ -459,45 +484,51 @@ static void bdrv_drain_assert_idle(BlockDriverState *= bs) */ void bdrv_drain_all_begin(void) { - BlockDriverState *bs; - BdrvNextIterator it; + BlockDriverState *bs =3D NULL; =20 if (qemu_in_coroutine()) { bdrv_co_yield_to_drain(NULL, true, false, NULL, true); return; } =20 - /* BDRV_POLL_WHILE() for a node can only be called from its own I/O th= read - * or the main loop AioContext. We potentially use BDRV_POLL_WHILE() on - * nodes in several different AioContexts, so make sure we're in the m= ain - * context. */ + /* AIO_WAIT_WHILE() with a NULL context can only be called from the ma= in + * loop AioContext, so make sure we're in the main context. */ assert(qemu_get_current_aio_context() =3D=3D qemu_get_aio_context()); + assert(bdrv_drain_all_count < INT_MAX); + bdrv_drain_all_count++; =20 - for (bs =3D bdrv_first(&it); bs; bs =3D bdrv_next(&it)) { + /* Quiesce all nodes, without polling in-flight requests yet. The graph + * cannot change during this loop. */ + while ((bs =3D bdrv_next_all_states(bs))) { AioContext *aio_context =3D bdrv_get_aio_context(bs); =20 aio_context_acquire(aio_context); - bdrv_do_drained_begin(bs, true, NULL, true); + bdrv_do_drained_begin(bs, false, NULL, false); aio_context_release(aio_context); } =20 - for (bs =3D bdrv_first(&it); bs; bs =3D bdrv_next(&it)) { + /* Now poll the in-flight requests */ + AIO_WAIT_WHILE(&drain_all_aio_wait, NULL, bdrv_drain_all_poll()); + + while ((bs =3D bdrv_next_all_states(bs))) { bdrv_drain_assert_idle(bs); } } =20 void bdrv_drain_all_end(void) { - BlockDriverState *bs; - BdrvNextIterator it; + BlockDriverState *bs =3D NULL; =20 - for (bs =3D bdrv_first(&it); bs; bs =3D bdrv_next(&it)) { + while ((bs =3D bdrv_next_all_states(bs))) { AioContext *aio_context =3D bdrv_get_aio_context(bs); =20 aio_context_acquire(aio_context); - bdrv_do_drained_end(bs, true, NULL); + bdrv_do_drained_end(bs, false, NULL); aio_context_release(aio_context); } + + assert(bdrv_drain_all_count > 0); + bdrv_drain_all_count--; } =20 void bdrv_drain_all(void) @@ -620,6 +651,7 @@ void bdrv_inc_in_flight(BlockDriverState *bs) void bdrv_wakeup(BlockDriverState *bs) { aio_wait_kick(bdrv_get_aio_wait(bs)); + aio_wait_kick(&drain_all_aio_wait); } =20 void bdrv_dec_in_flight(BlockDriverState *bs) diff --git a/tests/test-bdrv-drain.c b/tests/test-bdrv-drain.c index 79d845ecbb..97ca0743c6 100644 --- a/tests/test-bdrv-drain.c +++ b/tests/test-bdrv-drain.c @@ -174,7 +174,8 @@ static void do_drain_end(enum drain_type drain_type, Bl= ockDriverState *bs) } } =20 -static void test_drv_cb_common(enum drain_type drain_type, bool recursive) +static void test_drv_cb_common(enum drain_type drain_type, int top_drain_c= ount, + int backing_drain_count) { BlockBackend *blk; BlockDriverState *bs, *backing; @@ -205,8 +206,8 @@ static void test_drv_cb_common(enum drain_type drain_ty= pe, bool recursive) =20 do_drain_begin(drain_type, bs); =20 - g_assert_cmpint(s->drain_count, =3D=3D, 1); - g_assert_cmpint(backing_s->drain_count, =3D=3D, !!recursive); + g_assert_cmpint(s->drain_count, =3D=3D, top_drain_count); + g_assert_cmpint(backing_s->drain_count, =3D=3D, backing_drain_count); =20 do_drain_end(drain_type, bs); =20 @@ -225,8 +226,8 @@ static void test_drv_cb_common(enum drain_type drain_ty= pe, bool recursive) do_drain_begin(drain_type, bs); =20 g_assert_cmpint(aio_ret, =3D=3D, 0); - g_assert_cmpint(s->drain_count, =3D=3D, 1); - g_assert_cmpint(backing_s->drain_count, =3D=3D, !!recursive); + g_assert_cmpint(s->drain_count, =3D=3D, top_drain_count); + g_assert_cmpint(backing_s->drain_count, =3D=3D, backing_drain_count); =20 do_drain_end(drain_type, bs); =20 @@ -240,17 +241,17 @@ static void test_drv_cb_common(enum drain_type drain_= type, bool recursive) =20 static void test_drv_cb_drain_all(void) { - test_drv_cb_common(BDRV_DRAIN_ALL, true); + test_drv_cb_common(BDRV_DRAIN_ALL, 2, 1); } =20 static void test_drv_cb_drain(void) { - test_drv_cb_common(BDRV_DRAIN, false); + test_drv_cb_common(BDRV_DRAIN, 1, 0); } =20 static void test_drv_cb_drain_subtree(void) { - test_drv_cb_common(BDRV_SUBTREE_DRAIN, true); + test_drv_cb_common(BDRV_SUBTREE_DRAIN, 1, 1); } =20 static void test_drv_cb_co_drain_all(void) @@ -268,7 +269,9 @@ static void test_drv_cb_co_drain_subtree(void) call_in_coroutine(test_drv_cb_drain_subtree); } =20 -static void test_quiesce_common(enum drain_type drain_type, bool recursive) +static void test_quiesce_common(enum drain_type drain_type, + int top_quiesce_count, + int backing_quiesce_count) { BlockBackend *blk; BlockDriverState *bs, *backing; @@ -286,8 +289,8 @@ static void test_quiesce_common(enum drain_type drain_t= ype, bool recursive) =20 do_drain_begin(drain_type, bs); =20 - g_assert_cmpint(bs->quiesce_counter, =3D=3D, 1); - g_assert_cmpint(backing->quiesce_counter, =3D=3D, !!recursive); + g_assert_cmpint(bs->quiesce_counter, =3D=3D, top_quiesce_count); + g_assert_cmpint(backing->quiesce_counter, =3D=3D, backing_quiesce_coun= t); =20 do_drain_end(drain_type, bs); =20 @@ -301,17 +304,17 @@ static void test_quiesce_common(enum drain_type drain= _type, bool recursive) =20 static void test_quiesce_drain_all(void) { - test_quiesce_common(BDRV_DRAIN_ALL, true); + test_quiesce_common(BDRV_DRAIN_ALL, 2, 1); } =20 static void test_quiesce_drain(void) { - test_quiesce_common(BDRV_DRAIN, false); + test_quiesce_common(BDRV_DRAIN, 1, 0); } =20 static void test_quiesce_drain_subtree(void) { - test_quiesce_common(BDRV_SUBTREE_DRAIN, true); + test_quiesce_common(BDRV_SUBTREE_DRAIN, 1, 1); } =20 static void test_quiesce_co_drain_all(void) @@ -348,6 +351,8 @@ static void test_nested(void) =20 for (outer =3D 0; outer < DRAIN_TYPE_MAX; outer++) { for (inner =3D 0; inner < DRAIN_TYPE_MAX; inner++) { + int top_quiesce =3D 2 + (outer =3D=3D BDRV_DRAIN_ALL) + + (inner =3D=3D BDRV_DRAIN_ALL); int backing_quiesce =3D (outer !=3D BDRV_DRAIN) + (inner !=3D BDRV_DRAIN); =20 @@ -359,9 +364,9 @@ static void test_nested(void) do_drain_begin(outer, bs); do_drain_begin(inner, bs); =20 - g_assert_cmpint(bs->quiesce_counter, =3D=3D, 2); + g_assert_cmpint(bs->quiesce_counter, =3D=3D, top_quiesce); g_assert_cmpint(backing->quiesce_counter, =3D=3D, backing_quie= sce); - g_assert_cmpint(s->drain_count, =3D=3D, 2); + g_assert_cmpint(s->drain_count, =3D=3D, top_quiesce); g_assert_cmpint(backing_s->drain_count, =3D=3D, backing_quiesc= e); =20 do_drain_end(inner, bs); --=20 2.13.6