From nobody Tue Feb 10 11:56:00 2026 Delivered-To: importer@patchew.org Authentication-Results: mx.zohomail.com; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1763781955940456.9388938334681; Fri, 21 Nov 2025 19:25:55 -0800 (PST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1vMcrn-0006hR-3t; Fri, 21 Nov 2025 20:56:48 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1vMbqx-00045K-SI; Fri, 21 Nov 2025 19:51:51 -0500 Received: from isrv.corpit.ru ([212.248.84.144]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1vMbqn-0000cA-EP; Fri, 21 Nov 2025 19:51:48 -0500 Received: from tsrv.corpit.ru (tsrv.tls.msk.ru [192.168.177.2]) by isrv.corpit.ru (Postfix) with ESMTP id 892AF16CA83; Fri, 21 Nov 2025 21:44:29 +0300 (MSK) Received: from think4mjt.tls.msk.ru (mjtthink.wg.tls.msk.ru [192.168.177.146]) by tsrv.corpit.ru (Postfix) with ESMTP id 47005321CBF; Fri, 21 Nov 2025 21:44:38 +0300 (MSK) From: Michael Tokarev To: qemu-devel@nongnu.org Cc: qemu-stable@nongnu.org, Hanna Czenczek , Stefan Hajnoczi , Kevin Wolf , Michael Tokarev Subject: [Stable-10.0.7 72/81] nvme: Fix coroutine waking Date: Fri, 21 Nov 2025 21:44:11 +0300 Message-ID: <20251121184424.1137669-72-mjt@tls.msk.ru> X-Mailer: git-send-email 2.47.3 In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Received-SPF: pass client-ip=212.248.84.144; envelope-from=mjt@tls.msk.ru; helo=isrv.corpit.ru X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: qemu-devel-bounces+importer=patchew.org@nongnu.org X-ZM-MESSAGEID: 1763781956576018900 From: Hanna Czenczek nvme wakes the request coroutine via qemu_coroutine_enter() from a BH scheduled in the BDS AioContext. This may not be the same context as the one in which the request originally ran, which would be wrong: - It could mean we enter the coroutine before it yields, - We would move the coroutine in to a different context. (Can be reproduced with multiqueue by adding a usleep(100000) before the `while (data.ret =3D=3D -EINPROGRESS)` loop.) To fix that, use aio_co_wake() to run the coroutine in its home context. Just like in the preceding iscsi and nfs patches, we can drop the trivial nvme_rw_cb_bh() and use aio_co_wake() directly. With this, we can remove NVMeCoData.ctx. Note the check of data->co =3D=3D NULL to bypass the BH/yield combination in case nvme_rw_cb() is called from nvme_submit_command(): We probably want to keep this fast path for performance reasons, but we have to be quite careful about it: - We cannot overload .ret for this, but have to use a dedicated .skip_yield field. Otherwise, if nvme_rw_cb() runs in a different thread than the coroutine, it may see .ret set and skip the yield, while nvme_rw_cb() will still schedule a BH for waking. Therefore, the signal to skip the yield can only be set in nvme_rw_cb() if waking too is skipped, which is independent from communicating the return value. - We can only skip the yield if nvme_rw_cb() actually runs in the request coroutine. Otherwise (specifically if they run in different AioContexts), the order between this function=E2=80=99s execution and the coroutine yielding (or not yielding) is not reliable. - There is no point to yielding in a loop; there are no spurious wakes, so once we yield, we will only be re-entered once the command is done. Replace `while` by `if`. Cc: qemu-stable@nongnu.org Signed-off-by: Hanna Czenczek Message-ID: <20251110154854.151484-9-hreitz@redhat.com> Reviewed-by: Stefan Hajnoczi Reviewed-by: Kevin Wolf Signed-off-by: Kevin Wolf (cherry picked from commit 0f142cbd919fcb6cea7aa176f7e4939925806dd9) Signed-off-by: Michael Tokarev diff --git a/block/nvme.c b/block/nvme.c index 3366519e57..2e93abdb4b 100644 --- a/block/nvme.c +++ b/block/nvme.c @@ -1171,25 +1171,35 @@ fail: =20 typedef struct { Coroutine *co; + bool skip_yield; int ret; - AioContext *ctx; } NVMeCoData; =20 -static void nvme_rw_cb_bh(void *opaque) -{ - NVMeCoData *data =3D opaque; - qemu_coroutine_enter(data->co); -} - static void nvme_rw_cb(void *opaque, int ret) { NVMeCoData *data =3D opaque; + data->ret =3D ret; - if (!data->co) { - /* The rw coroutine hasn't yielded, don't try to enter. */ - return; + + if (data->co =3D=3D qemu_coroutine_self()) { + /* + * Fast path: We are inside of the request coroutine (through + * nvme_submit_command, nvme_deferred_fn, nvme_process_completion). + * We can set data->skip_yield here to keep the coroutine from + * yielding, and then we don't need to schedule a BH to wake it. + */ + data->skip_yield =3D true; + } else { + /* + * Safe to call: The case where we run in the request coroutine is + * handled above, so we must be independent of it; and without + * skip_yield set, the coroutine will yield. + * No need to release NVMeQueuePair.lock (we are called without it + * held). (Note: If we enter the coroutine here, @data will + * probably be dangling once aio_co_wake() returns.) + */ + aio_co_wake(data->co); } - replay_bh_schedule_oneshot_event(data->ctx, nvme_rw_cb_bh, data); } =20 static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs, @@ -1213,7 +1223,7 @@ static coroutine_fn int nvme_co_prw_aligned(BlockDriv= erState *bs, .cdw12 =3D cpu_to_le32(cdw12), }; NVMeCoData data =3D { - .ctx =3D bdrv_get_aio_context(bs), + .co =3D qemu_coroutine_self(), .ret =3D -EINPROGRESS, }; =20 @@ -1230,9 +1240,7 @@ static coroutine_fn int nvme_co_prw_aligned(BlockDriv= erState *bs, return r; } nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data); - - data.co =3D qemu_coroutine_self(); - while (data.ret =3D=3D -EINPROGRESS) { + if (!data.skip_yield) { qemu_coroutine_yield(); } =20 @@ -1328,7 +1336,7 @@ static coroutine_fn int nvme_co_flush(BlockDriverStat= e *bs) .nsid =3D cpu_to_le32(s->nsid), }; NVMeCoData data =3D { - .ctx =3D bdrv_get_aio_context(bs), + .co =3D qemu_coroutine_self(), .ret =3D -EINPROGRESS, }; =20 @@ -1336,9 +1344,7 @@ static coroutine_fn int nvme_co_flush(BlockDriverStat= e *bs) req =3D nvme_get_free_req(ioq); assert(req); nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data); - - data.co =3D qemu_coroutine_self(); - if (data.ret =3D=3D -EINPROGRESS) { + if (!data.skip_yield) { qemu_coroutine_yield(); } =20 @@ -1379,7 +1385,7 @@ static coroutine_fn int nvme_co_pwrite_zeroes(BlockDr= iverState *bs, }; =20 NVMeCoData data =3D { - .ctx =3D bdrv_get_aio_context(bs), + .co =3D qemu_coroutine_self(), .ret =3D -EINPROGRESS, }; =20 @@ -1399,9 +1405,7 @@ static coroutine_fn int nvme_co_pwrite_zeroes(BlockDr= iverState *bs, assert(req); =20 nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data); - - data.co =3D qemu_coroutine_self(); - while (data.ret =3D=3D -EINPROGRESS) { + if (!data.skip_yield) { qemu_coroutine_yield(); } =20 @@ -1429,7 +1433,7 @@ static int coroutine_fn nvme_co_pdiscard(BlockDriverS= tate *bs, }; =20 NVMeCoData data =3D { - .ctx =3D bdrv_get_aio_context(bs), + .co =3D qemu_coroutine_self(), .ret =3D -EINPROGRESS, }; =20 @@ -1474,9 +1478,7 @@ static int coroutine_fn nvme_co_pdiscard(BlockDriverS= tate *bs, trace_nvme_dsm(s, offset, bytes); =20 nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data); - - data.co =3D qemu_coroutine_self(); - while (data.ret =3D=3D -EINPROGRESS) { + if (!data.skip_yield) { qemu_coroutine_yield(); } =20 --=20 2.47.3