From nobody Tue Dec 16 19:40:02 2025 Delivered-To: importer@patchew.org Authentication-Results: mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass(p=quarantine dis=none) header.from=redhat.com ARC-Seal: i=1; a=rsa-sha256; t=1765824200; cv=none; d=zohomail.com; s=zohoarc; b=BiktvnywZT8Yp8xEr8ooWj8it8MxCEARcu3XGPMiGP6K87NR9FSQiq6qUFcWiAqNHDfE2s3Nm7bUuNVYXRWUp8OAPS1wCRM/zb+2WDSSITXG/wufZldI8vAPh/rRXos3DW2kxIpKH1tDBpwjJoHwlDRz0T9kLfDfJBOCFDJN8b0= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1765824200; h=Content-Type:Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:Subject:To:To:Message-Id:Reply-To; bh=GUZPUrzztc4iT4E5iH3zeW21tztwJ/78t30LZEzlwQg=; b=EnQAbTYlzrYht64LH1rrJK5Ck8+xFp0UZtRehshIpQOh6I9b+ljKCpxqDMkTI9h8H5V/EkAZfrzY9Slmkv4ZqTB/WFc9uS5EBESpPOzsePnjyQYaTNctPvzCeI6el7qyDRKtcNs5TZYvx+YPrm/YfAK2BQWg/gMU01f619c9Ps8= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass header.from= (p=quarantine dis=none) Return-Path: Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1765824200096813.3459206113475; Mon, 15 Dec 2025 10:43:20 -0800 (PST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1vVA1L-0005Cg-Pc; Mon, 15 Dec 2025 09:57:59 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1vVA0j-00057I-W6 for qemu-devel@nongnu.org; Mon, 15 Dec 2025 09:57:21 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1vVA0g-0001Ou-3F for qemu-devel@nongnu.org; Mon, 15 Dec 2025 09:57:17 -0500 Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-130-10CZoUeaNoW4Hrnpjh-SLw-1; Mon, 15 Dec 2025 09:57:10 -0500 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 9D4F11800625; Mon, 15 Dec 2025 14:57:08 +0000 (UTC) Received: from localhost (unknown [10.2.16.117]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id A8D7B1956056; Mon, 15 Dec 2025 14:57:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1765810633; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=GUZPUrzztc4iT4E5iH3zeW21tztwJ/78t30LZEzlwQg=; b=hJWJk2IdT9z1+hUqHFDOYh5Ofd/A6RhIIiJZwZNsIsMpgmjWr66QVLK2ZRImJ4LlqCCXES SAXnhnt5DjFUEhTpHudI5NBPKoXyUclgGlzQCu1e9IG3ckm8zp3jqsVsOl0bfGYuAtBa+E Dzjze1dGsXlZ3KSgO1sQl56n54SlrQc= X-MC-Unique: 10CZoUeaNoW4Hrnpjh-SLw-1 X-Mimecast-MFC-AGG-ID: 10CZoUeaNoW4Hrnpjh-SLw_1765810628 From: Stefan Hajnoczi To: qemu-devel@nongnu.org Cc: "Michael S. Tsirkin" , Kevin Wolf , Richard Henderson , Stefan Hajnoczi , =?UTF-8?q?Philippe=20Mathieu-Daud=C3=A9?= , Fam Zheng , Paolo Bonzini , Hanna Reitz , qemu-block@nongnu.org, qemu-stable@nongnu.org, =?UTF-8?q?Luk=C3=A1=C5=A1=20Doktor?= Subject: [PULL for-10.2 1/1] Revert "nvme: Fix coroutine waking" Date: Mon, 15 Dec 2025 09:57:03 -0500 Message-ID: <20251215145703.31841-2-stefanha@redhat.com> In-Reply-To: <20251215145703.31841-1-stefanha@redhat.com> References: <20251215145703.31841-1-stefanha@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Received-SPF: pass client-ip=170.10.133.124; envelope-from=stefanha@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_VALIDITY_CERTIFIED_BLOCKED=0.001, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=unavailable autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: qemu-devel-bounces+importer=patchew.org@nongnu.org X-ZohoMail-DKIM: pass (identity @redhat.com) X-ZM-MESSAGEID: 1765824201927154100 From: Hanna Czenczek This reverts commit 0f142cbd919fcb6cea7aa176f7e4939925806dd9. Said commit changed the replay_bh_schedule_oneshot_event() in nvme_rw_cb() to aio_co_wake(), allowing the request coroutine to be entered directly (instead of only being scheduled for later execution). This can cause the device to become stalled like so: It is possible that after completion the request coroutine goes on to submit another request without yielding, e.g. a flush after a write to emulate FUA. This will likely cause a nested nvme_process_completion() call because nvme_rw_cb() itself is called from there. (After submitting a request, we invoke nvme_process_completion() through defer_call(); but the fact that nvme_process_completion() ran in the first place indicates that we are not in a call-deferring section, so defer_call() will call nvme_process_completion() immediately.) If this inner nvme_process_completion() loop then processes any completions, it will write the final completion queue (CQ) head index to the CQ head doorbell, and subsequently execution will return to the outer nvme_process_completion() loop. Even if this loop now finds no further completions, it still processed at least one completion before, or it would not have called the nvme_rw_cb() which led to nesting. Therefore, it will now write the exact same CQ head index value to the doorbell, which effectively is an unrecoverable error[1]. Therefore, nesting of nvme_process_completion() does not work at this point. Reverting said commit removes the nesting (by scheduling the request coroutine instead of entering it immediately), and so fixes the stall. On the downside, reverting said commit breaks multiqueue for nvme, but better to have single-queue working than neither. For 11.0, we will have a solution that makes both work. A side note: There is a comment in nvme_process_completion() above qemu_bh_schedule() that claims nesting works, as long as it is done through the completion_bh. I am quite sure that is not true, for two reasons: - The problem described above, which is even worse when going through nvme_process_completion_bh() because that function unconditionally writes to the CQ head doorbell, - nvme_process_completion_bh() never takes q->lock, so nvme_process_completion() unlocking it will likely abort. Given the lack of reports of such aborts, I believe that completion_bh simply is unused in practice. [1] See the NVMe Base Specification revision 2.3, page 180, figure 152: =E2=80=9CInvalid Doorbell Write Value: A host attempted to write an inv= alid doorbell value. Some possible causes of this error are: [...] the value written is the same as the previously written doorbell value.=E2= =80=9D To even be notified of this error, we would need to send an Asynchronous Event Request to the admin queue (p. 178ff), which we don=E2=80=99t do, and then to handle it, we would need to delete and recreate the queue (p. 88, section 3.3.1.2 Queue Usage). Cc: qemu-stable@nongnu.org Reported-by: Luk=C3=A1=C5=A1 Doktor Tested-by: Luk=C3=A1=C5=A1 Doktor Signed-off-by: Hanna Czenczek Message-id: 20251215141540.88915-1-hreitz@redhat.com Signed-off-by: Stefan Hajnoczi --- block/nvme.c | 56 +++++++++++++++++++++++++--------------------------- 1 file changed, 27 insertions(+), 29 deletions(-) diff --git a/block/nvme.c b/block/nvme.c index 919e14cef9..c3d3b99d1f 100644 --- a/block/nvme.c +++ b/block/nvme.c @@ -1200,36 +1200,26 @@ fail: =20 typedef struct { Coroutine *co; - bool skip_yield; int ret; + AioContext *ctx; } NVMeCoData; =20 +static void nvme_rw_cb_bh(void *opaque) +{ + NVMeCoData *data =3D opaque; + qemu_coroutine_enter(data->co); +} + /* Put into NVMeRequest.cb, so runs in the BDS's main AioContext */ static void nvme_rw_cb(void *opaque, int ret) { NVMeCoData *data =3D opaque; - data->ret =3D ret; - - if (data->co =3D=3D qemu_coroutine_self()) { - /* - * Fast path: We are inside of the request coroutine (through - * nvme_submit_command, nvme_deferred_fn, nvme_process_completion). - * We can set data->skip_yield here to keep the coroutine from - * yielding, and then we don't need to schedule a BH to wake it. - */ - data->skip_yield =3D true; - } else { - /* - * Safe to call: The case where we run in the request coroutine is - * handled above, so we must be independent of it; and without - * skip_yield set, the coroutine will yield. - * No need to release NVMeQueuePair.lock (we are called without it - * held). (Note: If we enter the coroutine here, @data will - * probably be dangling once aio_co_wake() returns.) - */ - aio_co_wake(data->co); + if (!data->co) { + /* The rw coroutine hasn't yielded, don't try to enter. */ + return; } + replay_bh_schedule_oneshot_event(data->ctx, nvme_rw_cb_bh, data); } =20 static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs, @@ -1253,7 +1243,7 @@ static coroutine_fn int nvme_co_prw_aligned(BlockDriv= erState *bs, .cdw12 =3D cpu_to_le32(cdw12), }; NVMeCoData data =3D { - .co =3D qemu_coroutine_self(), + .ctx =3D bdrv_get_aio_context(bs), .ret =3D -EINPROGRESS, }; =20 @@ -1270,7 +1260,9 @@ static coroutine_fn int nvme_co_prw_aligned(BlockDriv= erState *bs, return r; } nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data); - if (!data.skip_yield) { + + data.co =3D qemu_coroutine_self(); + while (data.ret =3D=3D -EINPROGRESS) { qemu_coroutine_yield(); } =20 @@ -1366,7 +1358,7 @@ static coroutine_fn int nvme_co_flush(BlockDriverStat= e *bs) .nsid =3D cpu_to_le32(s->nsid), }; NVMeCoData data =3D { - .co =3D qemu_coroutine_self(), + .ctx =3D bdrv_get_aio_context(bs), .ret =3D -EINPROGRESS, }; =20 @@ -1374,7 +1366,9 @@ static coroutine_fn int nvme_co_flush(BlockDriverStat= e *bs) req =3D nvme_get_free_req(ioq); assert(req); nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data); - if (!data.skip_yield) { + + data.co =3D qemu_coroutine_self(); + if (data.ret =3D=3D -EINPROGRESS) { qemu_coroutine_yield(); } =20 @@ -1415,7 +1409,7 @@ static coroutine_fn int nvme_co_pwrite_zeroes(BlockDr= iverState *bs, }; =20 NVMeCoData data =3D { - .co =3D qemu_coroutine_self(), + .ctx =3D bdrv_get_aio_context(bs), .ret =3D -EINPROGRESS, }; =20 @@ -1435,7 +1429,9 @@ static coroutine_fn int nvme_co_pwrite_zeroes(BlockDr= iverState *bs, assert(req); =20 nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data); - if (!data.skip_yield) { + + data.co =3D qemu_coroutine_self(); + while (data.ret =3D=3D -EINPROGRESS) { qemu_coroutine_yield(); } =20 @@ -1463,7 +1459,7 @@ static int coroutine_fn nvme_co_pdiscard(BlockDriverS= tate *bs, }; =20 NVMeCoData data =3D { - .co =3D qemu_coroutine_self(), + .ctx =3D bdrv_get_aio_context(bs), .ret =3D -EINPROGRESS, }; =20 @@ -1508,7 +1504,9 @@ static int coroutine_fn nvme_co_pdiscard(BlockDriverS= tate *bs, trace_nvme_dsm(s, offset, bytes); =20 nvme_submit_command(ioq, req, &cmd, nvme_rw_cb, &data); - if (!data.skip_yield) { + + data.co =3D qemu_coroutine_self(); + while (data.ret =3D=3D -EINPROGRESS) { qemu_coroutine_yield(); } =20 --=20 2.52.0