From nobody Tue Feb 10 00:58:14 2026 Received: from m16.mail.163.com (m16.mail.163.com [220.197.31.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D114A2E542A for ; Mon, 9 Feb 2026 12:11:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=220.197.31.5 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770639076; cv=none; b=oEyiMcklfsk2LSniDrKdLrOuyip+/u5r1PmrBug6Ym01uhmoNvfdEtm9EmKvi47EPdMbIt0fUpV0rhG8Bu+2HrJDPP4QvntisnwTXLeoqHqeKyQyuK6KqW10vgNovlH3Gh1hUr87VnFj5y87jrOOWpn3LHH/cMsYLKNxK8Uc2YQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770639076; c=relaxed/simple; bh=dLqOkKu5DGX+eXpBDHWS+OGRV+160zVYPqeH1tkvlcc=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=Y3/JvW0zACv8uAzaXG/0dN5BkN7A2TrogeCjrzsTZqQcgtOTtxOmbzZLgjrnqxJIqRK34KjhIuoadDr34AhzzbrmAlmnyyphmw5mZVLADUEy92AAppHornTWzCoTZbafLrnDMdaeYj9jxoxDGPCgAQICJeW6elFfzYmFolLPezo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=163.com; spf=pass smtp.mailfrom=163.com; dkim=pass (1024-bit key) header.d=163.com header.i=@163.com header.b=onmAlw8M; arc=none smtp.client-ip=220.197.31.5 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=163.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=163.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=163.com header.i=@163.com header.b="onmAlw8M" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=163.com; s=s110527; h=From:To:Subject:Date:Message-ID:MIME-Version; bh=W1 d2Q44PBPc63sk/1tI/lBXjgDDmI9WU37N5alT7Wlw=; b=onmAlw8MwrNfECE3pL 2HGySqimQq+8vulHBRpltrSpf66WS4dPfweQoKWI0JR5EG379luSKh+1066t8OGX cGmdVQY4mQ7Mcpmtz/hNV+uobKgjSNBWpzoJUxEkYaHXXtf9sj0Upqo8MWx54ImK wDi5jWP0fgHJU50IlEdCcjlPc= Received: from sky.localdomain (unknown []) by gzga-smtp-mtada-g1-1 (Coremail) with SMTP id _____wDXC_muzolpla36Jw--.11722S2; Mon, 09 Feb 2026 20:10:24 +0800 (CST) From: Junnan Zhang To: Keith Busch , Jens Axboe , Christoph Hellwig , Sagi Grimberg Cc: Junnan Zhang , Shouxin Sun , Junnan Zhang , Qiliang Yuan , Zhaolong Zhang , Yaxuan Liu , linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org Subject: [PATCH] nvme-pci: fix potential I/O hang when CQ is full Date: Mon, 9 Feb 2026 20:10:19 +0800 Message-ID: <20260209121020.119853-1-zhangjn_dev@163.com> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-CM-TRANSID: _____wDXC_muzolpla36Jw--.11722S2 X-Coremail-Antispam: 1Uf129KBjvJXoW3GryUuF1DWrW7JFW8uF1kGrg_yoW7Wr4rpF W3KrZ8Gas3WryIyw4UXF4jvry8Can5tr9rJF92ka4xJa4qk3sYvFy3K3WYqrW5Wa4DW34Y vFn8trW8uF4rJaUanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x0zR3PEDUUUUU= X-CM-SenderInfo: x2kd0wxmqbvvry6rljoofrz/xtbCxhCcMmmJzrCA9gAA3Q Content-Type: text/plain; charset="utf-8" When an NVMe interrupt is triggered, the current implementation first handles the CQE and then updates the CQ head. This may lead to a timing window causing a lower-level CQ full condition, which in turn leads to an I/O hang as described below: 1. NVMe interrupt handling flow: nvme_handle_cqe -> nvme_pci_complete_rq -> ... -> blk_mq_put_tag(when not added to the batch processing flow) notifies the NVMe driver that it can continue issuing commands. 2. The NVMe driver issues a new command while the CQ head has not yet been updated. 3. The underlying layer finishes processing the new command immediately and attempts to place it into the completion queue. It then detects that the CQ is full and discards the command. 4. The NVMe interrupt flow from step 1 subsequently updates the CQ head. The sequence diagram is as follows: driver irq underlying(virtual/hardware) ------ ------ ------ 1. Wait for tag 1. Read CQE CQ is full, wait for head update 2. Handle CQE 3. Wake up tag 2. Get tag (blk_mq_put_tag) 3. Issue new cmd 1. Process cmd 2. Try write to CQ 3. CQ is full, discard cmd! 4. Update CQ head (LATE!) 4. Cmd timeout In this scenario, the NVMe driver observes that the command never completes and reports a hung task error. [ 7128.239445] INFO: task kworker/u128:1:912 blocked for more than 122 seco= nds. [ 7128.241536] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables = this message. [ 7128.242675] task:kworker/u128:1 state:D stack: 0 pid: 912 ppid: = 2 flags:0x00004000 [ 7128.243862] Workqueue: kacpi_hotplug acpi_hotplug_work_fn [ 7128.244736] Call Trace: [ 7128.245283] __schedule+0x2ea/0x640 [ 7128.245951] schedule+0x46/0xb0 [ 7128.246576] schedule_timeout+0x1a7/0x2b0 [ 7128.247364] ? __next_timer_interrupt+0x110/0x110 [ 7128.248281] io_schedule_timeout+0x4c/0x80 [ 7128.249129] wait_for_common_io.constprop.0+0x80/0xf0 [ 7128.250083] __nvme_disable_io_queues+0x14d/0x1a0 [nvme] [ 7128.251084] ? nvme_del_queue_end+0x20/0x20 [nvme] [ 7128.252016] nvme_dev_disable+0x20b/0x210 [nvme] [ 7128.252908] nvme_remove+0x6d/0x1b0 [nvme] [ 7128.253709] pci_device_remove+0x38/0xa0 [ 7128.254423] __device_release_driver+0x172/0x260 [ 7128.255189] device_release_driver+0x24/0x30 [ 7128.255937] pci_stop_bus_device+0x6c/0x90 [ 7128.256659] pci_stop_and_remove_bus_device+0xe/0x20 [ 7128.257594] disable_slot+0x49/0x90 [ 7128.258336] acpiphp_disable_and_eject_slot+0x15/0x90 [ 7128.259302] hotplug_event+0xc8/0x220 [ 7128.260080] ? acpiphp_post_dock_fixup+0xc0/0xc0 [ 7128.260993] acpiphp_hotplug_notify+0x20/0x40 [ 7128.261835] acpi_device_hotplug+0x8c/0x1d0 [ 7128.262702] acpi_hotplug_work_fn+0x3d/0x50 [ 7128.263532] process_one_work+0x1ad/0x350 [ 7128.264329] worker_thread+0x49/0x310 [ 7128.265114] ? rescuer_thread+0x370/0x370 [ 7128.265945] kthread+0xfb/0x140 [ 7128.266650] ? kthread_park+0x90/0x90 [ 7128.267435] ret_from_fork+0x1f/0x30 Reproducing method: In a cloud-native environment, SPDK vfio-user emulates an NVMe disk for VMs. The issue can be reliably reproduced by repeatedly attaching and detaching one disk via a host shell script that calls virsh attatch-interface and virsh detach-interface. Once the issue is reproduced, SPDK logs the following messages: vfio_user.c:1856:post_completion: ERROR: cqid:0 full (tail=3D6, head=3D7, inflight io: 2) vfio_user.c:1116:fail_ctrlr: ERROR: failing controller Correspondingly, the guest kernel reports a hung task error, displaying the call stack detailed in the previous section. Fix: Update the CQ head first, then process the CQE, and clear the bitmap to indicate that free slots are available for further command submission. Fixes: 324b494c2862 ("nvme-pci: Remove two-pass completions") Signed-off-by: Shouxin Sun Signed-off-by: Junnan Zhang Signed-off-by: Qiliang Yuan Signed-off-by: Zhaolong Zhang Signed-off-by: Yaxuan Liu Signed-off-by: Junnan Zhang --- drivers/nvme/host/pci.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index d86f2565a92c..904f45761cd2 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1438,7 +1438,9 @@ static inline bool nvme_poll_cq(struct nvme_queue *nv= meq, struct io_comp_batch *iob) { bool found =3D false; + u16 start, end; =20 + start =3D nvmeq->cq_head; while (nvme_cqe_pending(nvmeq)) { found =3D true; /* @@ -1446,12 +1448,19 @@ static inline bool nvme_poll_cq(struct nvme_queue *= nvmeq, * the cqe requires a full read memory barrier */ dma_rmb(); - nvme_handle_cqe(nvmeq, iob, nvmeq->cq_head); nvme_update_cq_head(nvmeq); } + end =3D nvmeq->cq_head; =20 - if (found) + if (found) { nvme_ring_cq_doorbell(nvmeq); + while (start !=3D end) { + nvme_handle_cqe(nvmeq, iob, start); + if (++start =3D=3D nvmeq->q_depth) + start =3D 0; + } + } + return found; } =20 --=20 2.43.0