drivers/nvme/host/pci.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-)
When an NVMe interrupt is triggered, the current implementation first
handles the CQE and then updates the CQ head. This may lead to a timing
window causing a lower-level CQ full condition, which in turn leads to
an I/O hang as described below:
1. NVMe interrupt handling flow: nvme_handle_cqe -> nvme_pci_complete_rq
-> ... -> blk_mq_put_tag(when not added to the batch processing flow)
notifies the NVMe driver that it can continue issuing commands.
2. The NVMe driver issues a new command while the CQ head has not yet
been updated.
3. The underlying layer finishes processing the new command immediately
and attempts to place it into the completion queue. It then detects that
the CQ is full and discards the command.
4. The NVMe interrupt flow from step 1 subsequently updates the CQ head.
The sequence diagram is as follows:
driver irq underlying(virtual/hardware)
------ ------ ------
1. Wait for tag
1. Read CQE CQ is full, wait for head update
2. Handle CQE
3. Wake up tag
2. Get tag
(blk_mq_put_tag)
3. Issue new cmd
1. Process cmd
2. Try write to CQ
3. CQ is full, discard cmd!
4. Update CQ head
(LATE!)
4. Cmd timeout
In this scenario, the NVMe driver observes that the command never completes
and reports a hung task error.
[ 7128.239445] INFO: task kworker/u128:1:912 blocked for more than 122 seconds.
[ 7128.241536] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7128.242675] task:kworker/u128:1 state:D stack: 0 pid: 912 ppid: 2 flags:0x00004000
[ 7128.243862] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[ 7128.244736] Call Trace:
[ 7128.245283] __schedule+0x2ea/0x640
[ 7128.245951] schedule+0x46/0xb0
[ 7128.246576] schedule_timeout+0x1a7/0x2b0
[ 7128.247364] ? __next_timer_interrupt+0x110/0x110
[ 7128.248281] io_schedule_timeout+0x4c/0x80
[ 7128.249129] wait_for_common_io.constprop.0+0x80/0xf0
[ 7128.250083] __nvme_disable_io_queues+0x14d/0x1a0 [nvme]
[ 7128.251084] ? nvme_del_queue_end+0x20/0x20 [nvme]
[ 7128.252016] nvme_dev_disable+0x20b/0x210 [nvme]
[ 7128.252908] nvme_remove+0x6d/0x1b0 [nvme]
[ 7128.253709] pci_device_remove+0x38/0xa0
[ 7128.254423] __device_release_driver+0x172/0x260
[ 7128.255189] device_release_driver+0x24/0x30
[ 7128.255937] pci_stop_bus_device+0x6c/0x90
[ 7128.256659] pci_stop_and_remove_bus_device+0xe/0x20
[ 7128.257594] disable_slot+0x49/0x90
[ 7128.258336] acpiphp_disable_and_eject_slot+0x15/0x90
[ 7128.259302] hotplug_event+0xc8/0x220
[ 7128.260080] ? acpiphp_post_dock_fixup+0xc0/0xc0
[ 7128.260993] acpiphp_hotplug_notify+0x20/0x40
[ 7128.261835] acpi_device_hotplug+0x8c/0x1d0
[ 7128.262702] acpi_hotplug_work_fn+0x3d/0x50
[ 7128.263532] process_one_work+0x1ad/0x350
[ 7128.264329] worker_thread+0x49/0x310
[ 7128.265114] ? rescuer_thread+0x370/0x370
[ 7128.265945] kthread+0xfb/0x140
[ 7128.266650] ? kthread_park+0x90/0x90
[ 7128.267435] ret_from_fork+0x1f/0x30
Reproducing method:
In a cloud-native environment, SPDK vfio-user emulates an NVMe disk for
VMs. The issue can be reliably reproduced by repeatedly attaching and
detaching one disk via a host shell script that calls virsh
attatch-interface and virsh detach-interface. Once the issue is
reproduced, SPDK logs the following messages:
vfio_user.c:1856:post_completion: ERROR: cqid:0 full
(tail=6, head=7, inflight io: 2)
vfio_user.c:1116:fail_ctrlr: ERROR: failing controller
Correspondingly, the guest kernel reports a hung task error, displaying
the call stack detailed in the previous section.
Fix: Update the CQ head first, then process the CQE, and clear the bitmap
to indicate that free slots are available for further command submission.
Fixes: 324b494c2862 ("nvme-pci: Remove two-pass completions")
Signed-off-by: Shouxin Sun <sunshx@chinatelecom.cn>
Signed-off-by: Junnan Zhang <zhangjn11@chinatelecom.cn>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
Signed-off-by: Zhaolong Zhang <zhangzl68@chinatelecom.cn>
Signed-off-by: Yaxuan Liu <liuyx92@chinatelecom.cn>
Signed-off-by: Junnan Zhang <zhangjn_dev@163.com>
---
drivers/nvme/host/pci.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d86f2565a92c..904f45761cd2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1438,7 +1438,9 @@ static inline bool nvme_poll_cq(struct nvme_queue *nvmeq,
struct io_comp_batch *iob)
{
bool found = false;
+ u16 start, end;
+ start = nvmeq->cq_head;
while (nvme_cqe_pending(nvmeq)) {
found = true;
/*
@@ -1446,12 +1448,19 @@ static inline bool nvme_poll_cq(struct nvme_queue *nvmeq,
* the cqe requires a full read memory barrier
*/
dma_rmb();
- nvme_handle_cqe(nvmeq, iob, nvmeq->cq_head);
nvme_update_cq_head(nvmeq);
}
+ end = nvmeq->cq_head;
- if (found)
+ if (found) {
nvme_ring_cq_doorbell(nvmeq);
+ while (start != end) {
+ nvme_handle_cqe(nvmeq, iob, start);
+ if (++start == nvmeq->q_depth)
+ start = 0;
+ }
+ }
+
return found;
}
--
2.43.0
© 2016 - 2026 Red Hat, Inc.