nvme-pci: detect I/O queue depth changes after reset

[RFC PATCH 0/1] nvme-pci: detect I/O queue depth changes after reset

Posted by guzebing 1 week, 5 days ago

From: Guzebing <guzebing@bytedance.com>

We have hit a case where an NVMe firmware activation made the controller
report a different CAP.MQES value after the following controller reset.
This was seen in production on a Memblaze PBlaze5 510 device:

  model:        P5510DS0384T00
  old firmware: 224005A0, CAP.MQES-derived queue depth 1024
  new firmware: 224005F0, CAP.MQES-derived queue depth 128

One way to hit this path is to activate the new firmware and then reset
the controller:

  nvme fw-download /dev/nvmeX -f fw_bin.tar
  nvme fw-activate /dev/nvmeX -s 2 -a 1
  nvme reset /dev/nvmeX

When the I/O queue depth derived from CAP.MQES became smaller after that
reset, the driver failed to recover any usable I/O queues.  In our
production kernel this was logged as:

  nvme nvme0: IO queues lost

The namespaces were then removed, so the corresponding block device
disappeared.  The opposite direction is less visible: if the CAP-derived
depth becomes larger, reset can complete without an error and the block
device can remain usable, but the live queue depth state is not updated
consistently.

The reason is that reset updates only part of the live queue depth state.
The nvme-pci reset path disables the controller, re-enables it, re-reads
CAP, and recalculates:

  dev->q_depth
  ctrl->sqsize

from the new CAP.MQES value.  Later, however, nvme_create_io_queues()
reuses the existing struct nvme_queue entries.  nvme_alloc_queue()
returns immediately when the queue already exists, so the old values
remain in:

  nvmeq->q_depth
  nvmeq->cq_dma_addr
  nvmeq->sq_dma_addr

Create CQ/SQ then requests queues with nvmeq->q_depth entries (encoded
in the command as nvmeq->q_depth - 1) and uses the old SQ/CQ DMA
addresses, not the newly computed dev->q_depth.  The blk-mq side also
keeps the old depth: the reset path updates the number of hardware
queues through blk_mq_update_nr_hw_queues(), but it does not resize the
existing tag set or update its queue_depth.

This explains the observed shrink failure and the expected grow case:

  * If the CAP-derived depth becomes smaller, the driver may try to create
    an I/O queue with the old larger nvmeq->q_depth.  A controller that
    now enforces the smaller CAP.MQES-derived limit can reject the Create
    CQ/SQ command.  If no I/O queues are recovered, nvme-pci removes the
    namespaces, so the block device disappears.

  * If the CAP-derived depth becomes larger, the old nvmeq->q_depth is
    still within the new controller limit.  Queue creation can therefore
    succeed and the device can remain usable, but the live state is
    inconsistent: dev->q_depth and ctrl->sqsize reflect the new capability
    while nvmeq queue resources and the blk-mq tag set still reflect the
    old depth.  The larger depth is not used until the controller is
    removed and probed again.

There are two broad ways to address this.

The direct fix would be to make reset recovery handle a changed live queue
depth.  That would require updating or rebuilding the nvmeq depth and
SQ/CQ DMA allocations, and resizing the block-layer depth state
consistently, including the blk-mq tag set, scheduler tags when present,
and queue->nr_requests.  That is broader than an nvme-pci-only change
and needs block layer review.

This RFC instead takes the smaller approach of detecting the reset-time
CAP.MQES change and making it visible.  If the live I/O queue depth
shrinks, reset recovery is failed before recreating I/O queues.  If it
grows, the driver warns and continues with the existing queue resources.

Feedback would be appreciated on whether this detection is useful on its
own, or whether nvme-pci should instead support full live queue-depth
resizing together with the required blk-mq changes.

Guzebing (1):
  nvme-pci: detect I/O queue depth changes after reset

 drivers/nvme/host/pci.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

-- 
2.20.1

Re: [RFC PATCH 0/1] nvme-pci: detect I/O queue depth changes after reset

Posted by Christoph Hellwig 1 week, 5 days ago

On Wed, May 27, 2026 at 03:53:19PM +0800, guzebing wrote:
> This RFC instead takes the smaller approach of detecting the reset-time
> CAP.MQES change and making it visible.  If the live I/O queue depth
> shrinks, reset recovery is failed before recreating I/O queues.  If it
> grows, the driver warns and continues with the existing queue resources.

Unlike the other version this at least sounds doable without creating
a complete mess.  So if we can live with this version that'd make me
much happier.

Re: [RFC PATCH 0/1] nvme-pci: detect I/O queue depth changes after reset

Posted by Keith Busch 1 week, 4 days ago

On Wed, May 27, 2026 at 03:19:51PM +0200, Christoph Hellwig wrote:
> On Wed, May 27, 2026 at 03:53:19PM +0800, guzebing wrote:
> > This RFC instead takes the smaller approach of detecting the reset-time
> > CAP.MQES change and making it visible.  If the live I/O queue depth
> > shrinks, reset recovery is failed before recreating I/O queues.  If it
> > grows, the driver warns and continues with the existing queue resources.
> 
> Unlike the other version this at least sounds doable without creating
> a complete mess.  So if we can live with this version that'd make me
> much happier.

Abandoning the device here sounds pretty harsh when blk-mq already
supports user space decreasing the q-depth on a live queue. Why can't
the driver do the same thing?

Re: [RFC PATCH 0/1] nvme-pci: detect I/O queue depth changes after reset

Posted by Christoph Hellwig 1 week, 4 days ago

On Wed, May 27, 2026 at 09:03:02PM -0600, Keith Busch wrote:
> On Wed, May 27, 2026 at 03:19:51PM +0200, Christoph Hellwig wrote:
> > On Wed, May 27, 2026 at 03:53:19PM +0800, guzebing wrote:
> > > This RFC instead takes the smaller approach of detecting the reset-time
> > > CAP.MQES change and making it visible.  If the live I/O queue depth
> > > shrinks, reset recovery is failed before recreating I/O queues.  If it
> > > grows, the driver warns and continues with the existing queue resources.
> > 
> > Unlike the other version this at least sounds doable without creating
> > a complete mess.  So if we can live with this version that'd make me
> > much happier.
> 
> Abandoning the device here sounds pretty harsh when blk-mq already
> supports user space decreasing the q-depth on a live queue. Why can't
> the driver do the same thing?

Reducing capabilities with a firmware upgrade is a losing proposition,
there's just way too many things that can go wrong.

And reducing the queue depth is one of the haіriest operation in
blk-mq.  I wish we had never supported it, but allowing it with a
remove trigger sounds really bad.

Re: [RFC PATCH 0/1] nvme-pci: detect I/O queue depth changes after reset

Posted by guzebing 1 week, 4 days ago

在 2026/5/27 21:19, Christoph Hellwig 写道:
> On Wed, May 27, 2026 at 03:53:19PM +0800, guzebing wrote:
>> This RFC instead takes the smaller approach of detecting the reset-time
>> CAP.MQES change and making it visible.  If the live I/O queue depth
>> shrinks, reset recovery is failed before recreating I/O queues.  If it
>> grows, the driver warns and continues with the existing queue resources.
> 
> Unlike the other version this at least sounds doable without creating
> a complete mess.  So if we can live with this version that'd make me
> much happier.

Thanks, Christoph.

Yes, that is the direction I would like to take here.

The goal of this RFC is to avoid the live queue-depth resize path for 
now and keep the reset recovery policy explicit: fail reset before 
recreating I/O queues if the CAP.MQES-derived depth shrinks, and warn 
but keep using the existing queue resources if it grows.

I will keep this lightweight approach unless there are objections, and 
will wait a bit for other comments before sending a non-RFC version.

Thanks,
Guzebing