[PATCH] nvme: reserve a keep-alive admin tag for all transports

Chao Shi posted 1 patch 1 month, 2 weeks ago
There is a newer version of this series
drivers/nvme/host/core.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
[PATCH] nvme: reserve a keep-alive admin tag for all transports
Posted by Chao Shi 1 month, 2 weeks ago
nvme_keep_alive_work() always allocates with BLK_MQ_REQ_RESERVED, but
nvme_alloc_admin_tag_set() only sets reserved_tags for fabrics.  Since
commit b58da2d270db ("nvme: update keep alive interval when kato is
modified"), userspace can start keep-alive on any transport via Set
Features (KATO), after which the allocation trips WARN_ON_ONCE() in
blk_mq_get_tag() and fails with -EWOULDBLOCK:

  nvme nvme0: keep-alive failed: -11

Reserve one admin tag for keep-alive on all transports.  Fabrics keeps
two, the second being for the connect command.

Fixes: b58da2d270db ("nvme: update keep alive interval when kato is modified")

Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).

Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
---

Reproducer (run as root on an unpatched kernel with a PCIe NVMe device):

    #include <fcntl.h>
    #include <stdio.h>
    #include <string.h>
    #include <sys/ioctl.h>
    #include <linux/nvme_ioctl.h>

    int main(void)
    {
            struct nvme_admin_cmd cmd = {0};
            int fd = open("/dev/nvme0", O_RDWR);
            if (fd < 0) { perror("open"); return 1; }
            cmd.opcode = 0x09;       /* SET_FEATURES */
            cmd.cdw10  = 0x0f;       /* Feature ID: KATO */
            cmd.cdw11  = 5;          /* KATO = 5 seconds */
            if (ioctl(fd, NVME_IOCTL_ADMIN_CMD, &cmd) < 0) {
                    perror("ioctl");
                    return 1;
            }
            return 0;
    }

Within ~kato/2 seconds after the program exits, dmesg shows:

    nvme nvme0: keep alive interval updated from 0 ms to 5000 ms
    WARNING: CPU: 0 PID: ... at block/blk-mq-tag.c:148 blk_mq_get_tag+...
    nvme nvme0: keep-alive failed: -11

 drivers/nvme/host/core.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7bf228df6001..6db02ecde6d1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4850,8 +4850,13 @@ int nvme_alloc_admin_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set,
 	memset(set, 0, sizeof(*set));
 	set->ops = ops;
 	set->queue_depth = NVME_AQ_MQ_TAG_DEPTH;
+	/*
+	 * Reserve one tag for keep-alive, which is allocated with
+	 * BLK_MQ_REQ_RESERVED and can be enabled on any transport via the
+	 * KATO feature.  Fabrics needs a second reserved tag for connect.
+	 */
+	set->reserved_tags = 1;
 	if (ctrl->ops->flags & NVME_F_FABRICS)
-		/* Reserved for fabric connect and keep alive */
 		set->reserved_tags = 2;
 	set->numa_node = ctrl->numa_node;
 	if (ctrl->ops->flags & NVME_F_BLOCKING)
-- 
2.43.0
Re: [PATCH] nvme: reserve a keep-alive admin tag for all transports
Posted by Keith Busch 1 month, 2 weeks ago
On Mon, Apr 27, 2026 at 10:29:11PM -0400, Chao Shi wrote:
> nvme_keep_alive_work() always allocates with BLK_MQ_REQ_RESERVED, but
> nvme_alloc_admin_tag_set() only sets reserved_tags for fabrics.  Since
> commit b58da2d270db ("nvme: update keep alive interval when kato is
> modified"), userspace can start keep-alive on any transport via Set
> Features (KATO), after which the allocation trips WARN_ON_ONCE() in
> blk_mq_get_tag() and fails with -EWOULDBLOCK:
> 
>   nvme nvme0: keep-alive failed: -11
> 
> Reserve one admin tag for keep-alive on all transports.  Fabrics keeps
> two, the second being for the connect command.
 
> Fixes: b58da2d270db ("nvme: update keep alive interval when kato is modified")
> 
> Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).
> 
> Acked-by: Sungwoo Kim <iam@sung-woo.kim>
> Acked-by: Dave Tian <daveti@purdue.edu>
> Acked-by: Weidong Zhu <weizhu@fiu.edu>
> Signed-off-by: Chao Shi <coshi036@gmail.com>
> ---
> 
> Reproducer (run as root on an unpatched kernel with a PCIe NVMe device):

You have a PCI controller that doesn't return Invalid Field In Command
status to the KATO feature? That's weird, it's fabrics specific feature.
I think the right thing to do is simply skip the driver's KATO start for
PCI.
Re: [PATCH] nvme: reserve a keep-alive admin tag for all transports
Posted by Maurizio Lombardi 1 month, 2 weeks ago
On Tue Apr 28, 2026 at 8:47 AM CEST, Keith Busch wrote:
> On Mon, Apr 27, 2026 at 10:29:11PM -0400, Chao Shi wrote:
>> nvme_keep_alive_work() always allocates with BLK_MQ_REQ_RESERVED, but
>> nvme_alloc_admin_tag_set() only sets reserved_tags for fabrics.  Since
>> commit b58da2d270db ("nvme: update keep alive interval when kato is
>> modified"), userspace can start keep-alive on any transport via Set
>> Features (KATO), after which the allocation trips WARN_ON_ONCE() in
>> blk_mq_get_tag() and fails with -EWOULDBLOCK:
>> 
>>   nvme nvme0: keep-alive failed: -11
>> 
>> Reserve one admin tag for keep-alive on all transports.  Fabrics keeps
>> two, the second being for the connect command.
>  
>> Fixes: b58da2d270db ("nvme: update keep alive interval when kato is modified")
>> 
>> Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).
>> 
>> Acked-by: Sungwoo Kim <iam@sung-woo.kim>
>> Acked-by: Dave Tian <daveti@purdue.edu>
>> Acked-by: Weidong Zhu <weizhu@fiu.edu>
>> Signed-off-by: Chao Shi <coshi036@gmail.com>
>> ---
>> 
>> Reproducer (run as root on an unpatched kernel with a PCIe NVMe device):
>
> You have a PCI controller that doesn't return Invalid Field In Command
> status to the KATO feature? That's weird, it's fabrics specific feature.

Are you sure that it's fabrics-only?

The spec 2.0a, at section 5.27.1.12 Keep Alive Timer (Feature Identifier
0Fh)

says:
"Keep Alive Timeout (KATO):
 
This field specifies the timeout value for the Keep Alive feature in
milliseconds.  [...]
The default value for this field is 0h for NVMe transports that do not require use of the Keep Alive
feature (e.g., NVMe over PCIe). For NVMe transports that require use of the Keep Alive feature
(e.g., RDMA and TCP), the default value for this field is 1D4C0h "

To me, it sounds like for nvme-pci, keep alive isn't required, but could
be activated.


Maurizio
Re: [PATCH] nvme: reserve a keep-alive admin tag for all transports
Posted by Chao S 1 month ago
On Tue, Apr 28, 2026, Maurizio Lombardi wrote:
> The spec 2.0a, at section 5.27.1.12 Keep Alive Timer (Feature
> Identifier 0Fh) says:
> "Keep Alive Timeout (KATO):
> ...
> The default value for this field is 0h for NVMe transports that do
> not require use of the Keep Alive feature (e.g., NVMe over PCIe).
> For NVMe transports that require use of the Keep Alive feature
> (e.g., RDMA and TCP), the default value for this field is 1D4C0h"
>
> To me, it sounds like for nvme-pci, keep alive isn't required, but
> could be activated.

Thanks for the spec reference - that exact wording will be quoted in
the v2 commit message.  I am adding you to the v2 Cc list so you see
the next revision directly. Thanks again!

Chao

On Tue, Apr 28, 2026 at 3:15 AM Maurizio Lombardi <mlombard@arkamax.eu> wrote:
>
> On Tue Apr 28, 2026 at 8:47 AM CEST, Keith Busch wrote:
> > On Mon, Apr 27, 2026 at 10:29:11PM -0400, Chao Shi wrote:
> >> nvme_keep_alive_work() always allocates with BLK_MQ_REQ_RESERVED, but
> >> nvme_alloc_admin_tag_set() only sets reserved_tags for fabrics.  Since
> >> commit b58da2d270db ("nvme: update keep alive interval when kato is
> >> modified"), userspace can start keep-alive on any transport via Set
> >> Features (KATO), after which the allocation trips WARN_ON_ONCE() in
> >> blk_mq_get_tag() and fails with -EWOULDBLOCK:
> >>
> >>   nvme nvme0: keep-alive failed: -11
> >>
> >> Reserve one admin tag for keep-alive on all transports.  Fabrics keeps
> >> two, the second being for the connect command.
> >
> >> Fixes: b58da2d270db ("nvme: update keep alive interval when kato is modified")
> >>
> >> Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).
> >>
> >> Acked-by: Sungwoo Kim <iam@sung-woo.kim>
> >> Acked-by: Dave Tian <daveti@purdue.edu>
> >> Acked-by: Weidong Zhu <weizhu@fiu.edu>
> >> Signed-off-by: Chao Shi <coshi036@gmail.com>
> >> ---
> >>
> >> Reproducer (run as root on an unpatched kernel with a PCIe NVMe device):
> >
> > You have a PCI controller that doesn't return Invalid Field In Command
> > status to the KATO feature? That's weird, it's fabrics specific feature.
>
> Are you sure that it's fabrics-only?
>
> The spec 2.0a, at section 5.27.1.12 Keep Alive Timer (Feature Identifier
> 0Fh)
>
> says:
> "Keep Alive Timeout (KATO):
>
> This field specifies the timeout value for the Keep Alive feature in
> milliseconds.  [...]
> The default value for this field is 0h for NVMe transports that do not require use of the Keep Alive
> feature (e.g., NVMe over PCIe). For NVMe transports that require use of the Keep Alive feature
> (e.g., RDMA and TCP), the default value for this field is 1D4C0h "
>
> To me, it sounds like for nvme-pci, keep alive isn't required, but could
> be activated.
>
>
> Maurizio
Re: [PATCH] nvme: reserve a keep-alive admin tag for all transports
Posted by Keith Busch 1 month, 2 weeks ago
On Tue, Apr 28, 2026 at 09:15:10AM +0200, Maurizio Lombardi wrote:
> The spec 2.0a, at section 5.27.1.12 Keep Alive Timer (Feature Identifier
> 0Fh)
> 
> says:
> "Keep Alive Timeout (KATO):
>  
> This field specifies the timeout value for the Keep Alive feature in
> milliseconds.  [...]
> The default value for this field is 0h for NVMe transports that do not require use of the Keep Alive
> feature (e.g., NVMe over PCIe). For NVMe transports that require use of the Keep Alive feature
> (e.g., RDMA and TCP), the default value for this field is 1D4C0h "
> 
> To me, it sounds like for nvme-pci, keep alive isn't required, but could
> be activated.

The spec says the support is subject to the Transport binding
specification, which does not exist in the PCIe transport spec.
Re: [PATCH] nvme: reserve a keep-alive admin tag for all transports
Posted by Christoph Hellwig 1 month, 1 week ago
On Tue, Apr 28, 2026 at 08:24:35AM +0100, Keith Busch wrote:
> > This field specifies the timeout value for the Keep Alive feature in
> > milliseconds.  [...]
> > The default value for this field is 0h for NVMe transports that do not require use of the Keep Alive
> > feature (e.g., NVMe over PCIe). For NVMe transports that require use of the Keep Alive feature
> > (e.g., RDMA and TCP), the default value for this field is 1D4C0h "
> > 
> > To me, it sounds like for nvme-pci, keep alive isn't required, but could
> > be activated.
> 
> The spec says the support is subject to the Transport binding
> specification, which does not exist in the PCIe transport spec.

My memories from the fabrics working group back in the day is that we
explicitly intended to support it in PCIe.  The wording in the spec
referring to transport specs I can find is:

The NVMe Transport binding specification for the associated NVMe Transport
defines:

 o the minimum Keep Alive Timeout value, if any;
 o the maximum Keep Alive Timeout value, if any; and
 o if the Keep Alive Timer feature is required to be supported and enabled.

which does not read to me like there is any required language in the
transport spec to require keep alive.
Re: [PATCH] nvme: reserve a keep-alive admin tag for all transports
Posted by Chao S 1 month ago
On Fri, May 8, 2026, Christoph Hellwig wrote:
> My memories from the fabrics working group back in the day is that
> we explicitly intended to support it in PCIe.  The wording in the
> spec referring to transport specs I can find is:
>
> The NVMe Transport binding specification for the associated NVMe
> Transport defines:
>
>  o the minimum Keep Alive Timeout value, if any;
>  o the maximum Keep Alive Timeout value, if any; and
>  o if the Keep Alive Timer feature is required to be supported and
>    enabled.
>
> which does not read to me like there is any required language in
> the transport spec to require keep alive.

Thanks for recalling the working-group history and the spec citation - both
will appear in the v2 commit message.  "PCIe MAY support KATO,
not required" will be the foundation for v2's frame.

Chao


On Fri, May 8, 2026 at 5:04 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, Apr 28, 2026 at 08:24:35AM +0100, Keith Busch wrote:
> > > This field specifies the timeout value for the Keep Alive feature in
> > > milliseconds.  [...]
> > > The default value for this field is 0h for NVMe transports that do not require use of the Keep Alive
> > > feature (e.g., NVMe over PCIe). For NVMe transports that require use of the Keep Alive feature
> > > (e.g., RDMA and TCP), the default value for this field is 1D4C0h "
> > >
> > > To me, it sounds like for nvme-pci, keep alive isn't required, but could
> > > be activated.
> >
> > The spec says the support is subject to the Transport binding
> > specification, which does not exist in the PCIe transport spec.
>
> My memories from the fabrics working group back in the day is that we
> explicitly intended to support it in PCIe.  The wording in the spec
> referring to transport specs I can find is:
>
> The NVMe Transport binding specification for the associated NVMe Transport
> defines:
>
>  o the minimum Keep Alive Timeout value, if any;
>  o the maximum Keep Alive Timeout value, if any; and
>  o if the Keep Alive Timer feature is required to be supported and enabled.
>
> which does not read to me like there is any required language in the
> transport spec to require keep alive.
Re: [PATCH] nvme: reserve a keep-alive admin tag for all transports
Posted by Keith Busch 1 month, 1 week ago
On Fri, May 08, 2026 at 11:04:27AM +0200, Christoph Hellwig wrote:
> On Tue, Apr 28, 2026 at 08:24:35AM +0100, Keith Busch wrote:
> > > This field specifies the timeout value for the Keep Alive feature in
> > > milliseconds.  [...]
> > > The default value for this field is 0h for NVMe transports that do not require use of the Keep Alive
> > > feature (e.g., NVMe over PCIe). For NVMe transports that require use of the Keep Alive feature
> > > (e.g., RDMA and TCP), the default value for this field is 1D4C0h "
> > > 
> > > To me, it sounds like for nvme-pci, keep alive isn't required, but could
> > > be activated.
> > 
> > The spec says the support is subject to the Transport binding
> > specification, which does not exist in the PCIe transport spec.
> 
> My memories from the fabrics working group back in the day is that we
> explicitly intended to support it in PCIe.  The wording in the spec
> referring to transport specs I can find is:
> 
> The NVMe Transport binding specification for the associated NVMe Transport
> defines:
> 
>  o the minimum Keep Alive Timeout value, if any;
>  o the maximum Keep Alive Timeout value, if any; and
>  o if the Keep Alive Timer feature is required to be supported and enabled.
> 
> which does not read to me like there is any required language in the
> transport spec to require keep alive.

So the absence of defining a minimum means it's simply optional? I
suppose I can see it that way as the intended interpretation, but seems
counter productive to do on PCIe when you can MMIO the controller status
register to verify liveness. If the controller responds successfully to
the feature, then I have to agree we need the host to do its part.
Re: [PATCH] nvme: reserve a keep-alive admin tag for all transports
Posted by Chao S 1 month ago
On Fri, May 8, 2026, Keith Busch wrote:
> So the absence of defining a minimum means it's simply optional?  I
> suppose I can see it that way as the intended interpretation, but
> seems counter productive to do on PCIe when you can MMIO the
> controller status register to verify liveness.  If the controller
> responds successfully to the feature, then I have to agree we need
> the host to do its part.

Thanks Keith - I will take that last sentence as the basis for v2.
The v2 commit message will cite the spec reading (NVMe 2.0a 5.27.1.12
plus the transport binding wording) explicitly and conclude that PCIe
MAY support KATO, in which case the host needs the reserved tag.

For data points: I also tested two Micron PCIe SSDs locally and both
report KAS=0, so the bug is likely field-rare on mainstream hardware
today.  The patch remains a cheap defensive change for spec-compliant
controllers that declare KAS != 0 or for emulated environments.

Chao

On Fri, May 8, 2026 at 5:31 AM Keith Busch <kbusch@kernel.org> wrote:
>
> On Fri, May 08, 2026 at 11:04:27AM +0200, Christoph Hellwig wrote:
> > On Tue, Apr 28, 2026 at 08:24:35AM +0100, Keith Busch wrote:
> > > > This field specifies the timeout value for the Keep Alive feature in
> > > > milliseconds.  [...]
> > > > The default value for this field is 0h for NVMe transports that do not require use of the Keep Alive
> > > > feature (e.g., NVMe over PCIe). For NVMe transports that require use of the Keep Alive feature
> > > > (e.g., RDMA and TCP), the default value for this field is 1D4C0h "
> > > >
> > > > To me, it sounds like for nvme-pci, keep alive isn't required, but could
> > > > be activated.
> > >
> > > The spec says the support is subject to the Transport binding
> > > specification, which does not exist in the PCIe transport spec.
> >
> > My memories from the fabrics working group back in the day is that we
> > explicitly intended to support it in PCIe.  The wording in the spec
> > referring to transport specs I can find is:
> >
> > The NVMe Transport binding specification for the associated NVMe Transport
> > defines:
> >
> >  o the minimum Keep Alive Timeout value, if any;
> >  o the maximum Keep Alive Timeout value, if any; and
> >  o if the Keep Alive Timer feature is required to be supported and enabled.
> >
> > which does not read to me like there is any required language in the
> > transport spec to require keep alive.
>
> So the absence of defining a minimum means it's simply optional? I
> suppose I can see it that way as the intended interpretation, but seems
> counter productive to do on PCIe when you can MMIO the controller status
> register to verify liveness. If the controller responds successfully to
> the feature, then I have to agree we need the host to do its part.
Re: [PATCH] nvme: reserve a keep-alive admin tag for all transports
Posted by Sagi Grimberg 1 month, 1 week ago

On 08/05/2026 12:31, Keith Busch wrote:
> On Fri, May 08, 2026 at 11:04:27AM +0200, Christoph Hellwig wrote:
>> On Tue, Apr 28, 2026 at 08:24:35AM +0100, Keith Busch wrote:
>>>> This field specifies the timeout value for the Keep Alive feature in
>>>> milliseconds.  [...]
>>>> The default value for this field is 0h for NVMe transports that do not require use of the Keep Alive
>>>> feature (e.g., NVMe over PCIe). For NVMe transports that require use of the Keep Alive feature
>>>> (e.g., RDMA and TCP), the default value for this field is 1D4C0h "
>>>>
>>>> To me, it sounds like for nvme-pci, keep alive isn't required, but could
>>>> be activated.
>>> The spec says the support is subject to the Transport binding
>>> specification, which does not exist in the PCIe transport spec.
>> My memories from the fabrics working group back in the day is that we
>> explicitly intended to support it in PCIe.  The wording in the spec
>> referring to transport specs I can find is:
>>
>> The NVMe Transport binding specification for the associated NVMe Transport
>> defines:
>>
>>   o the minimum Keep Alive Timeout value, if any;
>>   o the maximum Keep Alive Timeout value, if any; and
>>   o if the Keep Alive Timer feature is required to be supported and enabled.
>>
>> which does not read to me like there is any required language in the
>> transport spec to require keep alive.
> So the absence of defining a minimum means it's simply optional? I
> suppose I can see it that way as the intended interpretation, but seems
> counter productive to do on PCIe when you can MMIO the controller status
> register to verify liveness. If the controller responds successfully to
> the feature, then I have to agree we need the host to do its part.

Perhaps under a quirk? Would be interesting to understand which
pci devices support this...
Re: [PATCH] nvme: reserve a keep-alive admin tag for all transports
Posted by Chao S 1 month ago
On Fri, May 8, 2026, Sagi Grimberg wrote:
> Perhaps under a quirk? Would be interesting to understand which
> pci devices support this...

Thanks for the comments. I thought
always reserving an admin tag for keep-alive is simpler.

I queried the Identify Controller KAS field on the two PCIe NVMe
drives I have locally (Micron 7450 MTFDKBA960TFR and
MTFDLAL3T8THG-1BP1DFCYY), both report KAS=0.  I could not find
publicly documented PCIe NVMe controller that declares KAS != 0
either.
The spec (NVMe 2.0a 5.27.1.12 plus the transport binding wording
quoted upthread by Christoph) allows PCIe to support KATO; the host
should be defensive against that legal case.

The cost is a single admin tag - NVME_AQ_MQ_TAG_DEPTH stays at 30,
the regular pool goes from 30 to 29 on non-fabrics.  Fabrics has been
running with 28 regular tags without issue.

If a real PCIe controller is later found to advertise KAS != 0 and
misbehave with keep-alive, narrowing the activation with a quirk on
top of this change is straightforward.  I would prefer to land the
defensive default first.

I will send v2 shortly with this rationale in
the commit message.

Chao

On Sun, May 10, 2026 at 4:53 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
>
> On 08/05/2026 12:31, Keith Busch wrote:
> > On Fri, May 08, 2026 at 11:04:27AM +0200, Christoph Hellwig wrote:
> >> On Tue, Apr 28, 2026 at 08:24:35AM +0100, Keith Busch wrote:
> >>>> This field specifies the timeout value for the Keep Alive feature in
> >>>> milliseconds.  [...]
> >>>> The default value for this field is 0h for NVMe transports that do not require use of the Keep Alive
> >>>> feature (e.g., NVMe over PCIe). For NVMe transports that require use of the Keep Alive feature
> >>>> (e.g., RDMA and TCP), the default value for this field is 1D4C0h "
> >>>>
> >>>> To me, it sounds like for nvme-pci, keep alive isn't required, but could
> >>>> be activated.
> >>> The spec says the support is subject to the Transport binding
> >>> specification, which does not exist in the PCIe transport spec.
> >> My memories from the fabrics working group back in the day is that we
> >> explicitly intended to support it in PCIe.  The wording in the spec
> >> referring to transport specs I can find is:
> >>
> >> The NVMe Transport binding specification for the associated NVMe Transport
> >> defines:
> >>
> >>   o the minimum Keep Alive Timeout value, if any;
> >>   o the maximum Keep Alive Timeout value, if any; and
> >>   o if the Keep Alive Timer feature is required to be supported and enabled.
> >>
> >> which does not read to me like there is any required language in the
> >> transport spec to require keep alive.
> > So the absence of defining a minimum means it's simply optional? I
> > suppose I can see it that way as the intended interpretation, but seems
> > counter productive to do on PCIe when you can MMIO the controller status
> > register to verify liveness. If the controller responds successfully to
> > the feature, then I have to agree we need the host to do its part.
>
> Perhaps under a quirk? Would be interesting to understand which
> pci devices support this...