drivers/nvme/host/core.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
nvme_keep_alive_work() always allocates with BLK_MQ_REQ_RESERVED, but
nvme_alloc_admin_tag_set() only sets reserved_tags for fabrics. Since
commit b58da2d270db ("nvme: update keep alive interval when kato is
modified"), userspace can start keep-alive on any transport via Set
Features (KATO), after which the allocation trips WARN_ON_ONCE() in
blk_mq_get_tag() and fails with -EWOULDBLOCK:
nvme nvme0: keep-alive failed: -11
Per NVMe 2.0a section 5.27.1.12 and the transport binding wording,
PCIe MAY support KATO. Reserve one admin tag on all transports so
the host is ready when a controller accepts the feature. Fabrics
keeps two, the second being for the connect command.
A quirk-based approach was considered but no PCIe controller
documented to declare KAS != 0 was found (two enterprise SSDs tested
locally report KAS=0), so an allowlist has no entries today.
Link: https://lore.kernel.org/linux-nvme/20260428022911.1288485-1-coshi036@gmail.com/
Fixes: b58da2d270db ("nvme: update keep alive interval when kato is modified")
Found by FuzzNvme (Syzkaller with FEMU fuzzing framework).
Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
---
Reproducer (run as root on an unpatched kernel with a PCIe NVMe device):
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <sys/ioctl.h>
#include <linux/nvme_ioctl.h>
int main(void)
{
struct nvme_admin_cmd cmd = {0};
int fd = open("/dev/nvme0", O_RDWR);
if (fd < 0) { perror("open"); return 1; }
cmd.opcode = 0x09; /* SET_FEATURES */
cmd.cdw10 = 0x0f; /* Feature ID: KATO */
cmd.cdw11 = 5; /* KATO = 5 seconds */
if (ioctl(fd, NVME_IOCTL_ADMIN_CMD, &cmd) < 0) {
perror("ioctl");
return 1;
}
return 0;
}
Within ~kato/2 seconds after the program exits, dmesg shows:
nvme nvme0: keep alive interval updated from 0 ms to 5000 ms
WARNING: CPU: 0 PID: ... at block/blk-mq-tag.c:148 blk_mq_get_tag+...
nvme nvme0: keep-alive failed: -11
Changes since v1:
- Add spec citation (NVMe 2.0a 5.27.1.12 + transport binding wording)
clarifying that PCIe MAY support KATO.
- Discuss the quirk-based alternative suggested in v1 review and
note that no PCIe controller declaring KAS != 0 is documented
today (two enterprise SSDs tested locally report KAS=0).
- Add Link: to v1 thread.
drivers/nvme/host/core.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7bf228df6001..6db02ecde6d1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4850,8 +4850,13 @@ int nvme_alloc_admin_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set,
memset(set, 0, sizeof(*set));
set->ops = ops;
set->queue_depth = NVME_AQ_MQ_TAG_DEPTH;
+ /*
+ * Reserve one tag for keep-alive, which is allocated with
+ * BLK_MQ_REQ_RESERVED and can be enabled on any transport via the
+ * KATO feature. Fabrics needs a second reserved tag for connect.
+ */
+ set->reserved_tags = 1;
if (ctrl->ops->flags & NVME_F_FABRICS)
- /* Reserved for fabric connect and keep alive */
set->reserved_tags = 2;
set->numa_node = ctrl->numa_node;
if (ctrl->ops->flags & NVME_F_BLOCKING)
--
2.43.0
On Fri, May 15, 2026 at 03:12:48AM -0400, Chao Shi wrote: > Per NVMe 2.0a section 5.27.1.12 and the transport binding wording, > PCIe MAY support KATO. Reserve one admin tag on all transports so > the host is ready when a controller accepts the feature. Fabrics > keeps two, the second being for the connect command. > > A quirk-based approach was considered but no PCIe controller > documented to declare KAS != 0 was found (two enterprise SSDs tested > locally report KAS=0), so an allowlist has no entries today. I totally get it's optional for PCIe, but that also means it's the host's option on whether it wants to use it, and there's no requirement we have to. We just need the driver react correctly when someone tries to do it. I am skeptical anyone would produce a PCIe device that supports it, but let's say someone does: what is the use case motivating enabling this optional feature in this driver? If it's just because the option is there, then I think we can just reject the user command submitting the feature for PCIe transports, like I earlier suggested. Requiring an active command will just harm idle power states.
On Wed, May 20, 2026 at 02:26:13PM -0600, Keith Busch wrote: > On Fri, May 15, 2026 at 03:12:48AM -0400, Chao Shi wrote: > > Per NVMe 2.0a section 5.27.1.12 and the transport binding wording, > > PCIe MAY support KATO. Reserve one admin tag on all transports so > > the host is ready when a controller accepts the feature. Fabrics > > keeps two, the second being for the connect command. > > > > A quirk-based approach was considered but no PCIe controller > > documented to declare KAS != 0 was found (two enterprise SSDs tested > > locally report KAS=0), so an allowlist has no entries today. > > I totally get it's optional for PCIe, but that also means it's the > host's option on whether it wants to use it, and there's no requirement > we have to. We just need the driver react correctly when someone tries > to do it. > > I am skeptical anyone would produce a PCIe device that supports it, but > let's say someone does: what is the use case motivating enabling this > optional feature in this driver? If it's just because the option is > there, then I think we can just reject the user command submitting the > feature for PCIe transports, like I earlier suggested. Requiring an > active command will just harm idle power states. I don't think that's quite the point. We'd have to add special filtering to fix the reproducer. Compared to that just reserving a tag and officially supporting the feature is much easier and a much better story.
On Thu, May 21, 2026 at 10:25:49AM +0200, Christoph Hellwig wrote: > On Wed, May 20, 2026 at 02:26:13PM -0600, Keith Busch wrote: > > I am skeptical anyone would produce a PCIe device that supports it, but > > let's say someone does: what is the use case motivating enabling this > > optional feature in this driver? If it's just because the option is > > there, then I think we can just reject the user command submitting the > > feature for PCIe transports, like I earlier suggested. Requiring an > > active command will just harm idle power states. > > I don't think that's quite the point. We'd have to add special > filtering to fix the reproducer. Compared to that just reserving > a tag and officially supporting the feature is much easier and a much > better story. This command is already special since we filter for it on the completion side. We may want to selectively filter other Set Feature commands too. For example, we don't want user space turning on Host Dispersed Namespace Support, because this driver is not going to correctly react to that one either.
On Thu, May 21, 2026, Keith Busch wrote: > This command is already special since we filter for it on the completion > side. We may want to selectively filter other Set Feature commands too. > For example, we don't want user space turning on Host Dispersed > Namespace Support, because this driver is not going to correctly react > to that one either. Thank you so much. I agreed. v3 implements this: a filter in the passthrough path (nvme_passthru_cmd_allowed) that rejects Set Features commands the driver is not prepared to handle, returning -EOPNOTSUPP. It starts with KATO on non-fabrics and is structured so other features such as Host Dispersed Namespace Support can be added as needed. Thanks also for the idle-power-states point - that is now part of the rationale in the commit message, since an active keep-alive on PCIe would prevent deeper idle states for no benefit. Sent as a new series: https://lore.kernel.org/linux-nvme/20260522152807.2061501-1-coshi036@gmail.com/ Chao On Thu, May 21, 2026 at 10:38 AM Keith Busch <kbusch@kernel.org> wrote: > > On Thu, May 21, 2026 at 10:25:49AM +0200, Christoph Hellwig wrote: > > On Wed, May 20, 2026 at 02:26:13PM -0600, Keith Busch wrote: > > > I am skeptical anyone would produce a PCIe device that supports it, but > > > let's say someone does: what is the use case motivating enabling this > > > optional feature in this driver? If it's just because the option is > > > there, then I think we can just reject the user command submitting the > > > feature for PCIe transports, like I earlier suggested. Requiring an > > > active command will just harm idle power states. > > > > I don't think that's quite the point. We'd have to add special > > filtering to fix the reproducer. Compared to that just reserving > > a tag and officially supporting the feature is much easier and a much > > better story. > > This command is already special since we filter for it on the completion > side. We may want to selectively filter other Set Feature commands too. > For example, we don't want user space turning on Host Dispersed > Namespace Support, because this driver is not going to correctly react > to that one either.
On Thu, May 21, 2026 at 08:38:29AM -0600, Keith Busch wrote: > This command is already special since we filter for it on the completion > side. We may want to selectively filter other Set Feature commands too. > For example, we don't want user space turning on Host Dispersed > Namespace Support, because this driver is not going to correctly react > to that one either. True. So maybe start filtering out all these things will go wrong commands. Chao, can you start on that for keep alive? We can then extend it as needed.
On Thu, May 21, 2026, Christoph Hellwig wrote: > True. So maybe start filtering out all these things will go wrong > commands. Chao, can you start on that for keep alive? We can then > extend it as needed. Sure, thank you so much for your help and comments! Done. v3 adds the filter (nvme_passthru_cmd_allowed) in the passthrough path, rejecting KATO Set Features on non-fabrics with -EOPNOTSUPP, and is structured to extend to other features later. The reserve-a-tag change from v1/v2 is dropped. For the blktests testcase you asked about earlier: I will send that separately against the blktests tree. Sent as a new series: https://lore.kernel.org/linux-nvme/20260522152807.2061501-1-coshi036@gmail.com/ Chao On Fri, May 22, 2026 at 8:14 AM Christoph Hellwig <hch@lst.de> wrote: > > On Thu, May 21, 2026 at 08:38:29AM -0600, Keith Busch wrote: > > This command is already special since we filter for it on the completion > > side. We may want to selectively filter other Set Feature commands too. > > For example, we don't want user space turning on Host Dispersed > > Namespace Support, because this driver is not going to correctly react > > to that one either. > > True. So maybe start filtering out all these things will go wrong > commands. Chao, can you start on that for keep alive? We can then > extend it as needed.
On Fri, May 15, 2026 at 03:12:48AM -0400, Chao Shi wrote: > A quirk-based approach was considered but no PCIe controller > documented to declare KAS != 0 was found (two enterprise SSDs tested > locally report KAS=0), so an allowlist has no entries today. Quirking for spec allowed behavior sounds odd. If we care about testing KA for PCIe it should be trivial to implement in nvmet-epf in the kernel, but I'm not sure there is much of a point in that. > Reproducer (run as root on an unpatched kernel with a PCIe NVMe device): Can you wire this up as a testcase in blktests? The patch itself looks good: Reviewed-by: Christoph Hellwig <hch@lst.de>
© 2016 - 2026 Red Hat, Inc.