[PATCH v2] nvme: Let the blocklayer set timeouts for requests

Heyne, Maximilian posted 1 patch 2 months ago
drivers/nvme/host/core.c | 2 --
1 file changed, 2 deletions(-)
[PATCH v2] nvme: Let the blocklayer set timeouts for requests
Posted by Heyne, Maximilian 2 months ago
When initializing an nvme request which is about to be send to the block
layer, we do not need to initialize its timeout. If it's left
uninitialized at 0 the block layer will use the request queue's timeout
in blk_add_timer (via nvme_start_request which is called from
nvme_*_queue_rq). These timeouts are setup to either NVME_IO_TIMEOUT or
NVME_ADMIN_TIMEOUT when the request queues were created.

Because the io_timeout of the IO queues can be modified via sysfs, the
following situation can occur:

1) NVME_IO_TIMEOUT = 30 (default module parameter)
2) nvme1n1 is probed. IO queues default timeout is 30 s
3) manually change the IO timeout to 90 s
   echo 90000 > /sys/class/nvme/nvme1/nvme1n1/queue/io_timeout
4) Any call of __submit_sync_cmd on nvme1n1 to an IO queue will issue
   commands with the 30 s timeout instead of the wanted 90 s which might
   be more suitable for this device.

Commit 470e900c8036 ("nvme: refactor nvme_alloc_request") silently
changed the behavior for ioctl's already because it unconditionally
overrides the request's timeout that was set in nvme_init_request. If it
was unset by the user of the ioctl if will be overridden with 0 meaning
the block layer will pick the request queue's IO timeout.

Following up on that, this patch further improves the consistency of IO
timeout usage. However, there are still uses of NVME_IO_TIMEOUT which
could be inconsistent with what is set in the device's request_queue by
the user.

Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
---
 drivers/nvme/host/core.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7bf228df6001..b9315f0abf80 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -724,10 +724,8 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd)
 		struct nvme_ns *ns = req->q->disk->private_data;
 
 		logging_enabled = ns->head->passthru_err_log_enabled;
-		req->timeout = NVME_IO_TIMEOUT;
 	} else { /* no queuedata implies admin queue */
 		logging_enabled = nr->ctrl->passthru_err_log_enabled;
-		req->timeout = NVME_ADMIN_TIMEOUT;
 	}
 
 	if (!logging_enabled)
-- 
2.47.3




Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Christof Hellmis
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
Re: [PATCH v2] nvme: Let the blocklayer set timeouts for requests
Posted by Keith Busch 2 months ago
On Thu, Dec 04, 2025 at 02:11:50PM +0000, Heyne, Maximilian wrote:
> @@ -724,10 +724,8 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd)
>  		struct nvme_ns *ns = req->q->disk->private_data;
>  
>  		logging_enabled = ns->head->passthru_err_log_enabled;
> -		req->timeout = NVME_IO_TIMEOUT;
>  	} else { /* no queuedata implies admin queue */
>  		logging_enabled = nr->ctrl->passthru_err_log_enabled;
> -		req->timeout = NVME_ADMIN_TIMEOUT;
>  	}

I was trying to think of any in-kernel path using __submit_sync_cmd with
an IO queue, and quick search shows there's just one: zns report zones.

Everything else uses the admin queue, which doesn't have a sysfs tunable
for its request_queue's default timeout. All we have is the nvme module
parameter, which is writable after loading. Since that's the only way a
user can modify the default time for that queue, I think we need to
leave that req->timeout value as-is.
Re: [PATCH v2] nvme: Let the blocklayer set timeouts for requests
Posted by Heyne, Maximilian 2 months ago
On Thu, Dec 04, 2025 at 09:13:35AM -0700, Keith Busch wrote:
> On Thu, Dec 04, 2025 at 02:11:50PM +0000, Heyne, Maximilian wrote:
> > @@ -724,10 +724,8 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd)
> >  		struct nvme_ns *ns = req->q->disk->private_data;
> >  
> >  		logging_enabled = ns->head->passthru_err_log_enabled;
> > -		req->timeout = NVME_IO_TIMEOUT;
> >  	} else { /* no queuedata implies admin queue */
> >  		logging_enabled = nr->ctrl->passthru_err_log_enabled;
> > -		req->timeout = NVME_ADMIN_TIMEOUT;
> >  	}
> 
> I was trying to think of any in-kernel path using __submit_sync_cmd with
> an IO queue, and quick search shows there's just one: zns report zones.
> 
> Everything else uses the admin queue, which doesn't have a sysfs tunable
> for its request_queue's default timeout. All we have is the nvme module
> parameter, which is writable after loading. Since that's the only way a
> user can modify the default time for that queue, I think we need to
> leave that req->timeout value as-is.

Ok sound like a v3 is needed where I only delete the line with
NVME_IO_TIMEOUT but leave the NVME_ADMIN_TIMEOUT and add a comment about
it. Will prepare such a patch.

Thanks for you reviews.



Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Christof Hellmis
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
Re: [PATCH v2] nvme: Let the blocklayer set timeouts for requests
Posted by Heyne, Maximilian 2 months ago
On Thu, Dec 04, 2025 at 07:34:26PM +0000, Heyne, Maximilian wrote:
> On Thu, Dec 04, 2025 at 09:13:35AM -0700, Keith Busch wrote:
> > On Thu, Dec 04, 2025 at 02:11:50PM +0000, Heyne, Maximilian wrote:
> > > @@ -724,10 +724,8 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd)
> > >  		struct nvme_ns *ns = req->q->disk->private_data;
> > >  
> > >  		logging_enabled = ns->head->passthru_err_log_enabled;
> > > -		req->timeout = NVME_IO_TIMEOUT;
> > >  	} else { /* no queuedata implies admin queue */
> > >  		logging_enabled = nr->ctrl->passthru_err_log_enabled;
> > > -		req->timeout = NVME_ADMIN_TIMEOUT;
> > >  	}
> > 
> > I was trying to think of any in-kernel path using __submit_sync_cmd with
> > an IO queue, and quick search shows there's just one: zns report zones.
> > 
> > Everything else uses the admin queue, which doesn't have a sysfs tunable
> > for its request_queue's default timeout. All we have is the nvme module
> > parameter, which is writable after loading. Since that's the only way a
> > user can modify the default time for that queue, I think we need to
> > leave that req->timeout value as-is.
> 
> Ok sound like a v3 is needed where I only delete the line with
> NVME_IO_TIMEOUT but leave the NVME_ADMIN_TIMEOUT and add a comment about
> it. Will prepare such a patch.
> 

I thought about this a bit more. Considering that the module parameters
can be written to produces even more inconsistencies. 

With my proposed change we will have the following situation:
1) when a device is probed current settings for IO and admin timeout from
   the module parameters will be the respective default timeouts
2) almost all admin commands to the device will default to the initial
   admin timeout
3) almost all IO commands will adhere to whatever is set via sysfs or
   the initial default

-> changes to the module parameters will (mostly) not affect already
   probed devices

When we keep initializing the admin timeouts to the module parameter as
you suggest (so half of my patch), we'll have the following situation:

1) same as above
2) some admin commands will use the module parameters admin timeout,
   some (ioctl's via nvme_submit_user_cmd) will use the admin queue's timeout set in 1).
   This can be made consistent if we fix nvme_submit_user_cmd.
3) same as above; IO timeout not affected by module parameter changes.

-> changes to the module parameters affect only the admin commands but
   (mostly) not the IO commands which is inconsistent behavior.

Therefore, I wonder whether people are actually making changes to the
module parameters at runtime to adjust all device's admin timeouts. In
that case it's at least broken for ioctl's currently.
Should we make that timeout runtime configurable per-device too instead?

In summary, I think my patch as-is leads to better consistency.



Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Christof Hellmis
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
Re: [PATCH v2] nvme: Let the blocklayer set timeouts for requests
Posted by Sagi Grimberg 1 month, 3 weeks ago

On 05/12/2025 9:33, Heyne, Maximilian wrote:
> On Thu, Dec 04, 2025 at 07:34:26PM +0000, Heyne, Maximilian wrote:
>> On Thu, Dec 04, 2025 at 09:13:35AM -0700, Keith Busch wrote:
>>> On Thu, Dec 04, 2025 at 02:11:50PM +0000, Heyne, Maximilian wrote:
>>>> @@ -724,10 +724,8 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd)
>>>>   		struct nvme_ns *ns = req->q->disk->private_data;
>>>>   
>>>>   		logging_enabled = ns->head->passthru_err_log_enabled;
>>>> -		req->timeout = NVME_IO_TIMEOUT;
>>>>   	} else { /* no queuedata implies admin queue */
>>>>   		logging_enabled = nr->ctrl->passthru_err_log_enabled;
>>>> -		req->timeout = NVME_ADMIN_TIMEOUT;
>>>>   	}
>>> I was trying to think of any in-kernel path using __submit_sync_cmd with
>>> an IO queue, and quick search shows there's just one: zns report zones.
>>>
>>> Everything else uses the admin queue, which doesn't have a sysfs tunable
>>> for its request_queue's default timeout. All we have is the nvme module
>>> parameter, which is writable after loading. Since that's the only way a
>>> user can modify the default time for that queue, I think we need to
>>> leave that req->timeout value as-is.
>> Ok sound like a v3 is needed where I only delete the line with
>> NVME_IO_TIMEOUT but leave the NVME_ADMIN_TIMEOUT and add a comment about
>> it. Will prepare such a patch.
>>
> I thought about this a bit more. Considering that the module parameters
> can be written to produces even more inconsistencies.
>
> With my proposed change we will have the following situation:
> 1) when a device is probed current settings for IO and admin timeout from
>     the module parameters will be the respective default timeouts
> 2) almost all admin commands to the device will default to the initial
>     admin timeout
> 3) almost all IO commands will adhere to whatever is set via sysfs or
>     the initial default
>
> -> changes to the module parameters will (mostly) not affect already
>     probed devices
>
> When we keep initializing the admin timeouts to the module parameter as
> you suggest (so half of my patch), we'll have the following situation:
>
> 1) same as above
> 2) some admin commands will use the module parameters admin timeout,
>     some (ioctl's via nvme_submit_user_cmd) will use the admin queue's timeout set in 1).
>     This can be made consistent if we fix nvme_submit_user_cmd.
> 3) same as above; IO timeout not affected by module parameter changes.
>
> -> changes to the module parameters affect only the admin commands but
>     (mostly) not the IO commands which is inconsistent behavior.
>
> Therefore, I wonder whether people are actually making changes to the
> module parameters at runtime to adjust all device's admin timeouts. In
> that case it's at least broken for ioctl's currently.
> Should we make that timeout runtime configurable per-device too instead?
>
> In summary, I think my patch as-is leads to better consistency.

Perhaps you can simply add admin_timeout sysfs file for the controller 
that would alter
the set->timeout and take the value from there in nvme_init_request
Re: [PATCH v2] nvme: Let the blocklayer set timeouts for requests
Posted by Maurizio Lombardi 1 month, 3 weeks ago
On Fri Dec 12, 2025 at 2:43 PM CET, Sagi Grimberg wrote:
>
> Perhaps you can simply add admin_timeout sysfs file for the controller 
> that would alter
> the set->timeout and take the value from there in nvme_init_request

Curiously, this is something I was looking at recently.

Could it be done by calling blk_queue_rq_timeout() in sysfs and removing
"req->timeout = NVME_ADMIN_TIMEOUT;" from nvme_init_request() ?

From 1a36c2fcafc2502298d25da64cf3740721560f30 Mon Sep 17 00:00:00 2001
From: Maurizio Lombardi <mlombard@redhat.com>
Date: Tue, 11 Nov 2025 09:13:58 +0100
Subject: [PATCH] nvme: add sysfs attribute to change admin timeout per nvme
 controller

Currently, there is no method to adjust the timeout values
on a per controller basis with nvme admin queues.
Add an admin_timeout attribute to nvme so that different
nvme controllers which may have different timeout
requirements can have custom admin timeouts set.

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
---
 drivers/nvme/host/sysfs.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 29430949ce2f..da83b13d2cdd 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -601,6 +601,36 @@ static ssize_t dctype_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(dctype);

+static ssize_t nvme_admin_timeout_show(struct device *dev,
+			struct device_attribute *attr, char *buf)
+{
+	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+	return sysfs_emit(buf, "%u\n",
+			jiffies_to_msecs(ctrl->admin_q->rq_timeout) / 1000);
+}
+
+static ssize_t nvme_admin_timeout_store(struct device *dev,
+			struct device_attribute *attr,
+			const char *buf, size_t count)
+{
+	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+	u16 timeout;
+	int err;
+
+	err = kstrtou16(buf, 10, &timeout);
+	if (err || !timeout)
+		return -EINVAL;
+
+	blk_queue_rq_timeout(ctrl->admin_q, secs_to_jiffies(timeout));
+
+	return count;
+}
+
+static struct device_attribute dev_attr_admin_timeout =  \
+	__ATTR(admin_timeout, S_IRUGO | S_IWUSR, \
+	nvme_admin_timeout_show, nvme_admin_timeout_store);
+
 #ifdef CONFIG_NVME_HOST_AUTH
 static ssize_t nvme_ctrl_dhchap_secret_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
@@ -742,6 +772,7 @@ static struct attribute *nvme_dev_attrs[] = {
 	&dev_attr_kato.attr,
 	&dev_attr_cntrltype.attr,
 	&dev_attr_dctype.attr,
+	&dev_attr_admin_timeout.attr,
 #ifdef CONFIG_NVME_HOST_AUTH
 	&dev_attr_dhchap_secret.attr,
 	&dev_attr_dhchap_ctrl_secret.attr,
--
2.47.3