[PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count

Daniel Wagner posted 3 patches 1 year ago
There is a newer version of this series
[PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
Posted by Daniel Wagner 1 year ago
When the set feature attempts fails with any NVME status code set in
nvme_set_queue_count, the function still report success. Though the
numbers of queues set to 0. This is done to support controllers in
degraded state (the admin queue is still up and running but no IO
queues).

Though there is an exception. When nvme_set_features reports an host
path error, nvme_set_queue_count should propagate this error as the
connectivity is lost, which means also the admin queue is not working
anymore.

Fixes: 9a0be7abb62f ("nvme: refactor set_queue_count")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/nvme/host/core.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2a07c2c540b26c8cbe886711abaf6f0afbe6c4df..aa2a7c7d4d0ae7bd1f7fb704e55c0a8d724b9d56 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1677,7 +1677,12 @@ int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
 
 	status = nvme_set_features(ctrl, NVME_FEAT_NUM_QUEUES, q_count, NULL, 0,
 			&result);
-	if (status < 0)
+	/*
+	 * It's either a kernel error or the host observed a connection
+	 * lost. In either case it's not possible communicate with the
+	 * controller and thus enter the error code path.
+	 */
+	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
 		return status;
 
 	/*

-- 
2.47.0
Re: [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
Posted by Sagi Grimberg 11 months, 3 weeks ago
Makes sense to me,
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Re: [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
Posted by Hannes Reinecke 1 year ago
On 11/29/24 10:28, Daniel Wagner wrote:
> When the set feature attempts fails with any NVME status code set in
> nvme_set_queue_count, the function still report success. Though the
> numbers of queues set to 0. This is done to support controllers in
> degraded state (the admin queue is still up and running but no IO
> queues).
> 
> Though there is an exception. When nvme_set_features reports an host
> path error, nvme_set_queue_count should propagate this error as the
> connectivity is lost, which means also the admin queue is not working
> anymore.
> 
> Fixes: 9a0be7abb62f ("nvme: refactor set_queue_count")
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   drivers/nvme/host/core.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 2a07c2c540b26c8cbe886711abaf6f0afbe6c4df..aa2a7c7d4d0ae7bd1f7fb704e55c0a8d724b9d56 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1677,7 +1677,12 @@ int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
>   
>   	status = nvme_set_features(ctrl, NVME_FEAT_NUM_QUEUES, q_count, NULL, 0,
>   			&result);
> -	if (status < 0)
> +	/*
> +	 * It's either a kernel error or the host observed a connection
> +	 * lost. In either case it's not possible communicate with the
> +	 * controller and thus enter the error code path.
> +	 */
> +	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
>   		return status;
>   
>   	/*
> 
Hmm. Maybe checking for NVME_SC_DNR, too?

Otherwise:

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
Re: [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
Posted by Daniel Wagner 12 months ago
On Fri, Nov 29, 2024 at 12:10:33PM +0100, Hannes Reinecke wrote:
> > +	/*
> > +	 * It's either a kernel error or the host observed a connection
> > +	 * lost. In either case it's not possible communicate with the
> > +	 * controller and thus enter the error code path.
> > +	 */
> > +	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
> >   		return status;
> >   	/*
> > 
> Hmm. Maybe checking for NVME_SC_DNR, too?

if no one complains I'll update the check to:

	if (status < 0 || (status > 0 && (status & NVME_STATUS_DNR)) ||
	    status == NVME_SC_HOST_PATH_ERROR)
		return status;

okay?
Re: [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
Posted by Sagi Grimberg 11 months, 3 weeks ago


On 17/12/2024 10:35, Daniel Wagner wrote:
> On Fri, Nov 29, 2024 at 12:10:33PM +0100, Hannes Reinecke wrote:
>>> +	/*
>>> +	 * It's either a kernel error or the host observed a connection
>>> +	 * lost. In either case it's not possible communicate with the
>>> +	 * controller and thus enter the error code path.
>>> +	 */
>>> +	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
>>>    		return status;
>>>    	/*
>>>
>> Hmm. Maybe checking for NVME_SC_DNR, too?
> if no one complains I'll update the check to:
>
> 	if (status < 0 || (status > 0 && (status & NVME_STATUS_DNR)) ||
> 	    status == NVME_SC_HOST_PATH_ERROR)
> 		return status;
>
> okay?
Why do we care about the DNR? are you going to retry based on it?
Re: [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
Posted by Daniel Wagner 11 months, 1 week ago
On Tue, Dec 24, 2024 at 12:35:23PM +0200, Sagi Grimberg wrote:
> On 17/12/2024 10:35, Daniel Wagner wrote:
> > On Fri, Nov 29, 2024 at 12:10:33PM +0100, Hannes Reinecke wrote:
> > > > +	/*
> > > > +	 * It's either a kernel error or the host observed a connection
> > > > +	 * lost. In either case it's not possible communicate with the
> > > > +	 * controller and thus enter the error code path.
> > > > +	 */
> > > > +	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
> > > >    		return status;
> > > >    	/*
> > > > 
> > > Hmm. Maybe checking for NVME_SC_DNR, too?
> > if no one complains I'll update the check to:
> > 
> > 	if (status < 0 || (status > 0 && (status & NVME_STATUS_DNR)) ||
> > 	    status == NVME_SC_HOST_PATH_ERROR)
> > 		return status;
> > 
> > okay?
> Why do we care about the DNR? are you going to retry based on it?

I don't know if we should care. Hannes brought this up.
Re: [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
Posted by Sagi Grimberg 11 months, 1 week ago


On 07/01/2025 16:40, Daniel Wagner wrote:
> On Tue, Dec 24, 2024 at 12:35:23PM +0200, Sagi Grimberg wrote:
>> On 17/12/2024 10:35, Daniel Wagner wrote:
>>> On Fri, Nov 29, 2024 at 12:10:33PM +0100, Hannes Reinecke wrote:
>>>>> +	/*
>>>>> +	 * It's either a kernel error or the host observed a connection
>>>>> +	 * lost. In either case it's not possible communicate with the
>>>>> +	 * controller and thus enter the error code path.
>>>>> +	 */
>>>>> +	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
>>>>>     		return status;
>>>>>     	/*
>>>>>
>>>> Hmm. Maybe checking for NVME_SC_DNR, too?
>>> if no one complains I'll update the check to:
>>>
>>> 	if (status < 0 || (status > 0 && (status & NVME_STATUS_DNR)) ||
>>> 	    status == NVME_SC_HOST_PATH_ERROR)
>>> 		return status;
>>>
>>> okay?
>> Why do we care about the DNR? are you going to retry based on it?
> I don't know if we should care. Hannes brought this up.

I don't see why should we care.
Re: [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
Posted by Hannes Reinecke 12 months ago
On 12/17/24 09:35, Daniel Wagner wrote:
> On Fri, Nov 29, 2024 at 12:10:33PM +0100, Hannes Reinecke wrote:
>>> +	/*
>>> +	 * It's either a kernel error or the host observed a connection
>>> +	 * lost. In either case it's not possible communicate with the
>>> +	 * controller and thus enter the error code path.
>>> +	 */
>>> +	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
>>>    		return status;
>>>    	/*
>>>
>> Hmm. Maybe checking for NVME_SC_DNR, too?
> 
> if no one complains I'll update the check to:
> 
> 	if (status < 0 || (status > 0 && (status & NVME_STATUS_DNR)) ||
> 	    status == NVME_SC_HOST_PATH_ERROR)
> 		return status;
> 
> okay?

Which really cries out for a helper. But otherwise okay.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
Re: [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
Posted by Daniel Wagner 12 months ago
On Tue, Dec 17, 2024 at 10:45:45AM +0100, Hannes Reinecke wrote:
> On 12/17/24 09:35, Daniel Wagner wrote:
> > On Fri, Nov 29, 2024 at 12:10:33PM +0100, Hannes Reinecke wrote:
> > > > +	/*
> > > > +	 * It's either a kernel error or the host observed a connection
> > > > +	 * lost. In either case it's not possible communicate with the
> > > > +	 * controller and thus enter the error code path.
> > > > +	 */
> > > > +	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
> > > >    		return status;
> > > >    	/*
> > > > 
> > > Hmm. Maybe checking for NVME_SC_DNR, too?
> > 
> > if no one complains I'll update the check to:
> > 
> > 	if (status < 0 || (status > 0 && (status & NVME_STATUS_DNR)) ||
> > 	    status == NVME_SC_HOST_PATH_ERROR)
> > 		return status;
> > 
> > okay?
> 
> Which really cries out for a helper. But otherwise okay.

As far I remember, Christoph is not a big fan of single users of tiny
helper functions. Anyway, what about this:

+static bool nvme_num_queue_updated(status)
+{
+       /*
+        * It's either a kernel error or the host observed a connection
+        * lost. In either case it's not possible communicate with the
+        * controller and thus enter the error code path.
+        */
+
+       if (status < 0 || status & NVME_STATUS_DNR ||
+           status == NVME_SC_HOST_PATH_ERROR)
+               return false;
+
+       return true;
+}
+
 int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
 {
        u32 q_count = (*count - 1) | ((*count - 1) << 16);
@@ -1701,13 +1716,8 @@ int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)

        status = nvme_set_features(ctrl, NVME_FEAT_NUM_QUEUES, q_count, NULL, 0,
                        &result);
-       /*
-        * It's either a kernel error or the host observed a connection
-        * lost. In either case it's not possible communicate with the
-        * controller and thus enter the error code path.
-        */
-       if (status < 0 || (status > 0 && (status & NVME_STATUS_DNR)) ||
-           status == NVME_SC_HOST_PATH_ERROR)
+
+       if (!nvme_num_queue_updated(status))
                return status;

        /*
Re: [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
Posted by Hannes Reinecke 12 months ago
On 12/17/24 15:01, Daniel Wagner wrote:
> On Tue, Dec 17, 2024 at 10:45:45AM +0100, Hannes Reinecke wrote:
>> On 12/17/24 09:35, Daniel Wagner wrote:
>>> On Fri, Nov 29, 2024 at 12:10:33PM +0100, Hannes Reinecke wrote:
>>>>> +	/*
>>>>> +	 * It's either a kernel error or the host observed a connection
>>>>> +	 * lost. In either case it's not possible communicate with the
>>>>> +	 * controller and thus enter the error code path.
>>>>> +	 */
>>>>> +	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
>>>>>     		return status;
>>>>>     	/*
>>>>>
>>>> Hmm. Maybe checking for NVME_SC_DNR, too?
>>>
>>> if no one complains I'll update the check to:
>>>
>>> 	if (status < 0 || (status > 0 && (status & NVME_STATUS_DNR)) ||
>>> 	    status == NVME_SC_HOST_PATH_ERROR)
>>> 		return status;
>>>
>>> okay?
>>
>> Which really cries out for a helper. But otherwise okay.
> 
> As far I remember, Christoph is not a big fan of single users of tiny
> helper functions. Anyway, what about this:
> 
> +static bool nvme_num_queue_updated(status)
> +{
> +       /*
> +        * It's either a kernel error or the host observed a connection
> +        * lost. In either case it's not possible communicate with the
> +        * controller and thus enter the error code path.
> +        */
> +
> +       if (status < 0 || status & NVME_STATUS_DNR ||
> +           status == NVME_SC_HOST_PATH_ERROR)
> +               return false;
> +
> +       return true;
> +}
> +
>   int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
>   {
>          u32 q_count = (*count - 1) | ((*count - 1) << 16);
> @@ -1701,13 +1716,8 @@ int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
> 
>          status = nvme_set_features(ctrl, NVME_FEAT_NUM_QUEUES, q_count, NULL, 0,
>                          &result);
> -       /*
> -        * It's either a kernel error or the host observed a connection
> -        * lost. In either case it's not possible communicate with the
> -        * controller and thus enter the error code path.
> -        */
> -       if (status < 0 || (status > 0 && (status & NVME_STATUS_DNR)) ||
> -           status == NVME_SC_HOST_PATH_ERROR)
> +
> +       if (!nvme_num_queue_updated(status))
>                  return status;
> 
>          /*

Naming is terrible. The helper does not check whether the number of 
queues updated, so why the name?

Better keep it inline, then.

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich