[v1] TP8028 Rapid Path Failure Recovery

[RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state

Posted by Mohamed Khalfella 2 months, 2 weeks ago

Add NVME_CTRL_RECOVERING as a new controller state to be used when
impacted controller is being recovered. A LIVE controller enters
RECOVERING state when an IO error is encountered. While recovering
inflight IOs will not be canceled if they timeout. These IOs will be
canceled after recovery finishes. Also, while recovering a controller
can not be reset or deleted. This is intentional because reset or delete
will result in canceling inflight IOs. When recovery finishes, the
impacted controller transitions from RECOVERING state to RESETTING state.
Reset codepath takes care of queues teardown and inflight requests
cancellation.

Note, there is no transition from RECOVERING to RESETTING added to
nvme_change_ctrl_state(). The reason is that user should not be allowed
to reset or delete a controller that is being recovered.

Add NVME_CTRL_RECOVERED controller flag. This flag is set on a controller
about to schedule delayed work for time based recovery.

Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 drivers/nvme/host/core.c  | 10 ++++++++++
 drivers/nvme/host/nvme.h  |  2 ++
 drivers/nvme/host/sysfs.c |  1 +
 3 files changed, 13 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index aa007a7b9606..f5b84bc327d3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -574,6 +574,15 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 			break;
 		}
 		break;
+	case NVME_CTRL_RECOVERING:
+		switch (old_state) {
+		case NVME_CTRL_LIVE:
+			changed = true;
+			fallthrough;
+		default:
+			break;
+		}
+		break;
 	case NVME_CTRL_RESETTING:
 		switch (old_state) {
 		case NVME_CTRL_NEW:
@@ -761,6 +770,7 @@ blk_status_t nvme_fail_nonready_command(struct nvme_ctrl *ctrl,
 	if (state != NVME_CTRL_DELETING_NOIO &&
 	    state != NVME_CTRL_DELETING &&
 	    state != NVME_CTRL_DEAD &&
+	    state != NVME_CTRL_RECOVERING &&
 	    !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
 	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
 		return BLK_STS_RESOURCE;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 5195a9abfadf..cde427353e0a 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -251,6 +251,7 @@ static inline u16 nvme_req_qid(struct request *req)
 enum nvme_ctrl_state {
 	NVME_CTRL_NEW,
 	NVME_CTRL_LIVE,
+	NVME_CTRL_RECOVERING,
 	NVME_CTRL_RESETTING,
 	NVME_CTRL_CONNECTING,
 	NVME_CTRL_DELETING,
@@ -275,6 +276,7 @@ enum nvme_ctrl_flags {
 	NVME_CTRL_SKIP_ID_CNS_CS	= 4,
 	NVME_CTRL_DIRTY_CAPABILITY	= 5,
 	NVME_CTRL_FROZEN		= 6,
+	NVME_CTRL_RECOVERED		= 7,
 };
 
 struct nvme_ctrl {
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index ae36249ad61e..55f907fb6c86 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -443,6 +443,7 @@ static ssize_t nvme_sysfs_show_state(struct device *dev,
 	static const char *const state_name[] = {
 		[NVME_CTRL_NEW]		= "new",
 		[NVME_CTRL_LIVE]	= "live",
+		[NVME_CTRL_RECOVERING]	= "recovering",
 		[NVME_CTRL_RESETTING]	= "resetting",
 		[NVME_CTRL_CONNECTING]	= "connecting",
 		[NVME_CTRL_DELETING]	= "deleting",
-- 
2.51.2

Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state

Posted by Sagi Grimberg 1 month, 2 weeks ago


On 26/11/2025 4:11, Mohamed Khalfella wrote:
> Add NVME_CTRL_RECOVERING as a new controller state to be used when
> impacted controller is being recovered. A LIVE controller enters
> RECOVERING state when an IO error is encountered. While recovering
> inflight IOs will not be canceled if they timeout. These IOs will be
> canceled after recovery finishes. Also, while recovering a controller
> can not be reset or deleted. This is intentional because reset or delete
> will result in canceling inflight IOs. When recovery finishes, the
> impacted controller transitions from RECOVERING state to RESETTING state.
> Reset codepath takes care of queues teardown and inflight requests
> cancellation.

Is RECOVERING really capturing the nature of this state? Maybe RESETTLING?
or QUIESCING?

>
> Note, there is no transition from RECOVERING to RESETTING added to
> nvme_change_ctrl_state(). The reason is that user should not be allowed
> to reset or delete a controller that is being recovered.
>
> Add NVME_CTRL_RECOVERED controller flag. This flag is set on a controller
> about to schedule delayed work for time based recovery.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>   drivers/nvme/host/core.c  | 10 ++++++++++
>   drivers/nvme/host/nvme.h  |  2 ++
>   drivers/nvme/host/sysfs.c |  1 +
>   3 files changed, 13 insertions(+)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index aa007a7b9606..f5b84bc327d3 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -574,6 +574,15 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>   			break;
>   		}
>   		break;
> +	case NVME_CTRL_RECOVERING:
> +		switch (old_state) {
> +		case NVME_CTRL_LIVE:
> +			changed = true;
> +			fallthrough;
> +		default:
> +			break;
> +		}
> +		break;

That is a strange transition...

Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state

Posted by Mohamed Khalfella 1 month, 2 weeks ago

On Thu 2025-12-25 15:29:52 +0200, Sagi Grimberg wrote:
> 
> 
> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> > Add NVME_CTRL_RECOVERING as a new controller state to be used when
> > impacted controller is being recovered. A LIVE controller enters
> > RECOVERING state when an IO error is encountered. While recovering
> > inflight IOs will not be canceled if they timeout. These IOs will be
> > canceled after recovery finishes. Also, while recovering a controller
> > can not be reset or deleted. This is intentional because reset or delete
> > will result in canceling inflight IOs. When recovery finishes, the
> > impacted controller transitions from RECOVERING state to RESETTING state.
> > Reset codepath takes care of queues teardown and inflight requests
> > cancellation.
> 
> Is RECOVERING really capturing the nature of this state? Maybe RESETTLING?
> or QUIESCING?

Naming is hard. QUIESCING sounds better, I will renaming it to
QUIESCING.

> 
> >
> > Note, there is no transition from RECOVERING to RESETTING added to
> > nvme_change_ctrl_state(). The reason is that user should not be allowed
> > to reset or delete a controller that is being recovered.
> >
> > Add NVME_CTRL_RECOVERED controller flag. This flag is set on a controller
> > about to schedule delayed work for time based recovery.
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> >   drivers/nvme/host/core.c  | 10 ++++++++++
> >   drivers/nvme/host/nvme.h  |  2 ++
> >   drivers/nvme/host/sysfs.c |  1 +
> >   3 files changed, 13 insertions(+)
> >
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index aa007a7b9606..f5b84bc327d3 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -574,6 +574,15 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
> >   			break;
> >   		}
> >   		break;
> > +	case NVME_CTRL_RECOVERING:
> > +		switch (old_state) {
> > +		case NVME_CTRL_LIVE:
> > +			changed = true;
> > +			fallthrough;
> > +		default:
> > +			break;
> > +		}
> > +		break;
> 
> That is a strange transition...

Why is it strange?

We transition to RECOVERING state only if controller is LIVE. This is
when we expect to have inflight user IOs to be quiesced by CCR. We do
not care about inflight requests in other states.

Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state

Posted by Sagi Grimberg 1 month, 2 weeks ago


On 25/12/2025 19:17, Mohamed Khalfella wrote:
> On Thu 2025-12-25 15:29:52 +0200, Sagi Grimberg wrote:
>>
>> On 26/11/2025 4:11, Mohamed Khalfella wrote:
>>> Add NVME_CTRL_RECOVERING as a new controller state to be used when
>>> impacted controller is being recovered. A LIVE controller enters
>>> RECOVERING state when an IO error is encountered. While recovering
>>> inflight IOs will not be canceled if they timeout. These IOs will be
>>> canceled after recovery finishes. Also, while recovering a controller
>>> can not be reset or deleted. This is intentional because reset or delete
>>> will result in canceling inflight IOs. When recovery finishes, the
>>> impacted controller transitions from RECOVERING state to RESETTING state.
>>> Reset codepath takes care of queues teardown and inflight requests
>>> cancellation.
>> Is RECOVERING really capturing the nature of this state? Maybe RESETTLING?
>> or QUIESCING?
> Naming is hard. QUIESCING sounds better, I will renaming it to
> QUIESCING.

I actually think that FENCING is probably best to describe what the 
state is used for...

Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state

Posted by Mohamed Khalfella 1 month, 1 week ago

On Sat 2025-12-27 11:55:01 +0200, Sagi Grimberg wrote:
> 
> 
> On 25/12/2025 19:17, Mohamed Khalfella wrote:
> > On Thu 2025-12-25 15:29:52 +0200, Sagi Grimberg wrote:
> >>
> >> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> >>> Add NVME_CTRL_RECOVERING as a new controller state to be used when
> >>> impacted controller is being recovered. A LIVE controller enters
> >>> RECOVERING state when an IO error is encountered. While recovering
> >>> inflight IOs will not be canceled if they timeout. These IOs will be
> >>> canceled after recovery finishes. Also, while recovering a controller
> >>> can not be reset or deleted. This is intentional because reset or delete
> >>> will result in canceling inflight IOs. When recovery finishes, the
> >>> impacted controller transitions from RECOVERING state to RESETTING state.
> >>> Reset codepath takes care of queues teardown and inflight requests
> >>> cancellation.
> >> Is RECOVERING really capturing the nature of this state? Maybe RESETTLING?
> >> or QUIESCING?
> > Naming is hard. QUIESCING sounds better, I will renaming it to
> > QUIESCING.
> 
> I actually think that FENCING is probably best to describe what the 
> state is used for...

FENCING is used in HA clusters with persistent reservations. I find it
confusing to use it here. Let me know if you have strong preference.

Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state

Posted by Mohamed Khalfella 1 month, 1 week ago

On Wed 2025-12-31 14:36:53 -0800, Mohamed Khalfella wrote:
> On Sat 2025-12-27 11:55:01 +0200, Sagi Grimberg wrote:
> > 
> > 
> > On 25/12/2025 19:17, Mohamed Khalfella wrote:
> > > On Thu 2025-12-25 15:29:52 +0200, Sagi Grimberg wrote:
> > >>
> > >> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> > >>> Add NVME_CTRL_RECOVERING as a new controller state to be used when
> > >>> impacted controller is being recovered. A LIVE controller enters
> > >>> RECOVERING state when an IO error is encountered. While recovering
> > >>> inflight IOs will not be canceled if they timeout. These IOs will be
> > >>> canceled after recovery finishes. Also, while recovering a controller
> > >>> can not be reset or deleted. This is intentional because reset or delete
> > >>> will result in canceling inflight IOs. When recovery finishes, the
> > >>> impacted controller transitions from RECOVERING state to RESETTING state.
> > >>> Reset codepath takes care of queues teardown and inflight requests
> > >>> cancellation.
> > >> Is RECOVERING really capturing the nature of this state? Maybe RESETTLING?
> > >> or QUIESCING?
> > > Naming is hard. QUIESCING sounds better, I will renaming it to
> > > QUIESCING.
> > 
> > I actually think that FENCING is probably best to describe what the 
> > state is used for...
> 
> FENCING is used in HA clusters with persistent reservations. I find it
> confusing to use it here. Let me know if you have strong preference.

Nevermind, I will rename it to FENCING.

Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state

Posted by Sagi Grimberg 1 month, 2 weeks ago


On 25/12/2025 19:17, Mohamed Khalfella wrote:
> On Thu 2025-12-25 15:29:52 +0200, Sagi Grimberg wrote:
>>
>> On 26/11/2025 4:11, Mohamed Khalfella wrote:
>>> Add NVME_CTRL_RECOVERING as a new controller state to be used when
>>> impacted controller is being recovered. A LIVE controller enters
>>> RECOVERING state when an IO error is encountered. While recovering
>>> inflight IOs will not be canceled if they timeout. These IOs will be
>>> canceled after recovery finishes. Also, while recovering a controller
>>> can not be reset or deleted. This is intentional because reset or delete
>>> will result in canceling inflight IOs. When recovery finishes, the
>>> impacted controller transitions from RECOVERING state to RESETTING state.
>>> Reset codepath takes care of queues teardown and inflight requests
>>> cancellation.
>> Is RECOVERING really capturing the nature of this state? Maybe RESETTLING?
>> or QUIESCING?
> Naming is hard. QUIESCING sounds better, I will renaming it to
> QUIESCING.
>
>>> Note, there is no transition from RECOVERING to RESETTING added to
>>> nvme_change_ctrl_state(). The reason is that user should not be allowed
>>> to reset or delete a controller that is being recovered.
>>>
>>> Add NVME_CTRL_RECOVERED controller flag. This flag is set on a controller
>>> about to schedule delayed work for time based recovery.
>>>
>>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
>>> ---
>>>    drivers/nvme/host/core.c  | 10 ++++++++++
>>>    drivers/nvme/host/nvme.h  |  2 ++
>>>    drivers/nvme/host/sysfs.c |  1 +
>>>    3 files changed, 13 insertions(+)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index aa007a7b9606..f5b84bc327d3 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -574,6 +574,15 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>>>    			break;
>>>    		}
>>>    		break;
>>> +	case NVME_CTRL_RECOVERING:
>>> +		switch (old_state) {
>>> +		case NVME_CTRL_LIVE:
>>> +			changed = true;
>>> +			fallthrough;
>>> +		default:
>>> +			break;
>>> +		}
>>> +		break;
>> That is a strange transition...
> Why is it strange?
>
> We transition to RECOVERING state only if controller is LIVE. This is
> when we expect to have inflight user IOs to be quiesced by CCR. We do
> not care about inflight requests in other states.

Sorry, got confused myself - I read it as the other way around...
I am missing RECOVERING -> RESETTING transition in this patch.

Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state

Posted by Mohamed Khalfella 1 month, 1 week ago

On Sat 2025-12-27 11:52:31 +0200, Sagi Grimberg wrote:
> 
> 
> On 25/12/2025 19:17, Mohamed Khalfella wrote:
> > On Thu 2025-12-25 15:29:52 +0200, Sagi Grimberg wrote:
> >>
> >> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> >>> Add NVME_CTRL_RECOVERING as a new controller state to be used when
> >>> impacted controller is being recovered. A LIVE controller enters
> >>> RECOVERING state when an IO error is encountered. While recovering
> >>> inflight IOs will not be canceled if they timeout. These IOs will be
> >>> canceled after recovery finishes. Also, while recovering a controller
> >>> can not be reset or deleted. This is intentional because reset or delete
> >>> will result in canceling inflight IOs. When recovery finishes, the
> >>> impacted controller transitions from RECOVERING state to RESETTING state.
> >>> Reset codepath takes care of queues teardown and inflight requests
> >>> cancellation.
> >> Is RECOVERING really capturing the nature of this state? Maybe RESETTLING?
> >> or QUIESCING?
> > Naming is hard. QUIESCING sounds better, I will renaming it to
> > QUIESCING.
> >
> >>> Note, there is no transition from RECOVERING to RESETTING added to
> >>> nvme_change_ctrl_state(). The reason is that user should not be allowed
> >>> to reset or delete a controller that is being recovered.
> >>>
> >>> Add NVME_CTRL_RECOVERED controller flag. This flag is set on a controller
> >>> about to schedule delayed work for time based recovery.
> >>>
> >>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> >>> ---
> >>>    drivers/nvme/host/core.c  | 10 ++++++++++
> >>>    drivers/nvme/host/nvme.h  |  2 ++
> >>>    drivers/nvme/host/sysfs.c |  1 +
> >>>    3 files changed, 13 insertions(+)
> >>>
> >>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >>> index aa007a7b9606..f5b84bc327d3 100644
> >>> --- a/drivers/nvme/host/core.c
> >>> +++ b/drivers/nvme/host/core.c
> >>> @@ -574,6 +574,15 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
> >>>    			break;
> >>>    		}
> >>>    		break;
> >>> +	case NVME_CTRL_RECOVERING:
> >>> +		switch (old_state) {
> >>> +		case NVME_CTRL_LIVE:
> >>> +			changed = true;
> >>> +			fallthrough;
> >>> +		default:
> >>> +			break;
> >>> +		}
> >>> +		break;
> >> That is a strange transition...
> > Why is it strange?
> >
> > We transition to RECOVERING state only if controller is LIVE. This is
> > when we expect to have inflight user IOs to be quiesced by CCR. We do
> > not care about inflight requests in other states.
> 
> Sorry, got confused myself - I read it as the other way around...
> I am missing RECOVERING -> RESETTING transition in this patch.

This is in patch 8 ("nvme: Implement cross-controller reset recovery").
It was not added to nvme_change_ctrl_state() because we do not want the
controller to be reset while in RECOVERING state.

Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state

Posted by Randy Jennings 1 month, 3 weeks ago

On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> Add NVME_CTRL_RECOVERING as a new controller state to be used when
> impacted controller is being recovered. A LIVE controller enters
> RECOVERING state when an IO error is encountered. While recovering
> inflight IOs will not be canceled if they timeout. These IOs will be
> canceled after recovery finishes. Also, while recovering a controller
> can not be reset or deleted. This is intentional because reset or delete
> will result in canceling inflight IOs. When recovery finishes, the
> impacted controller transitions from RECOVERING state to RESETTING state.
> Reset codepath takes care of queues teardown and inflight requests
> cancellation.
>
> Note, there is no transition from RECOVERING to RESETTING added to
> nvme_change_ctrl_state(). The reason is that user should not be allowed
> to reset or delete a controller that is being recovered.
>
> Add NVME_CTRL_RECOVERED controller flag. This flag is set on a controller
> about to schedule delayed work for time based recovery.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>

Reviewed-by: Randy Jennings <randyj@purestorage.com>

Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state

Posted by Randy Jennings 1 month, 3 weeks ago

On Thu, Dec 18, 2025 at 3:18 PM Randy Jennings <randyj@purestorage.com> wrote:
>
> On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
> <mkhalfella@purestorage.com> wrote:
> >
> > Reset codepath takes care of queues teardown and inflight requests
> > cancellation.
Note: Tearing down the connection (with the queues) after going through
CCR-based recovery or time-based recovery is late.  CCR will trigger a
disconnect on the host side, and, because we stop traffic to the nvme
controller, KATO should kick in if CCR does not, so I accept your
argument that  tearing down the connections first could be considered
an optimization that can get implemented later.

Sincerely,
Randy Jennings