nvme-tcp: Implement TP4129 (KATO Corrections and Clarifications)

[RFC PATCH v1 0/7] nvme-tcp: Implement TP4129 (KATO Corrections and Clarifications)

Posted by Mohamed Khalfella 10 months, 2 weeks ago

Hello,

RFC patchset implementing TP4129 (KATO Corrections and Clarifications)
for nvme-tcp. nvme-tcp was choosen as an example to demonstrate the
approach taken in the patchset. Other fabric transports, nvme-rdma and
nvme-fc, will be added including the feedback received in this RFC.

TP4129 requires nvme controller to not immediately cancel inflight
requests when connection is lost between initiator and target.
Instead, inflight requests need to be held for a duration that is long
enough to allow target to learn about connection loss and quiesce
pending commands. Only then pending requests on the initiator side can
be canceled and possibly retried safely on another path. The main issue
TP4129 tries to address is ABA corruption that could happen if inflight
requests are tried immediately on another path.

Requests hold timeout has two components:

- KATO timeout is the time sufficient for target to learn about the
  connection loss. It depends on whether command based or traffic based
  keepalive is used. As per TP4129 the timeout is supposed to be 3 x KATO
  for traffic based keepalive and 2 * KATO for command based keepalive.

- CQT is the time needed by target controller to quiesce in flight nvme
  commands after the controller learns about connection loss.

On controller reset or delete cancel inflight requests if controller was
disabled correctly. Otherwise, hold the requests until it is safe to be
released.

Jyoti Rani (1):
  nvme-core: Read CQT wait from identify controller response

Mohamed Khalfella (6):
  nvmef: Add nvmef_req_hold_timeout_ms() to calculate kato request hold
    time
  nvme-tcp: Move freeing tagset out of nvme_tcp_teardown_io_queues()
  nvme-tcp: Move freeing admin tagset out of
    nvme_tcp_teardown_admin_queue()
  nvme-tcp: Split nvme_tcp_teardown_io_queues() into two functions
  nvme-core: Add support for holding inflight requests
  nvme-tcp: Do not immediately cancel inflight requests during recovery

 drivers/nvme/host/core.c    | 62 +++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/fabrics.h |  7 +++++
 drivers/nvme/host/nvme.h    |  5 +++
 drivers/nvme/host/tcp.c     | 56 +++++++++++++++++++++------------
 include/linux/nvme.h        |  4 ++-
 5 files changed, 114 insertions(+), 20 deletions(-)

-- 
2.48.1

Re: [RFC PATCH v1 0/7] nvme-tcp: Implement TP4129 (KATO Corrections and Clarifications)

Posted by Stuart Hayes 2 months, 3 weeks ago

On 2025-03-24 17:48 Mohamed Khalfella mkhalfella@purestorage.com wrote

> Hello,
>
> RFC patchset implementing TP4129 (KATO Corrections and Clarifications)
> for nvme-tcp. nvme-tcp was choosen as an example to demonstrate the
> approach taken in the patchset. Other fabric transports, nvme-rdma and
> nvme-fc, will be added including the feedback received in this RFC.
>
> TP4129 requires nvme controller to not immediately cancel inflight
> requests when connection is lost between initiator and target.
> Instead, inflight requests need to be held for a duration that is long
> enough to allow target to learn about connection loss and quiesce
> pending commands. Only then pending requests on the initiator side can
> be canceled and possibly retried safely on another path. The main issue
> TP4129 tries to address is ABA corruption that could happen if inflight
> requests are tried immediately on another path.
> 
> Requests hold timeout has two components:
> 
> - KATO timeout is the time sufficient for target to learn about the
>   connection loss. It depends on whether command based or traffic based
>   keepalive is used. As per TP4129 the timeout is supposed to be 3 x KATO
>   for traffic based keepalive and 2 * KATO for command based keepalive.
> 
> - CQT is the time needed by target controller to quiesce in flight nvme
>   commands after the controller learns about connection loss.
> 
> On controller reset or delete cancel inflight requests if controller was
> disabled correctly. Otherwise, hold the requests until it is safe to be
> released.
> 
> Jyoti Rani (1):
>   nvme-core: Read CQT wait from identify controller response
> 
> Mohamed Khalfella (6):
>   nvmef: Add nvmef_req_hold_timeout_ms() to calculate kato request hold
>     time
>   nvme-tcp: Move freeing tagset out of nvme_tcp_teardown_io_queues()
>   nvme-tcp: Move freeing admin tagset out of
>     nvme_tcp_teardown_admin_queue()
>   nvme-tcp: Split nvme_tcp_teardown_io_queues() into two functions
>   nvme-core: Add support for holding inflight requests
>   nvme-tcp: Do not immediately cancel inflight requests during recovery
> 
>  drivers/nvme/host/core.c    | 62 +++++++++++++++++++++++++++++++++++++
>  drivers/nvme/host/fabrics.h |  7 +++++
>  drivers/nvme/host/nvme.h    |  5 +++
>  drivers/nvme/host/tcp.c     | 56 +++++++++++++++++++++------------
>  include/linux/nvme.h        |  4 ++-
>  5 files changed, 114 insertions(+), 20 deletions(-)
> 
> -- 
> 2.48.1

Verified the patch fixes the Data integrity issue for which TP4129 was written.
Tested-by: Shai Dagieli (Shai.Dagieli@dell.com)

Summary:
The test included two parts:
1. Part 1: testing with original kernel without the patch to verify the test can 
   consistently reproduce the data integrity issue.
   a. Result: the Data integrity was recreated.
2. Part 2: testing with a kernel version with the patch.
   a. Results: 
      i.  The data integrity issued was not recreated with the patch. 
      ii. The messages seen on the storage log and dmesg are as expected based on 
          the test error injection and the patch expected behavior and messages.
3. Conclusion: the patch fixes the data integrity issue.
4. Additional comments:
   a. TP4129 defines an extended delay before an IO for which the host did not 
      receive a ‘command completion’ can be retried on a different path. Based 
      on the test logs, the extended delay is applied in other cases where the 
      IO is retried on another path, even when the host received command completion. 
      For example, the extended delay before retry was seen when the status returned 
      in the Command Completion was ‘status code type = 3h’ (Path Related Status) &
      ‘Status code = 00h’ (Internal Path Error). Applying the extended delay in cases
      where it is not required increases the potential performance impact of this
      patch and should be fixed. 
      i. Recommendation: Modify the patch to limit the extended delay to the relevant
         cases: IO retry, on a different path, when the host DID NOT receive a 
         Command Completion status.

Execution overview
- Test ran continuously for almost 24 hours with periodic SDT disruptions and host I/O validation.
 
Findings
- No DIs or I/O errors observed in Vdbench error logs (errorlog.html remained clean).
- No host panics or abnormal behavior were reported.
- dmesg shows expected TP4129 recovery messages, for example:
[18:52:28] nvme nvme2: holding inflight requests for 15000ms
[18:52:43] nvme nvme2: releasing inflight requests
[18:52:43] nvme_cancel_request: 123 callbacks suppressed
[18:52:43] nvme nvme2: Cancelling I/O ...