[PATCH] nvme-tcp: wait socket wmem to drain in queue stop

Michael Liang posted 1 patch 10 months, 1 week ago
There is a newer version of this series
drivers/nvme/host/tcp.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)
[PATCH] nvme-tcp: wait socket wmem to drain in queue stop
Posted by Michael Liang 10 months, 1 week ago
This patch addresses a data corruption issue observed in nvme-tcp during
testing.

Issue description:
In an NVMe native multipath setup, when an I/O timeout occurs, all inflight
I/Os are canceled almost immediately after the kernel socket is shut down.
These canceled I/Os are reported as host path errors, triggering a failover
that succeeds on a different path.

However, at this point, the original I/O may still be outstanding in the
host's network transmission path (e.g., the NIC’s TX queue). From the
user-space app's perspective, the buffer associated with the I/O is considered
completed since they're acked on the different path and may be reused for new
I/O requests.

Because nvme-tcp enables zero-copy by default in the transmission path,
this can lead to corrupted data being sent to the original target, ultimately
causing data corruption.

We can reproduce this data corruption by injecting delay on one path and
triggering i/o timeout.

To prevent this issue, this change ensures that all inflight transmissions are
fully completed from host's perspective before returning from queue stop.
This aligns with the behavior of queue stopping in other NVMe fabric transports.

Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
Signed-off-by: Michael Liang <mliang@purestorage.com>
---
 drivers/nvme/host/tcp.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 26c459f0198d..837684918aa1 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1944,10 +1944,26 @@ static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
 	cancel_work_sync(&queue->io_work);
 }
 
+static void nvme_tcp_stop_queue_wait(struct nvme_tcp_queue *queue)
+{
+	int timeout = 100;
+
+	while (timeout > 0) {
+		if (!sk_wmem_alloc_get(queue->sock->sk))
+			return;
+		msleep(2);
+		timeout -= 2;
+	}
+	dev_warn(queue->ctrl->ctrl.device,
+		 "qid %d: wait draining sock wmem allocation timeout\n",
+		 nvme_tcp_queue_id(queue));
+}
+
 static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
 {
 	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
 	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
+	bool was_alive = false;
 
 	if (!test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
 		return;
@@ -1956,11 +1972,14 @@ static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
 		atomic_dec(&nvme_tcp_cpu_queues[queue->io_cpu]);
 
 	mutex_lock(&queue->queue_lock);
-	if (test_and_clear_bit(NVME_TCP_Q_LIVE, &queue->flags))
+	was_alive = test_and_clear_bit(NVME_TCP_Q_LIVE, &queue->flags);
+	if (was_alive)
 		__nvme_tcp_stop_queue(queue);
 	/* Stopping the queue will disable TLS */
 	queue->tls_enabled = false;
 	mutex_unlock(&queue->queue_lock);
+	if (was_alive)
+		nvme_tcp_stop_queue_wait(queue);
 }
 
 static void nvme_tcp_setup_sock_ops(struct nvme_tcp_queue *queue)
-- 
2.34.1

[PATCH v2 0/1] nvme-tcp: wait socket wmem to drain in queue stop
Posted by Michael Liang 9 months, 3 weeks ago
Changes in v2:
 - Always wait in tcp queue stop regardless of the queue liveness, which
   fixes the concurrent queue stop from I/O timeout from multiple namespaces
   of the same controller.
 - Link to v1: https://lore.kernel.org/linux-nvme/20250405054848.3773471-1-mliang@purestorage.com/

Michael Liang (1):
  nvme-tcp: wait socket wmem to drain in queue stop

 drivers/nvme/host/tcp.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

-- 
2.34.1
[PATCH v2 1/1] nvme-tcp: wait socket wmem to drain in queue stop
Posted by Michael Liang 9 months, 3 weeks ago
This patch addresses a data corruption issue observed in nvme-tcp during
testing.

Issue description:
In an NVMe native multipath setup, when an I/O timeout occurs, all inflight
I/Os are canceled almost immediately after the kernel socket is shut down.
These canceled I/Os are reported as host path errors, triggering a failover
that succeeds on a different path.

However, at this point, the original I/O may still be outstanding in the
host's network transmission path (e.g., the NIC’s TX queue). From the
user-space app's perspective, the buffer associated with the I/O is considered
completed since they're acked on the different path and may be reused for new
I/O requests.

Because nvme-tcp enables zero-copy by default in the transmission path,
this can lead to corrupted data being sent to the original target, ultimately
causing data corruption.

We can reproduce this data corruption by injecting delay on one path and
triggering i/o timeout.

To prevent this issue, this change ensures that all inflight transmissions are
fully completed from host's perspective before returning from queue
stop. To handle concurrent I/O timeout from multiple namespaces under
the same controller, always wait in queue stop regardless of queue's state.

This aligns with the behavior of queue stopping in other NVMe fabric transports.

Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
Signed-off-by: Michael Liang <mliang@purestorage.com>
---
 drivers/nvme/host/tcp.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 26c459f0198d..62d73684e61e 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1944,6 +1944,21 @@ static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
 	cancel_work_sync(&queue->io_work);
 }
 
+static void nvme_tcp_stop_queue_wait(struct nvme_tcp_queue *queue)
+{
+	int timeout = 100;
+
+	while (timeout > 0) {
+		if (!sk_wmem_alloc_get(queue->sock->sk))
+			return;
+		msleep(2);
+		timeout -= 2;
+	}
+	dev_warn(queue->ctrl->ctrl.device,
+		 "qid %d: wait draining sock wmem allocation timeout\n",
+		 nvme_tcp_queue_id(queue));
+}
+
 static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
 {
 	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
@@ -1961,6 +1976,7 @@ static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
 	/* Stopping the queue will disable TLS */
 	queue->tls_enabled = false;
 	mutex_unlock(&queue->queue_lock);
+	nvme_tcp_stop_queue_wait(queue);
 }
 
 static void nvme_tcp_setup_sock_ops(struct nvme_tcp_queue *queue)
-- 
2.34.1

Re: [PATCH v2 1/1] nvme-tcp: wait socket wmem to drain in queue stop
Posted by Sagi Grimberg 9 months, 3 weeks ago

On 4/17/25 10:13, Michael Liang wrote:
> This patch addresses a data corruption issue observed in nvme-tcp during
> testing.
>
> Issue description:
> In an NVMe native multipath setup, when an I/O timeout occurs, all inflight
> I/Os are canceled almost immediately after the kernel socket is shut down.
> These canceled I/Os are reported as host path errors, triggering a failover
> that succeeds on a different path.
>
> However, at this point, the original I/O may still be outstanding in the
> host's network transmission path (e.g., the NIC’s TX queue). From the
> user-space app's perspective, the buffer associated with the I/O is considered
> completed since they're acked on the different path and may be reused for new
> I/O requests.
>
> Because nvme-tcp enables zero-copy by default in the transmission path,
> this can lead to corrupted data being sent to the original target, ultimately
> causing data corruption.
>
> We can reproduce this data corruption by injecting delay on one path and
> triggering i/o timeout.
>
> To prevent this issue, this change ensures that all inflight transmissions are
> fully completed from host's perspective before returning from queue
> stop. To handle concurrent I/O timeout from multiple namespaces under
> the same controller, always wait in queue stop regardless of queue's state.
>
> This aligns with the behavior of queue stopping in other NVMe fabric transports.
>
> Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> Reviewed-by: Randy Jennings <randyj@purestorage.com>
> Signed-off-by: Michael Liang <mliang@purestorage.com>
> ---
>   drivers/nvme/host/tcp.c | 16 ++++++++++++++++
>   1 file changed, 16 insertions(+)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 26c459f0198d..62d73684e61e 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1944,6 +1944,21 @@ static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
>   	cancel_work_sync(&queue->io_work);
>   }
>   
> +static void nvme_tcp_stop_queue_wait(struct nvme_tcp_queue *queue)
> +{
> +	int timeout = 100;
> +
> +	while (timeout > 0) {
> +		if (!sk_wmem_alloc_get(queue->sock->sk))
> +			return;
> +		msleep(2);
> +		timeout -= 2;
> +	}
> +	dev_warn(queue->ctrl->ctrl.device,
> +		 "qid %d: wait draining sock wmem allocation timeout\n",
> +		 nvme_tcp_queue_id(queue));
> +}
> +
>   static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
>   {
>   	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
> @@ -1961,6 +1976,7 @@ static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
>   	/* Stopping the queue will disable TLS */
>   	queue->tls_enabled = false;
>   	mutex_unlock(&queue->queue_lock);
> +	nvme_tcp_stop_queue_wait(queue);
>   }
>   
>   static void nvme_tcp_setup_sock_ops(struct nvme_tcp_queue *queue)

This makes sense. But I do not want to pay this price serially.
As the concern is just failover, lets do something like: diff --git 
a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 
5041cbfd8272..d482a8fe2c4b 100644 --- a/drivers/nvme/host/tcp.c +++ 
b/drivers/nvme/host/tcp.c @@ -2031,6 +2031,8 @@ static void 
nvme_tcp_stop_io_queues(struct nvme_ctrl *ctrl) for (i = 1; i < 
ctrl->queue_count; i++) nvme_tcp_stop_queue(ctrl, i); + for (i = 1; i < 
ctrl->queue_count; i++) + nvme_tcp_stop_queue_wait(&ctrl->queues[i]); } 
static int nvme_tcp_start_io_queues(struct nvme_ctrl *ctrl, @@ -2628,8 
+2630,10 @@ static void nvme_tcp_complete_timed_out(struct request *rq) 
{ struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq); struct nvme_ctrl 
*ctrl = &req->queue->ctrl->ctrl; + int idx = 
nvme_tcp_queue_id(req->queue); - nvme_tcp_stop_queue(ctrl, 
nvme_tcp_queue_id(req->queue)); + nvme_tcp_stop_queue(ctrl, idx); + 
nvme_tcp_stop_queue_wait(&ctrl->queues[idx]); 
nvmf_complete_timed_out_request(rq); }
Re: [PATCH v2 1/1] nvme-tcp: wait socket wmem to drain in queue stop
Posted by Sagi Grimberg 9 months, 3 weeks ago

On 4/18/25 14:30, Sagi Grimberg wrote:
>
>
> On 4/17/25 10:13, Michael Liang wrote:
>> This patch addresses a data corruption issue observed in nvme-tcp during
>> testing.
>>
>> Issue description:
>> In an NVMe native multipath setup, when an I/O timeout occurs, all 
>> inflight
>> I/Os are canceled almost immediately after the kernel socket is shut 
>> down.
>> These canceled I/Os are reported as host path errors, triggering a 
>> failover
>> that succeeds on a different path.
>>
>> However, at this point, the original I/O may still be outstanding in the
>> host's network transmission path (e.g., the NIC’s TX queue). From the
>> user-space app's perspective, the buffer associated with the I/O is 
>> considered
>> completed since they're acked on the different path and may be reused 
>> for new
>> I/O requests.
>>
>> Because nvme-tcp enables zero-copy by default in the transmission path,
>> this can lead to corrupted data being sent to the original target, 
>> ultimately
>> causing data corruption.
>>
>> We can reproduce this data corruption by injecting delay on one path and
>> triggering i/o timeout.
>>
>> To prevent this issue, this change ensures that all inflight 
>> transmissions are
>> fully completed from host's perspective before returning from queue
>> stop. To handle concurrent I/O timeout from multiple namespaces under
>> the same controller, always wait in queue stop regardless of queue's 
>> state.
>>
>> This aligns with the behavior of queue stopping in other NVMe fabric 
>> transports.
>>
>> Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com>
>> Reviewed-by: Randy Jennings <randyj@purestorage.com>
>> Signed-off-by: Michael Liang <mliang@purestorage.com>
>> ---
>>   drivers/nvme/host/tcp.c | 16 ++++++++++++++++
>>   1 file changed, 16 insertions(+)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 26c459f0198d..62d73684e61e 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -1944,6 +1944,21 @@ static void __nvme_tcp_stop_queue(struct 
>> nvme_tcp_queue *queue)
>>       cancel_work_sync(&queue->io_work);
>>   }
>>   +static void nvme_tcp_stop_queue_wait(struct nvme_tcp_queue *queue)
>> +{
>> +    int timeout = 100;
>> +
>> +    while (timeout > 0) {
>> +        if (!sk_wmem_alloc_get(queue->sock->sk))
>> +            return;
>> +        msleep(2);
>> +        timeout -= 2;
>> +    }
>> +    dev_warn(queue->ctrl->ctrl.device,
>> +         "qid %d: wait draining sock wmem allocation timeout\n",
>> +         nvme_tcp_queue_id(queue));
>> +}
>> +
>>   static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
>>   {
>>       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>> @@ -1961,6 +1976,7 @@ static void nvme_tcp_stop_queue(struct 
>> nvme_ctrl *nctrl, int qid)
>>       /* Stopping the queue will disable TLS */
>>       queue->tls_enabled = false;
>>       mutex_unlock(&queue->queue_lock);
>> +    nvme_tcp_stop_queue_wait(queue);
>>   }
>>     static void nvme_tcp_setup_sock_ops(struct nvme_tcp_queue *queue)
>
> This makes sense. But I do not want to pay this price serially.
> As the concern is just failover, lets do something like: diff --git 
> a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 
> 5041cbfd8272..d482a8fe2c4b 100644 --- a/drivers/nvme/host/tcp.c +++ 
> b/drivers/nvme/host/tcp.c @@ -2031,6 +2031,8 @@ static void 
> nvme_tcp_stop_io_queues(struct nvme_ctrl *ctrl) for (i = 1; i < 
> ctrl->queue_count; i++) nvme_tcp_stop_queue(ctrl, i); + for (i = 1; i 
> < ctrl->queue_count; i++) + 
> nvme_tcp_stop_queue_wait(&ctrl->queues[i]); } static int 
> nvme_tcp_start_io_queues(struct nvme_ctrl *ctrl, @@ -2628,8 +2630,10 
> @@ static void nvme_tcp_complete_timed_out(struct request *rq) { 
> struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq); struct nvme_ctrl 
> *ctrl = &req->queue->ctrl->ctrl; + int idx = 
> nvme_tcp_queue_id(req->queue); - nvme_tcp_stop_queue(ctrl, 
> nvme_tcp_queue_id(req->queue)); + nvme_tcp_stop_queue(ctrl, idx); + 
> nvme_tcp_stop_queue_wait(&ctrl->queues[idx]); 
> nvmf_complete_timed_out_request(rq); }

Or perhaps something like:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 5041cbfd8272..3e206a2cbbf3 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1944,7 +1944,7 @@ static void __nvme_tcp_stop_queue(struct 
nvme_tcp_queue *queue)
         cancel_work_sync(&queue->io_work);
  }

-static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
+static void nvme_tcp_stop_queue_nowait(struct nvme_ctrl *nctrl, int qid)
  {
         struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
         struct nvme_tcp_queue *queue = &ctrl->queues[qid];
@@ -1963,6 +1963,29 @@ static void nvme_tcp_stop_queue(struct nvme_ctrl 
*nctrl, int qid)
         mutex_unlock(&queue->queue_lock);
  }

+static void nvme_tcp_wait_queue(struct nvme_ctrl *nctrl, int qid)
+{
+       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
+       struct nvme_tcp_queue *queue = ctrl->queues[qid];
+       int timeout = 100;
+
+       while (timeout > 0) {
+               if (!sk_wmem_alloc_get(queue->sock->sk))
+                       return;
+               msleep(2);
+               timeout -= 2;
+       }
+       dev_warn(queue->ctrl->ctrl.device,
+                "qid %d: timeout draining sock wmem allocation expired\n",
+                nvme_tcp_queue_id(queue));
+}
+
+static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
+{
+       nvme_tcp_stop_queue_nowait(nctrl, qid);
+       nvme_tcp_wait_queue(nctrl, qid);
+}
+
  static void nvme_tcp_setup_sock_ops(struct nvme_tcp_queue *queue)
  {
write_lock_bh(&queue->sock->sk->sk_callback_lock);
@@ -2030,7 +2053,9 @@ static void nvme_tcp_stop_io_queues(struct 
nvme_ctrl *ctrl)
         int i;

         for (i = 1; i < ctrl->queue_count; i++)
-               nvme_tcp_stop_queue(ctrl, i);
+               nvme_tcp_stop_queue_nowait(ctrl, i);
+       for (i = 1; i < ctrl->queue_count; i++)
+               nvme_tcp_wait_queue(ctrl, i);
  }

  static int nvme_tcp_start_io_queues(struct nvme_ctrl *ctrl,
--
Re: [PATCH v2 1/1] nvme-tcp: wait socket wmem to drain in queue stop
Posted by Michael Liang 9 months, 3 weeks ago
On Fri, Apr 18, 2025 at 02:49:25PM +0300, Sagi Grimberg wrote:
> 
> 
> On 4/18/25 14:30, Sagi Grimberg wrote:
> > 
> > 
> > On 4/17/25 10:13, Michael Liang wrote:
> > > This patch addresses a data corruption issue observed in nvme-tcp during
> > > testing.
> > > 
> > > Issue description:
> > > In an NVMe native multipath setup, when an I/O timeout occurs, all
> > > inflight
> > > I/Os are canceled almost immediately after the kernel socket is shut
> > > down.
> > > These canceled I/Os are reported as host path errors, triggering a
> > > failover
> > > that succeeds on a different path.
> > > 
> > > However, at this point, the original I/O may still be outstanding in the
> > > host's network transmission path (e.g., the NIC’s TX queue). From the
> > > user-space app's perspective, the buffer associated with the I/O is
> > > considered
> > > completed since they're acked on the different path and may be
> > > reused for new
> > > I/O requests.
> > > 
> > > Because nvme-tcp enables zero-copy by default in the transmission path,
> > > this can lead to corrupted data being sent to the original target,
> > > ultimately
> > > causing data corruption.
> > > 
> > > We can reproduce this data corruption by injecting delay on one path and
> > > triggering i/o timeout.
> > > 
> > > To prevent this issue, this change ensures that all inflight
> > > transmissions are
> > > fully completed from host's perspective before returning from queue
> > > stop. To handle concurrent I/O timeout from multiple namespaces under
> > > the same controller, always wait in queue stop regardless of queue's
> > > state.
> > > 
> > > This aligns with the behavior of queue stopping in other NVMe fabric
> > > transports.
> > > 
> > > Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > > Reviewed-by: Randy Jennings <randyj@purestorage.com>
> > > Signed-off-by: Michael Liang <mliang@purestorage.com>
> > > ---
> > >   drivers/nvme/host/tcp.c | 16 ++++++++++++++++
> > >   1 file changed, 16 insertions(+)
> > > 
> > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > > index 26c459f0198d..62d73684e61e 100644
> > > --- a/drivers/nvme/host/tcp.c
> > > +++ b/drivers/nvme/host/tcp.c
> > > @@ -1944,6 +1944,21 @@ static void __nvme_tcp_stop_queue(struct
> > > nvme_tcp_queue *queue)
> > >       cancel_work_sync(&queue->io_work);
> > >   }
> > >   +static void nvme_tcp_stop_queue_wait(struct nvme_tcp_queue *queue)
> > > +{
> > > +    int timeout = 100;
> > > +
> > > +    while (timeout > 0) {
> > > +        if (!sk_wmem_alloc_get(queue->sock->sk))
> > > +            return;
> > > +        msleep(2);
> > > +        timeout -= 2;
> > > +    }
> > > +    dev_warn(queue->ctrl->ctrl.device,
> > > +         "qid %d: wait draining sock wmem allocation timeout\n",
> > > +         nvme_tcp_queue_id(queue));
> > > +}
> > > +
> > >   static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
> > >   {
> > >       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
> > > @@ -1961,6 +1976,7 @@ static void nvme_tcp_stop_queue(struct
> > > nvme_ctrl *nctrl, int qid)
> > >       /* Stopping the queue will disable TLS */
> > >       queue->tls_enabled = false;
> > >       mutex_unlock(&queue->queue_lock);
> > > +    nvme_tcp_stop_queue_wait(queue);
> > >   }
> > >     static void nvme_tcp_setup_sock_ops(struct nvme_tcp_queue *queue)
> > 
> > This makes sense. But I do not want to pay this price serially.
> > As the concern is just failover, lets do something like: diff --git
> > a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index
> > 5041cbfd8272..d482a8fe2c4b 100644 --- a/drivers/nvme/host/tcp.c +++
> > b/drivers/nvme/host/tcp.c @@ -2031,6 +2031,8 @@ static void
> > nvme_tcp_stop_io_queues(struct nvme_ctrl *ctrl) for (i = 1; i <
> > ctrl->queue_count; i++) nvme_tcp_stop_queue(ctrl, i); + for (i = 1; i <
> > ctrl->queue_count; i++) + nvme_tcp_stop_queue_wait(&ctrl->queues[i]); }
> > static int nvme_tcp_start_io_queues(struct nvme_ctrl *ctrl, @@ -2628,8
> > +2630,10 @@ static void nvme_tcp_complete_timed_out(struct request *rq)
> > { struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq); struct nvme_ctrl
> > *ctrl = &req->queue->ctrl->ctrl; + int idx =
> > nvme_tcp_queue_id(req->queue); - nvme_tcp_stop_queue(ctrl,
> > nvme_tcp_queue_id(req->queue)); + nvme_tcp_stop_queue(ctrl, idx); +
> > nvme_tcp_stop_queue_wait(&ctrl->queues[idx]);
> > nvmf_complete_timed_out_request(rq); }
> 
> Or perhaps something like:
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 5041cbfd8272..3e206a2cbbf3 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1944,7 +1944,7 @@ static void __nvme_tcp_stop_queue(struct
> nvme_tcp_queue *queue)
>         cancel_work_sync(&queue->io_work);
>  }
> 
> -static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
> +static void nvme_tcp_stop_queue_nowait(struct nvme_ctrl *nctrl, int qid)
>  {
>         struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>         struct nvme_tcp_queue *queue = &ctrl->queues[qid];
> @@ -1963,6 +1963,29 @@ static void nvme_tcp_stop_queue(struct nvme_ctrl
> *nctrl, int qid)
>         mutex_unlock(&queue->queue_lock);
>  }
> 
> +static void nvme_tcp_wait_queue(struct nvme_ctrl *nctrl, int qid)
> +{
> +       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
> +       struct nvme_tcp_queue *queue = ctrl->queues[qid];
> +       int timeout = 100;
> +
> +       while (timeout > 0) {
> +               if (!sk_wmem_alloc_get(queue->sock->sk))
> +                       return;
> +               msleep(2);
> +               timeout -= 2;
> +       }
> +       dev_warn(queue->ctrl->ctrl.device,
> +                "qid %d: timeout draining sock wmem allocation expired\n",
> +                nvme_tcp_queue_id(queue));
> +}
> +
> +static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
> +{
> +       nvme_tcp_stop_queue_nowait(nctrl, qid);
> +       nvme_tcp_wait_queue(nctrl, qid);
> +}
> +
>  static void nvme_tcp_setup_sock_ops(struct nvme_tcp_queue *queue)
>  {
> write_lock_bh(&queue->sock->sk->sk_callback_lock);
> @@ -2030,7 +2053,9 @@ static void nvme_tcp_stop_io_queues(struct nvme_ctrl
> *ctrl)
>         int i;
> 
>         for (i = 1; i < ctrl->queue_count; i++)
> -               nvme_tcp_stop_queue(ctrl, i);
> +               nvme_tcp_stop_queue_nowait(ctrl, i);
> +       for (i = 1; i < ctrl->queue_count; i++)
> +               nvme_tcp_wait_queue(ctrl, i);
>  }
> 
>  static int nvme_tcp_start_io_queues(struct nvme_ctrl *ctrl,
> --
Yes, good idea to stop first and then wait all. Will verify this patch.

Thanks,
Michael Liang
Re: [PATCH] nvme-tcp: wait socket wmem to drain in queue stop
Posted by Sagi Grimberg 10 months ago

On 05/04/2025 8:48, Michael Liang wrote:
> This patch addresses a data corruption issue observed in nvme-tcp during
> testing.
>
> Issue description:
> In an NVMe native multipath setup, when an I/O timeout occurs, all inflight
> I/Os are canceled almost immediately after the kernel socket is shut down.
> These canceled I/Os are reported as host path errors, triggering a failover
> that succeeds on a different path.
>
> However, at this point, the original I/O may still be outstanding in the
> host's network transmission path (e.g., the NIC’s TX queue). From the
> user-space app's perspective, the buffer associated with the I/O is considered
> completed since they're acked on the different path and may be reused for new
> I/O requests.
>
> Because nvme-tcp enables zero-copy by default in the transmission path,
> this can lead to corrupted data being sent to the original target, ultimately
> causing data corruption.

This is unexpected.

1. before retrying the command, the host shuts down the socket.
2. the host sets sk_lingerime to 0, which means that
as soon as the socket is shutdown - the packet should not be able to 
transmit again
on the socket, zero-copy or not. Perhaps there is something not handled 
correctly
with linger=0? perhaps you should try with linger=<some-timeout> ?
Re: [PATCH] nvme-tcp: wait socket wmem to drain in queue stop
Posted by Michael Liang 9 months, 3 weeks ago
On Mon, Apr 14, 2025 at 01:25:05AM +0300, Sagi Grimberg wrote:
> 
> 
> On 05/04/2025 8:48, Michael Liang wrote:
> > This patch addresses a data corruption issue observed in nvme-tcp during
> > testing.
> > 
> > Issue description:
> > In an NVMe native multipath setup, when an I/O timeout occurs, all inflight
> > I/Os are canceled almost immediately after the kernel socket is shut down.
> > These canceled I/Os are reported as host path errors, triggering a failover
> > that succeeds on a different path.
> > 
> > However, at this point, the original I/O may still be outstanding in the
> > host's network transmission path (e.g., the NIC’s TX queue). From the
> > user-space app's perspective, the buffer associated with the I/O is considered
> > completed since they're acked on the different path and may be reused for new
> > I/O requests.
> > 
> > Because nvme-tcp enables zero-copy by default in the transmission path,
> > this can lead to corrupted data being sent to the original target, ultimately
> > causing data corruption.
> 
> This is unexpected.
> 
> 1. before retrying the command, the host shuts down the socket.
> 2. the host sets sk_lingerime to 0, which means that
> as soon as the socket is shutdown - the packet should not be able to
> transmit again
> on the socket, zero-copy or not. Perhaps there is something not handled
> correctly
> with linger=0? perhaps you should try with linger=<some-timeout> ?
I did notice that the linger time is explicitly set to 0 in nvme-tcp, but it
doesn't behave as expected in this case for two main reasons:
1. We're invoking socket shutdown, not socket close. During shutdown, tcp_shutdown()
in net/ipv4/tcp.c is called. This changes the socket state and may send a FIN
if needed, however it doesn't consider the linger setting at all;
2. Further more while tcp_close() does check the linger time, we experimented with
using socket close instead of shutdown, yet the same data corruption issue persisted.

The root cause is that once data is handed off to the lower-level device driver for
transmission, neither socket shutdown nor close can cancel it. With further tracing,
I realized that the socket could be freed a while after close when the outstanding
TX skb is released by the NIC. And sk_wmem_alloc is the one used to track the outstanding
data sent to low-level driver, so it's necessary to wait until it turns to 0 in queue
stop.
Re: [PATCH] nvme-tcp: wait socket wmem to drain in queue stop
Posted by Randy Jennings 10 months ago
On Fri, Apr 4, 2025 at 10:49 PM Michael Liang <mliang@purestorage.com> wrote:
>
> This patch addresses a data corruption issue observed in nvme-tcp during
> testing.
>
> Issue description:
> In an NVMe native multipath setup, when an I/O timeout occurs, all inflight
> I/Os are canceled almost immediately after the kernel socket is shut down.
> These canceled I/Os are reported as host path errors, triggering a failover
> that succeeds on a different path.
>
> However, at this point, the original I/O may still be outstanding in the
> host's network transmission path (e.g., the NIC’s TX queue). From the
> user-space app's perspective, the buffer associated with the I/O is considered
> completed since they're acked on the different path and may be reused for new
> I/O requests.
>
> Because nvme-tcp enables zero-copy by default in the transmission path,
> this can lead to corrupted data being sent to the original target, ultimately
> causing data corruption.
>
> We can reproduce this data corruption by injecting delay on one path and
> triggering i/o timeout.
>
> To prevent this issue, this change ensures that all inflight transmissions are
> fully completed from host's perspective before returning from queue stop.
> This aligns with the behavior of queue stopping in other NVMe fabric transports.
>
> Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> Reviewed-by: Randy Jennings <randyj@purestorage.com>
> Signed-off-by: Michael Liang <mliang@purestorage.com>

Through additional testing, we have recreated the corruption with this
patch.  We had a previous iteration of the patch that ran some time
without the corruption, and we convinced ourselves internally that a
portion of that version should not be needed.  So, unfortunately, it
looks like this patch is not sufficient to prevent the data
corruption.  We do believe the issue is still with the zero-copy and
too-quick retransmission (our tests showed that data that was only in
the buffer while userspace controlled the buffer was transmitted on
the wire), but we are still investigating.

Sincerely,
Randy Jennings
Re: [PATCH] nvme-tcp: wait socket wmem to drain in queue stop
Posted by Michael Liang 9 months, 3 weeks ago
On Tue, Apr 08, 2025 at 02:07:24PM -0700, Randy Jennings wrote:
> On Fri, Apr 4, 2025 at 10:49 PM Michael Liang <mliang@purestorage.com> wrote:
> >
> > This patch addresses a data corruption issue observed in nvme-tcp during
> > testing.
> >
> > Issue description:
> > In an NVMe native multipath setup, when an I/O timeout occurs, all inflight
> > I/Os are canceled almost immediately after the kernel socket is shut down.
> > These canceled I/Os are reported as host path errors, triggering a failover
> > that succeeds on a different path.
> >
> > However, at this point, the original I/O may still be outstanding in the
> > host's network transmission path (e.g., the NIC’s TX queue). From the
> > user-space app's perspective, the buffer associated with the I/O is considered
> > completed since they're acked on the different path and may be reused for new
> > I/O requests.
> >
> > Because nvme-tcp enables zero-copy by default in the transmission path,
> > this can lead to corrupted data being sent to the original target, ultimately
> > causing data corruption.
> >
> > We can reproduce this data corruption by injecting delay on one path and
> > triggering i/o timeout.
> >
> > To prevent this issue, this change ensures that all inflight transmissions are
> > fully completed from host's perspective before returning from queue stop.
> > This aligns with the behavior of queue stopping in other NVMe fabric transports.
> >
> > Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > Reviewed-by: Randy Jennings <randyj@purestorage.com>
> > Signed-off-by: Michael Liang <mliang@purestorage.com>
> 
> Through additional testing, we have recreated the corruption with this
> patch.  We had a previous iteration of the patch that ran some time
> without the corruption, and we convinced ourselves internally that a
> portion of that version should not be needed.  So, unfortunately, it
> looks like this patch is not sufficient to prevent the data
> corruption.  We do believe the issue is still with the zero-copy and
> too-quick retransmission (our tests showed that data that was only in
> the buffer while userspace controlled the buffer was transmitted on
> the wire), but we are still investigating.
> 
> Sincerely,
> Randy Jennings
We identified what was missing in the patch. In cases where concurrent I/O
timeouts occur on multiple namespaces under the same controller, there's a
race condition when stopping the same I/O queue. Due to the current patch's
queue liveness check, only the first timeout handler waits, while the others
roceed to fail I/Os immediately.

To address this race, we've modified the logic to always wait during queue
stop, regardless of the queue state. This resolves the issue. We'll post an
updated patch shortly.

Thanks,
Michael Liang
Re: [PATCH] nvme-tcp: wait socket wmem to drain in queue stop
Posted by Chaitanya Kulkarni 10 months ago
On 4/4/25 22:48, Michael Liang wrote:
> +static void nvme_tcp_stop_queue_wait(struct nvme_tcp_queue *queue)
> +{
> +	int timeout = 100;
> +

is there a guarantee that above will work for all the setups?
using configurable timeout values helps creating more generic
fix, do we need to consider that here ?

> +	while (timeout > 0) {
> +		if (!sk_wmem_alloc_get(queue->sock->sk))
> +			return;
> +		msleep(2);
> +		timeout -= 2;
> +	}
> +	dev_warn(queue->ctrl->ctrl.device,
> +		 "qid %d: wait draining sock wmem allocation timeout\n",
> +		 nvme_tcp_queue_id(queue));
> +}
> +

-ck


Re: [PATCH] nvme-tcp: wait socket wmem to drain in queue stop
Posted by Michael Liang 9 months, 3 weeks ago
On Tue, Apr 08, 2025 at 09:00:00PM +0000, Chaitanya Kulkarni wrote:
> On 4/4/25 22:48, Michael Liang wrote:
> > +static void nvme_tcp_stop_queue_wait(struct nvme_tcp_queue *queue)
> > +{
> > +	int timeout = 100;
> > +
> 
> is there a guarantee that above will work for all the setups?
> using configurable timeout values helps creating more generic
> fix, do we need to consider that here ?
The value here primarily reflects the latency between __tcp_transmit_skb()
and the freeing of the skb in the TX completion path. For most scenarios,
100ms should be sufficient. While it's theoretically possible to see higher
latencies, such cases might not be typical or practical for NVMe-TCP (please
correct me if I’m wrong).

That said, I'm open to making this timeout configurable if needed—perhaps
via a module parameter?

> > +	while (timeout > 0) {
> > +		if (!sk_wmem_alloc_get(queue->sock->sk))
> > +			return;
> > +		msleep(2);
> > +		timeout -= 2;
> > +	}
> > +	dev_warn(queue->ctrl->ctrl.device,
> > +		 "qid %d: wait draining sock wmem allocation timeout\n",
> > +		 nvme_tcp_queue_id(queue));
> > +}
> > +
> 
> -ck
> 
> 

Thanks,
Michael