[v1] nvme/tcp: handle tls partially sent records in write_space()

[PATCH] nvme/tcp: handle tls partially sent records in write_space()

Posted by Wilfred Mallawa 4 months ago

From: Wilfred Mallawa <wilfred.mallawa@wdc.com>

With TLS enabled, records that are encrypted and appended to TLS TX
list can fail to see a retry if the underlying TCP socket is busy, for
example, hitting an EAGAIN from tcp_sendmsg_locked(). This is not known
to the NVMe TCP driver, as the TLS layer successfully generated a record.

Typically, the TLS write_space() callback would ensure such records are
retried, but in the NVMe TCP Host driver, write_space() invokes
nvme_tcp_write_space(). This causes a partially sent record in the TLS TX
list to timeout after not being retried.

This patch aims to address the above by first publically exposing
tls_is_partially_sent_record(), then, using this in the NVMe TCP host
driver to invoke the TLS write_space() handler where appropriate.

Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Fixes: be8e82caa685 ("nvme-tcp: enable TLS handshake upcall")
---
 drivers/nvme/host/tcp.c | 8 ++++++++
 include/net/tls.h       | 5 +++++
 net/tls/tls.h           | 5 -----
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 1413788ca7d5..e3d02c33243b 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1076,11 +1076,18 @@ static void nvme_tcp_data_ready(struct sock *sk)
 static void nvme_tcp_write_space(struct sock *sk)
 {
 	struct nvme_tcp_queue *queue;
+	struct tls_context *ctx = tls_get_ctx(sk);
 
 	read_lock_bh(&sk->sk_callback_lock);
 	queue = sk->sk_user_data;
+
 	if (likely(queue && sk_stream_is_writeable(sk))) {
 		clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+		/* Ensure pending TLS partial records are retried */
+		if (nvme_tcp_queue_tls(queue) &&
+		    tls_is_partially_sent_record(ctx))
+			queue->write_space(sk);
+
 		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
 	}
 	read_unlock_bh(&sk->sk_callback_lock);
@@ -1306,6 +1313,7 @@ static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
 static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
 {
 	struct nvme_tcp_request *req;
+	struct tls_context *ctx = tls_get_ctx(queue->sock->sk);
 	unsigned int noreclaim_flag;
 	int ret = 1;
 
diff --git a/include/net/tls.h b/include/net/tls.h
index 857340338b69..9c61a2de44bf 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -373,6 +373,11 @@ static inline struct tls_context *tls_get_ctx(const struct sock *sk)
 	return (__force void *)icsk->icsk_ulp_data;
 }
 
+static inline bool tls_is_partially_sent_record(struct tls_context *ctx)
+{
+	return !!ctx->partially_sent_record;
+}
+
 static inline struct tls_sw_context_rx *tls_sw_ctx_rx(
 		const struct tls_context *tls_ctx)
 {
diff --git a/net/tls/tls.h b/net/tls/tls.h
index 2f86baeb71fc..7839a2effe31 100644
--- a/net/tls/tls.h
+++ b/net/tls/tls.h
@@ -271,11 +271,6 @@ int tls_push_partial_record(struct sock *sk, struct tls_context *ctx,
 			    int flags);
 void tls_free_partial_record(struct sock *sk, struct tls_context *ctx);
 
-static inline bool tls_is_partially_sent_record(struct tls_context *ctx)
-{
-	return !!ctx->partially_sent_record;
-}
-
 static inline bool tls_is_pending_open_record(struct tls_context *tls_ctx)
 {
 	return tls_ctx->pending_open_record_frags;
-- 
2.51.0

Re: [PATCH] nvme/tcp: handle tls partially sent records in write_space()

Posted by kernel test robot 4 months ago

Hi Wilfred,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net/main]
[also build test WARNING on net-next/main linus/master linux-nvme/for-next v6.17 next-20251009]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Wilfred-Mallawa/nvme-tcp-handle-tls-partially-sent-records-in-write_space/20251009-193029
base:   net/main
patch link:    https://lore.kernel.org/r/20251007004634.38716-2-wilfred.opensource%40gmail.com
patch subject: [PATCH] nvme/tcp: handle tls partially sent records in write_space()
config: s390-randconfig-002-20251010 (https://download.01.org/0day-ci/archive/20251010/202510100505.gzOzGPbI-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 39f292ffa13d7ca0d1edff27ac8fd55024bb4d19)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251010/202510100505.gzOzGPbI-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510100505.gzOzGPbI-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/nvme/host/tcp.c:1316:22: warning: unused variable 'ctx' [-Wunused-variable]
    1316 |         struct tls_context *ctx = tls_get_ctx(queue->sock->sk);
         |                             ^~~
   1 warning generated.


vim +/ctx +1316 drivers/nvme/host/tcp.c

  1312	
  1313	static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
  1314	{
  1315		struct nvme_tcp_request *req;
> 1316		struct tls_context *ctx = tls_get_ctx(queue->sock->sk);
  1317		unsigned int noreclaim_flag;
  1318		int ret = 1;
  1319	
  1320		if (!queue->request) {
  1321			queue->request = nvme_tcp_fetch_request(queue);
  1322			if (!queue->request)
  1323				return 0;
  1324		}
  1325		req = queue->request;
  1326	
  1327		noreclaim_flag = memalloc_noreclaim_save();
  1328		if (req->state == NVME_TCP_SEND_CMD_PDU) {
  1329			ret = nvme_tcp_try_send_cmd_pdu(req);
  1330			if (ret <= 0)
  1331				goto done;
  1332			if (!nvme_tcp_has_inline_data(req))
  1333				goto out;
  1334		}
  1335	
  1336		if (req->state == NVME_TCP_SEND_H2C_PDU) {
  1337			ret = nvme_tcp_try_send_data_pdu(req);
  1338			if (ret <= 0)
  1339				goto done;
  1340		}
  1341	
  1342		if (req->state == NVME_TCP_SEND_DATA) {
  1343			ret = nvme_tcp_try_send_data(req);
  1344			if (ret <= 0)
  1345				goto done;
  1346		}
  1347	
  1348		if (req->state == NVME_TCP_SEND_DDGST)
  1349			ret = nvme_tcp_try_send_ddgst(req);
  1350	done:
  1351		if (ret == -EAGAIN) {
  1352			ret = 0;
  1353		} else if (ret < 0) {
  1354			dev_err(queue->ctrl->ctrl.device,
  1355				"failed to send request %d\n", ret);
  1356			nvme_tcp_fail_request(queue->request);
  1357			nvme_tcp_done_send_req(queue);
  1358		}
  1359	out:
  1360		memalloc_noreclaim_restore(noreclaim_flag);
  1361		return ret;
  1362	}
  1363	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH] nvme/tcp: handle tls partially sent records in write_space()

Posted by Hannes Reinecke 4 months ago

On 10/7/25 02:46, Wilfred Mallawa wrote:
> From: Wilfred Mallawa <wilfred.mallawa@wdc.com>
> 
> With TLS enabled, records that are encrypted and appended to TLS TX
> list can fail to see a retry if the underlying TCP socket is busy, for
> example, hitting an EAGAIN from tcp_sendmsg_locked(). This is not known
> to the NVMe TCP driver, as the TLS layer successfully generated a record.
> 
> Typically, the TLS write_space() callback would ensure such records are
> retried, but in the NVMe TCP Host driver, write_space() invokes
> nvme_tcp_write_space(). This causes a partially sent record in the TLS TX
> list to timeout after not being retried.
> 
> This patch aims to address the above by first publically exposing
> tls_is_partially_sent_record(), then, using this in the NVMe TCP host
> driver to invoke the TLS write_space() handler where appropriate.
> 
> Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
> Fixes: be8e82caa685 ("nvme-tcp: enable TLS handshake upcall")
> ---
>   drivers/nvme/host/tcp.c | 8 ++++++++
>   include/net/tls.h       | 5 +++++
>   net/tls/tls.h           | 5 -----
>   3 files changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 1413788ca7d5..e3d02c33243b 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1076,11 +1076,18 @@ static void nvme_tcp_data_ready(struct sock *sk)
>   static void nvme_tcp_write_space(struct sock *sk)
>   {
>   	struct nvme_tcp_queue *queue;
> +	struct tls_context *ctx = tls_get_ctx(sk);
>   
>   	read_lock_bh(&sk->sk_callback_lock);
>   	queue = sk->sk_user_data;
> +
>   	if (likely(queue && sk_stream_is_writeable(sk))) {
>   		clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> +		/* Ensure pending TLS partial records are retried */
> +		if (nvme_tcp_queue_tls(queue) &&
> +		    tls_is_partially_sent_record(ctx))
> +			queue->write_space(sk);
> +
>   		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>   	}
>   	read_unlock_bh(&sk->sk_callback_lock);

I wonder: Do we really need to check for a partially assembled record,
or wouldn't it be easier to call queue->write_space() every time here?
We sure would end up with executing the callback more often, but if no
data is present it shouldn't do any harm.

IE just use

if (nvme_tcp_queue_tls(queue)
     queue->write_space(sk);

> @@ -1306,6 +1313,7 @@ static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
>   static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
>   {
>   	struct nvme_tcp_request *req;
> +	struct tls_context *ctx = tls_get_ctx(queue->sock->sk);
>   	unsigned int noreclaim_flag;
>   	int ret = 1;
>   And we need this why?

> diff --git a/include/net/tls.h b/include/net/tls.h
> index 857340338b69..9c61a2de44bf 100644
> --- a/include/net/tls.h
> +++ b/include/net/tls.h
> @@ -373,6 +373,11 @@ static inline struct tls_context *tls_get_ctx(const struct sock *sk)
>   	return (__force void *)icsk->icsk_ulp_data;
>   }
>   
> +static inline bool tls_is_partially_sent_record(struct tls_context *ctx)
> +{
> +	return !!ctx->partially_sent_record;
> +}
> +
>   static inline struct tls_sw_context_rx *tls_sw_ctx_rx(
>   		const struct tls_context *tls_ctx)
>   {
> diff --git a/net/tls/tls.h b/net/tls/tls.h
> index 2f86baeb71fc..7839a2effe31 100644
> --- a/net/tls/tls.h
> +++ b/net/tls/tls.h
> @@ -271,11 +271,6 @@ int tls_push_partial_record(struct sock *sk, struct tls_context *ctx,
>   			    int flags);
>   void tls_free_partial_record(struct sock *sk, struct tls_context *ctx);
>   
> -static inline bool tls_is_partially_sent_record(struct tls_context *ctx)
> -{
> -	return !!ctx->partially_sent_record;
> -}
> -
>   static inline bool tls_is_pending_open_record(struct tls_context *tls_ctx)
>   {
>   	return tls_ctx->pending_open_record_frags;
See above. If we were calling ->write_space unconditionally we 
wouldn'teven need this export.Cheers,Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

Re: [PATCH] nvme/tcp: handle tls partially sent records in write_space()

Posted by Wilfred Mallawa 4 months ago

On Tue, 2025-10-07 at 07:19 +0200, Hannes Reinecke wrote:
> On 10/7/25 02:46, Wilfred Mallawa wrote:
> > From: Wilfred Mallawa <wilfred.mallawa@wdc.com>
> > 
> 
[...]
> I wonder: Do we really need to check for a partially assembled
> record,
> or wouldn't it be easier to call queue->write_space() every time
> here?
> We sure would end up with executing the callback more often, but if
> no
> data is present it shouldn't do any harm.
> 
> IE just use
> 
> if (nvme_tcp_queue_tls(queue)
>      queue->write_space(sk);

Hey Hannes,

This was my initial approach, but I figured using
tls_is_partially_sent_record() might be slightly more efficient. But if
we think that's negligible, happy to go with this approach (omitting
the partial record check).

Wilfred

> 
> > @@ -1306,6 +1313,7 @@ static int nvme_tcp_try_send_ddgst(struct
> > nvme_tcp_request *req)
> >   static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
> >   {
> >   	struct nvme_tcp_request *req;
> > +	struct tls_context *ctx = tls_get_ctx(queue->sock->sk);
> >   	unsigned int noreclaim_flag;
> >   	int ret = 1;
> >   And we need this why?
> 
> > diff --git a/include/net/tls.h b/include/net/tls.h
> > index 857340338b69..9c61a2de44bf 100644
> > --- a/include/net/tls.h
> > +++ b/include/net/tls.h
> > @@ -373,6 +373,11 @@ static inline struct tls_context
> > *tls_get_ctx(const struct sock *sk)
> >   	return (__force void *)icsk->icsk_ulp_data;
> >   }
> >   
> > +static inline bool tls_is_partially_sent_record(struct tls_context
> > *ctx)
> > +{
> > +	return !!ctx->partially_sent_record;
> > +}
> > +
> >   static inline struct tls_sw_context_rx *tls_sw_ctx_rx(
> >   		const struct tls_context *tls_ctx)
> >   {
> > diff --git a/net/tls/tls.h b/net/tls/tls.h
> > index 2f86baeb71fc..7839a2effe31 100644
> > --- a/net/tls/tls.h
> > +++ b/net/tls/tls.h
> > @@ -271,11 +271,6 @@ int tls_push_partial_record(struct sock *sk,
> > struct tls_context *ctx,
> >   			    int flags);
> >   void tls_free_partial_record(struct sock *sk, struct tls_context
> > *ctx);
> >   
> > -static inline bool tls_is_partially_sent_record(struct tls_context
> > *ctx)
> > -{
> > -	return !!ctx->partially_sent_record;
> > -}
> > -
> >   static inline bool tls_is_pending_open_record(struct tls_context
> > *tls_ctx)
> >   {
> >   	return tls_ctx->pending_open_record_frags;
> See above. If we were calling ->write_space unconditionally we 
> wouldn'teven need this export.Cheers,Hannes

Re: [PATCH] nvme/tcp: handle tls partially sent records in write_space()

Posted by Hannes Reinecke 4 months ago

On 10/7/25 11:24, Wilfred Mallawa wrote:
> On Tue, 2025-10-07 at 07:19 +0200, Hannes Reinecke wrote:
>> On 10/7/25 02:46, Wilfred Mallawa wrote:
>>> From: Wilfred Mallawa <wilfred.mallawa@wdc.com>
>>>
>>
> [...]
>> I wonder: Do we really need to check for a partially assembled
>> record,
>> or wouldn't it be easier to call queue->write_space() every time
>> here?
>> We sure would end up with executing the callback more often, but if
>> no
>> data is present it shouldn't do any harm.
>>
>> IE just use
>>
>> if (nvme_tcp_queue_tls(queue)
>>       queue->write_space(sk);
> 
> Hey Hannes,
> 
> This was my initial approach, but I figured using
> tls_is_partially_sent_record() might be slightly more efficient. But if
> we think that's negligible, happy to go with this approach (omitting
> the partial record check).
> 
Please do.
Performance testing on NVMe-TCP is notoriously tricky, so for now we
really should not assume anything here.
And it's making the patch _vastly_ simpler, _and_ we don't have to
involve the networking folks here.
We have a similar patch for the data_ready() function in nvmet_tcp(),
and that seemed to work, too.
Nit: we don't unset the 'NOSPACE' flag there. Can you check if that's
really required? And, if it is, fixup nvmet_tcp() to unset it?
Or, if not, modify your patch to not clear it?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

Re: [PATCH] nvme/tcp: handle tls partially sent records in write_space()

Posted by Wilfred Mallawa 4 months ago

On Tue, 2025-10-07 at 11:51 +0200, Hannes Reinecke wrote:
> On 10/7/25 11:24, Wilfred Mallawa wrote:
> > On Tue, 2025-10-07 at 07:19 +0200, Hannes Reinecke wrote:
> > > On 10/7/25 02:46, Wilfred Mallawa wrote:
> > > > From: Wilfred Mallawa <wilfred.mallawa@wdc.com>
> > > > 
> > > 
> > [...]
> > > I wonder: Do we really need to check for a partially assembled
> > > record,
> > > or wouldn't it be easier to call queue->write_space() every time
> > > here?
> > > We sure would end up with executing the callback more often, but
> > > if
> > > no
> > > data is present it shouldn't do any harm.
> > > 
> > > IE just use
> > > 
> > > if (nvme_tcp_queue_tls(queue)
> > >       queue->write_space(sk);
> > 
> > Hey Hannes,
> > 
> > This was my initial approach, but I figured using
> > tls_is_partially_sent_record() might be slightly more efficient.
> > But if
> > we think that's negligible, happy to go with this approach
> > (omitting
> > the partial record check).
> > 
> Please do.
> Performance testing on NVMe-TCP is notoriously tricky, so for now we
> really should not assume anything here.
> And it's making the patch _vastly_ simpler, _and_ we don't have to
> involve the networking folks here.

Okay, will send a V2 with this approach.

> We have a similar patch for the data_ready() function in nvmet_tcp(),
> and that seemed to work, too.
> Nit: we don't unset the 'NOSPACE' flag there. Can you check if that's
> really required? 
> And, if it is, fixup nvmet_tcp() to unset it?
> Or, if not, modify your patch to not clear it?

I don't see why we would need to clear the NOSPACE flag in
data_ready()? My understanding is that this flag is used when the send
buffer is full.

I would think the clear_bit() is necessary in write_space() since it
would typically get done in something like sk_stream_write_space()? 
However, running some quick FIOs with the clear_bit() removed, things
seem to work. Not sure if removing it has any further implications
though...

Regards,
Wilfred


> Cheers,
> 
> Hannes

Re: [PATCH] nvme/tcp: handle tls partially sent records in write_space()

Posted by Hannes Reinecke 4 months ago

On 10/8/25 04:11, Wilfred Mallawa wrote:
> On Tue, 2025-10-07 at 11:51 +0200, Hannes Reinecke wrote:
>> On 10/7/25 11:24, Wilfred Mallawa wrote:
>>> On Tue, 2025-10-07 at 07:19 +0200, Hannes Reinecke wrote:
>>>> On 10/7/25 02:46, Wilfred Mallawa wrote:
>>>>> From: Wilfred Mallawa <wilfred.mallawa@wdc.com>
>>>>>
>>>>
>>> [...]
>>>> I wonder: Do we really need to check for a partially assembled
>>>> record,
>>>> or wouldn't it be easier to call queue->write_space() every time
>>>> here?
>>>> We sure would end up with executing the callback more often, but
>>>> if
>>>> no
>>>> data is present it shouldn't do any harm.
>>>>
>>>> IE just use
>>>>
>>>> if (nvme_tcp_queue_tls(queue)
>>>>        queue->write_space(sk);
>>>
>>> Hey Hannes,
>>>
>>> This was my initial approach, but I figured using
>>> tls_is_partially_sent_record() might be slightly more efficient.
>>> But if
>>> we think that's negligible, happy to go with this approach
>>> (omitting
>>> the partial record check).
>>>
>> Please do.
>> Performance testing on NVMe-TCP is notoriously tricky, so for now we
>> really should not assume anything here.
>> And it's making the patch _vastly_ simpler, _and_ we don't have to
>> involve the networking folks here.
> 
> Okay, will send a V2 with this approach.
> 
>> We have a similar patch for the data_ready() function in nvmet_tcp(),
>> and that seemed to work, too.
>> Nit: we don't unset the 'NOSPACE' flag there. Can you check if that's
>> really required?
>> And, if it is, fixup nvmet_tcp() to unset it?
>> Or, if not, modify your patch to not clear it?
> 
> I don't see why we would need to clear the NOSPACE flag in
> data_ready()? My understanding is that this flag is used when the send
> buffer is full.
> 
> I would think the clear_bit() is necessary in write_space() since it
> would typically get done in something like sk_stream_write_space()?
> However, running some quick FIOs with the clear_bit() removed, things
> seem to work. Not sure if removing it has any further implications
> though...
> 
I am not sure, either. Code analysis suggests that we don't need to
do that, but then we're the first ever to explore that area.
So I would think we don't need to worry (as nvmet-tcp doesn't do that,
either). Sounds like a question for LPC.
So let's drop the 'NOSPACE' flag handling to get the
partial records fixed, and address the NOSPACE issue separately.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich