drivers/nvme/host/tcp.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
nvme_tcp_init_connection() attempts to receive an ICResp PDU but only
checks that the return value from recvmsg() is non-negative. If the
sender closes the TCP connection or sends fewer than 128 bytes, this
check will pass even though the full PDU wasn't received.
Ensure the full ICResp PDU is received by checking that recvmsg()
returns the expected 128 bytes.
Additionally set the MSG_WAITALL flag for recvmsg(), as a sender could
split the ICResp over multiple TCP frames. Without MSG_WAITALL,
recvmsg() could return prematurely with only part of the PDU.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
---
v4: keep recvmsg() error return value
v3: fix return value to indicate error
v2: add Fixes tag
drivers/nvme/host/tcp.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index e9ff6babc540..56679eb8c0d6 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1446,15 +1446,18 @@ static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
iov.iov_len = sizeof(*icresp);
if (nvme_tcp_queue_tls(queue)) {
msg.msg_control = cbuf;
msg.msg_controllen = sizeof(cbuf);
}
+ msg.msg_flags = MSG_WAITALL;
ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
iov.iov_len, msg.msg_flags);
- if (ret < 0) {
+ if (ret < sizeof(*icresp)) {
pr_warn("queue %d: failed to receive icresp, error %d\n",
nvme_tcp_queue_id(queue), ret);
+ if (ret >= 0)
+ ret = -ECONNRESET;
goto free_icresp;
}
ret = -ENOTCONN;
if (nvme_tcp_queue_tls(queue)) {
ctype = tls_get_record_type(queue->sock->sk,
--
2.45.2
On Fri, Jan 24, 2025 at 11:43:10AM -0700, Caleb Sander Mateos wrote: > nvme_tcp_init_connection() attempts to receive an ICResp PDU but only > checks that the return value from recvmsg() is non-negative. If the > sender closes the TCP connection or sends fewer than 128 bytes, this > check will pass even though the full PDU wasn't received. > > Ensure the full ICResp PDU is received by checking that recvmsg() > returns the expected 128 bytes. Thanks, applied to nvme-6.14.
On 1/24/25 19:43, Caleb Sander Mateos wrote:
> nvme_tcp_init_connection() attempts to receive an ICResp PDU but only
> checks that the return value from recvmsg() is non-negative. If the
> sender closes the TCP connection or sends fewer than 128 bytes, this
> check will pass even though the full PDU wasn't received.
>
> Ensure the full ICResp PDU is received by checking that recvmsg()
> returns the expected 128 bytes.
>
> Additionally set the MSG_WAITALL flag for recvmsg(), as a sender could
> split the ICResp over multiple TCP frames. Without MSG_WAITALL,
> recvmsg() could return prematurely with only part of the PDU.
>
> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
> ---
> v4: keep recvmsg() error return value
> v3: fix return value to indicate error
> v2: add Fixes tag
>
> drivers/nvme/host/tcp.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index e9ff6babc540..56679eb8c0d6 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1446,15 +1446,18 @@ static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
> iov.iov_len = sizeof(*icresp);
> if (nvme_tcp_queue_tls(queue)) {
> msg.msg_control = cbuf;
> msg.msg_controllen = sizeof(cbuf);
> }
> + msg.msg_flags = MSG_WAITALL;
> ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
> iov.iov_len, msg.msg_flags);
But won't we have to wait for a TCP timeout now if the sender sends less
than 128 bytes? With this patch we always wait for 128 bytes, and
possibly wait for TCP timeout if not.
Testcase for this would be nice ...
And I need to check if secure concatenation is affected here; with
secure concatenation we need to peek at the first packet to check
if it's an ICRESP or a TLS negotiation.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
On Sun, Jan 26, 2025 at 11:37 PM Hannes Reinecke <hare@suse.de> wrote:
>
> On 1/24/25 19:43, Caleb Sander Mateos wrote:
> > nvme_tcp_init_connection() attempts to receive an ICResp PDU but only
> > checks that the return value from recvmsg() is non-negative. If the
> > sender closes the TCP connection or sends fewer than 128 bytes, this
> > check will pass even though the full PDU wasn't received.
> >
> > Ensure the full ICResp PDU is received by checking that recvmsg()
> > returns the expected 128 bytes.
> >
> > Additionally set the MSG_WAITALL flag for recvmsg(), as a sender could
> > split the ICResp over multiple TCP frames. Without MSG_WAITALL,
> > recvmsg() could return prematurely with only part of the PDU.
> >
> > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
> > ---
> > v4: keep recvmsg() error return value
> > v3: fix return value to indicate error
> > v2: add Fixes tag
> >
> > drivers/nvme/host/tcp.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > index e9ff6babc540..56679eb8c0d6 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -1446,15 +1446,18 @@ static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
> > iov.iov_len = sizeof(*icresp);
> > if (nvme_tcp_queue_tls(queue)) {
> > msg.msg_control = cbuf;
> > msg.msg_controllen = sizeof(cbuf);
> > }
> > + msg.msg_flags = MSG_WAITALL;
> > ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
> > iov.iov_len, msg.msg_flags);
>
> But won't we have to wait for a TCP timeout now if the sender sends less
> than 128 bytes? With this patch we always wait for 128 bytes, and
> possibly wait for TCP timeout if not.
Yes, if the NVMe/TCP controller sends less than 128 bytes, we need to
wait for it to send the remainder of the ICResp PDU. That's just how
the NVMe/TCP protocol works. If we want to protect against
buggy/malicious controllers that don't send a full ICResp, we need a
timeout mechanism. That's the purpose of the existing
`queue->sock->sk->sk_rcvtimeo = 10 * HZ;` in nvme_tcp_alloc_queue().
Note that recvmsg() timing out was already possible in the original
code if the controller didn't send anything on the TCP connection
after accepting it.
> Testcase for this would be nice ...
>
> And I need to check if secure concatenation is affected here; with
> secure concatenation we need to peek at the first packet to check
> if it's an ICRESP or a TLS negotiation.
Are you saying that with secure concatenation we don't know in advance
whether the connection is using TLS between the TCP and NVMe/TCP
protocol layers? Wouldn't the host already need to know that when it
sent its ICReq PDU?
If TLS is being used, my assumption was that `struct kvec iov` will
receive the decrypted (NVMe/TCP protocol) bytes rather than the
encrypted (TLS protocol) bytes. If that's not the case, then I think
there would be another existing bug, as the code interprets the
received bytes as an ICResp PDU regardless of whether TLS is in use.
Thanks,
Caleb
On 1/27/25 18:38, Caleb Sander wrote:
> On Sun, Jan 26, 2025 at 11:37 PM Hannes Reinecke <hare@suse.de> wrote:
>>
>> On 1/24/25 19:43, Caleb Sander Mateos wrote:
>>> nvme_tcp_init_connection() attempts to receive an ICResp PDU but only
>>> checks that the return value from recvmsg() is non-negative. If the
>>> sender closes the TCP connection or sends fewer than 128 bytes, this
>>> check will pass even though the full PDU wasn't received.
>>>
>>> Ensure the full ICResp PDU is received by checking that recvmsg()
>>> returns the expected 128 bytes.
>>>
>>> Additionally set the MSG_WAITALL flag for recvmsg(), as a sender could
>>> split the ICResp over multiple TCP frames. Without MSG_WAITALL,
>>> recvmsg() could return prematurely with only part of the PDU.
>>>
>>> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
>>> Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
>>> ---
>>> v4: keep recvmsg() error return value
>>> v3: fix return value to indicate error
>>> v2: add Fixes tag
>>>
>>> drivers/nvme/host/tcp.c | 5 ++++-
>>> 1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index e9ff6babc540..56679eb8c0d6 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1446,15 +1446,18 @@ static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
>>> iov.iov_len = sizeof(*icresp);
>>> if (nvme_tcp_queue_tls(queue)) {
>>> msg.msg_control = cbuf;
>>> msg.msg_controllen = sizeof(cbuf);
>>> }
>>> + msg.msg_flags = MSG_WAITALL;
>>> ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
>>> iov.iov_len, msg.msg_flags);
>>
>> But won't we have to wait for a TCP timeout now if the sender sends less
>> than 128 bytes? With this patch we always wait for 128 bytes, and
>> possibly wait for TCP timeout if not.
>
> Yes, if the NVMe/TCP controller sends less than 128 bytes, we need to
> wait for it to send the remainder of the ICResp PDU. That's just how
> the NVMe/TCP protocol works. If we want to protect against
> buggy/malicious controllers that don't send a full ICResp, we need a
> timeout mechanism. That's the purpose of the existing
> `queue->sock->sk->sk_rcvtimeo = 10 * HZ;` in nvme_tcp_alloc_queue().
> Note that recvmsg() timing out was already possible in the original
> code if the controller didn't send anything on the TCP connection
> after accepting it.
>
Hmm. With checking the code 'rcvtimeo' is only evaluated if MSG_WAITALL
is _not_ set. Makes me wonder why we do set it...
But that's beside the point.
>> Testcase for this would be nice ...
>>
>> And I need to check if secure concatenation is affected here; with
>> secure concatenation we need to peek at the first packet to check
>> if it's an ICRESP or a TLS negotiation.
>
> Are you saying that with secure concatenation we don't know in advance
> whether the connection is using TLS between the TCP and NVMe/TCP
> protocol layers? Wouldn't the host already need to know that when it
> sent its ICReq PDU?
No, the host doesn't need to know. TLS is enabled by the lower
layers.
But upon further checking, I guess it'll be okay with secure
concatenation.
Nevertheless, I would vastly prefer to have a receive loop here
instead of waiting to receive the full amount as per MSG_WAITALL.
The entire tcp code is using nonblocking calls, so I'd rather
keep it that way and implement a receive loop here.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
On Tue, Jan 28, 2025 at 12:28 AM Hannes Reinecke <hare@suse.de> wrote:
>
> On 1/27/25 18:38, Caleb Sander wrote:
> > On Sun, Jan 26, 2025 at 11:37 PM Hannes Reinecke <hare@suse.de> wrote:
> >>
> >> On 1/24/25 19:43, Caleb Sander Mateos wrote:
> >>> nvme_tcp_init_connection() attempts to receive an ICResp PDU but only
> >>> checks that the return value from recvmsg() is non-negative. If the
> >>> sender closes the TCP connection or sends fewer than 128 bytes, this
> >>> check will pass even though the full PDU wasn't received.
> >>>
> >>> Ensure the full ICResp PDU is received by checking that recvmsg()
> >>> returns the expected 128 bytes.
> >>>
> >>> Additionally set the MSG_WAITALL flag for recvmsg(), as a sender could
> >>> split the ICResp over multiple TCP frames. Without MSG_WAITALL,
> >>> recvmsg() could return prematurely with only part of the PDU.
> >>>
> >>> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> >>> Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
> >>> ---
> >>> v4: keep recvmsg() error return value
> >>> v3: fix return value to indicate error
> >>> v2: add Fixes tag
> >>>
> >>> drivers/nvme/host/tcp.c | 5 ++++-
> >>> 1 file changed, 4 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> >>> index e9ff6babc540..56679eb8c0d6 100644
> >>> --- a/drivers/nvme/host/tcp.c
> >>> +++ b/drivers/nvme/host/tcp.c
> >>> @@ -1446,15 +1446,18 @@ static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
> >>> iov.iov_len = sizeof(*icresp);
> >>> if (nvme_tcp_queue_tls(queue)) {
> >>> msg.msg_control = cbuf;
> >>> msg.msg_controllen = sizeof(cbuf);
> >>> }
> >>> + msg.msg_flags = MSG_WAITALL;
> >>> ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
> >>> iov.iov_len, msg.msg_flags);
> >>
> >> But won't we have to wait for a TCP timeout now if the sender sends less
> >> than 128 bytes? With this patch we always wait for 128 bytes, and
> >> possibly wait for TCP timeout if not.
> >
> > Yes, if the NVMe/TCP controller sends less than 128 bytes, we need to
> > wait for it to send the remainder of the ICResp PDU. That's just how
> > the NVMe/TCP protocol works. If we want to protect against
> > buggy/malicious controllers that don't send a full ICResp, we need a
> > timeout mechanism. That's the purpose of the existing
> > `queue->sock->sk->sk_rcvtimeo = 10 * HZ;` in nvme_tcp_alloc_queue().
> > Note that recvmsg() timing out was already possible in the original
> > code if the controller didn't send anything on the TCP connection
> > after accepting it.
> >
> Hmm. With checking the code 'rcvtimeo' is only evaluated if MSG_WAITALL
> is _not_ set. Makes me wonder why we do set it...
> But that's beside the point.
I am not seeing where sk_rcvtimeo is ignored when MSG_WAITALL is used,
can you point me to the code?
It is certainly ignored for *non-blocking* recvmsg() (MSG_DONTWAIT),
but I don't see why it would be ignored for MSG_WAITALL. man 7 socket
also suggests it applies to all blocking socket I/O.
static inline long sock_rcvtimeo(const struct sock *sk, bool noblock)
{
return noblock ? 0 : sk->sk_rcvtimeo;
}
In tcp_recvmsg_locked():
timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
>
> >> Testcase for this would be nice ...
> >>
> >> And I need to check if secure concatenation is affected here; with
> >> secure concatenation we need to peek at the first packet to check
> >> if it's an ICRESP or a TLS negotiation.
> >
> > Are you saying that with secure concatenation we don't know in advance
> > whether the connection is using TLS between the TCP and NVMe/TCP
> > protocol layers? Wouldn't the host already need to know that when it
> > sent its ICReq PDU?
>
> No, the host doesn't need to know. TLS is enabled by the lower
> layers.
>
> But upon further checking, I guess it'll be okay with secure
> concatenation.
>
> Nevertheless, I would vastly prefer to have a receive loop here
> instead of waiting to receive the full amount as per MSG_WAITALL.
> The entire tcp code is using nonblocking calls, so I'd rather
> keep it that way and implement a receive loop here.
The driver uses non-blocking socket I/O in the I/O path, but not the
connect path. nvme_tcp_init_connection() is already using blocking
socket I/O to send the ICReq PDU and receive the ICResp. I suppose we
could change the connect path to register poll interest and use
non-blocking operations, but that seems like a more involved code
change and orthogonal to the issue of receiving the full ICResp.
Best,
Caleb
On 29/01/2025 0:02, Caleb Sander wrote:
> On Tue, Jan 28, 2025 at 12:28 AM Hannes Reinecke <hare@suse.de> wrote:
>> On 1/27/25 18:38, Caleb Sander wrote:
>>> On Sun, Jan 26, 2025 at 11:37 PM Hannes Reinecke <hare@suse.de> wrote:
>>>> On 1/24/25 19:43, Caleb Sander Mateos wrote:
>>>>> nvme_tcp_init_connection() attempts to receive an ICResp PDU but only
>>>>> checks that the return value from recvmsg() is non-negative. If the
>>>>> sender closes the TCP connection or sends fewer than 128 bytes, this
>>>>> check will pass even though the full PDU wasn't received.
>>>>>
>>>>> Ensure the full ICResp PDU is received by checking that recvmsg()
>>>>> returns the expected 128 bytes.
>>>>>
>>>>> Additionally set the MSG_WAITALL flag for recvmsg(), as a sender could
>>>>> split the ICResp over multiple TCP frames. Without MSG_WAITALL,
>>>>> recvmsg() could return prematurely with only part of the PDU.
>>>>>
>>>>> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
>>>>> Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
>>>>> ---
>>>>> v4: keep recvmsg() error return value
>>>>> v3: fix return value to indicate error
>>>>> v2: add Fixes tag
>>>>>
>>>>> drivers/nvme/host/tcp.c | 5 ++++-
>>>>> 1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>>> index e9ff6babc540..56679eb8c0d6 100644
>>>>> --- a/drivers/nvme/host/tcp.c
>>>>> +++ b/drivers/nvme/host/tcp.c
>>>>> @@ -1446,15 +1446,18 @@ static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
>>>>> iov.iov_len = sizeof(*icresp);
>>>>> if (nvme_tcp_queue_tls(queue)) {
>>>>> msg.msg_control = cbuf;
>>>>> msg.msg_controllen = sizeof(cbuf);
>>>>> }
>>>>> + msg.msg_flags = MSG_WAITALL;
>>>>> ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
>>>>> iov.iov_len, msg.msg_flags);
>>>> But won't we have to wait for a TCP timeout now if the sender sends less
>>>> than 128 bytes? With this patch we always wait for 128 bytes, and
>>>> possibly wait for TCP timeout if not.
>>> Yes, if the NVMe/TCP controller sends less than 128 bytes, we need to
>>> wait for it to send the remainder of the ICResp PDU. That's just how
>>> the NVMe/TCP protocol works. If we want to protect against
>>> buggy/malicious controllers that don't send a full ICResp, we need a
>>> timeout mechanism. That's the purpose of the existing
>>> `queue->sock->sk->sk_rcvtimeo = 10 * HZ;` in nvme_tcp_alloc_queue().
>>> Note that recvmsg() timing out was already possible in the original
>>> code if the controller didn't send anything on the TCP connection
>>> after accepting it.
>>>
>> Hmm. With checking the code 'rcvtimeo' is only evaluated if MSG_WAITALL
>> is _not_ set. Makes me wonder why we do set it...
>> But that's beside the point.
> I am not seeing where sk_rcvtimeo is ignored when MSG_WAITALL is used,
> can you point me to the code?
> It is certainly ignored for *non-blocking* recvmsg() (MSG_DONTWAIT),
> but I don't see why it would be ignored for MSG_WAITALL. man 7 socket
> also suggests it applies to all blocking socket I/O.
>
> static inline long sock_rcvtimeo(const struct sock *sk, bool noblock)
> {
> return noblock ? 0 : sk->sk_rcvtimeo;
> }
>
> In tcp_recvmsg_locked():
> timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
>
>>>> Testcase for this would be nice ...
>>>>
>>>> And I need to check if secure concatenation is affected here; with
>>>> secure concatenation we need to peek at the first packet to check
>>>> if it's an ICRESP or a TLS negotiation.
>>> Are you saying that with secure concatenation we don't know in advance
>>> whether the connection is using TLS between the TCP and NVMe/TCP
>>> protocol layers? Wouldn't the host already need to know that when it
>>> sent its ICReq PDU?
>> No, the host doesn't need to know. TLS is enabled by the lower
>> layers.
>>
>> But upon further checking, I guess it'll be okay with secure
>> concatenation.
>>
>> Nevertheless, I would vastly prefer to have a receive loop here
>> instead of waiting to receive the full amount as per MSG_WAITALL.
>> The entire tcp code is using nonblocking calls, so I'd rather
>> keep it that way and implement a receive loop here.
> The driver uses non-blocking socket I/O in the I/O path, but not the
> connect path. nvme_tcp_init_connection() is already using blocking
> socket I/O to send the ICReq PDU and receive the ICResp. I suppose we
> could change the connect path to register poll interest and use
> non-blocking operations, but that seems like a more involved code
> change and orthogonal to the issue of receiving the full ICResp.
I agree. the driver blocks on ICResp rcvmsg, there is no reason to
use async interface.
Caleb, can you please make sure to test this patch with TLS?
Do you have a reliable way to reproduce this?
On Fri, Jan 31, 2025 at 12:17 AM Sagi Grimberg <sagi@grimberg.me> wrote:
...
> Caleb, can you please make sure to test this patch with TLS?
Can you point me to some documentation for that? I tried setting up a
nvmet_tcp port to use TLS and connecting to it as described in the
cover letter for the patch series (https://lwn.net/Articles/941139/).
But the TLS Server Hello seems to fail with EACCES. Any idea what I'm
doing wrong?
$ modprobe nvmet_tcp
$ modprobe null_blk nr_devices=1
$ mkdir /sys/kernel/config/nvmet/subsystems/nqn.nvmet
$ echo 1 > /sys/kernel/config/nvmet/subsystems/nqn.nvmet/attr_allow_any_host
$ mkdir /sys/kernel/config/nvmet/subsystems/nqn.nvmet/namespaces/1
$ echo /dev/nullb0 >
/sys/kernel/config/nvmet/subsystems/nqn.nvmet/namespaces/1/device_path
$ echo 1 > /sys/kernel/config/nvmet/subsystems/nqn.nvmet/namespaces/1/enable
$ mkdir /sys/kernel/config/nvmet/ports/1
$ echo tcp > /sys/kernel/config/nvmet/ports/1/addr_trtype
$ echo ipv4 > /sys/kernel/config/nvmet/ports/1/addr_adrfam
$ echo 127.0.0.1 > /sys/kernel/config/nvmet/ports/1/addr_traddr
$ echo 4420 > /sys/kernel/config/nvmet/ports/1/addr_trsvcid
$ echo required > /sys/kernel/config/nvmet/ports/1/addr_treq
$ echo tls1.3 > /sys/kernel/config/nvmet/ports/1/addr_tsas
$ ln -s /sys/kernel/config/nvmet/subsystems/nqn.nvmet
/sys/kernel/config/nvmet/ports/1/subsystems
$ nvme gen-tls-key --subsysnqn nqn.nvmet --insert
Inserted TLS key 005e8a74
$ nvme gen-tls-key --subsysnqn nqn.2014-08.org.nvmexpress.discovery --insert
Inserted TLS key 22d676b8
$ tlshd &
$ nvme discover --transport tcp --traddr 127.0.0.1 --trsvcid 4420 --tls
Failed to write to /dev/nvme-fabrics: Input/output error
failed to add controller, error failed to write to nvme-fabrics device
With debug logs enabled, I see the following:
$ dmesg | tail -6
[ 440.405298] nvme nvme0: connecting queue 0
[ 440.405403] nvmet_tcp: queue 0: TLS ServerHello
[ 440.405433] nvme nvme0: queue 0: start TLS with key 11b456f9
[ 440.407836] nvmet_tcp: queue 0: TLS handshake done, key 0, status -13
[ 440.422881] nvme nvme0: queue 0: TLS handshake done, key 0, status -13
[ 440.422932] nvme nvme0: queue 0: TLS handshake complete, error 13
A tcpdump shows the host sending a TLS Client Hello packet and the
target immediately closing the connection.
> Do you have a reliable way to reproduce this?
Sure, here's a fake Python NVMe/TCP controller that immediately closes
each connection after receiving the ICReq PDU:
```
import socket
def recv_all(socket, length):
result = b''
while len(result) < length:
result += socket.recv(length - len(result))
return result
listen_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
listen_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
listen_sock.bind(('', 4420))
listen_sock.listen()
while True:
client_sock, _ = listen_sock.accept()
recv_all(client_sock, 128) # ICReq
client_sock.close()
```
Attempting to connect to it reports an error about the ICResp PDU type
field even though no ICResp was sent:
$ nvme connect --transport tcp --traddr 192.168.1.12 --nqn nqn.abc
Failed to write to /dev/nvme-fabrics: Invalid argument
could not add new controller: invalid arguments/configuration
$ dmesg | tail -1
[1351639.614853] nvme_tcp: queue 0: bad type returned 0
Here's a valid scenario where the controller sends the ICResp Common
Header and PDU Specific Header separately (using TCP_NODELAY to ensure
the sends are not coalesced):
```
import socket
def recv_all(socket, length):
result = b''
while len(result) < length:
result += socket.recv(length - len(result))
return result
def send_all(socket, data):
while data:
data = data[socket.send(data):]
listen_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
listen_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
listen_sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
listen_sock.bind(('', 4420))
listen_sock.listen()
while True:
client_sock, _ = listen_sock.accept()
recv_all(client_sock, 128) # ICReq
common_header = bytes([
0x01, # PDU-type
0, # FLAGS
128, # HLEN
0, # PDO
128, 0, 0, 0, # PLEN
])
send_all(client_sock, common_header)
ic_resp = bytes([
0, 0, # PFV
0, # CPDA
0, # DGST
0xFC, 0xFF, 0xFF, 0xFF, # MAXH2CDATA
] + [0] * 112)
send_all(client_sock, ic_resp)
client_sock.close()
```
The host refuses to connect, complaining that the MAXH2CDATA field in
the ICResp is invalid. But that is because it only received the Common
Header.
$ nvme connect --transport tcp --traddr 192.168.1.12 --nqn nqn.abc
Failed to write to /dev/nvme-fabrics: Invalid argument
could not add new controller: invalid arguments/configuration
$ dmesg | tail -1
[1351960.082011] nvme_tcp: queue 0: invalid maxh2cdata returned 0
With the patch applied, the controller closing the connection without
sending a ICResp PDU correctly results in a "Connection reset by peer"
error:
$ nvme connect --transport tcp --traddr 192.168.1.12 --nqn nqn.abc
Failed to write to /dev/nvme-fabrics: Connection reset by peer
could not add new controller: failed to write to nvme-fabrics device
$ dmesg | tail -1
[ 450.050463] nvme_tcp: queue 0: failed to receive icresp, error 0
And when the controller sends the Common Header separately from the
PDU Specific Header, nvme_tcp_init_connection() now succeeds. The
connection attempt instead times out waiting for a response to the
Fabrics Connect command, since the fake NVMe/TCP controller doesn't
implement that:
$ nvme connect --transport tcp --traddr 192.168.1.12 --nqn nqn.abc
Failed to write to /dev/nvme-fabrics: Input/output error
could not add new controller: failed to write to nvme-fabrics device
$ dmesg | tail -3
[ 644.728894] nvme nvme0: I/O tag 0 (0000) type 4 opcode 0x7f
(Connect) QID 0 timeout
[ 644.728974] nvme nvme0: Connect command failed, error wo/DNR bit: 881
[ 644.728999] nvme nvme0: failed to connect queue: 0 ret=881
Thanks,
Caleb
On 2/3/25 01:19, Caleb Sander wrote:
> On Fri, Jan 31, 2025 at 12:17 AM Sagi Grimberg <sagi@grimberg.me> wrote:
> ...
>> Caleb, can you please make sure to test this patch with TLS?
>
> Can you point me to some documentation for that? I tried setting up a
> nvmet_tcp port to use TLS and connecting to it as described in the
> cover letter for the patch series (https://lwn.net/Articles/941139/).
> But the TLS Server Hello seems to fail with EACCES. Any idea what I'm
> doing wrong?
> $ modprobe nvmet_tcp
> $ modprobe null_blk nr_devices=1
> $ mkdir /sys/kernel/config/nvmet/subsystems/nqn.nvmet
> $ echo 1 > /sys/kernel/config/nvmet/subsystems/nqn.nvmet/attr_allow_any_host
> $ mkdir /sys/kernel/config/nvmet/subsystems/nqn.nvmet/namespaces/1
> $ echo /dev/nullb0 >
> /sys/kernel/config/nvmet/subsystems/nqn.nvmet/namespaces/1/device_path
> $ echo 1 > /sys/kernel/config/nvmet/subsystems/nqn.nvmet/namespaces/1/enable
> $ mkdir /sys/kernel/config/nvmet/ports/1
> $ echo tcp > /sys/kernel/config/nvmet/ports/1/addr_trtype
> $ echo ipv4 > /sys/kernel/config/nvmet/ports/1/addr_adrfam
> $ echo 127.0.0.1 > /sys/kernel/config/nvmet/ports/1/addr_traddr
> $ echo 4420 > /sys/kernel/config/nvmet/ports/1/addr_trsvcid
> $ echo required > /sys/kernel/config/nvmet/ports/1/addr_treq
> $ echo tls1.3 > /sys/kernel/config/nvmet/ports/1/addr_tsas
> $ ln -s /sys/kernel/config/nvmet/subsystems/nqn.nvmet
> /sys/kernel/config/nvmet/ports/1/subsystems
> $ nvme gen-tls-key --subsysnqn nqn.nvmet --insert
> Inserted TLS key 005e8a74
> $ nvme gen-tls-key --subsysnqn nqn.2014-08.org.nvmexpress.discovery --insert
> Inserted TLS key 22d676b8
> $ tlshd &
> $ nvme discover --transport tcp --traddr 127.0.0.1 --trsvcid 4420 --tls
> Failed to write to /dev/nvme-fabrics: Input/output error
> failed to add controller, error failed to write to nvme-fabrics device
>
> With debug logs enabled, I see the following:
> $ dmesg | tail -6
> [ 440.405298] nvme nvme0: connecting queue 0
> [ 440.405403] nvmet_tcp: queue 0: TLS ServerHello
> [ 440.405433] nvme nvme0: queue 0: start TLS with key 11b456f9
> [ 440.407836] nvmet_tcp: queue 0: TLS handshake done, key 0, status -13
> [ 440.422881] nvme nvme0: queue 0: TLS handshake done, key 0, status -13
> [ 440.422932] nvme nvme0: queue 0: TLS handshake complete, error 13
>
> A tcpdump shows the host sending a TLS Client Hello packet and the
> target immediately closing the connection.
>
The PSK identification has to contain the host NQN _and_ the target NQN.
So you need to call gen-tls-key with both in order to generate a PSK
with an identitify which can be found for a connection attempt.
>> Do you have a reliable way to reproduce this?
>
> Sure, here's a fake Python NVMe/TCP controller that immediately closes
> each connection after receiving the ICReq PDU:
> ```
> import socket
>
> def recv_all(socket, length):
> result = b''
> while len(result) < length:
> result += socket.recv(length - len(result))
> return result
>
> listen_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> listen_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
> listen_sock.bind(('', 4420))
> listen_sock.listen()
> while True:
> client_sock, _ = listen_sock.accept()
> recv_all(client_sock, 128) # ICReq
> client_sock.close()
> ```
> Attempting to connect to it reports an error about the ICResp PDU type
> field even though no ICResp was sent:
> $ nvme connect --transport tcp --traddr 192.168.1.12 --nqn nqn.abc
> Failed to write to /dev/nvme-fabrics: Invalid argument
> could not add new controller: invalid arguments/configuration
> $ dmesg | tail -1
> [1351639.614853] nvme_tcp: queue 0: bad type returned 0
>
> Here's a valid scenario where the controller sends the ICResp Common
> Header and PDU Specific Header separately (using TCP_NODELAY to ensure
> the sends are not coalesced):
> ```
> import socket
>
> def recv_all(socket, length):
> result = b''
> while len(result) < length:
> result += socket.recv(length - len(result))
> return result
>
> def send_all(socket, data):
> while data:
> data = data[socket.send(data):]
>
> listen_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> listen_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
> listen_sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
> listen_sock.bind(('', 4420))
> listen_sock.listen()
> while True:
> client_sock, _ = listen_sock.accept()
> recv_all(client_sock, 128) # ICReq
> common_header = bytes([
> 0x01, # PDU-type
> 0, # FLAGS
> 128, # HLEN
> 0, # PDO
> 128, 0, 0, 0, # PLEN
> ])
> send_all(client_sock, common_header)
> ic_resp = bytes([
> 0, 0, # PFV
> 0, # CPDA
> 0, # DGST
> 0xFC, 0xFF, 0xFF, 0xFF, # MAXH2CDATA
> ] + [0] * 112)
> send_all(client_sock, ic_resp)
> client_sock.close()
> ```
> The host refuses to connect, complaining that the MAXH2CDATA field in
> the ICResp is invalid. But that is because it only received the Common
> Header.
> $ nvme connect --transport tcp --traddr 192.168.1.12 --nqn nqn.abc
> Failed to write to /dev/nvme-fabrics: Invalid argument
> could not add new controller: invalid arguments/configuration
> $ dmesg | tail -1
> [1351960.082011] nvme_tcp: queue 0: invalid maxh2cdata returned 0
>
> With the patch applied, the controller closing the connection without
> sending a ICResp PDU correctly results in a "Connection reset by peer"
> error:
> $ nvme connect --transport tcp --traddr 192.168.2343.066666666671.12 --nqn nqn.abc
> Failed to write to /dev/nvme-fabrics: Connection reset by peer
> could not add new controller: failed to write to nvme-fabrics device
> $ dmesg | tail -1
> [ 450.050463] nvme_tcp: queue 0: failed to receive icresp, error 0
>
> And when the controller sends the Common Header separately from the
> PDU Specific Header, nvme_tcp_init_connection() now succeeds. The
> connection attempt instead times out waiting for a response to the
> Fabrics Connect command, since the fake NVMe/TCP controller doesn't
> implement that:
> $ nvme connect --transport tcp --traddr 192.168.1.12 --nqn nqn.abc
> Failed to write to /dev/nvme-fabrics: Input/output error
> could not add new controller: failed to write to nvme-fabrics device
> $ dmesg | tail -3
> [ 644.728894] nvme nvme0: I/O tag 0 (0000) type 4 opcode 0x7f
> (Connect) QID 0 timeout
> [ 644.728974] nvme nvme0: Connect command failed, error wo/DNR bit: 881
> [ 644.728999] nvme nvme0: failed to connect queue: 0 ret=881
>
Thanks.
I'll see to give it a spin.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
On Sun, Feb 2, 2025 at 11:20 PM Hannes Reinecke <hare@suse.de> wrote:
...
> The PSK identification has to contain the host NQN _and_ the target NQN.
> So you need to call gen-tls-key with both in order to generate a PSK
> with an identitify which can be found for a connection attempt.
Got it. With "--hostnqn `cat /etc/nvme/hostnqn`" added to the nvme
gen-tls-key commands, the discovery and connection succeed. I see the
same behavior with and without my patch.
$ nvme connect-all --transport tcp --traddr 127.0.0.1 --trsvcid 4420 --tls
$ nvme list
Node Generic SN Model
Namespace Usage
Format FW Rev
--------------------- --------------------- --------------------
---------------------------------------- ----------
-------------------------- ---------------- --------
/dev/nvme1n1 /dev/ng1n1 d95671fe6e0ebf8c199c Linux
0x1 268.44 GB / 268.44 GB
512 B + 0 B 5.14.0-5
Hannes, is there any other information you need or do you think this
patch can be merged?
Thanks,
Caleb
Gentle ping on this patch. Sagi reviewed it, and I don't hear any additional comments from Hannes. Thanks, Caleb On Tue, Feb 4, 2025 at 2:20 PM Caleb Sander <csander@purestorage.com> wrote: > > On Sun, Feb 2, 2025 at 11:20 PM Hannes Reinecke <hare@suse.de> wrote: > ... > > The PSK identification has to contain the host NQN _and_ the target NQN. > > So you need to call gen-tls-key with both in order to generate a PSK > > with an identitify which can be found for a connection attempt. > > Got it. With "--hostnqn `cat /etc/nvme/hostnqn`" added to the nvme > gen-tls-key commands, the discovery and connection succeed. I see the > same behavior with and without my patch. > > $ nvme connect-all --transport tcp --traddr 127.0.0.1 --trsvcid 4420 --tls > $ nvme list > Node Generic SN Model > Namespace Usage > Format FW Rev > --------------------- --------------------- -------------------- > ---------------------------------------- ---------- > -------------------------- ---------------- -------- > /dev/nvme1n1 /dev/ng1n1 d95671fe6e0ebf8c199c Linux > 0x1 268.44 GB / 268.44 GB > 512 B + 0 B 5.14.0-5 > > Hannes, is there any other information you need or do you think this > patch can be merged? > > Thanks, > Caleb
On 10/02/2025 19:24, Caleb Sander wrote: > Gentle ping on this patch. Sagi reviewed it, and I don't hear any > additional comments from Hannes. Caleb, lets resend it with review tags from me and Hannes.
On 2/10/25 18:24, Caleb Sander wrote: > Gentle ping on this patch. Sagi reviewed it, and I don't hear any > additional comments from Hannes. > Because you convinced me :-) Reviewed-by: Hannes Reinecke <hare@suse.de> Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
po 27. 1. 2025 v 18:39 odesílatel Caleb Sander <csander@purestorage.com> napsal: > > > + msg.msg_flags = MSG_WAITALL; > > > ret = kernel_recvmsg(queue->sock, &msg, &iov, 1, > > > iov.iov_len, msg.msg_flags); > > > > But won't we have to wait for a TCP timeout now if the sender sends less > > than 128 bytes? With this patch we always wait for 128 bytes, and > > possibly wait for TCP timeout if not. > > Yes, if the NVMe/TCP controller sends less than 128 bytes, we need to > wait for it to send the remainder of the ICResp PDU. I guess what Hannes wanted to ask is whether it makes sense to check if ret is a number greater than 0 but less than 128, given that the MSG_WAITALL flag has been set. Maurizio
On 1/27/25 08:37, Hannes Reinecke wrote:
> On 1/24/25 19:43, Caleb Sander Mateos wrote:
>> nvme_tcp_init_connection() attempts to receive an ICResp PDU but only
>> checks that the return value from recvmsg() is non-negative. If the
>> sender closes the TCP connection or sends fewer than 128 bytes, this
>> check will pass even though the full PDU wasn't received.
>>
>> Ensure the full ICResp PDU is received by checking that recvmsg()
>> returns the expected 128 bytes.
>>
>> Additionally set the MSG_WAITALL flag for recvmsg(), as a sender could
>> split the ICResp over multiple TCP frames. Without MSG_WAITALL,
>> recvmsg() could return prematurely with only part of the PDU.
>>
>> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
>> Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
>> ---
>> v4: keep recvmsg() error return value
>> v3: fix return value to indicate error
>> v2: add Fixes tag
>>
>> drivers/nvme/host/tcp.c | 5 ++++-
>> 1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index e9ff6babc540..56679eb8c0d6 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -1446,15 +1446,18 @@ static int nvme_tcp_init_connection(struct
>> nvme_tcp_queue *queue)
>> iov.iov_len = sizeof(*icresp);
>> if (nvme_tcp_queue_tls(queue)) {
>> msg.msg_control = cbuf;
>> msg.msg_controllen = sizeof(cbuf);
>> }
>> + msg.msg_flags = MSG_WAITALL;
>> ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
>> iov.iov_len, msg.msg_flags);
>
> But won't we have to wait for a TCP timeout now if the sender sends less
> than 128 bytes? With this patch we always wait for 128 bytes, and
> possibly wait for TCP timeout if not.
> Testcase for this would be nice ...
>
> And I need to check if secure concatenation is affected here; with
> secure concatenation we need to peek at the first packet to check
> if it's an ICRESP or a TLS negotiation.
>
And wouldn't this patch be sufficient here?
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 841238f38fdd..85b1328a8757 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1451,12 +1451,13 @@ static int nvme_tcp_init_connection(struct
nvme_tcp_queue *queue)
}
ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
iov.iov_len, msg.msg_flags);
+ if (ret == 0)
+ ret = -ENOTCONN;
if (ret < 0) {
pr_warn("queue %d: failed to receive icresp, error %d\n",
nvme_tcp_queue_id(queue), ret);
goto free_icresp;
}
- ret = -ENOTCONN;
if (nvme_tcp_queue_tls(queue)) {
ctype = tls_get_record_type(queue->sock->sk,
(struct cmsghdr *)cbuf);
icresp validity is checked later in the code, so if we haven't received
a full icresp we _should_ fail those tests. And then we don't really
have to check how many bytes we've received.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
On Mon, Jan 27, 2025 at 12:09 AM Hannes Reinecke <hare@suse.de> wrote:
>
> On 1/27/25 08:37, Hannes Reinecke wrote:
> > On 1/24/25 19:43, Caleb Sander Mateos wrote:
> >> nvme_tcp_init_connection() attempts to receive an ICResp PDU but only
> >> checks that the return value from recvmsg() is non-negative. If the
> >> sender closes the TCP connection or sends fewer than 128 bytes, this
> >> check will pass even though the full PDU wasn't received.
> >>
> >> Ensure the full ICResp PDU is received by checking that recvmsg()
> >> returns the expected 128 bytes.
> >>
> >> Additionally set the MSG_WAITALL flag for recvmsg(), as a sender could
> >> split the ICResp over multiple TCP frames. Without MSG_WAITALL,
> >> recvmsg() could return prematurely with only part of the PDU.
> >>
> >> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> >> Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
> >> ---
> >> v4: keep recvmsg() error return value
> >> v3: fix return value to indicate error
> >> v2: add Fixes tag
> >>
> >> drivers/nvme/host/tcp.c | 5 ++++-
> >> 1 file changed, 4 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> >> index e9ff6babc540..56679eb8c0d6 100644
> >> --- a/drivers/nvme/host/tcp.c
> >> +++ b/drivers/nvme/host/tcp.c
> >> @@ -1446,15 +1446,18 @@ static int nvme_tcp_init_connection(struct
> >> nvme_tcp_queue *queue)
> >> iov.iov_len = sizeof(*icresp);
> >> if (nvme_tcp_queue_tls(queue)) {
> >> msg.msg_control = cbuf;
> >> msg.msg_controllen = sizeof(cbuf);
> >> }
> >> + msg.msg_flags = MSG_WAITALL;
> >> ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
> >> iov.iov_len, msg.msg_flags);
> >
> > But won't we have to wait for a TCP timeout now if the sender sends less
> > than 128 bytes? With this patch we always wait for 128 bytes, and
> > possibly wait for TCP timeout if not.
> > Testcase for this would be nice ...
> >
> > And I need to check if secure concatenation is affected here; with
> > secure concatenation we need to peek at the first packet to check
> > if it's an ICRESP or a TLS negotiation.
> >
> And wouldn't this patch be sufficient here?
If the controller sends the ICResp across multiple TCP packets, the
recvmsg() will only receive the first one. There are 2 different
problematic outcomes:
- The validation of the icresp buffer may fail because it hasn't been
fully initialized from the PDU
- Even if the validation passes, the remaining bytes of the PDU are
still in the socket buffer. Subsequent receive operations will return
those bytes instead of the following PDUs they meant to receive.
It is important to receive the full ICResp to avoid connection
establishment failure, and to ensure there are no remaining bytes in
the socket buffer.
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 841238f38fdd..85b1328a8757 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1451,12 +1451,13 @@ static int nvme_tcp_init_connection(struct
> nvme_tcp_queue *queue)
> }
> ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
> iov.iov_len, msg.msg_flags);
> + if (ret == 0)
> + ret = -ENOTCONN;
It is possible for ret to be positive but less than sizeof(ICResp).
For example, a controller is allowed to send the 8-byte Common Header
and the 120-byte ICResp PDU Specific Header in separate TCP packets.
In that case, recvmsg() would only receive the Common Header and
return 8. We need to receive the full 128 bytes of the ICResp PDU
before processing it and receiving subsequent PDUs from the socket.
> if (ret < 0) {
> pr_warn("queue %d: failed to receive icresp, error %d\n",
> nvme_tcp_queue_id(queue), ret);
> goto free_icresp;
> }
> - ret = -ENOTCONN;
> if (nvme_tcp_queue_tls(queue)) {
> ctype = tls_get_record_type(queue->sock->sk,
> (struct cmsghdr *)cbuf);
>
> icresp validity is checked later in the code, so if we haven't received
> a full icresp we _should_ fail those tests. And then we don't really
> have to check how many bytes we've received.
The ICResp is only validated up to the MAXH2CDATA field. The
validation would fail if that field is 0, so it is true that we would
catch a recvmsg() that received 12 bytes or fewer. However, the
remaining 115 bytes of the ICResp PDU are not validated, so a passing
validation does not ensure all 128 bytes were received from the
socket.
Thanks,
Caleb
© 2016 - 2026 Red Hat, Inc.