io_uring/net.c | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-)
When io_uring recv/send with MSG_WAITALL accumulates partial data
through done_io and then encounters an error or EOF, req_set_fail()
sets REQ_F_FAIL despite the CQE result being positive (done_io bytes).
io_disarm_next() then sees REQ_F_FAIL and cancels all linked operations
with -ECANCELED, even though the user-visible result indicates success.
This manifests in two code paths:
1) Direct completion: io_recv/io_send fall through to req_set_fail()
when ret < min_ret, even if done_io > 0. The CQE shows done_io
(positive) but REQ_F_FAIL severs the link chain.
2) io-wq fallback: after APOLL_MAX_RETRY (128) poll retries, the
request moves to io-wq. io_recv returns IOU_RETRY from the
MSG_WAITALL retry path, io-wq fails the request with -EAGAIN, and
io_req_defer_failed -> io_sendrecv_fail overwrites cqe.res with
done_io but leaves REQ_F_FAIL set.
Fix this by:
- Not calling req_set_fail() when done_io > 0 in io_recv, io_recvmsg,
io_send, io_sendmsg, io_send_zc, io_sendmsg_zc
- Clearing REQ_F_FAIL in io_sendrecv_fail() when done_io > 0
This makes MSG_WAITALL partial completions consistent with
non-MSG_WAITALL behavior, where positive results never sever the
IO_LINK chain.
Reproducer: MSG_WAITALL recv via IO_LINK -> write on a UNIX socketpair
where the sender closes after partial data. The recv CQE shows positive
bytes but the linked write gets -ECANCELED.
Fixes: 0031275d119e ("io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL")
Cc: stable@vger.kernel.org
Signed-off-by: Hannes Furmans <hannes@stillwind.ai>
---
io_uring/net.c | 22 +++++++++++++++-------
1 file changed, 15 insertions(+), 7 deletions(-)
diff --git a/io_uring/net.c b/io_uring/net.c
index 8576c6cb2236..ebe51db34af8 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -576,7 +576,8 @@ int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
}
if (ret == -ERESTARTSYS)
ret = -EINTR;
- req_set_fail(req);
+ if (!sr->done_io)
+ req_set_fail(req);
}
io_req_msg_cleanup(req, issue_flags);
if (ret >= 0)
@@ -688,7 +689,8 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
}
if (ret == -ERESTARTSYS)
ret = -EINTR;
- req_set_fail(req);
+ if (!sr->done_io)
+ req_set_fail(req);
}
if (ret >= 0)
ret += sr->done_io;
@@ -1074,7 +1076,8 @@ int io_recvmsg(struct io_kiocb *req, unsigned int issue_flags)
}
if (ret == -ERESTARTSYS)
ret = -EINTR;
- req_set_fail(req);
+ if (!sr->done_io)
+ req_set_fail(req);
} else if ((flags & MSG_WAITALL) && (kmsg->msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC))) {
req_set_fail(req);
}
@@ -1220,7 +1223,8 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
}
if (ret == -ERESTARTSYS)
ret = -EINTR;
- req_set_fail(req);
+ if (!sr->done_io)
+ req_set_fail(req);
} else if ((flags & MSG_WAITALL) && (kmsg->msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC))) {
out_free:
req_set_fail(req);
@@ -1498,7 +1502,8 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
}
if (ret == -ERESTARTSYS)
ret = -EINTR;
- req_set_fail(req);
+ if (!zc->done_io)
+ req_set_fail(req);
}
if (ret >= 0)
@@ -1570,7 +1575,8 @@ int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags)
}
if (ret == -ERESTARTSYS)
ret = -EINTR;
- req_set_fail(req);
+ if (!sr->done_io)
+ req_set_fail(req);
}
if (ret >= 0)
@@ -1595,8 +1601,10 @@ void io_sendrecv_fail(struct io_kiocb *req)
{
struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
- if (sr->done_io)
+ if (sr->done_io) {
req->cqe.res = sr->done_io;
+ req->flags &= ~REQ_F_FAIL;
+ }
if ((req->flags & REQ_F_NEED_CLEANUP) &&
(req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC))
--
2.53.0
Hi Hannes,
Am 26.02.26 um 23:03 schrieb Hannes Furmans:
> When io_uring recv/send with MSG_WAITALL accumulates partial data
> through done_io and then encounters an error or EOF, req_set_fail()
> sets REQ_F_FAIL despite the CQE result being positive (done_io bytes).
> io_disarm_next() then sees REQ_F_FAIL and cancels all linked operations
> with -ECANCELED, even though the user-visible result indicates success.
>
> This manifests in two code paths:
>
> 1) Direct completion: io_recv/io_send fall through to req_set_fail()
> when ret < min_ret, even if done_io > 0. The CQE shows done_io
> (positive) but REQ_F_FAIL severs the link chain.
>
> 2) io-wq fallback: after APOLL_MAX_RETRY (128) poll retries, the
> request moves to io-wq. io_recv returns IOU_RETRY from the
> MSG_WAITALL retry path, io-wq fails the request with -EAGAIN, and
> io_req_defer_failed -> io_sendrecv_fail overwrites cqe.res with
> done_io but leaves REQ_F_FAIL set.
>
> Fix this by:
> - Not calling req_set_fail() when done_io > 0 in io_recv, io_recvmsg,
> io_send, io_sendmsg, io_send_zc, io_sendmsg_zc
> - Clearing REQ_F_FAIL in io_sendrecv_fail() when done_io > 0
>
> This makes MSG_WAITALL partial completions consistent with
> non-MSG_WAITALL behavior, where positive results never sever the
> IO_LINK chain.
>
> Reproducer: MSG_WAITALL recv via IO_LINK -> write on a UNIX socketpair
> where the sender closes after partial data. The recv CQE shows positive
> bytes but the linked write gets -ECANCELED.
>
> Fixes: 0031275d119e ("io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL")
That's by design, if a MSG_WAITALL calls fails it means
not call data the caller expected arrived or were sent.
When there's a LINK after that the linked operation likely
relies on all expected data being processed! Otherwise
the message stream can get out of sync and causes corruption.
Let's assume I want to send a message header with
IO_SEND linked with a IO_SPLICE to send the payload.
If IO_SEND returns short the situation needs to be
recovered by the caller instead of letting the
IO_SPLICE give more data to the socket.
So the current behavior is exactly what MSG_WAITALL
gives you. If you don't want that why are you using it
at all?
metze
Hi Stefan,
Am 27.02.26 um 14:59 schrieb Stefan Metzmacher:
> That's by design, if a MSG_WAITALL calls fails it means
> not call data the caller expected arrived or were sent.
> When there's a LINK after that the linked operation likely
> relies on all expected data being processed! Otherwise
> the message stream can get out of sync and causes corruption.
You're right — a short MSG_WAITALL read should sever the IO_LINK
chain. The v1 patch was wrong to guard req_set_fail() on done_io > 0.
> Let's assume I want to send a message header with
> IO_SEND linked with a IO_SPLICE to send the payload.
>
> If IO_SEND returns short the situation needs to be
> recovered by the caller instead of letting the
> IO_SPLICE give more data to the socket.
Agreed, the linked operation expects the complete data.
> So the current behavior is exactly what MSG_WAITALL
> gives you. If you don't want that why are you using it
> at all?
The actual bug is narrower. I traced the root cause with kTLS.
When IORING_OP_RECV is used with MSG_WAITALL on a kTLS socket,
the recv completes successfully (ret >= min_ret, full requested
amount received). But kTLS calls put_cmsg(SOL_TLS,
TLS_GET_RECORD_TYPE) for every first record of a recvmsg call
(tls_sw.c:1843). Since io_recv sets up the msghdr with
msg_control=NULL and msg_controllen=0, put_cmsg sets MSG_CTRUNC.
Then io_recv hits the else-if branch:
} else if ((flags & MSG_WAITALL) &&
(msg_flags & (MSG_TRUNC | MSG_CTRUNC))) {
req_set_fail(req);
}
This sets REQ_F_FAIL on a fully successful recv. The CQE shows
the full byte count, but the linked write gets -ECANCELED.
I confirmed this with ftrace — the recv completes with
result=67108864 (exactly 64MB requested), then
io_uring_fail_link fires immediately after from an io-wq worker.
I also confirmed with a plain recvmsg debug tool that kTLS
returns msg_flags=0x88 (MSG_EOR | MSG_CTRUNC) on every call.
Your commit 0031275d119e says "For IORING_OP_RECVMSG we also
check for the MSG_TRUNC and MSG_CTRUNC flags" but the code
applies the check to IORING_OP_RECV as well. MSG_CTRUNC is
meaningful for IORING_OP_RECVMSG (user provides a cmsg buffer).
It's meaningless for IORING_OP_RECV which never has a cmsg
buffer.
I'll send a v2 that only removes MSG_CTRUNC from the io_recv
check.
Thanks,
Hannes
Am 27.02.26 um 17:14 schrieb Hannes Furmans:
> Hi Stefan,
>
> Am 27.02.26 um 14:59 schrieb Stefan Metzmacher:
>> That's by design, if a MSG_WAITALL calls fails it means
>> not call data the caller expected arrived or were sent.
>> When there's a LINK after that the linked operation likely
>> relies on all expected data being processed! Otherwise
>> the message stream can get out of sync and causes corruption.
>
> You're right — a short MSG_WAITALL read should sever the IO_LINK
> chain. The v1 patch was wrong to guard req_set_fail() on done_io > 0.
>
>> Let's assume I want to send a message header with
>> IO_SEND linked with a IO_SPLICE to send the payload.
>>
>> If IO_SEND returns short the situation needs to be
>> recovered by the caller instead of letting the
>> IO_SPLICE give more data to the socket.
>
> Agreed, the linked operation expects the complete data.
>
>> So the current behavior is exactly what MSG_WAITALL
>> gives you. If you don't want that why are you using it
>> at all?
>
> The actual bug is narrower. I traced the root cause with kTLS.
>
> When IORING_OP_RECV is used with MSG_WAITALL on a kTLS socket,
> the recv completes successfully (ret >= min_ret, full requested
> amount received). But kTLS calls put_cmsg(SOL_TLS,
> TLS_GET_RECORD_TYPE) for every first record of a recvmsg call
> (tls_sw.c:1843). Since io_recv sets up the msghdr with
> msg_control=NULL and msg_controllen=0, put_cmsg sets MSG_CTRUNC.
>
> Then io_recv hits the else-if branch:
>
> } else if ((flags & MSG_WAITALL) &&
> (msg_flags & (MSG_TRUNC | MSG_CTRUNC))) {
> req_set_fail(req);
> }
>
> This sets REQ_F_FAIL on a fully successful recv. The CQE shows
> the full byte count, but the linked write gets -ECANCELED.
>
> I confirmed this with ftrace — the recv completes with
> result=67108864 (exactly 64MB requested), then
> io_uring_fail_link fires immediately after from an io-wq worker.
> I also confirmed with a plain recvmsg debug tool that kTLS
> returns msg_flags=0x88 (MSG_EOR | MSG_CTRUNC) on every call.
>
> Your commit 0031275d119e says "For IORING_OP_RECVMSG we also
> check for the MSG_TRUNC and MSG_CTRUNC flags" but the code
> applies the check to IORING_OP_RECV as well. MSG_CTRUNC is
> meaningful for IORING_OP_RECVMSG (user provides a cmsg buffer).
> It's meaningless for IORING_OP_RECV which never has a cmsg
> buffer.
>
> I'll send a v2 that only removes MSG_CTRUNC from the io_recv
> check.
Sounds good :-)
Thanks!
metze
IORING_OP_RECV sets up the msghdr with msg_control=NULL and
msg_controllen=0, as it has no cmsg support. Any socket layer that
calls put_cmsg() will find no buffer space and set MSG_CTRUNC in
msg_flags. This is expected — the caller didn't ask for control data.
However, io_recv checks:
if ((flags & MSG_WAITALL) && (msg_flags & (MSG_TRUNC | MSG_CTRUNC)))
req_set_fail(req);
This sets REQ_F_FAIL on a fully successful recv (ret >= min_ret) when
MSG_CTRUNC is set, which causes io_disarm_next() to cancel all linked
operations with -ECANCELED. The recv CQE shows the full requested byte
count, yet linked operations are cancelled.
This is triggered by kTLS, which calls put_cmsg(SOL_TLS,
TLS_GET_RECORD_TYPE) for every record in tls_record_content_type()
(tls_sw.c), but it affects any protocol that delivers cmsg data on
the kernel side.
The MSG_CTRUNC check was introduced by commit 0031275d119e ("io_uring:
call req_set_fail_links() on short send[msg]()/recv[msg]() with
MSG_WAITALL") whose commit message states "For IORING_OP_RECVMSG we
also check for the MSG_TRUNC and MSG_CTRUNC flags", but the code
applied the check to IORING_OP_RECV as well. MSG_CTRUNC is meaningful
for IORING_OP_RECVMSG where the user provides a cmsg buffer —
truncation there means lost metadata. It is meaningless for
IORING_OP_RECV which never provides a cmsg buffer.
Remove MSG_CTRUNC from the io_recv check. The io_recvmsg check is
left unchanged as MSG_CTRUNC is meaningful there.
Fixes: 0031275d119e ("io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL")
Cc: stable@vger.kernel.org
Signed-off-by: Hannes Furmans <hannes@stillwind.ai>
---
v2: v1 incorrectly guarded req_set_fail() for all done_io > 0 cases.
Stefan Metzmacher correctly pointed out that short MSG_WAITALL
reads should still sever the link chain.
Root-caused via ftrace + msg_flags inspection on a real kTLS
connection (TLS 1.3, AES-128-GCM, S3 download):
ftrace shows io_uring_fail_link firing immediately after
io_uring_complete with result=67108864 (full 64MB), from io-wq:
iou-wrk-52242 io_uring_complete: req ..., result 67108864
iou-wrk-52242 io_uring_fail_link: opcode RECV, link ...
A debug recvmsg on the same kTLS socket shows:
recvmsg: ret=67108864 msg_flags=0x88 (MSG_EOR | MSG_CTRUNC)
MSG_CTRUNC is always set because kTLS calls put_cmsg() but
IORING_OP_RECV provides no cmsg buffer.
io_uring/net.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/io_uring/net.c b/io_uring/net.c
index 8576c6cb2236..8baaf74e8f8d 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1221,7 +1221,7 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
if (ret == -ERESTARTSYS)
ret = -EINTR;
req_set_fail(req);
- } else if ((flags & MSG_WAITALL) && (kmsg->msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC))) {
+ } else if ((flags & MSG_WAITALL) && (kmsg->msg.msg_flags & MSG_TRUNC)) {
out_free:
req_set_fail(req);
}
--
2.53.0
Gentle ping on this. This is a one-line fix for a real bug where IORING_OP_RECV on kTLS sockets spuriously fails linked ops due to MSG_CTRUNC being sent by put_cmsg() when no cmsg buffer is provided.
Stefan indicated the approach looks correct. Would be great to get this into 7.0 if possible, as we’re in the RC window and this is a straightforward bug fix.
> On 27. Feb 2026, at 17:27, Hannes Furmans <hannes@stillwind.ai> wrote:
>
> IORING_OP_RECV sets up the msghdr with msg_control=NULL and
> msg_controllen=0, as it has no cmsg support. Any socket layer that
> calls put_cmsg() will find no buffer space and set MSG_CTRUNC in
> msg_flags. This is expected — the caller didn't ask for control data.
>
> However, io_recv checks:
>
> if ((flags & MSG_WAITALL) && (msg_flags & (MSG_TRUNC | MSG_CTRUNC)))
> req_set_fail(req);
>
> This sets REQ_F_FAIL on a fully successful recv (ret >= min_ret) when
> MSG_CTRUNC is set, which causes io_disarm_next() to cancel all linked
> operations with -ECANCELED. The recv CQE shows the full requested byte
> count, yet linked operations are cancelled.
>
> This is triggered by kTLS, which calls put_cmsg(SOL_TLS,
> TLS_GET_RECORD_TYPE) for every record in tls_record_content_type()
> (tls_sw.c), but it affects any protocol that delivers cmsg data on
> the kernel side.
>
> The MSG_CTRUNC check was introduced by commit 0031275d119e ("io_uring:
> call req_set_fail_links() on short send[msg]()/recv[msg]() with
> MSG_WAITALL") whose commit message states "For IORING_OP_RECVMSG we
> also check for the MSG_TRUNC and MSG_CTRUNC flags", but the code
> applied the check to IORING_OP_RECV as well. MSG_CTRUNC is meaningful
> for IORING_OP_RECVMSG where the user provides a cmsg buffer —
> truncation there means lost metadata. It is meaningless for
> IORING_OP_RECV which never provides a cmsg buffer.
>
> Remove MSG_CTRUNC from the io_recv check. The io_recvmsg check is
> left unchanged as MSG_CTRUNC is meaningful there.
>
> Fixes: 0031275d119e ("io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL")
> Cc: stable@vger.kernel.org
> Signed-off-by: Hannes Furmans <hannes@stillwind.ai>
> ---
> v2: v1 incorrectly guarded req_set_fail() for all done_io > 0 cases.
> Stefan Metzmacher correctly pointed out that short MSG_WAITALL
> reads should still sever the link chain.
>
> Root-caused via ftrace + msg_flags inspection on a real kTLS
> connection (TLS 1.3, AES-128-GCM, S3 download):
>
> ftrace shows io_uring_fail_link firing immediately after
> io_uring_complete with result=67108864 (full 64MB), from io-wq:
>
> iou-wrk-52242 io_uring_complete: req ..., result 67108864
> iou-wrk-52242 io_uring_fail_link: opcode RECV, link ...
>
> A debug recvmsg on the same kTLS socket shows:
>
> recvmsg: ret=67108864 msg_flags=0x88 (MSG_EOR | MSG_CTRUNC)
>
> MSG_CTRUNC is always set because kTLS calls put_cmsg() but
> IORING_OP_RECV provides no cmsg buffer.
>
> io_uring/net.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/io_uring/net.c b/io_uring/net.c
> index 8576c6cb2236..8baaf74e8f8d 100644
> --- a/io_uring/net.c
> +++ b/io_uring/net.c
> @@ -1221,7 +1221,7 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
> if (ret == -ERESTARTSYS)
> ret = -EINTR;
> req_set_fail(req);
> - } else if ((flags & MSG_WAITALL) && (kmsg->msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC))) {
> + } else if ((flags & MSG_WAITALL) && (kmsg->msg.msg_flags & MSG_TRUNC)) {
> out_free:
> req_set_fail(req);
> }
> --
> 2.53.0
>
© 2016 - 2026 Red Hat, Inc.