net/tls/tls_sw.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-)
The kTLS TX path can hand an open record to a sk_msg verdict
program before encryption. If the verdict applies fewer bytes
than the open record contains, tls_push_record() splits
ctx->open_rec into the record being encrypted and a remainder.
The synchronous path reattaches that remainder before continuing.
With an async AEAD provider, crypto_aead_encrypt() can return
-EINPROGRESS after ctx->open_rec has been unhooked but before the
split remainder is reattached. The remainder is no longer
reachable through ctx->open_rec or ctx->tx_list, silently dropping
transmitted data and leaking the unreachable tls_rec. The same
composition also entangles the user-page zerocopy lifetime rules
with an async completion path.
A sockmap cannot be attached to a socket after an inet ULP is
installed: sk_psock_init() returns -EINVAL when
inet_csk_has_ulp() is true. So the supported ordering for
sockmap + kTLS TX is sockmap first, TLS_TX setup second. When
TLS_TX setup sees an existing sk_psock, allocate the AEAD with
CRYPTO_ALG_ASYNC masked out and latch the TX zerocopy gate
(sw_ctx_tx->async_capable) so the buggy composition becomes
structurally unreachable. Ordinary kTLS sockets without sk_msg
BPF attached are unaffected and continue to use async-capable
providers.
Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
Cc: stable@vger.kernel.org # 4.20+
Signed-off-by: Christopher Lusk <clusk@northecho.dev>
Assisted-by: Codex:gpt-5.5
Assisted-by: Claude:claude-opus-4-7
---
Changes since v2 [1]:
- Per netdev maintainer guidance [2], replace the Option-C
drain-on-error fix with a setup-time surface narrowing in
tls_set_sw_offload(): when a sockmap is already attached at
TLS_TX setup, request a synchronous AEAD (CRYPTO_ALG_ASYNC in
the allocation mask) and set sw_ctx_tx->async_capable = 1.
Both moves are needed: latching async_capable alone disables
zerocopy but tls_do_encryption() can still return -EINPROGRESS
on the copy path; selecting a sync provider removes that return
path for sk_msg-attached sockets.
- Drop the selftest from the series per Jakub's note that the
existing sockmap + TLS coverage at
tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c exercises
this configuration [3]. That suite covers sockmap + kTLS
policy paths broadly; the specific async-pcrypt pass-then-drop
failure mode from the v2 reproducer was validated for v3 on
QEMU/KVM with a KASAN+LOCKDEP-instrumented kernel against net
base 2156a29aecff before send.
- Single-patch series.
Changes since v1:
- v1's remainder-rooting fix was incomplete; Sashiko AI review
surfaced a real UAF in the v2 follow-up that John Fastabend
endorsed on the v1 thread [4]. The surface-narrowing approach
in v3 makes both failure modes unreachable by avoiding the
async + sk_msg composition entirely rather than patching each
continuation point.
[1] https://lore.kernel.org/all/20260521025840.976378-1-clusk@northecho.dev/
[2] https://lore.kernel.org/all/20260525133028.58494274@kernel.org/
[3] https://lore.kernel.org/all/20260525133048.2dc6d8d3@kernel.org/
[4] https://lore.kernel.org/all/huduxtn6parzgiaf5cyiyrrvjjvx6jsdedowvrd4nkwmuyeind@j6migjgofh2i/
net/tls/tls_sw.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 964ebc268..0000000 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2867,7 +2867,20 @@ int tls_set_sw_offload(struct sock *sk, int tx,
rec_seq = crypto_info_rec_seq(src_crypto_info, cipher_desc);
if (!*aead) {
- *aead = crypto_alloc_aead(cipher_desc->cipher_name, 0, 0);
+ u32 mask = 0;
+
+ if (tx) {
+ struct sk_psock *psock;
+
+ psock = sk_psock_get(sk);
+ if (psock) {
+ mask = CRYPTO_ALG_ASYNC;
+ sw_ctx_tx->async_capable = 1;
+ sk_psock_put(sk, psock);
+ }
+ }
+
+ *aead = crypto_alloc_aead(cipher_desc->cipher_name, 0, mask);
if (IS_ERR(*aead)) {
rc = PTR_ERR(*aead);
*aead = NULL;
--
2.54.0
On 5/26/26 10:51 AM, Christopher Lusk wrote:
> The kTLS TX path can hand an open record to a sk_msg verdict
> program before encryption. If the verdict applies fewer bytes
> than the open record contains, tls_push_record() splits
> ctx->open_rec into the record being encrypted and a remainder.
> The synchronous path reattaches that remainder before continuing.
>
> With an async AEAD provider, crypto_aead_encrypt() can return
> -EINPROGRESS after ctx->open_rec has been unhooked but before the
> split remainder is reattached. The remainder is no longer
> reachable through ctx->open_rec or ctx->tx_list, silently dropping
> transmitted data and leaking the unreachable tls_rec. The same
> composition also entangles the user-page zerocopy lifetime rules
> with an async completion path.
>
> A sockmap cannot be attached to a socket after an inet ULP is
> installed: sk_psock_init() returns -EINVAL when
> inet_csk_has_ulp() is true. So the supported ordering for
> sockmap + kTLS TX is sockmap first, TLS_TX setup second. When
> TLS_TX setup sees an existing sk_psock, allocate the AEAD with
> CRYPTO_ALG_ASYNC masked out and latch the TX zerocopy gate
> (sw_ctx_tx->async_capable) so the buggy composition becomes
> structurally unreachable. Ordinary kTLS sockets without sk_msg
> BPF attached are unaffected and continue to use async-capable
> providers.
>
> Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
> Cc: stable@vger.kernel.org # 4.20+
> Signed-off-by: Christopher Lusk <clusk@northecho.dev>
> Assisted-by: Codex:gpt-5.5
> Assisted-by: Claude:claude-opus-4-7
> ---
>
> Changes since v2 [1]:
> - Per netdev maintainer guidance [2], replace the Option-C
> drain-on-error fix with a setup-time surface narrowing in
> tls_set_sw_offload(): when a sockmap is already attached at
> TLS_TX setup, request a synchronous AEAD (CRYPTO_ALG_ASYNC in
> the allocation mask) and set sw_ctx_tx->async_capable = 1.
> Both moves are needed: latching async_capable alone disables
> zerocopy but tls_do_encryption() can still return -EINPROGRESS
> on the copy path; selecting a sync provider removes that return
> path for sk_msg-attached sockets.
> - Drop the selftest from the series per Jakub's note that the
> existing sockmap + TLS coverage at
> tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c exercises
> this configuration [3]. That suite covers sockmap + kTLS
> policy paths broadly; the specific async-pcrypt pass-then-drop
> failure mode from the v2 reproducer was validated for v3 on
> QEMU/KVM with a KASAN+LOCKDEP-instrumented kernel against net
> base 2156a29aecff before send.
> - Single-patch series.
>
> Changes since v1:
> - v1's remainder-rooting fix was incomplete; Sashiko AI review
> surfaced a real UAF in the v2 follow-up that John Fastabend
> endorsed on the v1 thread [4]. The surface-narrowing approach
> in v3 makes both failure modes unreachable by avoiding the
> async + sk_msg composition entirely rather than patching each
> continuation point.
>
> [1] https://lore.kernel.org/all/20260521025840.976378-1-clusk@northecho.dev/
> [2] https://lore.kernel.org/all/20260525133028.58494274@kernel.org/
> [3] https://lore.kernel.org/all/20260525133048.2dc6d8d3@kernel.org/
> [4] https://lore.kernel.org/all/huduxtn6parzgiaf5cyiyrrvjjvx6jsdedowvrd4nkwmuyeind@j6migjgofh2i/
>
> net/tls/tls_sw.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
> index 964ebc268..0000000 100644
> --- a/net/tls/tls_sw.c
> +++ b/net/tls/tls_sw.c
> @@ -2867,7 +2867,20 @@ int tls_set_sw_offload(struct sock *sk, int tx,
> rec_seq = crypto_info_rec_seq(src_crypto_info, cipher_desc);
>
> if (!*aead) {
> - *aead = crypto_alloc_aead(cipher_desc->cipher_name, 0, 0);
> + u32 mask = 0;
> +
> + if (tx) {
> + struct sk_psock *psock;
> +
> + psock = sk_psock_get(sk);
> + if (psock) {
> + mask = CRYPTO_ALG_ASYNC;
> + sw_ctx_tx->async_capable = 1;
> + sk_psock_put(sk, psock);
> + }
> + }
> +
> + *aead = crypto_alloc_aead(cipher_desc->cipher_name, 0, mask);
> if (IS_ERR(*aead)) {
> rc = PTR_ERR(*aead);
> *aead = NULL;
> --
> 2.54.0
If async_capable is set to 1, the zerocopy path in tls_sw_sendmsg() is
skipped.
Unfortunately ktls with bpf_msg_pop_data() does not work correctly under
this
copy path.
tls_clone_plaintext_msg() aliases msg_pl onto msg_en's plaintext area
(in-place encryption).
BPF runs bpf_msg_pop_data(msg, 0, 2). This shifts msg_pl's SG entry
forward by 2 bytes.
The two SGs now point to the same page at different offsets. Physical
memory overlaps but the start of
address differ.
I think selecting a sync provider via mask = CRYPTO_ALG_ASYNC is
sufficient to
remove the -EINPROGRESS return path.
May be time to remove skmsg from ktls? (disable by default first,
re-enable via a new ktls module_param?)
On Tue, 26 May 2026 14:44:24 +0800 Jiayuan Chen wrote: > If async_capable is set to 1, the zerocopy path in tls_sw_sendmsg() is > skipped. > Unfortunately ktls with bpf_msg_pop_data() does not work correctly under > this > copy path. > > tls_clone_plaintext_msg() aliases msg_pl onto msg_en's plaintext area > (in-place encryption). > > BPF runs bpf_msg_pop_data(msg, 0, 2). This shifts msg_pl's SG entry > forward by 2 bytes. > The two SGs now point to the same page at different offsets. Physical > memory overlaps but the start of > address differ. Ugh, do you mean that the memcopy path is broken? There are other conditions under which we may fall into it than just !async_capable :( Small send with MSG_MORE is probably the easiest? So we need to fix that one way or the other. > I think selecting a sync provider via mask = CRYPTO_ALG_ASYNC is > sufficient to > remove the -EINPROGRESS return path. > > May be time to remove skmsg from ktls? (disable by default first, > re-enable via a new ktls module_param?) Yes, we asked John F off-list to get his attention and I think there's only a vague plan to start using kTLS + sockmap, no current user (sorry if I misread / misremembered). module params aren't a great API. If we want to deprecate it let's just remove the integration in net-next. You have my vote..
On Tue, 26 May 2026 16:11:01 -0700 Jakub Kicinski wrote: > module params aren't a great API. If we want to deprecate it let's just > remove the integration in net-next. You have my vote.. Happy to draft the net-next removal series if that's useful. Let me know the scope you'd like (sk_msg verdict path in tls_sw.c + the sockmap-attach side + selftest cleanup; or a wider sweep), and whether the stable trees should get a narrower fix as a separate backport for the 4.20+ tail. Christopher
2026-05-26, 16:11:01 -0700, Jakub Kicinski wrote: > On Tue, 26 May 2026 14:44:24 +0800 Jiayuan Chen wrote: > > May be time to remove skmsg from ktls? (disable by default first, > > re-enable via a new ktls module_param?) > > Yes, we asked John F off-list to get his attention and I think there's > only a vague plan to start using kTLS + sockmap, no current user > (sorry if I misread / misremembered). That was also what I got from this. > module params aren't a great API. If we want to deprecate it let's just > remove the integration in net-next. You have my vote.. +1 and keeping the code around means we still have to maintain it and deal with the extra complexity. -- Sabrina
On 5/27/26 7:11 AM, Jakub Kicinski wrote: > On Tue, 26 May 2026 14:44:24 +0800 Jiayuan Chen wrote: >> If async_capable is set to 1, the zerocopy path in tls_sw_sendmsg() is >> skipped. >> Unfortunately ktls with bpf_msg_pop_data() does not work correctly under >> this >> copy path. >> >> tls_clone_plaintext_msg() aliases msg_pl onto msg_en's plaintext area >> (in-place encryption). >> >> BPF runs bpf_msg_pop_data(msg, 0, 2). This shifts msg_pl's SG entry >> forward by 2 bytes. >> The two SGs now point to the same page at different offsets. Physical >> memory overlaps but the start of >> address differ. > Ugh, do you mean that the memcopy path is broken? There are other > conditions under which we may fall into it than just !async_capable :( > Small send with MSG_MORE is probably the easiest? > > So we need to fix that one way or the other. Yes, the memcopy path is broken, but only when combined with sockmap's pop helper. msg_pl and msg_en share the underlying page: msg_pl msg_pl end ^ ^ |------|------------------|-------| | hdr | plaintext | tag | |------|------------------|-------| ^ ^ | | msg_en msg_en end Before encryption, sge->offset += prot->prepend_size is applied to msg_en so that the encryption's dst and src point to the same block of memory. But once pop has run — i.e. msg_pl's start advances — the encryption's dst and src are no longer the same. crypto_ctr_crypt(): When dst and src have the same address, crypto saves the encryption result into a temporary buffer and then writes it back to dst. When dst and src have different addresses, the crypto module treats them as two separate buffers and stops considering in-place mode. it's complicated to process pop/push + head/mid/tail... >> I think selecting a sync provider via mask = CRYPTO_ALG_ASYNC is >> sufficient to >> remove the -EINPROGRESS return path. >> >> May be time to remove skmsg from ktls? (disable by default first, >> re-enable via a new ktls module_param?) > Yes, we asked John F off-list to get his attention and I think there's > only a vague plan to start using kTLS + sockmap, no current user > (sorry if I misread / misremembered). > > module params aren't a great API. If we want to deprecate it let's just > remove the integration in net-next. You have my vote..
On Wed, May 27, 2026 at 01:09:44PM +0800, Jiayuan Chen wrote: > >On 5/27/26 7:11 AM, Jakub Kicinski wrote: >>On Tue, 26 May 2026 14:44:24 +0800 Jiayuan Chen wrote: >>>If async_capable is set to 1, the zerocopy path in tls_sw_sendmsg() is >>>skipped. >>>Unfortunately ktls with bpf_msg_pop_data() does not work correctly under >>>this >>>copy path. >>> >>>tls_clone_plaintext_msg() aliases msg_pl onto msg_en's plaintext area >>>(in-place encryption). >>> >>>BPF runs bpf_msg_pop_data(msg, 0, 2). This shifts msg_pl's SG entry >>>forward by 2 bytes. >>>The two SGs now point to the same page at different offsets. Physical >>>memory overlaps but the start of >>>address differ. >>Ugh, do you mean that the memcopy path is broken? There are other >>conditions under which we may fall into it than just !async_capable :( >>Small send with MSG_MORE is probably the easiest? >> >>So we need to fix that one way or the other. > > >Yes, the memcopy path is broken, but only when combined with sockmap's >pop helper. > > >msg_pl and msg_en share the underlying page: > > msg_pl msg_pl end > ^ ^ > |------|------------------|-------| > | hdr | plaintext | tag | > |------|------------------|-------| > ^ ^ > | | > msg_en msg_en end > >Before encryption, sge->offset += prot->prepend_size is applied >to msg_en so that the encryption's dst and src point to the same >block of memory. > >But once pop has run — i.e. msg_pl's start advances — the encryption's >dst and src >are no longer the same. > >crypto_ctr_crypt(): >When dst and src have the same address, crypto saves the encryption >result into a >temporary buffer and then writes it back to dst. > >When dst and src have different addresses, the crypto module treats >them as two > >separate buffers and stops considering in-place mode. > >it's complicated to process pop/push + head/mid/tail... For our use case (not deployed yet, but deployed in non-kTLS case) all we do is observe data and possible drop the skb if it has malicious HTTP headers for example. All this push/pop/... in the middle of the kTLS stack is painful. One option we start rejecting these helpers? That would resolve most the pain I suspect. The original thought was we do have use cases now for userspace proxy where we insert headers. > >>>I think selecting a sync provider via mask = CRYPTO_ALG_ASYNC is >>>sufficient to >>>remove the -EINPROGRESS return path. >>> >>>May be time to remove skmsg from ktls? (disable by default first, >>>re-enable via a new ktls module_param?) >>Yes, we asked John F off-list to get his attention and I think there's >>only a vague plan to start using kTLS + sockmap, no current user >>(sorry if I misread / misremembered). I'm not against a cleaner solution here. Another idea: We just add a simple sockops BPF hook with the sk_buff? No updating sg lists, manipulating data packet sizes and so on. That would solve the vast majority of any future use case if we have a user that really started running kTLS and wanted the security stack to keep working. Even openssl usage of kTLS has really ground to a halt after it was initially added as far as I can tell. Something like this already on the list for recv side of tcp. [PATCH v3 bpf-next 10/11] bpf: tcp: Add SOCK_OPS rcvlowat hook >> >>module params aren't a great API. If we want to deprecate it let's just >>remove the integration in net-next. You have my vote..
On Wed, 27 May 2026 12:16:02 -0700 John Fastabend wrote: > One option we start rejecting these helpers? That would resolve most > the pain I suspect. The original thought was we do have use cases > now for userspace proxy where we insert headers. Rejecting the helpers would solve all the recent security issues, IIRC. I couldn't think of a clean way to do that, are you thinking adding a bit into the skmsg like "from ktls" or "fixed stream" (kinda like we have at_ingress)? > >>Yes, we asked John F off-list to get his attention and I think there's > >>only a vague plan to start using kTLS + sockmap, no current user > >>(sorry if I misread / misremembered). > > I'm not against a cleaner solution here. > > Another idea: We just add a simple sockops BPF hook with the sk_buff? > No updating sg lists, manipulating data packet sizes and so on. TBH I don't think the existing solution is particularly unclean. It's just complex enough that it'd benefit from getting removed and re-added, cause the re-add would undergo the modern LLM reviewer bashing that should hopefully shake out most of the bugs. Trying to do this surgery now, as urgent fixes is quite constraining.
© 2016 - 2026 Red Hat, Inc.