From nobody Sun Feb 8 05:40:30 2026 Received: from out-178.mta0.migadu.com (out-178.mta0.migadu.com [91.218.175.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 407173451D4 for ; Thu, 8 Jan 2026 15:01:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767884501; cv=none; b=vBos3tjX94FM+HX8W6sZD8UYwFUilGeHm9UuuPb88TLbYqwWIX3D11qSD9gd2k/G6fr1lJJ8mwzKWE7NR0PQQAxhFVvVNnZGbeE/NV0xPUvuPPPSvTGGEQb9hgcdNoHRZ8oGNu44u6EJtqzUySLRXS4Ffpr2rj1+g71hktMFG4I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767884501; c=relaxed/simple; bh=1wM+Umqivq6fi2g259smDi3EfrM2ZQeywOPoTMUrnO8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ixrLzqUkA3CxfuX6cunMTtD3NkD42rq/oiTLUh+bH6RZPDOwRCiydhNYtyL4De4jAEyEYocAwooiEN0bHhixR5J4f0/Z/ImIIjluCI9bjej2LiIGKcKtMqZTz+2/CxEvKDla7Mcdxr+dwbwZgT6gQJ15+rdD9XOX1vcvyxrqjak= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=luXrKEP8; arc=none smtp.client-ip=91.218.175.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="luXrKEP8" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1767884495; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=V53pa5S0H0LfMXP8GS8ZkZDhqoGNJL2k82MVkgcqumI=; b=luXrKEP8a2U0fRquefBFUZYo3y1muSEH6iaJfq3oHV4qU7Lh1XrAulkhf6hSvmrpkszmCX GofQvJ5m6Xi3wSo2h3sT01x0rLYTK5OuxUAxS2Ozyav3kRUwnDfoCNHSFiMy4h4cJVy5th XJG7BFY+Jt+UvLstQblELfxtSqZH66g= From: Jiayuan Chen To: bpf@vger.kernel.org Cc: Jiayuan Chen , Jakub Sitnicki , John Fastabend , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Neal Cardwell , Kuniyuki Iwashima , David Ahern , Andrii Nakryiko , Eduard Zingerman , Alexei Starovoitov , Daniel Borkmann , Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , Shuah Khan , Michal Luczaj , Stefano Garzarella , Cong Wang , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org Subject: [PATCH bpf-next v6 1/3] bpf, sockmap: Fix incorrect copied_seq calculation Date: Thu, 8 Jan 2026 23:00:30 +0800 Message-ID: <20260108150102.12563-2-jiayuan.chen@linux.dev> In-Reply-To: <20260108150102.12563-1-jiayuan.chen@linux.dev> References: <20260108150102.12563-1-jiayuan.chen@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" A socket using sockmap has its own independent receive queue: ingress_msg. This queue may contain data from its own protocol stack or from other sockets. The issue is that when reading from ingress_msg, we update tp->copied_seq by default. However, if the data is not from its own protocol stack, tcp->rcv_nxt is not increased. Later, if we convert this socket to a native socket, reading from this socket may fail because copied_seq might be significantly larger than rcv_nxt. This fix also addresses the syzkaller-reported bug referenced in the Closes tag. This patch marks the skmsg objects in ingress_msg. When reading, we update copied_seq only if the data is from its own protocol stack. FD1:read() -- FD1->copied_seq++ | [read data] | [enqueue data] v [sockmap] -> ingress to self -> ingress_msg queue FD1 native stack ------> ^ -- FD1->rcv_nxt++ -> redirect to other | [enqueue data] | | | ingress to FD1 v ^ ... | [sockmap] FD2 native stack Closes: https://syzkaller.appspot.com/bug?extid=3D06dbd397158ec0ea4983 Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Reviewed-by: Jakub Sitnicki Signed-off-by: Jiayuan Chen --- include/linux/skmsg.h | 2 ++ net/core/skmsg.c | 25 ++++++++++++++++++++++--- net/ipv4/tcp_bpf.c | 5 +++-- 3 files changed, 27 insertions(+), 5 deletions(-) diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h index 49847888c287..dfdc158ab88c 100644 --- a/include/linux/skmsg.h +++ b/include/linux/skmsg.h @@ -141,6 +141,8 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct io= v_iter *from, struct sk_msg *msg, u32 bytes); int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr = *msg, int len, int flags); +int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghd= r *msg, + int len, int flags, int *copied_from_self); bool sk_msg_is_readable(struct sock *sk); =20 static inline void sk_msg_check_to_free(struct sk_msg *msg, u32 i, u32 byt= es) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 2ac7731e1e0a..3d147837b82c 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -409,14 +409,14 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct = iov_iter *from, } EXPORT_SYMBOL_GPL(sk_msg_memcopy_from_iter); =20 -/* Receive sk_msg from psock->ingress_msg to @msg. */ -int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr = *msg, - int len, int flags) +int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghd= r *msg, + int len, int flags, int *copied_from_self) { struct iov_iter *iter =3D &msg->msg_iter; int peek =3D flags & MSG_PEEK; struct sk_msg *msg_rx; int i, copied =3D 0; + bool from_self; =20 msg_rx =3D sk_psock_peek_msg(psock); while (copied !=3D len) { @@ -425,6 +425,7 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *ps= ock, struct msghdr *msg, if (unlikely(!msg_rx)) break; =20 + from_self =3D msg_rx->sk =3D=3D sk; i =3D msg_rx->sg.start; do { struct page *page; @@ -443,6 +444,9 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *ps= ock, struct msghdr *msg, } =20 copied +=3D copy; + if (from_self && copied_from_self) + *copied_from_self +=3D copy; + if (likely(!peek)) { sge->offset +=3D copy; sge->length -=3D copy; @@ -487,6 +491,14 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *p= sock, struct msghdr *msg, out: return copied; } +EXPORT_SYMBOL_GPL(__sk_msg_recvmsg); + +/* Receive sk_msg from psock->ingress_msg to @msg. */ +int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr = *msg, + int len, int flags) +{ + return __sk_msg_recvmsg(sk, psock, msg, len, flags, NULL); +} EXPORT_SYMBOL_GPL(sk_msg_recvmsg); =20 bool sk_msg_is_readable(struct sock *sk) @@ -616,6 +628,12 @@ static int sk_psock_skb_ingress_self(struct sk_psock *= psock, struct sk_buff *skb if (unlikely(!msg)) return -EAGAIN; skb_set_owner_r(skb, sk); + + /* This is used in tcp_bpf_recvmsg_parser() to determine whether the + * data originates from the socket's own protocol stack. No need to + * refcount sk because msg's lifetime is bound to sk via the ingress_msg. + */ + msg->sk =3D sk; err =3D sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_= ref); if (err < 0) kfree(msg); @@ -909,6 +927,7 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_pso= ck *psock, sk_msg_compute_data_pointers(msg); msg->sk =3D sk; ret =3D bpf_prog_run_pin_on_cpu(prog, msg); + msg->sk =3D NULL; ret =3D sk_psock_map_verd(ret, msg->sk_redir); psock->apply_bytes =3D msg->apply_bytes; if (ret =3D=3D __SK_REDIRECT) { diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index a268e1595b22..5c698fd7fbf8 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -226,6 +226,7 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, int peek =3D flags & MSG_PEEK; struct sk_psock *psock; struct tcp_sock *tcp; + int copied_from_self =3D 0; int copied =3D 0; u32 seq; =20 @@ -262,7 +263,7 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, } =20 msg_bytes_ready: - copied =3D sk_msg_recvmsg(sk, psock, msg, len, flags); + copied =3D __sk_msg_recvmsg(sk, psock, msg, len, flags, &copied_from_self= ); /* The typical case for EFAULT is the socket was gracefully * shutdown with a FIN pkt. So check here the other case is * some error on copy_page_to_iter which would be unexpected. @@ -277,7 +278,7 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, goto out; } } - seq +=3D copied; + seq +=3D copied_from_self; if (!copied) { long timeo; int data; --=20 2.43.0 From nobody Sun Feb 8 05:40:30 2026 Received: from out-188.mta0.migadu.com (out-188.mta0.migadu.com [91.218.175.188]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C3C563382D4 for ; Thu, 8 Jan 2026 15:01:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.188 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767884516; cv=none; b=Xq5wBSUQHBvqQa011r1jJCzsL4WuyGHXSRrQ41DAaw7OxDwW0juWcPaABm749lh1bM6/+clflFVoF1PBsyfQv5+LnpEnps4wDYsccC3Kn6Ty/Vo4nibPomI5auUaX8fFfeNOJrSEUsFBo1b72bOz6uT3PwUwvcXv06AI/9+BQB0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767884516; c=relaxed/simple; bh=uTEFHpms52+Kj/Dc4NysMNrKs5pVaTqCqNbbyg6NOM4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=e1baG3GQSGi+QJqIiS5idGcl+sv3FJdnD05WJAbIW4Aw5Kf/O4XVU0GekIAdYNh0lJAv0CxH9hDRkQK3VHKf5zBYiagaayVCDB7nCYTTPFK68FGIT2DiP7w69qmRhS8wSFeNULwbRurVp5IdAbCdhkMOIZW1YYSzn41Mtz2iUvM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=OdxHuz/s; arc=none smtp.client-ip=91.218.175.188 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="OdxHuz/s" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1767884512; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FVVMWCFl5G9DkyqwHkJQ7Ai+veghGpKUotVxRpUoQF0=; b=OdxHuz/sWbci864VCBIybj9JkymCjaHqxjPEiKotcFT63PcaryjHwQvkjN4/mBSdOfDjHu YG7o2rVgKiZtJpnP2fffD6TdF7oRpTcjAasgQerBdi5vuVaUeJsFEvsbeo7W35GpF+b6RG DkFaCJT532qETJkFhw4YGXZlxhCawI0= From: Jiayuan Chen To: bpf@vger.kernel.org Cc: Jiayuan Chen , John Fastabend , Jakub Sitnicki , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Neal Cardwell , Kuniyuki Iwashima , David Ahern , Andrii Nakryiko , Eduard Zingerman , Alexei Starovoitov , Daniel Borkmann , Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , Shuah Khan , Michal Luczaj , Cong Wang , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org Subject: [PATCH bpf-next v6 2/3] bpf, sockmap: Fix FIONREAD for sockmap Date: Thu, 8 Jan 2026 23:00:31 +0800 Message-ID: <20260108150102.12563-3-jiayuan.chen@linux.dev> In-Reply-To: <20260108150102.12563-1-jiayuan.chen@linux.dev> References: <20260108150102.12563-1-jiayuan.chen@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" A socket using sockmap has its own independent receive queue: ingress_msg. This queue may contain data from its own protocol stack or from other sockets. Therefore, for sockmap, relying solely on copied_seq and rcv_nxt to calculate FIONREAD is not enough. This patch adds a new msg_tot_len field in the psock structure to record the data length in ingress_msg. Additionally, we implement new ioctl interfaces for TCP and UDP to intercept FIONREAD operations. Unix and VSOCK sockets have similar issues, but fixing them is outside the scope of this patch as it would require more intrusive changes. Previous work by John Fastabend made some efforts towards FIONREAD support: commit e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq") Although the current patch is based on the previous work by John Fastabend, it is acceptable for our Fixes tag to point to the same commit. FD1:read() -- FD1->copied_seq++ | [read data] | [enqueue data] v [sockmap] -> ingress to self -> ingress_msg queue FD1 native stack ------> ^ -- FD1->rcv_nxt++ -> redirect to other | [enqueue data] | | | ingress to FD1 v ^ ... | [sockmap] FD2 native stack Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: Jiayuan Chen --- include/linux/skmsg.h | 68 +++++++++++++++++++++++++++++++++++++++++-- net/core/skmsg.c | 3 ++ net/ipv4/tcp_bpf.c | 21 +++++++++++++ net/ipv4/udp_bpf.c | 20 ++++++++++--- 4 files changed, 106 insertions(+), 6 deletions(-) diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h index dfdc158ab88c..829b281d6c9c 100644 --- a/include/linux/skmsg.h +++ b/include/linux/skmsg.h @@ -97,6 +97,8 @@ struct sk_psock { struct sk_buff_head ingress_skb; struct list_head ingress_msg; spinlock_t ingress_lock; + /** @msg_tot_len: Total bytes queued in ingress_msg list. */ + u32 msg_tot_len; unsigned long state; struct list_head link; spinlock_t link_lock; @@ -321,6 +323,27 @@ static inline void sock_drop(struct sock *sk, struct s= k_buff *skb) kfree_skb(skb); } =20 +static inline u32 sk_psock_get_msg_len_nolock(struct sk_psock *psock) +{ + /* Used by ioctl to read msg_tot_len only; lock-free for performance */ + return READ_ONCE(psock->msg_tot_len); +} + +static inline void sk_psock_msg_len_add_locked(struct sk_psock *psock, int= diff) +{ + /* Use WRITE_ONCE to ensure correct read in sk_psock_get_msg_len_nolock(). + * ingress_lock should be held to prevent concurrent updates to msg_tot_l= en + */ + WRITE_ONCE(psock->msg_tot_len, psock->msg_tot_len + diff); +} + +static inline void sk_psock_msg_len_add(struct sk_psock *psock, int diff) +{ + spin_lock_bh(&psock->ingress_lock); + sk_psock_msg_len_add_locked(psock, diff); + spin_unlock_bh(&psock->ingress_lock); +} + static inline bool sk_psock_queue_msg(struct sk_psock *psock, struct sk_msg *msg) { @@ -329,6 +352,7 @@ static inline bool sk_psock_queue_msg(struct sk_psock *= psock, spin_lock_bh(&psock->ingress_lock); if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) { list_add_tail(&msg->list, &psock->ingress_msg); + sk_psock_msg_len_add_locked(psock, msg->sg.size); ret =3D true; } else { sk_msg_free(psock->sk, msg); @@ -345,18 +369,25 @@ static inline struct sk_msg *sk_psock_dequeue_msg(str= uct sk_psock *psock) =20 spin_lock_bh(&psock->ingress_lock); msg =3D list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list= ); - if (msg) + if (msg) { list_del(&msg->list); + sk_psock_msg_len_add_locked(psock, -msg->sg.size); + } spin_unlock_bh(&psock->ingress_lock); return msg; } =20 +static inline struct sk_msg *sk_psock_peek_msg_locked(struct sk_psock *pso= ck) +{ + return list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list); +} + static inline struct sk_msg *sk_psock_peek_msg(struct sk_psock *psock) { struct sk_msg *msg; =20 spin_lock_bh(&psock->ingress_lock); - msg =3D list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list= ); + msg =3D sk_psock_peek_msg_locked(psock); spin_unlock_bh(&psock->ingress_lock); return msg; } @@ -523,6 +554,39 @@ static inline bool sk_psock_strp_enabled(struct sk_pso= ck *psock) return !!psock->saved_data_ready; } =20 +/* for tcp only, sk is locked */ +static inline ssize_t sk_psock_msg_inq(struct sock *sk) +{ + struct sk_psock *psock; + ssize_t inq =3D 0; + + psock =3D sk_psock_get(sk); + if (likely(psock)) { + inq =3D sk_psock_get_msg_len_nolock(psock); + sk_psock_put(sk, psock); + } + return inq; +} + +/* for udp only, sk is not locked */ +static inline ssize_t sk_msg_first_len(struct sock *sk) +{ + struct sk_psock *psock; + struct sk_msg *msg; + ssize_t inq =3D 0; + + psock =3D sk_psock_get(sk); + if (likely(psock)) { + spin_lock_bh(&psock->ingress_lock); + msg =3D sk_psock_peek_msg_locked(psock); + if (msg) + inq =3D msg->sg.size; + spin_unlock_bh(&psock->ingress_lock); + sk_psock_put(sk, psock); + } + return inq; +} + #if IS_ENABLED(CONFIG_NET_SOCK_MSG) =20 #define BPF_F_STRPARSER (1UL << 1) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 3d147837b82c..57a94e9fb8c1 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -455,6 +455,7 @@ int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *= psock, struct msghdr *msg atomic_sub(copy, &sk->sk_rmem_alloc); } msg_rx->sg.size -=3D copy; + sk_psock_msg_len_add(psock, -copy); =20 if (!sge->length) { sk_msg_iter_var_next(i); @@ -819,9 +820,11 @@ static void __sk_psock_purge_ingress_msg(struct sk_pso= ck *psock) list_del(&msg->list); if (!msg->skb) atomic_sub(msg->sg.size, &psock->sk->sk_rmem_alloc); + sk_psock_msg_len_add(psock, -msg->sg.size); sk_msg_free(psock->sk, msg); kfree(msg); } + WARN_ON_ONCE(psock->msg_tot_len); } =20 static void __sk_psock_zap_ingress(struct sk_psock *psock) diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index 5c698fd7fbf8..1660b4efe5d2 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -10,6 +10,7 @@ =20 #include #include +#include =20 void tcp_eat_skb(struct sock *sk, struct sk_buff *skb) { @@ -332,6 +333,25 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, return copied; } =20 +static int tcp_bpf_ioctl(struct sock *sk, int cmd, int *karg) +{ + bool slow; + + /* we only care about FIONREAD */ + if (cmd !=3D SIOCINQ) + return tcp_ioctl(sk, cmd, karg); + + /* works similar as tcp_ioctl */ + if (sk->sk_state =3D=3D TCP_LISTEN) + return -EINVAL; + + slow =3D lock_sock_fast(sk); + *karg =3D sk_psock_msg_inq(sk); + unlock_sock_fast(sk, slow); + + return 0; +} + static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags, int *addr_len) { @@ -610,6 +630,7 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TC= P_BPF_NUM_CFGS], prot[TCP_BPF_BASE].close =3D sock_map_close; prot[TCP_BPF_BASE].recvmsg =3D tcp_bpf_recvmsg; prot[TCP_BPF_BASE].sock_is_readable =3D sk_msg_is_readable; + prot[TCP_BPF_BASE].ioctl =3D tcp_bpf_ioctl; =20 prot[TCP_BPF_TX] =3D prot[TCP_BPF_BASE]; prot[TCP_BPF_TX].sendmsg =3D tcp_bpf_sendmsg; diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c index 0735d820e413..424f664df71b 100644 --- a/net/ipv4/udp_bpf.c +++ b/net/ipv4/udp_bpf.c @@ -5,6 +5,7 @@ #include #include #include +#include =20 #include "udp_impl.h" =20 @@ -111,12 +112,23 @@ enum { static DEFINE_SPINLOCK(udpv6_prot_lock); static struct proto udp_bpf_prots[UDP_BPF_NUM_PROTS]; =20 +static int udp_bpf_ioctl(struct sock *sk, int cmd, int *karg) +{ + if (cmd !=3D SIOCINQ) + return udp_ioctl(sk, cmd, karg); + + /* works similar as udp_ioctl. */ + *karg =3D sk_msg_first_len(sk); + return 0; +} + static void udp_bpf_rebuild_protos(struct proto *prot, const struct proto = *base) { - *prot =3D *base; - prot->close =3D sock_map_close; - prot->recvmsg =3D udp_bpf_recvmsg; - prot->sock_is_readable =3D sk_msg_is_readable; + *prot =3D *base; + prot->close =3D sock_map_close; + prot->recvmsg =3D udp_bpf_recvmsg; + prot->sock_is_readable =3D sk_msg_is_readable; + prot->ioctl =3D udp_bpf_ioctl; } =20 static void udp_bpf_check_v6_needs_rebuild(struct proto *ops) --=20 2.43.0 From nobody Sun Feb 8 05:40:30 2026 Received: from out-179.mta0.migadu.com (out-179.mta0.migadu.com [91.218.175.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3E78C34CFCC for ; Thu, 8 Jan 2026 15:02:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767884533; cv=none; b=P8CNT4XtIuN60GAqlxhY9puTXWFNXitkFWSqhNjoEkywWYGtcMowKSSkPVBc1fA09Ug6iLS2k51Ds+SB+yFjLKDK3KDjvl/ztAABXGkJaBrDlLg6QnYi8nvU45X1x3GvfhrDBk3c12zncGozBvU/fUWVmJlUzBjO6DclK/w+3KA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767884533; c=relaxed/simple; bh=JKnwGSkyGevgNvKWzN55WmXYew/BftNa5hJhfvDZxto=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Az0dbGs3K1bwxDzOD6rtc+inAaN8zFCATGV5+EPzW+WLaUhIC08DgZo1l8/UL9ahUIr0buUm6r5Qrjtux/PIrt1ShObIasK3CadBILz+XFQYxmqJfNN97v1OkfWpAUcYwbG3Qa4+P6oCkRHLuxG6wimVVg8CJwsPFnujMzp0ogs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=RQDGXtFA; arc=none smtp.client-ip=91.218.175.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="RQDGXtFA" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1767884529; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=beUzW2xmxVFqLxvkldhMYdz/cxTMf0BaGcwXoiZvxGw=; b=RQDGXtFAvFUnjplxHc74SDABwFQFsbMixEq4s1LqjC+P/iOtOjblaw10kl2Q7gzUCHYRG1 08sAt9h04F6PNspKtPHWapqUoNntjnajv14aF+y3x5B24PFwYUOdgxltUquaPFFv7LKTYg zKn6YjkVxZy7OY/y2ZYDO45VNnVP7pg= From: Jiayuan Chen To: bpf@vger.kernel.org Cc: Jiayuan Chen , John Fastabend , Jakub Sitnicki , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Neal Cardwell , Kuniyuki Iwashima , David Ahern , Andrii Nakryiko , Eduard Zingerman , Alexei Starovoitov , Daniel Borkmann , Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , Shuah Khan , Stefano Garzarella , Michal Luczaj , Cong Wang , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org Subject: [PATCH bpf-next v6 3/3] bpf, selftest: Add tests for FIONREAD and copied_seq Date: Thu, 8 Jan 2026 23:00:32 +0800 Message-ID: <20260108150102.12563-4-jiayuan.chen@linux.dev> In-Reply-To: <20260108150102.12563-1-jiayuan.chen@linux.dev> References: <20260108150102.12563-1-jiayuan.chen@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" This commit adds two new test functions: one to reproduce the bug reported by syzkaller [1], and another to cover the calculation of copied_seq. The tests primarily involve installing and uninstalling sockmap on sockets, then reading data to verify proper functionality. Additionally, extend the do_test_sockmap_skb_verdict_fionread() function to support UDP FIONREAD testing. [1] https://syzkaller.appspot.com/bug?extid=3D06dbd397158ec0ea4983 Signed-off-by: Jiayuan Chen --- .../selftests/bpf/prog_tests/sockmap_basic.c | 205 +++++++++++++++++- .../bpf/progs/test_sockmap_pass_prog.c | 14 ++ 2 files changed, 213 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools= /testing/selftests/bpf/prog_tests/sockmap_basic.c index 1e3e4392dcca..928659030ef6 100644 --- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c @@ -1,7 +1,8 @@ // SPDX-License-Identifier: GPL-2.0 // Copyright (c) 2020 Cloudflare #include -#include +#include +#include #include =20 #include "test_progs.h" @@ -22,6 +23,15 @@ #define TCP_REPAIR_ON 1 #define TCP_REPAIR_OFF_NO_WP -1 /* Turn off without window probes */ =20 +/** + * SOL_TCP is defined in (glibc), but the copybuf_address + * field of tcp_zerocopy_receive is not yet included in older versions. + * This workaround remains necessary until the glibc update propagates. + */ +#ifndef SOL_TCP +#define SOL_TCP 6 +#endif + static int connected_socket_v4(void) { struct sockaddr_in addr =3D { @@ -536,13 +546,14 @@ static void test_sockmap_skb_verdict_shutdown(void) } =20 =20 -static void test_sockmap_skb_verdict_fionread(bool pass_prog) +static void do_test_sockmap_skb_verdict_fionread(int sotype, bool pass_pro= g) { int err, map, verdict, c0 =3D -1, c1 =3D -1, p0 =3D -1, p1 =3D -1; int expected, zero =3D 0, sent, recvd, avail; struct test_sockmap_pass_prog *pass =3D NULL; struct test_sockmap_drop_prog *drop =3D NULL; char buf[256] =3D "0123456789"; + int split_len =3D sizeof(buf) / 2; =20 if (pass_prog) { pass =3D test_sockmap_pass_prog__open_and_load(); @@ -550,7 +561,10 @@ static void test_sockmap_skb_verdict_fionread(bool pas= s_prog) return; verdict =3D bpf_program__fd(pass->progs.prog_skb_verdict); map =3D bpf_map__fd(pass->maps.sock_map_rx); - expected =3D sizeof(buf); + if (sotype =3D=3D SOCK_DGRAM) + expected =3D split_len; /* FIONREAD for UDP is different from TCP */ + else + expected =3D sizeof(buf); } else { drop =3D test_sockmap_drop_prog__open_and_load(); if (!ASSERT_OK_PTR(drop, "open_and_load")) @@ -566,7 +580,7 @@ static void test_sockmap_skb_verdict_fionread(bool pass= _prog) if (!ASSERT_OK(err, "bpf_prog_attach")) goto out; =20 - err =3D create_socket_pairs(AF_INET, SOCK_STREAM, &c0, &c1, &p0, &p1); + err =3D create_socket_pairs(AF_INET, sotype, &c0, &c1, &p0, &p1); if (!ASSERT_OK(err, "create_socket_pairs()")) goto out; =20 @@ -574,8 +588,9 @@ static void test_sockmap_skb_verdict_fionread(bool pass= _prog) if (!ASSERT_OK(err, "bpf_map_update_elem(c1)")) goto out_close; =20 - sent =3D xsend(p1, &buf, sizeof(buf), 0); - ASSERT_EQ(sent, sizeof(buf), "xsend(p0)"); + sent =3D xsend(p1, &buf, split_len, 0); + sent +=3D xsend(p1, &buf, sizeof(buf) - split_len, 0); + ASSERT_EQ(sent, sizeof(buf), "xsend(p1)"); err =3D ioctl(c1, FIONREAD, &avail); ASSERT_OK(err, "ioctl(FIONREAD) error"); ASSERT_EQ(avail, expected, "ioctl(FIONREAD)"); @@ -597,6 +612,12 @@ static void test_sockmap_skb_verdict_fionread(bool pas= s_prog) test_sockmap_drop_prog__destroy(drop); } =20 +static void test_sockmap_skb_verdict_fionread(bool pass_prog) +{ + do_test_sockmap_skb_verdict_fionread(SOCK_STREAM, pass_prog); + do_test_sockmap_skb_verdict_fionread(SOCK_DGRAM, pass_prog); +} + static void test_sockmap_skb_verdict_change_tail(void) { struct test_sockmap_change_tail *skel; @@ -1042,6 +1063,172 @@ static void test_sockmap_vsock_unconnected(void) xclose(map); } =20 +/* it is used to reproduce WARNING */ +static void test_sockmap_zc(void) +{ + int map, err, sent, recvd, zero =3D 0, one =3D 1, on =3D 1; + char buf[10] =3D "0123456789", rcv[11], addr[100]; + struct test_sockmap_pass_prog *skel =3D NULL; + int c0 =3D -1, p0 =3D -1, c1 =3D -1, p1 =3D -1; + struct tcp_zerocopy_receive zc; + socklen_t zc_len =3D sizeof(zc); + struct bpf_program *prog; + + skel =3D test_sockmap_pass_prog__open_and_load(); + if (!ASSERT_OK_PTR(skel, "open_and_load")) + return; + + if (create_socket_pairs(AF_INET, SOCK_STREAM, &c0, &c1, &p0, &p1)) + goto end; + + prog =3D skel->progs.prog_skb_verdict_ingress; + map =3D bpf_map__fd(skel->maps.sock_map_rx); + + err =3D bpf_prog_attach(bpf_program__fd(prog), map, BPF_SK_SKB_STREAM_VER= DICT, 0); + if (!ASSERT_OK(err, "bpf_prog_attach")) + goto end; + + err =3D bpf_map_update_elem(map, &zero, &p0, BPF_ANY); + if (!ASSERT_OK(err, "bpf_map_update_elem")) + goto end; + + err =3D bpf_map_update_elem(map, &one, &p1, BPF_ANY); + if (!ASSERT_OK(err, "bpf_map_update_elem")) + goto end; + + sent =3D xsend(c0, buf, sizeof(buf), 0); + if (!ASSERT_EQ(sent, sizeof(buf), "xsend")) + goto end; + + /* trigger tcp_bpf_recvmsg_parser and inc copied_seq of p1 */ + recvd =3D recv_timeout(p1, rcv, sizeof(rcv), MSG_DONTWAIT, 1); + if (!ASSERT_EQ(recvd, sent, "recv_timeout(p1)")) + goto end; + + /* uninstall sockmap of p1 */ + bpf_map_delete_elem(map, &one); + + /* trigger tcp stack and the rcv_nxt of p1 is less than copied_seq */ + sent =3D xsend(c1, buf, sizeof(buf) - 1, 0); + if (!ASSERT_EQ(sent, sizeof(buf) - 1, "xsend")) + goto end; + + err =3D setsockopt(p1, SOL_SOCKET, SO_ZEROCOPY, &on, sizeof(on)); + if (!ASSERT_OK(err, "setsockopt")) + goto end; + + memset(&zc, 0, sizeof(zc)); + zc.copybuf_address =3D (__u64)((unsigned long)addr); + zc.copybuf_len =3D sizeof(addr); + + err =3D getsockopt(p1, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, &zc, &zc_len); + if (!ASSERT_OK(err, "getsockopt")) + goto end; + +end: + if (c0 >=3D 0) + close(c0); + if (p0 >=3D 0) + close(p0); + if (c1 >=3D 0) + close(c1); + if (p1 >=3D 0) + close(p1); + test_sockmap_pass_prog__destroy(skel); +} + +/* it is used to check whether copied_seq of sk is correct */ +static void test_sockmap_copied_seq(bool strp) +{ + int i, map, err, sent, recvd, zero =3D 0, one =3D 1; + struct test_sockmap_pass_prog *skel =3D NULL; + int c0 =3D -1, p0 =3D -1, c1 =3D -1, p1 =3D -1; + char buf[10] =3D "0123456789", rcv[11]; + struct bpf_program *prog; + + skel =3D test_sockmap_pass_prog__open_and_load(); + if (!ASSERT_OK_PTR(skel, "open_and_load")) + return; + + if (create_socket_pairs(AF_INET, SOCK_STREAM, &c0, &c1, &p0, &p1)) + goto end; + + prog =3D skel->progs.prog_skb_verdict_ingress; + map =3D bpf_map__fd(skel->maps.sock_map_rx); + + err =3D bpf_prog_attach(bpf_program__fd(prog), map, BPF_SK_SKB_STREAM_VER= DICT, 0); + if (!ASSERT_OK(err, "bpf_prog_attach verdict")) + goto end; + + if (strp) { + prog =3D skel->progs.prog_skb_verdict_ingress_strp; + err =3D bpf_prog_attach(bpf_program__fd(prog), map, BPF_SK_SKB_STREAM_PA= RSER, 0); + if (!ASSERT_OK(err, "bpf_prog_attach parser")) + goto end; + } + + err =3D bpf_map_update_elem(map, &zero, &p0, BPF_ANY); + if (!ASSERT_OK(err, "bpf_map_update_elem(p0)")) + goto end; + + err =3D bpf_map_update_elem(map, &one, &p1, BPF_ANY); + if (!ASSERT_OK(err, "bpf_map_update_elem(p1)")) + goto end; + + /* just trigger sockamp: data sent by c0 will be received by p1 */ + sent =3D xsend(c0, buf, sizeof(buf), 0); + if (!ASSERT_EQ(sent, sizeof(buf), "xsend(c0), bpf")) + goto end; + + /* do partial read */ + recvd =3D recv_timeout(p1, rcv, 1, MSG_DONTWAIT, 1); + recvd +=3D recv_timeout(p1, rcv + 1, sizeof(rcv) - 1, MSG_DONTWAIT, 1); + if (!ASSERT_EQ(recvd, sent, "recv_timeout(p1), bpf") || + !ASSERT_OK(memcmp(buf, rcv, recvd), "data mismatch")) + goto end; + + /* uninstall sockmap of p1 and p0 */ + err =3D bpf_map_delete_elem(map, &one); + if (!ASSERT_OK(err, "bpf_map_delete_elem(1)")) + goto end; + + err =3D bpf_map_delete_elem(map, &zero); + if (!ASSERT_OK(err, "bpf_map_delete_elem(0)")) + goto end; + + /* now all sockets become plain socket, they should still work */ + for (i =3D 0; i < 5; i++) { + /* test copied_seq of p1 by running tcp native stack */ + sent =3D xsend(c1, buf, sizeof(buf), 0); + if (!ASSERT_EQ(sent, sizeof(buf), "xsend(c1), native")) + goto end; + + recvd =3D recv(p1, rcv, sizeof(rcv), MSG_DONTWAIT); + if (!ASSERT_EQ(recvd, sent, "recv_timeout(p1), native")) + goto end; + + /* p0 previously redirected skb to p1, we also check copied_seq of p0 */ + sent =3D xsend(c0, buf, sizeof(buf), 0); + if (!ASSERT_EQ(sent, sizeof(buf), "xsend(c0), native")) + goto end; + + recvd =3D recv(p0, rcv, sizeof(rcv), MSG_DONTWAIT); + if (!ASSERT_EQ(recvd, sent, "recv_timeout(p0), native")) + goto end; + } + +end: + if (c0 >=3D 0) + close(c0); + if (p0 >=3D 0) + close(p0); + if (c1 >=3D 0) + close(c1); + if (p1 >=3D 0) + close(p1); + test_sockmap_pass_prog__destroy(skel); +} + void test_sockmap_basic(void) { if (test__start_subtest("sockmap create_update_free")) @@ -1108,4 +1295,10 @@ void test_sockmap_basic(void) test_sockmap_skb_verdict_vsock_poll(); if (test__start_subtest("sockmap vsock unconnected")) test_sockmap_vsock_unconnected(); + if (test__start_subtest("sockmap with zc")) + test_sockmap_zc(); + if (test__start_subtest("sockmap recover")) + test_sockmap_copied_seq(false); + if (test__start_subtest("sockmap recover with strp")) + test_sockmap_copied_seq(true); } diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_pass_prog.c b/t= ools/testing/selftests/bpf/progs/test_sockmap_pass_prog.c index 69aacc96db36..ef9edca184ea 100644 --- a/tools/testing/selftests/bpf/progs/test_sockmap_pass_prog.c +++ b/tools/testing/selftests/bpf/progs/test_sockmap_pass_prog.c @@ -44,4 +44,18 @@ int prog_skb_parser(struct __sk_buff *skb) return SK_PASS; } =20 +SEC("sk_skb/stream_verdict") +int prog_skb_verdict_ingress(struct __sk_buff *skb) +{ + int one =3D 1; + + return bpf_sk_redirect_map(skb, &sock_map_rx, one, BPF_F_INGRESS); +} + +SEC("sk_skb/stream_parser") +int prog_skb_verdict_ingress_strp(struct __sk_buff *skb) +{ + return skb->len; +} + char _license[] SEC("license") =3D "GPL"; --=20 2.43.0