[PATCH mptcp-net v2] mptcp: bpf: fix NULL deref and UAF in bpf_mptcp_sock_from_subflow()

Kalpan Jani posted 1 patch 1 week, 1 day ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/multipath-tcp/mptcp_net-next tags/patchew/20260626125058.868855-1-kalpan.jani@mpiricsoftware.com
There is a newer version of this series
kernel/trace/bpf_trace.c |  2 +-
net/mptcp/bpf.c          | 14 ++++++++++++--
net/mptcp/protocol.c     |  6 +++++-
net/mptcp/subflow.c      |  9 +++++++--
4 files changed, 25 insertions(+), 6 deletions(-)
[PATCH mptcp-net v2] mptcp: bpf: fix NULL deref and UAF in bpf_mptcp_sock_from_subflow()
Posted by Kalpan Jani 1 week, 1 day ago
bpf_mptcp_sock_from_subflow() is reachable from BPF programs via
bpf_skc_to_mptcp_sock() on an arbitrary socket, without the subflow
socket lock held. It assumes sk_is_mptcp(sk) implies a valid subflow
context whose ->conn points to a live parent mptcp_sock. That invariant
does not hold in three windows:

- Fallback: subflow_ulp_fallback() clears icsk_ulp_data via
  rcu_assign_pointer(..., NULL) before clearing tcp_sk(sk)->is_mptcp, so a
  concurrent reader can observe is_mptcp == 1 with a NULL context and
  dereference mptcp_subflow_ctx(sk)->conn through a NULL pointer.

- Init: subflow_ulp_init() sets is_mptcp = 1 while ->conn is still NULL;
  ->conn is only assigned later in subflow_syn_recv_sock() or
  mptcp_subflow_create_socket(). On CONFIG_DEBUG_NET, mptcp_sk()
  dereferences its argument in a WARN_ON(), so mptcp_sk(NULL) faults.

- Teardown: subflow_ulp_release() drops the subflow-owned parent
  reference with sock_put(ctx->conn) without clearing ctx->conn. The
  established parent msk is not SOCK_RCU_FREE (mptcp_sk_clone_init()
  resets the flag), so it is freed synchronously and a lockless reader can
  dereference a dangling ->conn: a use-after-free.

icsk_ulp_data is an __rcu pointer and this helper can run locklessly, so
load the context with rcu_dereference_check() (lockdep_sock_is_held()
covers the locked struct_ops callers), load ->conn once with READ_ONCE(),
pair the ->conn stores with WRITE_ONCE(), reject a NULL context or NULL
->conn, and clear ->conn before dropping the parent reference.

A NULL check alone cannot fix the teardown case: a reader may load a
non-NULL ->conn just before it is cleared and dereference the parent
after sock_put() has freed it. Give the parent msk RCU-grace lifetime by
setting SOCK_RCU_FREE in the shared __mptcp_init_sock(), covering both the
active and passive paths, and dropping the reset in mptcp_sk_clone_init().
Non-sleepable BPF runs in an RCU read-side critical section, so a reader
that obtains the msk is then guaranteed it stays allocated for the
program's duration.

That lifetime guarantee relies on classic RCU, which does not hold off RCU
Tasks Trace, so it does not extend to sleepable programs. Unlike the other
bpf_skc_to_*() casts, which return the same object as the trusted input
sk, this helper returns a different object, so stop exposing it to
sleepable programs in tracing_prog_func_proto().

Fixes: 3bc253c2e652 ("bpf: Add bpf_skc_to_mptcp_sock_proto")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/622
Assisted-by: Claude:claude-opus-4.8.
Signed-off-by: Kalpan Jani <kalpan.jani@mpiricsoftware.com>
---
Changes in v2:
- Lockless access fixes (Li Xiasong): load the subflow context with
  rcu_dereference_check() instead of a plain dereference, and read ->conn
  once with READ_ONCE().
- Fix the teardown use-after-free that v1 did not address: clear ->conn
  before dropping the parent reference in subflow_ulp_release(), and give
  the parent msk RCU-grace lifetime via SOCK_RCU_FREE so a lockless reader
  cannot dereference a freed parent.
- Stop exposing the helper to sleepable BPF programs, where classic RCU
  gives no lifetime guarantee.

v1: https://lore.kernel.org/all/20260612072643.2313900-1-kalpan.jani@mpiricsoftware.com/
  

 kernel/trace/bpf_trace.c |  2 +-
 net/mptcp/bpf.c          | 14 ++++++++++++--
 net/mptcp/protocol.c     |  6 +++++-
 net/mptcp/subflow.c      |  9 +++++++--
 4 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index a02bd258677e..ea489edbecec 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1715,7 +1715,7 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_skc_to_unix_sock:
 		return &bpf_skc_to_unix_sock_proto;
 	case BPF_FUNC_skc_to_mptcp_sock:
-		return &bpf_skc_to_mptcp_sock_proto;
+		return prog->sleepable ? NULL : &bpf_skc_to_mptcp_sock_proto;
 	case BPF_FUNC_sk_storage_get:
 		return &bpf_sk_storage_get_tracing_proto;
 	case BPF_FUNC_sk_storage_delete:
diff --git a/net/mptcp/bpf.c b/net/mptcp/bpf.c
index 08bb037f0951..138e6a145135 100644
--- a/net/mptcp/bpf.c
+++ b/net/mptcp/bpf.c
@@ -193,8 +193,18 @@ static struct bpf_struct_ops bpf_mptcp_sched_ops = {
 
 struct mptcp_sock *bpf_mptcp_sock_from_subflow(struct sock *sk)
 {
-	if (sk && sk_fullsock(sk) && sk_is_tcp(sk) && sk_is_mptcp(sk))
-		return mptcp_sk(mptcp_subflow_ctx(sk)->conn);
+	struct mptcp_subflow_context *ctx;
+	struct sock *conn;
+
+	if (sk && sk_fullsock(sk) && sk_is_tcp(sk) && sk_is_mptcp(sk)) {
+		ctx = rcu_dereference_check(inet_csk(sk)->icsk_ulp_data,
+					    lockdep_sock_is_held(sk));
+		if (ctx) {
+			conn = READ_ONCE(ctx->conn);
+			if (conn)
+				return mptcp_sk(conn);
+		}
+	}
 
 	return NULL;
 }
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index a4f7e99b30db..fa85ad0dc5a8 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -3177,6 +3177,11 @@ static void __mptcp_init_sock(struct sock *sk)
 	mptcp_pm_data_init(msk);
 	spin_lock_init(&msk->fallback_lock);
 
+	/* the returned parent msk may be handed to lockless readers
+	 * (bpf_skc_to_mptcp_sock()); free it after an RCU grace period
+	 */
+	sock_set_flag(sk, SOCK_RCU_FREE);
+
 	/* re-use the csk retrans timer for MPTCP-level retrans */
 	timer_setup(&sk->mptcp_retransmit_timer, mptcp_retransmit_timer, 0);
 	timer_setup(&msk->sk.mptcp_tout_timer, mptcp_tout_timer, 0);
@@ -3712,7 +3717,6 @@ struct sock *mptcp_sk_clone_init(const struct sock *sk,
 	/* passive msk is created after the first/MPC subflow */
 	msk->subflow_id = 2;
 
-	sock_reset_flag(nsk, SOCK_RCU_FREE);
 	security_inet_csk_clone(nsk, req);
 
 	/* this can't race with mptcp_close(), as the msk is
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 8e386899ceb9..8eefd64dcf7d 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -920,7 +920,7 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk,
 
 			/* move the msk reference ownership to the subflow */
 			subflow_req->msk = NULL;
-			ctx->conn = (struct sock *)owner;
+			WRITE_ONCE(ctx->conn, (struct sock *)owner);
 
 			if (subflow_use_different_sport(owner, sk)) {
 				pr_debug("ack inet_sport=%d %d\n",
@@ -1827,7 +1827,7 @@ int mptcp_subflow_create_socket(struct sock *sk, unsigned short family,
 
 	*new_sock = sf;
 	sock_hold(sk);
-	subflow->conn = sk;
+	WRITE_ONCE(subflow->conn, sk);
 	mptcp_subflow_ops_override(sf->sk);
 
 	return 0;
@@ -2024,6 +2024,11 @@ static void subflow_ulp_release(struct sock *ssk)
 		if (!release && !test_and_set_bit(MPTCP_WORK_CLOSE_SUBFLOW,
 						  &mptcp_sk(sk)->flags))
 			mptcp_schedule_work(sk);
+
+		/* hide the parent from lockless readers (e.g. BPF) before
+		 * dropping the subflow-owned reference
+		 */
+		WRITE_ONCE(ctx->conn, NULL);
 		sock_put(sk);
 	}
 
-- 
2.43.0
Re: [PATCH mptcp-net v2] mptcp: bpf: fix NULL deref and UAF in bpf_mptcp_sock_from_subflow()
Posted by MPTCP CI 1 week, 1 day ago
Hi Kalpan,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal (except selftest_mptcp_join): Unstable: 1 failed test(s): selftest_simult_flows ⚠️ 
- KVM Validation: normal (only selftest_mptcp_join): Success! ✅
- KVM Validation: debug (except selftest_mptcp_join): Success! ✅
- KVM Validation: debug (only selftest_mptcp_join): Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/28240266592

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/712483977074
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=1117109


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-normal

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)