[v9] l2tp: fix double dst_release() on sk_dst_cache race

[PATCH net-next v9] l2tp: fix double dst_release() on sk_dst_cache race

Posted by Mikhail Lobanov 6 days, 6 hours ago

A reproducible rcuref - imbalanced put() warning is observed under
IPv6 L2TP (pppol2tp) traffic with blackhole routes, indicating an
imbalance in dst reference counting for routes cached in
sk->sk_dst_cache and pointing to a subtle lifetime/synchronization
issue between the helpers that validate and drop cached dst entries.

rcuref - imbalanced put()
WARNING: CPU: 0 PID: 899 at lib/rcuref.c:266 rcuref_put_slowpath+0x1ce/0x240 lib/rcuref.c:266
Call Trace:
 <TASK>
 dst_release+0x291/0x310 net/core/dst.c:167
 __sk_dst_check+0x2d4/0x350 net/core/sock.c:604
 __inet6_csk_dst_check net/ipv6/inet6_connection_sock.c:76 [inline]
 inet6_csk_route_socket+0x6ed/0x10c0 net/ipv6/inet6_connection_sock.c:104
 inet6_csk_xmit+0x12f/0x740 net/ipv6/inet6_connection_sock.c:121
 l2tp_xmit_queue net/l2tp/l2tp_core.c:1214 [inline]
 l2tp_xmit_core net/l2tp/l2tp_core.c:1309 [inline]
 l2tp_xmit_skb+0x1404/0x1910 net/l2tp/l2tp_core.c:1325
 pppol2tp_sendmsg+0x3ca/0x550 net/l2tp/l2tp_ppp.c:302
 __sys_sendmmsg+0x188/0x450 net/socket.c:2749
 __x64_sys_sendmmsg+0x98/0x100 net/socket.c:2775
 do_syscall_64+0x64/0x140 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
 </TASK>

The race occurs between the lockless UDPv6 transmit path
(udpv6_sendmsg() -> sk_dst_check()) and the locked L2TP/pppol2tp
transmit path (pppol2tp_sendmsg() -> l2tp_xmit_skb() ->
... -> inet6_csk_xmit() -> __sk_dst_check()), when both handle
the same obsolete dst from sk->sk_dst_cache: the UDPv6 side takes
an extra reference and atomically steals and releases the cached
dst, while the L2TP side, using a stale cached pointer, still
calls dst_release() on it, and together these updates produce
an extra final dst_release() on that dst, triggering
rcuref - imbalanced put().

The Race Condition:

Initial:
  sk->sk_dst_cache = dst
  ref(dst) = 1

Thread 1: sk_dst_check()                Thread 2: __sk_dst_check()
------------------------               ---------------------------
sk_dst_get(sk):
  rcu_read_lock()
  dst = rcu_dereference(sk->sk_dst_cache)
  rcuref_get(dst) succeeds
  rcu_read_unlock()
  // ref = 2

                                            dst = __sk_dst_check()
                                    // reads same dst from sk->sk_dst_cache
                                    // ref still = 2 (no explicit get)

[both see dst obsolete & check() == NULL]

sk_dst_reset(sk):
  old = xchg(&sk->sk_dst_cache, NULL)
    // old = dst
  dst_release(old)
    // drop cached ref
    // ref: 2 -> 1

                                  RCU_INIT_POINTER(sk->sk_dst_cache, NULL)
                                  // cache already NULL after xchg
                                            dst_release(dst)
                                              // ref: 1 -> 0

  dst_release(dst)
  // tries to drop its own ref after final put
  // rcuref_put_slowpath() -> "rcuref - imbalanced put()"

Make L2TP's IPv6 transmit path stop using inet6_csk_xmit()
(and thus __sk_dst_check()) and instead open-code the same
routing and transmit sequence using ip6_sk_dst_lookup_flow()
and ip6_xmit(). The new code builds a flowi6 from the socket
fields in the same way as inet6_csk_route_socket(), then calls
ip6_sk_dst_lookup_flow(), which internally relies on the lockless
sk_dst_check()/sk_dst_reset() pattern shared with UDPv6, and
attaches the resulting dst to the skb before invoking ip6_xmit().
This makes both the UDPv6 and L2TP IPv6 paths use the same
dst-cache handling logic for a given socket and removes the
possibility that sk_dst_check() and __sk_dst_check() concurrently
drop the same cached dst and trigger the rcuref - imbalanced put()
warning under concurrent traffic.

Use a helper to pre-route IPv4 L2TP packets via sk_dst_check()
and ip_route_output_ports(), attach the resulting dst to the
skb, and then hand the skb to ip_queue_xmit(). With skb->dst
already set, __ip_queue_xmit() skips its __sk_dst_check()-based
dst cache handling, so IPv4 L2TP uses the same lockless
sk_dst_check() helper as UDPv4 for a given socket. This avoids
mixed sk_dst_check()/__sk_dst_check() users of sk->sk_dst_cache
and closes the same class of double dst_release() race on IPv4.

Found by Linux Verification Center (linuxtesting.org) with Syzkaller.

Fixes: b0270e91014d ("ipv4: add a sock pointer to ip_queue_xmit()")
Signed-off-by: Mikhail Lobanov <m.lobanov@rosa.ru>
---
Changes in v9:
 - Rebase on net-next; no functional change to the fix.
 - The kmemleak reported by CI on v8 is the AEAD transform of an IPsec
   ESP SA, allocated by the XFRM control plane (xfrm_add_sa ->
   __xfrm_init_state -> esp_init_state -> esp_init_aead, comm "ip") and
   freed by the xfrm_state destructor esp_destroy() -> crypto_free_aead().
   This change only rewrites the L2TP data-plane transmit path and never
   allocates, frees, or changes the refcounting of that object, so it is
   not the source of the leak. Not reproducible here (KASAN +
   DEBUG_KMEMLEAK; adding/removing rfc4106(gcm(aes)) ESP SAs in a loop ->
   0 unreferenced objects) nor by Li Xiasong. Detailed analysis in reply
   to the v8 thread.

v8: https://lore.kernel.org/netdev/20251215145537.5085-1-m.lobanov@rosa.ru/

 net/l2tp/l2tp_core.c | 101 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 98 insertions(+), 3 deletions(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index 0710281dd95a..342e65db6eb8 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1202,19 +1202,114 @@ static int l2tp_build_l2tpv3_header(struct l2tp_session *session, void *buf)
 	return bufp - optr;
 }

+#if IS_ENABLED(CONFIG_IPV6)
+static int l2tp_xmit_ipv6(struct sock *sk, struct sk_buff *skb)
+{
+	struct ipv6_pinfo *np = inet6_sk(sk);
+	struct inet_sock *inet = inet_sk(sk);
+	struct in6_addr *final_p, final;
+	struct ipv6_txoptions *opt;
+	struct dst_entry *dst;
+	struct flowi6 fl6;
+	int err;
+
+	memset(&fl6, 0, sizeof(fl6));
+	fl6.flowi6_proto = sk->sk_protocol;
+	fl6.daddr        = sk->sk_v6_daddr;
+	fl6.saddr        = np->saddr;
+	fl6.flowlabel    = np->flow_label;
+	IP6_ECN_flow_xmit(sk, fl6.flowlabel);
+
+	fl6.flowi6_oif   = READ_ONCE(sk->sk_bound_dev_if);
+	fl6.flowi6_mark  = READ_ONCE(sk->sk_mark);
+	fl6.fl6_sport    = inet->inet_sport;
+	fl6.fl6_dport    = inet->inet_dport;
+	fl6.flowi6_uid   = sk_uid(sk);
+
+	security_sk_classify_flow(sk, flowi6_to_flowi_common(&fl6));
+
+	rcu_read_lock();
+	opt = rcu_dereference(np->opt);
+	final_p = fl6_update_dst(&fl6, opt, &final);
+
+	dst = ip6_sk_dst_lookup_flow(sk, &fl6, final_p, true);
+	if (IS_ERR(dst)) {
+		rcu_read_unlock();
+		kfree_skb(skb);
+		return NET_XMIT_DROP;
+	}
+
+	skb_dst_set(skb, dst);
+	fl6.daddr = sk->sk_v6_daddr;
+
+	err = ip6_xmit(sk, skb, &fl6, READ_ONCE(sk->sk_mark),
+		       opt, np->tclass,
+		       READ_ONCE(sk->sk_priority));
+	rcu_read_unlock();
+	return err;
+}
+#endif
+
+static int l2tp_xmit_ipv4(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	struct net *net = sock_net(sk);
+	struct ip_options_rcu *inet_opt;
+	struct flowi4 *fl4;
+	struct rtable *rt;
+	__u8 tos;
+	int err;
+
+	rcu_read_lock();
+	inet_opt = rcu_dereference(inet->inet_opt);
+	fl4 = &fl->u.ip4;
+	tos = READ_ONCE(inet->tos);
+
+	rt = dst_rtable(sk_dst_check(sk, 0));
+	if (!rt) {
+		__be32 daddr = inet->inet_daddr;
+
+		if (inet_opt && inet_opt->opt.srr)
+			daddr = inet_opt->opt.faddr;
+
+		rt = ip_route_output_ports(net, fl4, sk,
+					   daddr, inet->inet_saddr,
+					   inet->inet_dport,
+					   inet->inet_sport,
+					   sk->sk_protocol,
+					   tos & INET_DSCP_MASK,
+					   READ_ONCE(sk->sk_bound_dev_if));
+		if (IS_ERR(rt)) {
+			rcu_read_unlock();
+			IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+			kfree_skb_reason(skb, SKB_DROP_REASON_IP_OUTNOROUTES);
+			return -EHOSTUNREACH;
+		}
+
+		sk_setup_caps(sk, &rt->dst);
+	}
+
+	skb_dst_set_noref(skb, &rt->dst);
+	rcu_read_unlock();
+
+	err = ip_queue_xmit(sk, skb, fl);
+	return err;
+}
+
 /* Queue the packet to IP for output: tunnel socket lock must be held */
 static int l2tp_xmit_queue(struct l2tp_tunnel *tunnel, struct sk_buff *skb, struct flowi *fl)
 {
 	int err;
+	struct sock *sk = tunnel->sock;

 	skb->ignore_df = 1;
 	skb_dst_drop(skb);
 #if IS_ENABLED(CONFIG_IPV6)
-	if (l2tp_sk_is_v6(tunnel->sock))
-		err = inet6_csk_xmit(tunnel->sock, skb, NULL);
+	if (l2tp_sk_is_v6(sk))
+		err = l2tp_xmit_ipv6(sk, skb);
 	else
 #endif
-		err = ip_queue_xmit(tunnel->sock, skb, fl);
+		err = l2tp_xmit_ipv4(sk, skb, fl);

 	return err >= 0 ? NET_XMIT_SUCCESS : NET_XMIT_DROP;
 }
--
2.43.0

Re: [PATCH net-next v9] l2tp: fix double dst_release() on sk_dst_cache race

Posted by Paolo Abeni 3 days, 14 hours ago

On 6/1/26 11:33 PM, Mikhail Lobanov wrote:
> +static int l2tp_xmit_ipv4(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
> +{
> +	struct inet_sock *inet = inet_sk(sk);
> +	struct net *net = sock_net(sk);
> +	struct ip_options_rcu *inet_opt;

Since a repost is needed, please additionaly fix the variable
declaration order above ...

> +	struct flowi4 *fl4;
> +	struct rtable *rt;
> +	__u8 tos;
> +	int err;
> +
> +	rcu_read_lock();
> +	inet_opt = rcu_dereference(inet->inet_opt);
> +	fl4 = &fl->u.ip4;
> +	tos = READ_ONCE(inet->tos);
> +
> +	rt = dst_rtable(sk_dst_check(sk, 0));
> +	if (!rt) {
> +		__be32 daddr = inet->inet_daddr;
> +
> +		if (inet_opt && inet_opt->opt.srr)
> +			daddr = inet_opt->opt.faddr;
> +
> +		rt = ip_route_output_ports(net, fl4, sk,
> +					   daddr, inet->inet_saddr,
> +					   inet->inet_dport,
> +					   inet->inet_sport,
> +					   sk->sk_protocol,
> +					   tos & INET_DSCP_MASK,
> +					   READ_ONCE(sk->sk_bound_dev_if));
> +		if (IS_ERR(rt)) {
> +			rcu_read_unlock();
> +			IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
> +			kfree_skb_reason(skb, SKB_DROP_REASON_IP_OUTNOROUTES);
> +			return -EHOSTUNREACH;
> +		}
> +
> +		sk_setup_caps(sk, &rt->dst);
> +	}
> +
> +	skb_dst_set_noref(skb, &rt->dst);
> +	rcu_read_unlock();
> +
> +	err = ip_queue_xmit(sk, skb, fl);
> +	return err;
> +}
> +
>  /* Queue the packet to IP for output: tunnel socket lock must be held */
>  static int l2tp_xmit_queue(struct l2tp_tunnel *tunnel, struct sk_buff *skb, struct flowi *fl)
>  {
>  	int err;
> +	struct sock *sk = tunnel->sock;

... and here, too.

Thanks,

Paolo

Re: [PATCH net-next v9] l2tp: fix double dst_release() on sk_dst_cache race

Posted by Paolo Abeni 3 days, 14 hours ago

From: AI Reviewer <ai@example.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/
---
l2tp: fix double dst_release() on sk_dst_cache race

This commit reworks the L2TP transmit path to stop using
inet6_csk_xmit() for IPv6 and to pre-route IPv4 packets via a new
helper, with the goal of avoiding a double dst_release() race between
sk_dst_check() and __sk_dst_check() on sk->sk_dst_cache observed under
pppol2tp traffic. New helpers l2tp_xmit_ipv6() and l2tp_xmit_ipv4()
replace the former inet6_csk_xmit() / ip_queue_xmit() calls in
l2tp_xmit_queue().

A few questions on the new helpers below.

> diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
> index f940914959b1..54e837fec201 100644
> --- a/net/l2tp/l2tp_core.c
> +++ b/net/l2tp/l2tp_core.c
> @@ -1205,19 +1205,114 @@ static int l2tp_build_l2tpv3_header(struct l2tp_session *session, void *buf)
>  	return bufp - optr;
>  }
>  
> +#if IS_ENABLED(CONFIG_IPV6)
> +static int l2tp_xmit_ipv6(struct sock *sk, struct sk_buff *skb)
> +{
> +	struct ipv6_pinfo *np = inet6_sk(sk);
> +	struct inet_sock *inet = inet_sk(sk);
> +	struct in6_addr *final_p, final;
> +	struct ipv6_txoptions *opt;
> +	struct dst_entry *dst;
> +	struct flowi6 fl6;
> +	int err;

[ ... ]

> +	dst = ip6_sk_dst_lookup_flow(sk, &fl6, final_p, true);
> +	if (IS_ERR(dst)) {
> +		rcu_read_unlock();
> +		kfree_skb(skb);
> +		return NET_XMIT_DROP;
> +	}

[Medium]
Should this error path return a negative errno rather than
NET_XMIT_DROP?

l2tp_xmit_queue() converts the result of this function with:

	return err >= 0 ? NET_XMIT_SUCCESS : NET_XMIT_DROP;

NET_XMIT_DROP is 0x01 (a positive value), so the wrapper takes the
err >= 0 branch and returns NET_XMIT_SUCCESS even though the skb has
already been freed via kfree_skb(). l2tp_xmit_skb() then bumps
session/tunnel tx_packets and tx_bytes (rather than tx_errors) for a
packet that never went out.

The IPv4 sibling l2tp_xmit_ipv4() returns -EHOSTUNREACH on the
analogous failure, and the inet6_csk_xmit() being replaced returned
PTR_ERR(dst) (negative) here. Returning PTR_ERR(dst) or -EHOSTUNREACH
would let the wrapper map this to NET_XMIT_DROP and have
l2tp_xmit_skb() count it as an error.

> +
> +	skb_dst_set(skb, dst);
> +	fl6.daddr = sk->sk_v6_daddr;
> +
> +	err = ip6_xmit(sk, skb, &fl6, READ_ONCE(sk->sk_mark),
> +		       opt, np->tclass,
> +		       READ_ONCE(sk->sk_priority));
> +	rcu_read_unlock();
> +	return err;
> +}
> +#endif
> +
> +static int l2tp_xmit_ipv4(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
> +{
> +	struct inet_sock *inet = inet_sk(sk);
> +	struct net *net = sock_net(sk);
> +	struct ip_options_rcu *inet_opt;
> +	struct flowi4 *fl4;
> +	struct rtable *rt;
> +	__u8 tos;
> +	int err;
> +
> +	rcu_read_lock();
> +	inet_opt = rcu_dereference(inet->inet_opt);
> +	fl4 = &fl->u.ip4;
> +	tos = READ_ONCE(inet->tos);
> +
> +	rt = dst_rtable(sk_dst_check(sk, 0));

[High]
Does the cache-hit path leak a reference on the cached dst?

sk_dst_check() ultimately calls sk_dst_get(), which does
rcuref_get(&dst->__rcuref) and returns a dst with a caller-owned +1
reference.

> +	if (!rt) {
> +		__be32 daddr = inet->inet_daddr;
> +
> +		if (inet_opt && inet_opt->opt.srr)
> +			daddr = inet_opt->opt.faddr;
> +
> +		rt = ip_route_output_ports(net, fl4, sk,
> +					   daddr, inet->inet_saddr,
> +					   inet->inet_dport,
> +					   inet->inet_sport,
> +					   sk->sk_protocol,
> +					   tos & INET_DSCP_MASK,
> +					   READ_ONCE(sk->sk_bound_dev_if));
> +		if (IS_ERR(rt)) {
> +			rcu_read_unlock();
> +			IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
> +			kfree_skb_reason(skb, SKB_DROP_REASON_IP_OUTNOROUTES);
> +			return -EHOSTUNREACH;
> +		}
> +
> +		sk_setup_caps(sk, &rt->dst);
> +	}
> +
> +	skb_dst_set_noref(skb, &rt->dst);
> +	rcu_read_unlock();
> +
> +	err = ip_queue_xmit(sk, skb, fl);
> +	return err;
> +}

[High]
Continuing on the cache-hit path: skb_dst_set_noref() stores the
pointer with SKB_DST_NOREF and explicitly does not consume a
reference, and ip_queue_xmit()/__ip_queue_xmit() only drop noref-
flagged dsts via skb_dst_drop()/refdst_drop without releasing a
reference. The +1 reference returned by sk_dst_check() is never put.

For comparison, net/ipv4/udp.c:udp_sendmsg() uses the same idiom and
explicitly calls ip_rt_put(rt) at its out: label to release this
reference. The previous behaviour relied on __ip_queue_xmit() using
__sk_dst_check() (the no-ref variant), which avoids this issue. Should
this helper also call ip_rt_put(rt) (or equivalent) on its way out?

Note that sk_dst_reset() drops only the cache's own ref, so the
leaked refs persist past socket close and the cached dst is never
freed, which appears to make this an unbounded leak under user-driven
sendmsg loops.

[Critical]
On the cache-miss path, can the rtable be freed before
__ip_queue_xmit() dereferences it?

ip_route_output_ports() returns rt with a +1 reference. sk_setup_caps()
calls sk_dst_set(), which xchg's sk_dst_cache and dst_releases the old
entry but does not take a new reference on the new rt; the caller's +1
ref is consumed by being installed into the cache slot.

After sk_setup_caps(), rt's only liveness anchor is sk->sk_dst_cache.
The function then does:

	skb_dst_set_noref(skb, &rt->dst);
	rcu_read_unlock();

	err = ip_queue_xmit(sk, skb, fl);

__ip_queue_xmit() takes its own rcu_read_lock(), starting a new
critical section. In the gap between rcu_read_unlock() here and the
new rcu_read_lock() inside __ip_queue_xmit(), an RCU grace period can
complete on this CPU.

A concurrent sk_dst_reset() (from a route update, sockopt, or another
lockless caller observing the dst as obsolete, similar to the
udpv6_sendmsg pattern described in the commit message) can xchg
sk_dst_cache to NULL and dst_release(rt), dropping the refcount 1->0
and queueing the rtable for free via call_rcu_hurry / dst_destroy_rcu.
If the grace period elapses in the rcu_read_unlock window, the rtable
is freed before __ip_queue_xmit() dereferences skb_rtable(skb)
(rt_uses_gateway, rt->dst, ...) at the packet_routed label.

skb_dst_set_noref() requires an unbroken RCU read-side critical
section spanning the entire skb consumption, which is the invariant
__ip_queue_xmit() itself maintains (its rcu_read_lock() spans
skb_dst_set_noref through ip_local_out). Should this helper either
hold rcu_read_lock() across the ip_queue_xmit() call, or take a real
reference (skb_dst_set() with dst_hold(), or skip the noref path) so
that rt cannot be freed during the handoff?
-- 
This is an AI-generated review.