[v3] bpf, sockmap: Fix af_unix null-ptr-deref in proto update

[PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock

Posted by Michal Luczaj 3 weeks, 6 days ago

unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
socket is properly set up, which would include having a defined peer. IOW,
there's a window when unix_stream_bpf_update_proto() can be called on
socket which still has unix_peer(sk) == NULL.

          T0 bpf                            T1 connect
          ------                            ----------

                                WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
sock_map_sk_state_allowed(sk)
...
sk_pair = unix_peer(sk)
sock_hold(sk_pair)
                                sock_hold(newsk)
                                smp_mb__after_atomic()
                                unix_peer(sk) = newsk

BUG: kernel NULL pointer dereference, address: 0000000000000080
RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
Call Trace:
  sock_map_link+0x564/0x8b0
  sock_map_update_common+0x6e/0x340
  sock_map_update_elem_sys+0x17d/0x240
  __sys_bpf+0x26db/0x3250
  __x64_sys_bpf+0x21/0x30
  do_syscall_64+0x6b/0x3a0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Initial idea was to move peer assignment _before_ the sk_state update[1],
but that involved an additional memory barrier, and changing the hot path
was rejected. Then a check during proto update was considered[2], but a
follow-up discussion[3] concluded the root cause is sockmap taking a wrong
lock. Or, more specifically, an insufficient lock[4].

Thus, teach sockmap about the af_unix-specific locking: af_unix protects
critical sections under unix_state_lock() operating on unix_sock::lock.

[1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
[2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
[3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
[4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/

Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 7ba6a7f24ccd..6109fbe6f99c 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -12,6 +12,7 @@
 #include <linux/list.h>
 #include <linux/jhash.h>
 #include <linux/sock_diag.h>
+#include <net/af_unix.h>
 #include <net/udp.h>
 
 struct bpf_stab {
@@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
 }
 
 static void sock_map_sk_acquire(struct sock *sk)
-	__acquires(&sk->sk_lock.slock)
 {
 	lock_sock(sk);
+
+	if (sk_is_unix(sk))
+		unix_state_lock(sk);
+
 	rcu_read_lock();
 }
 
 static void sock_map_sk_release(struct sock *sk)
-	__releases(&sk->sk_lock.slock)
 {
 	rcu_read_unlock();
+
+	if (sk_is_unix(sk))
+		unix_state_unlock(sk);
+
 	release_sock(sk);
 }
 
+static inline void sock_map_sk_acquire_fast(struct sock *sk)
+{
+	local_bh_disable();
+	bh_lock_sock(sk);
+
+	if (sk_is_unix(sk))
+		unix_state_lock(sk);
+}
+
+static inline void sock_map_sk_release_fast(struct sock *sk)
+{
+	if (sk_is_unix(sk))
+		unix_state_unlock(sk);
+
+	bh_unlock_sock(sk);
+	local_bh_enable();
+}
+
 static void sock_map_add_link(struct sk_psock *psock,
 			      struct sk_psock_link *link,
 			      struct bpf_map *map, void *link_raw)
@@ -604,16 +629,14 @@ static long sock_map_update_elem(struct bpf_map *map, void *key,
 	if (!sock_map_sk_is_suitable(sk))
 		return -EOPNOTSUPP;
 
-	local_bh_disable();
-	bh_lock_sock(sk);
+	sock_map_sk_acquire_fast(sk);
 	if (!sock_map_sk_state_allowed(sk))
 		ret = -EOPNOTSUPP;
 	else if (map->map_type == BPF_MAP_TYPE_SOCKMAP)
 		ret = sock_map_update_common(map, *(u32 *)key, sk, flags);
 	else
 		ret = sock_hash_update_common(map, key, sk, flags);
-	bh_unlock_sock(sk);
-	local_bh_enable();
+	sock_map_sk_release_fast(sk);
 	return ret;
 }
 

-- 
2.52.0

Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock

Posted by Jiayuan Chen 3 weeks, 6 days ago

On 3/6/26 7:30 AM, Michal Luczaj wrote:
> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
> socket is properly set up, which would include having a defined peer. IOW,
> there's a window when unix_stream_bpf_update_proto() can be called on
> socket which still has unix_peer(sk) == NULL.
>
>            T0 bpf                            T1 connect
>            ------                            ----------
>
>                                  WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
> sock_map_sk_state_allowed(sk)
> ...
> sk_pair = unix_peer(sk)
> sock_hold(sk_pair)
>                                  sock_hold(newsk)
>                                  smp_mb__after_atomic()
>                                  unix_peer(sk) = newsk
>
> BUG: kernel NULL pointer dereference, address: 0000000000000080
> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
> Call Trace:
>    sock_map_link+0x564/0x8b0
>    sock_map_update_common+0x6e/0x340
>    sock_map_update_elem_sys+0x17d/0x240
>    __sys_bpf+0x26db/0x3250
>    __x64_sys_bpf+0x21/0x30
>    do_syscall_64+0x6b/0x3a0
>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Initial idea was to move peer assignment _before_ the sk_state update[1],
> but that involved an additional memory barrier, and changing the hot path
> was rejected. Then a check during proto update was considered[2], but a
> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
> lock. Or, more specifically, an insufficient lock[4].
>
> Thus, teach sockmap about the af_unix-specific locking: af_unix protects
> critical sections under unix_state_lock() operating on unix_sock::lock.
>
> [1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
> [2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
> [3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
> [4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
>
> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> ---
>   net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
>   1 file changed, 29 insertions(+), 6 deletions(-)
>
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 7ba6a7f24ccd..6109fbe6f99c 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -12,6 +12,7 @@
>   #include <linux/list.h>
>   #include <linux/jhash.h>
>   #include <linux/sock_diag.h>
> +#include <net/af_unix.h>
>   #include <net/udp.h>
>   
>   struct bpf_stab {
> @@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
>   }
>   
>   static void sock_map_sk_acquire(struct sock *sk)
> -	__acquires(&sk->sk_lock.slock)
>   {
>   	lock_sock(sk);
> +
> +	if (sk_is_unix(sk))
> +		unix_state_lock(sk);
> +
>   	rcu_read_lock();
>   }
>   

This introduces a new ordering constraint: lock_sock() before
unix_state_lock(). Kuniyuki flagged in the v2 review that taking
lock_sock() inside unix_state_lock() in the future would create an
ABBA deadlock, which is exactly why the order was settled this way. However,
the thread did not reach a conclusion on whether that constraint should be
documented in the code.

Since there is nothing enforcing this ordering mechanically, a brief comment
at sock_map_sk_acquire() would help future readers avoid introducing the
inverse ordering.


>   static void sock_map_sk_release(struct sock *sk)
> -	__releases(&sk->sk_lock.slock)
>   {
>   	rcu_read_unlock();
> +
> +	if (sk_is_unix(sk))
> +		unix_state_unlock(sk);
> +
>   	release_sock(sk);
>   }
>   
> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
> +{
> +	local_bh_disable();
> +	bh_lock_sock(sk);
> +
> +	if (sk_is_unix(sk))
> +		unix_state_lock(sk);
> +}
> +


v2 and v3 differ here in a way that deserves a closer look. In v2, AF_UNIX
sockets took only unix_state_lock() in the fast path, skipping
bh_lock_sock() entirely:

/* v2 */
if (sk_is_unix(sk))
   unix_state_lock(sk);
else
   bh_lock_sock(sk);

v3 takes both for AF_UNIX sockets.

bh_lock_sock() protects sock::sk_lock.slock, whereas the state that
actually needs protection here — sk_state and unix_peer() — lives under
unix_sock::lock. Since unix_state_lock() is already sufficient to close
the race against unix_stream_connect(), is bh_lock_sock() still doing
anything useful for AF_UNIX sockets in this path?

The v3 changelog attributes "Keep lock_sock() along with unix_state_lock()"
to Kuniyuki, but that comment appears to concern sock_map_sk_acquire()
(the slow path); the fast path behaviour change was not explicitly discussed
in the v2 thread.

If bh_lock_sock() is redundant for AF_UNIX here, reverting to the v2
approach would reduce unnecessary lock contention. If it is still needed,
a comment explaining what it protects for AF_UNIX sockets would help.

> +static inline void sock_map_sk_release_fast(struct sock *sk)
> +{
> +	if (sk_is_unix(sk))
> +		unix_state_unlock(sk);
> +
> +	bh_unlock_sock(sk);
> +	local_bh_enable();
> +}
> +
>   static void sock_map_add_link(struct sk_psock *psock,
>   			      struct sk_psock_link *link,
>   			      struct bpf_map *map, void *link_raw)
> @@ -604,16 +629,14 @@ static long sock_map_update_elem(struct bpf_map *map, void *key,
>   	if (!sock_map_sk_is_suitable(sk))
>   		return -EOPNOTSUPP;
>   
> -	local_bh_disable();
> -	bh_lock_sock(sk);
> +	sock_map_sk_acquire_fast(sk);
>   	if (!sock_map_sk_state_allowed(sk))
>   		ret = -EOPNOTSUPP;
>   	else if (map->map_type == BPF_MAP_TYPE_SOCKMAP)
>   		ret = sock_map_update_common(map, *(u32 *)key, sk, flags);
>   	else
>   		ret = sock_hash_update_common(map, key, sk, flags);
> -	bh_unlock_sock(sk);
> -	local_bh_enable();
> +	sock_map_sk_release_fast(sk);
>   	return ret;
>   }
>   
>

Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock

Posted by Michal Luczaj 3 weeks, 5 days ago

On 3/6/26 06:01, Jiayuan Chen wrote:
> 
> On 3/6/26 7:30 AM, Michal Luczaj wrote:
>> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
>> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
>> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
>> socket is properly set up, which would include having a defined peer. IOW,
>> there's a window when unix_stream_bpf_update_proto() can be called on
>> socket which still has unix_peer(sk) == NULL.
>>
>>            T0 bpf                            T1 connect
>>            ------                            ----------
>>
>>                                  WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
>> sock_map_sk_state_allowed(sk)
>> ...
>> sk_pair = unix_peer(sk)
>> sock_hold(sk_pair)
>>                                  sock_hold(newsk)
>>                                  smp_mb__after_atomic()
>>                                  unix_peer(sk) = newsk
>>
>> BUG: kernel NULL pointer dereference, address: 0000000000000080
>> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
>> Call Trace:
>>    sock_map_link+0x564/0x8b0
>>    sock_map_update_common+0x6e/0x340
>>    sock_map_update_elem_sys+0x17d/0x240
>>    __sys_bpf+0x26db/0x3250
>>    __x64_sys_bpf+0x21/0x30
>>    do_syscall_64+0x6b/0x3a0
>>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> Initial idea was to move peer assignment _before_ the sk_state update[1],
>> but that involved an additional memory barrier, and changing the hot path
>> was rejected. Then a check during proto update was considered[2], but a
>> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
>> lock. Or, more specifically, an insufficient lock[4].
>>
>> Thus, teach sockmap about the af_unix-specific locking: af_unix protects
>> critical sections under unix_state_lock() operating on unix_sock::lock.
>>
>> [1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
>> [2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
>> [3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
>> [4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
>>
>> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
>> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
>> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
>> ---
>>   net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
>>   1 file changed, 29 insertions(+), 6 deletions(-)
>>
>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>> index 7ba6a7f24ccd..6109fbe6f99c 100644
>> --- a/net/core/sock_map.c
>> +++ b/net/core/sock_map.c
>> @@ -12,6 +12,7 @@
>>   #include <linux/list.h>
>>   #include <linux/jhash.h>
>>   #include <linux/sock_diag.h>
>> +#include <net/af_unix.h>
>>   #include <net/udp.h>
>>   
>>   struct bpf_stab {
>> @@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
>>   }
>>   
>>   static void sock_map_sk_acquire(struct sock *sk)
>> -	__acquires(&sk->sk_lock.slock)
>>   {
>>   	lock_sock(sk);
>> +
>> +	if (sk_is_unix(sk))
>> +		unix_state_lock(sk);
>> +
>>   	rcu_read_lock();
>>   }
>>   
> 
> This introduces a new ordering constraint: lock_sock() before
> unix_state_lock(). Kuniyuki flagged in the v2 review that taking
> lock_sock() inside unix_state_lock() in the future would create an
> ABBA deadlock, which is exactly why the order was settled this way. However,
> the thread did not reach a conclusion on whether that constraint should be
> documented in the code.
> 
> Since there is nothing enforcing this ordering mechanically, a brief comment
> at sock_map_sk_acquire() would help future readers avoid introducing the
> inverse ordering.

Sure, will do.

>>   static void sock_map_sk_release(struct sock *sk)
>> -	__releases(&sk->sk_lock.slock)
>>   {
>>   	rcu_read_unlock();
>> +
>> +	if (sk_is_unix(sk))
>> +		unix_state_unlock(sk);
>> +
>>   	release_sock(sk);
>>   }
>>   
>> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
>> +{
>> +	local_bh_disable();
>> +	bh_lock_sock(sk);
>> +
>> +	if (sk_is_unix(sk))
>> +		unix_state_lock(sk);
>> +}
>> +
> 
> 
> v2 and v3 differ here in a way that deserves a closer look. In v2, AF_UNIX
> sockets took only unix_state_lock() in the fast path, skipping
> bh_lock_sock() entirely:
> 
> /* v2 */
> if (sk_is_unix(sk))
>    unix_state_lock(sk);
> else
>    bh_lock_sock(sk);
> 
> v3 takes both for AF_UNIX sockets.
> 
> bh_lock_sock() protects sock::sk_lock.slock, whereas the state that
> actually needs protection here — sk_state and unix_peer() — lives under
> unix_sock::lock. Since unix_state_lock() is already sufficient to close
> the race against unix_stream_connect(), is bh_lock_sock() still doing
> anything useful for AF_UNIX sockets in this path?

Yeah, good point. I think I was just trying to keep the _fast variant
aligned with the sleepy one. Which I agree might be unnecessary.

In a parallel thread I've asked Kuniyuki if it might be better to
completely drop patch 2/5, which would change how we interact with
sock_map_close(). Lets see how it goes.

Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock

Posted by Martin KaFai Lau 3 weeks, 1 day ago


On 3/6/26 6:09 AM, Michal Luczaj wrote:
> On 3/6/26 06:01, Jiayuan Chen wrote:
>>
>> On 3/6/26 7:30 AM, Michal Luczaj wrote:
>>> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
>>> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
>>> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
>>> socket is properly set up, which would include having a defined peer. IOW,
>>> there's a window when unix_stream_bpf_update_proto() can be called on
>>> socket which still has unix_peer(sk) == NULL.
>>>
>>>             T0 bpf                            T1 connect
>>>             ------                            ----------
>>>
>>>                                   WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
>>> sock_map_sk_state_allowed(sk)
>>> ...
>>> sk_pair = unix_peer(sk)
>>> sock_hold(sk_pair)
>>>                                   sock_hold(newsk)
>>>                                   smp_mb__after_atomic()
>>>                                   unix_peer(sk) = newsk
>>>
>>> BUG: kernel NULL pointer dereference, address: 0000000000000080
>>> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
>>> Call Trace:
>>>     sock_map_link+0x564/0x8b0
>>>     sock_map_update_common+0x6e/0x340
>>>     sock_map_update_elem_sys+0x17d/0x240
>>>     __sys_bpf+0x26db/0x3250
>>>     __x64_sys_bpf+0x21/0x30
>>>     do_syscall_64+0x6b/0x3a0
>>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> Initial idea was to move peer assignment _before_ the sk_state update[1],
>>> but that involved an additional memory barrier, and changing the hot path
>>> was rejected. Then a check during proto update was considered[2], but a
>>> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
>>> lock. Or, more specifically, an insufficient lock[4].
>>>
>>> Thus, teach sockmap about the af_unix-specific locking: af_unix protects
>>> critical sections under unix_state_lock() operating on unix_sock::lock.
>>>
>>> [1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
>>> [2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
>>> [3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
>>> [4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
>>>
>>> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
>>> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
>>> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
>>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
>>> ---
>>>    net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
>>>    1 file changed, 29 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>>> index 7ba6a7f24ccd..6109fbe6f99c 100644
>>> --- a/net/core/sock_map.c
>>> +++ b/net/core/sock_map.c
>>> @@ -12,6 +12,7 @@
>>>    #include <linux/list.h>
>>>    #include <linux/jhash.h>
>>>    #include <linux/sock_diag.h>
>>> +#include <net/af_unix.h>
>>>    #include <net/udp.h>
>>>    
>>>    struct bpf_stab {
>>> @@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
>>>    }
>>>    
>>>    static void sock_map_sk_acquire(struct sock *sk)
>>> -	__acquires(&sk->sk_lock.slock)
>>>    {
>>>    	lock_sock(sk);
>>> +
>>> +	if (sk_is_unix(sk))
>>> +		unix_state_lock(sk);
>>> +
>>>    	rcu_read_lock();
>>>    }
>>>    
>>
>> This introduces a new ordering constraint: lock_sock() before
>> unix_state_lock(). Kuniyuki flagged in the v2 review that taking
>> lock_sock() inside unix_state_lock() in the future would create an
>> ABBA deadlock, which is exactly why the order was settled this way. However,
>> the thread did not reach a conclusion on whether that constraint should be
>> documented in the code.
>>
>> Since there is nothing enforcing this ordering mechanically, a brief comment
>> at sock_map_sk_acquire() would help future readers avoid introducing the
>> inverse ordering.
> 
> Sure, will do.
> 
>>>    static void sock_map_sk_release(struct sock *sk)
>>> -	__releases(&sk->sk_lock.slock)
>>>    {
>>>    	rcu_read_unlock();
>>> +
>>> +	if (sk_is_unix(sk))
>>> +		unix_state_unlock(sk);
>>> +
>>>    	release_sock(sk);
>>>    }
>>>    
>>> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
>>> +{
>>> +	local_bh_disable();
>>> +	bh_lock_sock(sk);
>>> +
>>> +	if (sk_is_unix(sk))
>>> +		unix_state_lock(sk);
>>> +}
>>> +
>>
>>
>> v2 and v3 differ here in a way that deserves a closer look. In v2, AF_UNIX
>> sockets took only unix_state_lock() in the fast path, skipping
>> bh_lock_sock() entirely:
>>
>> /* v2 */
>> if (sk_is_unix(sk))
>>     unix_state_lock(sk);
>> else
>>     bh_lock_sock(sk);
>>
>> v3 takes both for AF_UNIX sockets.
>>
>> bh_lock_sock() protects sock::sk_lock.slock, whereas the state that
>> actually needs protection here — sk_state and unix_peer() — lives under
>> unix_sock::lock. Since unix_state_lock() is already sufficient to close
>> the race against unix_stream_connect(), is bh_lock_sock() still doing
>> anything useful for AF_UNIX sockets in this path?
> 
> Yeah, good point. I think I was just trying to keep the _fast variant
> aligned with the sleepy one. Which I agree might be unnecessary.

I hope the common use case should not be calling 
bpf_map_update_elem(sockmap) only.

Beside, from looking at the may_update_sockmap(), I don't know if it is 
even doable (or useful) to bpf_map_update_elem(unix_sk) in 
tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking 
at unix_dgram_sendmsg() => sk_filter(). It was not the original use case 
when the bpf_map_update_elem(sockmap) support was added iirc.

The only path left is bpf_iter, which I believe was the primary use case 
when adding bpf_map_update_elem(sockmap) support [1]. It would be nice 
to avoid bh_lock_sock() when calling from all bpf_iter (tcp/udp/unix) 
where lock_sock() has already been done. It is more for 
reading-correctness though. This just came to my mind. 
has_current_bpf_ctx() can be used to check this. sockopt_lock_sock() has 
been using it to conditionally take lock_sock() or not. [ I would still 
keep patch 3 though. ]

[1]: https://lore.kernel.org/bpf/20200821102948.21918-1-lmb@cloudflare.com/

> 
> In a parallel thread I've asked Kuniyuki if it might be better to
> completely drop patch 2/5, which would change how we interact with
> sock_map_close(). Lets see how it goes.
> 

If patch 2 is dropped, lock_sock() is always needed for unix_sk?

Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock

Posted by Michal Luczaj 2 weeks, 3 days ago

On 3/10/26 23:20, Martin KaFai Lau wrote:
> On 3/6/26 6:09 AM, Michal Luczaj wrote:
>> On 3/6/26 06:01, Jiayuan Chen wrote:
>>>
>>> On 3/6/26 7:30 AM, Michal Luczaj wrote:
>>>> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
>>>> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
>>>> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
>>>> socket is properly set up, which would include having a defined peer. IOW,
>>>> there's a window when unix_stream_bpf_update_proto() can be called on
>>>> socket which still has unix_peer(sk) == NULL.
>>>>
>>>>             T0 bpf                            T1 connect
>>>>             ------                            ----------
>>>>
>>>>                                   WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
>>>> sock_map_sk_state_allowed(sk)
>>>> ...
>>>> sk_pair = unix_peer(sk)
>>>> sock_hold(sk_pair)
>>>>                                   sock_hold(newsk)
>>>>                                   smp_mb__after_atomic()
>>>>                                   unix_peer(sk) = newsk
>>>>
>>>> BUG: kernel NULL pointer dereference, address: 0000000000000080
>>>> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
>>>> Call Trace:
>>>>     sock_map_link+0x564/0x8b0
>>>>     sock_map_update_common+0x6e/0x340
>>>>     sock_map_update_elem_sys+0x17d/0x240
>>>>     __sys_bpf+0x26db/0x3250
>>>>     __x64_sys_bpf+0x21/0x30
>>>>     do_syscall_64+0x6b/0x3a0
>>>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>
>>>> Initial idea was to move peer assignment _before_ the sk_state update[1],
>>>> but that involved an additional memory barrier, and changing the hot path
>>>> was rejected. Then a check during proto update was considered[2], but a
>>>> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
>>>> lock. Or, more specifically, an insufficient lock[4].
>>>>
>>>> Thus, teach sockmap about the af_unix-specific locking: af_unix protects
>>>> critical sections under unix_state_lock() operating on unix_sock::lock.
>>>>
>>>> [1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
>>>> [2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
>>>> [3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
>>>> [4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
>>>>
>>>> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
>>>> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
>>>> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
>>>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
>>>> ---
>>>>    net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
>>>>    1 file changed, 29 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>>>> index 7ba6a7f24ccd..6109fbe6f99c 100644
>>>> --- a/net/core/sock_map.c
>>>> +++ b/net/core/sock_map.c
>>>> @@ -12,6 +12,7 @@
>>>>    #include <linux/list.h>
>>>>    #include <linux/jhash.h>
>>>>    #include <linux/sock_diag.h>
>>>> +#include <net/af_unix.h>
>>>>    #include <net/udp.h>
>>>>    
>>>>    struct bpf_stab {
>>>> @@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
>>>>    }
>>>>    
>>>>    static void sock_map_sk_acquire(struct sock *sk)
>>>> -	__acquires(&sk->sk_lock.slock)
>>>>    {
>>>>    	lock_sock(sk);
>>>> +
>>>> +	if (sk_is_unix(sk))
>>>> +		unix_state_lock(sk);
>>>> +
>>>>    	rcu_read_lock();
>>>>    }
>>>>    
>>>
>>> This introduces a new ordering constraint: lock_sock() before
>>> unix_state_lock(). Kuniyuki flagged in the v2 review that taking
>>> lock_sock() inside unix_state_lock() in the future would create an
>>> ABBA deadlock, which is exactly why the order was settled this way. However,
>>> the thread did not reach a conclusion on whether that constraint should be
>>> documented in the code.
>>>
>>> Since there is nothing enforcing this ordering mechanically, a brief comment
>>> at sock_map_sk_acquire() would help future readers avoid introducing the
>>> inverse ordering.
>>
>> Sure, will do.
>>
>>>>    static void sock_map_sk_release(struct sock *sk)
>>>> -	__releases(&sk->sk_lock.slock)
>>>>    {
>>>>    	rcu_read_unlock();
>>>> +
>>>> +	if (sk_is_unix(sk))
>>>> +		unix_state_unlock(sk);
>>>> +
>>>>    	release_sock(sk);
>>>>    }
>>>>    
>>>> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
>>>> +{
>>>> +	local_bh_disable();
>>>> +	bh_lock_sock(sk);
>>>> +
>>>> +	if (sk_is_unix(sk))
>>>> +		unix_state_lock(sk);
>>>> +}
>>>> +
>>>
>>>
>>> v2 and v3 differ here in a way that deserves a closer look. In v2, AF_UNIX
>>> sockets took only unix_state_lock() in the fast path, skipping
>>> bh_lock_sock() entirely:
>>>
>>> /* v2 */
>>> if (sk_is_unix(sk))
>>>     unix_state_lock(sk);
>>> else
>>>     bh_lock_sock(sk);
>>>
>>> v3 takes both for AF_UNIX sockets.
>>>
>>> bh_lock_sock() protects sock::sk_lock.slock, whereas the state that
>>> actually needs protection here — sk_state and unix_peer() — lives under
>>> unix_sock::lock. Since unix_state_lock() is already sufficient to close
>>> the race against unix_stream_connect(), is bh_lock_sock() still doing
>>> anything useful for AF_UNIX sockets in this path?
>>
>> Yeah, good point. I think I was just trying to keep the _fast variant
>> aligned with the sleepy one. Which I agree might be unnecessary.
> 
> I hope the common use case should not be calling 
> bpf_map_update_elem(sockmap) only.
> 
> Beside, from looking at the may_update_sockmap(), I don't know if it is 
> even doable (or useful) to bpf_map_update_elem(unix_sk) in 
> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking 
> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case 
> when the bpf_map_update_elem(sockmap) support was added iirc.

What about a situation when unix_sk is stored in a sockmap, then tc prog
looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
useful, but seems doable.

> The only path left is bpf_iter, which I believe was the primary use case 
> when adding bpf_map_update_elem(sockmap) support [1]. It would be nice 
> to avoid bh_lock_sock() when calling from all bpf_iter (tcp/udp/unix) 
> where lock_sock() has already been done. It is more for 
> reading-correctness though. This just came to my mind. 
> has_current_bpf_ctx() can be used to check this. sockopt_lock_sock() has 
> been using it to conditionally take lock_sock() or not.

[ One clarification: bh_lock_sock() is a sock_map_update_elem() thing,
which can only be called by a bpf prog. IOW, has_current_bpf_ctx() is
always `true` in sock_map_update_elem(), right? ]

Let me know if I'm correctly rephrasing your idea: assume all bpf-context
callers hold the socket locked or keep it "stable" (meaning: "sk won't
surprise sockmap update by some breaking state change coming from another
thread"). As you said, most bpf iters already take the sock_lock(), and I
have a patch that fixes sock_{map,hash}_seq_show(). Then we could try
dropping that bh_lock_sock().

> [ I would still keep patch 3 though. ]

Right.

> [1]: https://lore.kernel.org/bpf/20200821102948.21918-1-lmb@cloudflare.com/
> 
>>
>> In a parallel thread I've asked Kuniyuki if it might be better to
>> completely drop patch 2/5, which would change how we interact with
>> sock_map_close(). Lets see how it goes.
>>
> 
> If patch 2 is dropped, lock_sock() is always needed for unix_sk?

For sock_map_update_elem_sys() I wanted to lock_sock()+unix_state_lock()
following Kuniyuki's suggestion to keep locking pattern/order (that repeats
when unix bpf iter prog invokes bpf_map_update_elem() ->
sock_map_update_elem()). For sock_map_update_elem() not, we can't sleep there.

Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock

Posted by Martin KaFai Lau 1 week ago

On 3/15/26 4:58 PM, Michal Luczaj wrote:
>> Beside, from looking at the may_update_sockmap(), I don't know if it is
>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
>> when the bpf_map_update_elem(sockmap) support was added iirc.
> 
> What about a situation when unix_sk is stored in a sockmap, then tc prog
> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
> useful, but seems doable.

[ Sorry for the late reply ]

It is a bummer that the bpf_map_update_elem(unix_sk) path is possible 
from tc :(

Then unix_state_lock() in its current form cannot be safely acquired in 
sock_map_update_elem(). It is currently a spin_lock() instead of 
spin_lock_bh().

> 
>> The only path left is bpf_iter, which I believe was the primary use case
>> when adding bpf_map_update_elem(sockmap) support [1]. It would be nice
>> to avoid bh_lock_sock() when calling from all bpf_iter (tcp/udp/unix)
>> where lock_sock() has already been done. It is more for
>> reading-correctness though. This just came to my mind.
>> has_current_bpf_ctx() can be used to check this. sockopt_lock_sock() has
>> been using it to conditionally take lock_sock() or not.
> 
> [ One clarification: bh_lock_sock() is a sock_map_update_elem() thing,
> which can only be called by a bpf prog. IOW, has_current_bpf_ctx() is
> always `true` in sock_map_update_elem(), right? ]

For all the bpf prog types allowed by may_update_sockmap() to do 
bpf_map_update_elem(sockmap), only BPF_TRACE_ITER should have 
has_current_bpf_ctx() == true. The tc prog (and others allowed in 
may_update_sockmap()) will have has_current_bpf_ctx() == false when 
calling sock_map_update_elem().

The tc case of bpf_map_update_elem(unix_sk) is unfortunate and requires 
going back to the drawing board. I think checking unix_peer(sk) for NULL 
without acquiring unix_state_lock() is needed for the 
sock_map_update_elem() path, since changing unix_state_lock() for this 
unknown use case seems overkill.

Whether sock_map_update_elem_"sys"() needs unix_state_lock() is up for 
debate.

For bpf_iter_unix_seq_show(), one thought is to add unix_state_lock() 
there before running the bpf iter prog. iiuc, it is what Kuniyuki has in 
mind also to allow bpf iter prog having a stable view of unix_sock. This 
could be a followup.
[fwiw, it was why I first thought of has_current_bpf_ctx() to avoid 
sock_map_update_elem() taking unix_state_lock() again if 
bpf_iter_unix_seq_show() acquires unix_state_lock() earlier. I later 
concluded (but proved to be incorrect) that tc cannot call 
bpf_map_update_elem(unix_sk).]

> 
> Let me know if I'm correctly rephrasing your idea: assume all bpf-context
> callers hold the socket locked or keep it "stable" (meaning: "sk won't
> surprise sockmap update by some breaking state change coming from another
> thread"). As you said, most bpf iters already take the sock_lock(), and I

Right, all bpf iter (udp, tcp, unix) has acquired the lock_sock() before 
running the bpf iter prog. afaik, the only exception is netlink bpf iter 
but it cannot be added to sock_map afaik.

> have a patch that fixes sock_{map,hash}_seq_show(). Then we could try
> dropping that bh_lock_sock().
> 
>> [ I would still keep patch 3 though. ]
> 
> Right.
> 
>> [1]: https://lore.kernel.org/bpf/20200821102948.21918-1-lmb@cloudflare.com/
>>
>>>
>>> In a parallel thread I've asked Kuniyuki if it might be better to
>>> completely drop patch 2/5, which would change how we interact with
>>> sock_map_close(). Lets see how it goes.
>>>
>>
>> If patch 2 is dropped, lock_sock() is always needed for unix_sk?
> 
> For sock_map_update_elem_sys() I wanted to lock_sock()+unix_state_lock()
> following Kuniyuki's suggestion to keep locking pattern/order (that repeats
> when unix bpf iter prog invokes bpf_map_update_elem() ->
> sock_map_update_elem()). For sock_map_update_elem() not, we can't sleep there.

Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock

Posted by Michal Luczaj 2 days, 11 hours ago

On 3/26/26 07:26, Martin KaFai Lau wrote:
> On 3/15/26 4:58 PM, Michal Luczaj wrote:
>>> Beside, from looking at the may_update_sockmap(), I don't know if it is
>>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
>>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
>>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
>>> when the bpf_map_update_elem(sockmap) support was added iirc.
>>
>> What about a situation when unix_sk is stored in a sockmap, then tc prog
>> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
>> useful, but seems doable.
> 
> [ Sorry for the late reply ]
> 
> It is a bummer that the bpf_map_update_elem(unix_sk) path is possible 
> from tc :(
> 
> Then unix_state_lock() in its current form cannot be safely acquired in 
> sock_map_update_elem(). It is currently a spin_lock() instead of 
> spin_lock_bh().

Is there a specific deadlock you have in your mind?

>>> The only path left is bpf_iter, which I believe was the primary use case
>>> when adding bpf_map_update_elem(sockmap) support [1]. It would be nice
>>> to avoid bh_lock_sock() when calling from all bpf_iter (tcp/udp/unix)
>>> where lock_sock() has already been done. It is more for
>>> reading-correctness though. This just came to my mind.
>>> has_current_bpf_ctx() can be used to check this. sockopt_lock_sock() has
>>> been using it to conditionally take lock_sock() or not.
>>
>> [ One clarification: bh_lock_sock() is a sock_map_update_elem() thing,
>> which can only be called by a bpf prog. IOW, has_current_bpf_ctx() is
>> always `true` in sock_map_update_elem(), right? ]
> 
> For all the bpf prog types allowed by may_update_sockmap() to do 
> bpf_map_update_elem(sockmap), only BPF_TRACE_ITER should have 
> has_current_bpf_ctx() == true. The tc prog (and others allowed in 
> may_update_sockmap()) will have has_current_bpf_ctx() == false when 
> calling sock_map_update_elem().

OK, so let's take test_sockmap_update.c:copy_sock_map(). It is a tc prog
and it calls bpf_map_update_elem() -> sock_map_update_elem(), right? But
running `test_progs -t "sockmap_basic/sockmap update"` shows (pr_warn() in
sock_map_update_elem()) that has_current_bpf_ctx() == true. That's expected
and has_current_bpf_ctx() would be false if sock_map_update_elem() was ran
via a hook?

> The tc case of bpf_map_update_elem(unix_sk) is unfortunate and requires 
> going back to the drawing board. I think checking unix_peer(sk) for NULL 
> without acquiring unix_state_lock() is needed for the 
> sock_map_update_elem() path, since changing unix_state_lock() for this 
> unknown use case seems overkill.
> 
> Whether sock_map_update_elem_"sys"() needs unix_state_lock() is up for 
> debate.

All right, I'll re-spin the series reverting back to v1.

> For bpf_iter_unix_seq_show(), one thought is to add unix_state_lock() 
> there before running the bpf iter prog. iiuc, it is what Kuniyuki has in 
> mind also to allow bpf iter prog having a stable view of unix_sock. This 
> could be a followup.
> [fwiw, it was why I first thought of has_current_bpf_ctx() to avoid 
> sock_map_update_elem() taking unix_state_lock() again if 
> bpf_iter_unix_seq_show() acquires unix_state_lock() earlier. I later 
> concluded (but proved to be incorrect) that tc cannot call 
> bpf_map_update_elem(unix_sk).]
> 
>>
>> Let me know if I'm correctly rephrasing your idea: assume all bpf-context
>> callers hold the socket locked or keep it "stable" (meaning: "sk won't
>> surprise sockmap update by some breaking state change coming from another
>> thread"). As you said, most bpf iters already take the sock_lock(), and I
> 
> Right, all bpf iter (udp, tcp, unix) has acquired the lock_sock() before 
> running the bpf iter prog. afaik, the only exception is netlink bpf iter 
> but it cannot be added to sock_map afaik.

And sock_{map,hash}_seq_show() (being a part of bpf iter machinery) needs
to take lock_sock() just as well? Would that require a special-casing
(unix_state_lock()) for af_unix?

>> have a patch that fixes sock_{map,hash}_seq_show(). Then we could try
>> dropping that bh_lock_sock().
>>
>>> [ I would still keep patch 3 though. ]
>>
>> Right.
>>
>>> [1]: https://lore.kernel.org/bpf/20200821102948.21918-1-lmb@cloudflare.com/
>>>
>>>>
>>>> In a parallel thread I've asked Kuniyuki if it might be better to
>>>> completely drop patch 2/5, which would change how we interact with
>>>> sock_map_close(). Lets see how it goes.
>>>>
>>>
>>> If patch 2 is dropped, lock_sock() is always needed for unix_sk?
>>
>> For sock_map_update_elem_sys() I wanted to lock_sock()+unix_state_lock()
>> following Kuniyuki's suggestion to keep locking pattern/order (that repeats
>> when unix bpf iter prog invokes bpf_map_update_elem() ->
>> sock_map_update_elem()). For sock_map_update_elem() not, we can't sleep there.
>

Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock

Posted by Martin KaFai Lau 2 days, 10 hours ago

On 3/30/26 4:03 PM, Michal Luczaj wrote:
> On 3/26/26 07:26, Martin KaFai Lau wrote:
>> On 3/15/26 4:58 PM, Michal Luczaj wrote:
>>>> Beside, from looking at the may_update_sockmap(), I don't know if it is
>>>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
>>>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
>>>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
>>>> when the bpf_map_update_elem(sockmap) support was added iirc.
>>>
>>> What about a situation when unix_sk is stored in a sockmap, then tc prog
>>> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
>>> useful, but seems doable.
>>
>> [ Sorry for the late reply ]
>>
>> It is a bummer that the bpf_map_update_elem(unix_sk) path is possible
>> from tc :(
>>
>> Then unix_state_lock() in its current form cannot be safely acquired in
>> sock_map_update_elem(). It is currently a spin_lock() instead of
>> spin_lock_bh().
> 
> Is there a specific deadlock you have in your mind?

e.g. unix_stream_connect() is taking unix_state_lock(). Can a tc's 
ingress bpf prog call unix_state_lock()?

> 
>>>> The only path left is bpf_iter, which I believe was the primary use case
>>>> when adding bpf_map_update_elem(sockmap) support [1]. It would be nice
>>>> to avoid bh_lock_sock() when calling from all bpf_iter (tcp/udp/unix)
>>>> where lock_sock() has already been done. It is more for
>>>> reading-correctness though. This just came to my mind.
>>>> has_current_bpf_ctx() can be used to check this. sockopt_lock_sock() has
>>>> been using it to conditionally take lock_sock() or not.
>>>
>>> [ One clarification: bh_lock_sock() is a sock_map_update_elem() thing,
>>> which can only be called by a bpf prog. IOW, has_current_bpf_ctx() is
>>> always `true` in sock_map_update_elem(), right? ]
>>
>> For all the bpf prog types allowed by may_update_sockmap() to do
>> bpf_map_update_elem(sockmap), only BPF_TRACE_ITER should have
>> has_current_bpf_ctx() == true. The tc prog (and others allowed in
>> may_update_sockmap()) will have has_current_bpf_ctx() == false when
>> calling sock_map_update_elem().
> 
> OK, so let's take test_sockmap_update.c:copy_sock_map(). It is a tc prog
> and it calls bpf_map_update_elem() -> sock_map_update_elem(), right? But
> running `test_progs -t "sockmap_basic/sockmap update"` shows (pr_warn() in
> sock_map_update_elem()) that has_current_bpf_ctx() == true. That's expected

I think it is because of the bpf_prog_test_run_skb() code path used by 
the test_sockmap_update() test. This would need to be addressed if 
has_current_bpf_ctx() was used in sock_map_update_elem().

> and has_current_bpf_ctx() would be false if sock_map_update_elem() was ran
> via a hook?

It should be false when the bpf prog is run by tc instead of 
bpf_prog_test_run_skb().

>>> Let me know if I'm correctly rephrasing your idea: assume all bpf-context
>>> callers hold the socket locked or keep it "stable" (meaning: "sk won't
>>> surprise sockmap update by some breaking state change coming from another
>>> thread"). As you said, most bpf iters already take the sock_lock(), and I
>>
>> Right, all bpf iter (udp, tcp, unix) has acquired the lock_sock() before
>> running the bpf iter prog. afaik, the only exception is netlink bpf iter
>> but it cannot be added to sock_map afaik.
> 
> And sock_{map,hash}_seq_show() (being a part of bpf iter machinery) needs
> to take lock_sock() just as well? Would that require a special-casing
> (unix_state_lock()) for af_unix?

I would think so for lock_sock() considering the current bh_lock_sock 
without !sock_owned_by_user() usage is incorrect in 
sock_map_update_elem(). [ this probably should be a separate issue for 
another patch ]

Some more side-tracking... from looking at the code, the bpf_iter of 
sock_{map,hash} can do bpf_map_lookup_elem(&sock_map, ...). This 
bpr_iter program probably will be failed to load because the 
bpf_sk_release() is not available.

I still don't have good idea what to do with the tc's prog calling 
sock_map_update_elem().

Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock

Posted by Michal Luczaj 1 day, 12 hours ago

On 3/31/26 02:20, Martin KaFai Lau wrote:
> On 3/30/26 4:03 PM, Michal Luczaj wrote:
>> On 3/26/26 07:26, Martin KaFai Lau wrote:
>>> On 3/15/26 4:58 PM, Michal Luczaj wrote:
>>>>> Beside, from looking at the may_update_sockmap(), I don't know if it is
>>>>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
>>>>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
>>>>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
>>>>> when the bpf_map_update_elem(sockmap) support was added iirc.
>>>>
>>>> What about a situation when unix_sk is stored in a sockmap, then tc prog
>>>> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
>>>> useful, but seems doable.
>>>
>>> [ Sorry for the late reply ]
>>>
>>> It is a bummer that the bpf_map_update_elem(unix_sk) path is possible
>>> from tc :(
>>>
>>> Then unix_state_lock() in its current form cannot be safely acquired in
>>> sock_map_update_elem(). It is currently a spin_lock() instead of
>>> spin_lock_bh().
>>
>> Is there a specific deadlock you have in your mind?
> 
> e.g. unix_stream_connect() is taking unix_state_lock(). Can a tc's 
> ingress bpf prog call unix_state_lock()?

Ah, right, that's the problem, thanks for explaining.

But, as I've asked in the parallel thread, do we really need to take the
unix_state_lock() in sock_map_update_elem()? Taking it in
sock_map_update_elem_sys() fixes the null-ptr-deref and does not lead to a
deadlock. Taking unix_state_lock() in sock_map_update_elem() seems
unnecessary. Well, at least under the assumption progs can only access
unix_sk via the sockmap lookup.

>> ...
>> And sock_{map,hash}_seq_show() (being a part of bpf iter machinery) needs
>> to take lock_sock() just as well? Would that require a special-casing
>> (unix_state_lock()) for af_unix?
> 
> I would think so for lock_sock() considering the current bh_lock_sock 
> without !sock_owned_by_user() usage is incorrect in 
> sock_map_update_elem(). [ this probably should be a separate issue for 
> another patch ]

All right, leaving that for later.

> Some more side-tracking... from looking at the code, the bpf_iter of 
> sock_{map,hash} can do bpf_map_lookup_elem(&sock_map, ...). This 
> bpr_iter program probably will be failed to load because the 
> bpf_sk_release() is not available.

I think ability to bpf_map_lookup_elem(sockmap) was added way before bpf
iter in
https://lore.kernel.org/bpf/20200429181154.479310-2-jakub@cloudflare.com/
and iter "allows" it somewhat accidentally. Would you rather have it
explicitly dropped?

> I still don't have good idea what to do with the tc's prog calling 
> sock_map_update_elem().

Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock

Posted by Martin KaFai Lau 9 hours ago

On Wed, 01 Apr 2026 00:43:58 +0200, Michal Luczaj wrote:
> On 3/31/26 02:20, Martin KaFai Lau wrote:
> > On 3/30/26 4:03 PM, Michal Luczaj wrote:
> >> On 3/26/26 07:26, Martin KaFai Lau wrote:
> >>> On 3/15/26 4:58 PM, Michal Luczaj wrote:
> >>>>> Beside, from looking at the may_update_sockmap(), I don't know if it is
> >>>>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
> >>>>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
> >>>>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
> >>>>> when the bpf_map_update_elem(sockmap) support was added iirc.
> >>>>
> >>>> What about a situation when unix_sk is stored in a sockmap, then tc prog
> >>>> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
> >>>> useful, but seems doable.
> >>>
> >>> [ Sorry for the late reply ]
> >>>
> >>> It is a bummer that the bpf_map_update_elem(unix_sk) path is possible
> >>> from tc :(
> >>>
> >>> Then unix_state_lock() in its current form cannot be safely acquired in
> >>> sock_map_update_elem(). It is currently a spin_lock() instead of
> >>> spin_lock_bh().
> >>
> >> Is there a specific deadlock you have in your mind?
> > 
> > e.g. unix_stream_connect() is taking unix_state_lock(). Can a tc's 
> > ingress bpf prog call unix_state_lock()?
> 
> Ah, right, that's the problem, thanks for explaining.
> 
> But, as I've asked in the parallel thread, do we really need to take the
> unix_state_lock() in sock_map_update_elem()? Taking it in
> sock_map_update_elem_sys() fixes the null-ptr-deref and does not lead to a
> deadlock. Taking unix_state_lock() in sock_map_update_elem() seems
> unnecessary. Well, at least under the assumption progs can only access
> unix_sk via the sockmap lookup.

right, sock_map_update_elem_sys() should be safe to take
unix_state_lock().

If it is fixed by testing unix_peer(), is the TCPF_ESTABLISHED test
in sock_map_sk_state_allowed() still useful and needed? Also,
please explain in detail in the commit message why testing for NULL
without unix_state_lock() is enough. For example, for the BPF iterator on
sock_map, my understanding is that unix_release_sock() can still happen
while the BPF iterator is iterating over a unix_sock. I guess a future
unix_state_lock() in the iterator's seq_show() should be useful.

It will also be useful to mention what was discovered about TC + lookup
+ update_elem(&sock_map, ...) and why it is not safe to take
unix_state_lock() in that path. Thanks.

> 
> >> ...
> >> And sock_{map,hash}_seq_show() (being a part of bpf iter machinery) needs
> >> to take lock_sock() just as well? Would that require a special-casing
> >> (unix_state_lock()) for af_unix?
> > 
> > I would think so for lock_sock() considering the current bh_lock_sock 
> > without !sock_owned_by_user() usage is incorrect in 
> > sock_map_update_elem(). [ this probably should be a separate issue for 
> > another patch ]
> 
> All right, leaving that for later.
> 
> > Some more side-tracking... from looking at the code, the bpf_iter of 
> > sock_{map,hash} can do bpf_map_lookup_elem(&sock_map, ...). This 
> > bpr_iter program probably will be failed to load because the 
> > bpf_sk_release() is not available.
> 
> I think ability to bpf_map_lookup_elem(sockmap) was added way before bpf
> iter in
> https://lore.kernel.org/bpf/20200429181154.479310-2-jakub@cloudflare.com/
> and iter "allows" it somewhat accidentally. Would you rather have it
> explicitly dropped?

It is fine to leave it as is. It would be a bit nicer to reject map_lookup
on sockmap instead of giving the confusing error that bpf_sk_release
is not available. That is for another day. I was just wondering what else
could go wrong with map_lookup.

Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock

Posted by Kuniyuki Iwashima 2 days, 11 hours ago

On Mon, Mar 30, 2026 at 4:04 PM Michal Luczaj <mhal@rbox.co> wrote:
>
> On 3/26/26 07:26, Martin KaFai Lau wrote:
> > On 3/15/26 4:58 PM, Michal Luczaj wrote:
> >>> Beside, from looking at the may_update_sockmap(), I don't know if it is
> >>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
> >>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
> >>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
> >>> when the bpf_map_update_elem(sockmap) support was added iirc.
> >>
> >> What about a situation when unix_sk is stored in a sockmap, then tc prog
> >> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
> >> useful, but seems doable.
> >
> > [ Sorry for the late reply ]
> >
> > It is a bummer that the bpf_map_update_elem(unix_sk) path is possible
> > from tc :(
> >
> > Then unix_state_lock() in its current form cannot be safely acquired in
> > sock_map_update_elem(). It is currently a spin_lock() instead of
> > spin_lock_bh().
>
> Is there a specific deadlock you have in your mind?

lockdep would complain if we used the same lock from
different contexts.

e.g.)
Process context holding unix_state_lock() with the normal spin_lock()
-> BH interrupt
-> tc prog trying to hold the same lock with spin_lock(). (_bh())
-> deadlock


>
> >>> The only path left is bpf_iter, which I believe was the primary use case
> >>> when adding bpf_map_update_elem(sockmap) support [1]. It would be nice
> >>> to avoid bh_lock_sock() when calling from all bpf_iter (tcp/udp/unix)
> >>> where lock_sock() has already been done. It is more for
> >>> reading-correctness though. This just came to my mind.
> >>> has_current_bpf_ctx() can be used to check this. sockopt_lock_sock() has
> >>> been using it to conditionally take lock_sock() or not.
> >>
> >> [ One clarification: bh_lock_sock() is a sock_map_update_elem() thing,
> >> which can only be called by a bpf prog. IOW, has_current_bpf_ctx() is
> >> always `true` in sock_map_update_elem(), right? ]
> >
> > For all the bpf prog types allowed by may_update_sockmap() to do
> > bpf_map_update_elem(sockmap), only BPF_TRACE_ITER should have
> > has_current_bpf_ctx() == true. The tc prog (and others allowed in
> > may_update_sockmap()) will have has_current_bpf_ctx() == false when
> > calling sock_map_update_elem().
>
> OK, so let's take test_sockmap_update.c:copy_sock_map(). It is a tc prog
> and it calls bpf_map_update_elem() -> sock_map_update_elem(), right? But
> running `test_progs -t "sockmap_basic/sockmap update"` shows (pr_warn() in
> sock_map_update_elem()) that has_current_bpf_ctx() == true. That's expected
> and has_current_bpf_ctx() would be false if sock_map_update_elem() was ran
> via a hook?
>
> > The tc case of bpf_map_update_elem(unix_sk) is unfortunate and requires
> > going back to the drawing board. I think checking unix_peer(sk) for NULL
> > without acquiring unix_state_lock() is needed for the
> > sock_map_update_elem() path, since changing unix_state_lock() for this
> > unknown use case seems overkill.
> >
> > Whether sock_map_update_elem_"sys"() needs unix_state_lock() is up for
> > debate.
>
> All right, I'll re-spin the series reverting back to v1.

Sounds good.


>
> > For bpf_iter_unix_seq_show(), one thought is to add unix_state_lock()
> > there before running the bpf iter prog. iiuc, it is what Kuniyuki has in
> > mind also to allow bpf iter prog having a stable view of unix_sock. This
> > could be a followup.
> > [fwiw, it was why I first thought of has_current_bpf_ctx() to avoid
> > sock_map_update_elem() taking unix_state_lock() again if
> > bpf_iter_unix_seq_show() acquires unix_state_lock() earlier. I later
> > concluded (but proved to be incorrect) that tc cannot call
> > bpf_map_update_elem(unix_sk).]
> >
> >>
> >> Let me know if I'm correctly rephrasing your idea: assume all bpf-context
> >> callers hold the socket locked or keep it "stable" (meaning: "sk won't
> >> surprise sockmap update by some breaking state change coming from another
> >> thread"). As you said, most bpf iters already take the sock_lock(), and I
> >
> > Right, all bpf iter (udp, tcp, unix) has acquired the lock_sock() before
> > running the bpf iter prog. afaik, the only exception is netlink bpf iter
> > but it cannot be added to sock_map afaik.
>
> And sock_{map,hash}_seq_show() (being a part of bpf iter machinery) needs
> to take lock_sock() just as well? Would that require a special-casing
> (unix_state_lock()) for af_unix?

Right, lock_sock() + unix_state_lock() + SOCK_DEAD check
should be best.


>
> >> have a patch that fixes sock_{map,hash}_seq_show(). Then we could try
> >> dropping that bh_lock_sock().
> >>
> >>> [ I would still keep patch 3 though. ]
> >>
> >> Right.
> >>
> >>> [1]: https://lore.kernel.org/bpf/20200821102948.21918-1-lmb@cloudflare.com/
> >>>
> >>>>
> >>>> In a parallel thread I've asked Kuniyuki if it might be better to
> >>>> completely drop patch 2/5, which would change how we interact with
> >>>> sock_map_close(). Lets see how it goes.
> >>>>
> >>>
> >>> If patch 2 is dropped, lock_sock() is always needed for unix_sk?
> >>
> >> For sock_map_update_elem_sys() I wanted to lock_sock()+unix_state_lock()
> >> following Kuniyuki's suggestion to keep locking pattern/order (that repeats
> >> when unix bpf iter prog invokes bpf_map_update_elem() ->
> >> sock_map_update_elem()). For sock_map_update_elem() not, we can't sleep there.
> >
>

Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock

Posted by Michal Luczaj 1 day, 12 hours ago

On 3/31/26 01:27, Kuniyuki Iwashima wrote:
> On Mon, Mar 30, 2026 at 4:04 PM Michal Luczaj <mhal@rbox.co> wrote:
>> On 3/26/26 07:26, Martin KaFai Lau wrote:
>>> On 3/15/26 4:58 PM, Michal Luczaj wrote:
>>>>> Beside, from looking at the may_update_sockmap(), I don't know if it is
>>>>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
>>>>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
>>>>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
>>>>> when the bpf_map_update_elem(sockmap) support was added iirc.
>>>>
>>>> What about a situation when unix_sk is stored in a sockmap, then tc prog
>>>> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
>>>> useful, but seems doable.
>>>
>>> [ Sorry for the late reply ]
>>>
>>> It is a bummer that the bpf_map_update_elem(unix_sk) path is possible
>>> from tc :(
>>>
>>> Then unix_state_lock() in its current form cannot be safely acquired in
>>> sock_map_update_elem(). It is currently a spin_lock() instead of
>>> spin_lock_bh().
>>
>> Is there a specific deadlock you have in your mind?
> 
> lockdep would complain if we used the same lock from
> different contexts.
> 
> e.g.)
> Process context holding unix_state_lock() with the normal spin_lock()
> -> BH interrupt
> -> tc prog trying to hold the same lock with spin_lock(). (_bh())
> -> deadlock

OK, I get it, thanks.

So here's one more idea: the null-ptr-deref issue is connect() racing
against sock_map_update_elem_*SYS*() coming from user-space, not the
can-be-called-from-BH sock_map_update_elem() variant. So can't we assume
that for any sock_map_update_elem(unix_sk) invoked by a tc prog, unix_sk
will always be "stable", i.e. in a state that cannot lead to that
null-ptr-deref?

IOW, if for a tc prog the only way to get hold of unix_sk is look it up in
a sockmap, then (by the fact that unix_sk _is_ in the sockmap) unix_sk will
be already safe to use by sock_map_update_elem() without taking the af_unix
state lock.

Long story short: take the unix state lock in sock_map_update_elem_sys(),
don't bother in sock_map_update_elem()?

>> ...
>> And sock_{map,hash}_seq_show() (being a part of bpf iter machinery) needs
>> to take lock_sock() just as well? Would that require a special-casing
>> (unix_state_lock()) for af_unix?
> 
> Right, lock_sock() + unix_state_lock() + SOCK_DEAD check
> should be best.

OK, got it.

Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock

Posted by Kuniyuki Iwashima 1 day, 11 hours ago

On Tue, Mar 31, 2026 at 3:44 PM Michal Luczaj <mhal@rbox.co> wrote:
>
> On 3/31/26 01:27, Kuniyuki Iwashima wrote:
> > On Mon, Mar 30, 2026 at 4:04 PM Michal Luczaj <mhal@rbox.co> wrote:
> >> On 3/26/26 07:26, Martin KaFai Lau wrote:
> >>> On 3/15/26 4:58 PM, Michal Luczaj wrote:
> >>>>> Beside, from looking at the may_update_sockmap(), I don't know if it is
> >>>>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
> >>>>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
> >>>>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
> >>>>> when the bpf_map_update_elem(sockmap) support was added iirc.
> >>>>
> >>>> What about a situation when unix_sk is stored in a sockmap, then tc prog
> >>>> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
> >>>> useful, but seems doable.
> >>>
> >>> [ Sorry for the late reply ]
> >>>
> >>> It is a bummer that the bpf_map_update_elem(unix_sk) path is possible
> >>> from tc :(
> >>>
> >>> Then unix_state_lock() in its current form cannot be safely acquired in
> >>> sock_map_update_elem(). It is currently a spin_lock() instead of
> >>> spin_lock_bh().
> >>
> >> Is there a specific deadlock you have in your mind?
> >
> > lockdep would complain if we used the same lock from
> > different contexts.
> >
> > e.g.)
> > Process context holding unix_state_lock() with the normal spin_lock()
> > -> BH interrupt
> > -> tc prog trying to hold the same lock with spin_lock(). (_bh())
> > -> deadlock
>
> OK, I get it, thanks.
>
> So here's one more idea: the null-ptr-deref issue is connect() racing
> against sock_map_update_elem_*SYS*() coming from user-space, not the
> can-be-called-from-BH sock_map_update_elem() variant. So can't we assume
> that for any sock_map_update_elem(unix_sk) invoked by a tc prog, unix_sk
> will always be "stable", i.e. in a state that cannot lead to that
> null-ptr-deref?
>
> IOW, if for a tc prog the only way to get hold of unix_sk is look it up in
> a sockmap, then (by the fact that unix_sk _is_ in the sockmap) unix_sk will
> be already safe to use by sock_map_update_elem() without taking the af_unix
> state lock.
>
> Long story short: take the unix state lock in sock_map_update_elem_sys(),
> don't bother in sock_map_update_elem()?

but it will prevents bpf iter from taking unix_state_lock().

I don't see a good reason to introduce a new locking rule by unnecessarily
wrapping the entire sockmap update with unix_state_lock() even though
the root bug can be avoided by a simple null check in a leaf function.


>
> >> ...
> >> And sock_{map,hash}_seq_show() (being a part of bpf iter machinery) needs
> >> to take lock_sock() just as well? Would that require a special-casing
> >> (unix_state_lock()) for af_unix?
> >
> > Right, lock_sock() + unix_state_lock() + SOCK_DEAD check
> > should be best.
>
> OK, got it.
>

Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock

Posted by Michal Luczaj 15 hours ago

On 4/1/26 01:18, Kuniyuki Iwashima wrote:
> On Tue, Mar 31, 2026 at 3:44 PM Michal Luczaj <mhal@rbox.co> wrote:
>>
>> On 3/31/26 01:27, Kuniyuki Iwashima wrote:
>>> On Mon, Mar 30, 2026 at 4:04 PM Michal Luczaj <mhal@rbox.co> wrote:
>>>> On 3/26/26 07:26, Martin KaFai Lau wrote:
>>>>> On 3/15/26 4:58 PM, Michal Luczaj wrote:
>>>>>>> Beside, from looking at the may_update_sockmap(), I don't know if it is
>>>>>>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
>>>>>>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
>>>>>>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
>>>>>>> when the bpf_map_update_elem(sockmap) support was added iirc.
>>>>>>
>>>>>> What about a situation when unix_sk is stored in a sockmap, then tc prog
>>>>>> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
>>>>>> useful, but seems doable.
>>>>>
>>>>> [ Sorry for the late reply ]
>>>>>
>>>>> It is a bummer that the bpf_map_update_elem(unix_sk) path is possible
>>>>> from tc :(
>>>>>
>>>>> Then unix_state_lock() in its current form cannot be safely acquired in
>>>>> sock_map_update_elem(). It is currently a spin_lock() instead of
>>>>> spin_lock_bh().
>>>>
>>>> Is there a specific deadlock you have in your mind?
>>>
>>> lockdep would complain if we used the same lock from
>>> different contexts.
>>>
>>> e.g.)
>>> Process context holding unix_state_lock() with the normal spin_lock()
>>> -> BH interrupt
>>> -> tc prog trying to hold the same lock with spin_lock(). (_bh())
>>> -> deadlock
>>
>> OK, I get it, thanks.
>>
>> So here's one more idea: the null-ptr-deref issue is connect() racing
>> against sock_map_update_elem_*SYS*() coming from user-space, not the
>> can-be-called-from-BH sock_map_update_elem() variant. So can't we assume
>> that for any sock_map_update_elem(unix_sk) invoked by a tc prog, unix_sk
>> will always be "stable", i.e. in a state that cannot lead to that
>> null-ptr-deref?
>>
>> IOW, if for a tc prog the only way to get hold of unix_sk is look it up in
>> a sockmap, then (by the fact that unix_sk _is_ in the sockmap) unix_sk will
>> be already safe to use by sock_map_update_elem() without taking the af_unix
>> state lock.
>>
>> Long story short: take the unix state lock in sock_map_update_elem_sys(),
>> don't bother in sock_map_update_elem()?
> 
> but it will prevents bpf iter from taking unix_state_lock().

How come? bpf iter can take unix_state_lock() and the prog is free to
invoke sock_map_update_elem(). It's the _sys variant that would be taking
unix_state_lock() by itself. Or is there some other locking complexity I'm
missing?

> I don't see a good reason to introduce a new locking rule by unnecessarily
> wrapping the entire sockmap update with unix_state_lock() even though
> the root bug can be avoided by a simple null check in a leaf function.

Yeah, I get that point. I just assumed the root bug was indeed sockmap
update missing a lock.

[PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock::sk_state data-races
[PATCH bpf v3 2/5] bpf, sockmap: Use sock_map_sk_{acquire,release}() where open-coded
[PATCH bpf v3 3/5] bpf, sockmap: Fix af_unix iter deadlock
[PATCH bpf v3 4/5] selftests/bpf: Extend bpf_iter_unix to attempt deadlocking
[PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock