unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
socket is properly set up, which would include having a defined peer. IOW,
there's a window when unix_stream_bpf_update_proto() can be called on
socket which still has unix_peer(sk) == NULL.
T0 bpf T1 connect
------ ----------
WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
sock_map_sk_state_allowed(sk)
...
sk_pair = unix_peer(sk)
sock_hold(sk_pair)
sock_hold(newsk)
smp_mb__after_atomic()
unix_peer(sk) = newsk
BUG: kernel NULL pointer dereference, address: 0000000000000080
RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
Call Trace:
sock_map_link+0x564/0x8b0
sock_map_update_common+0x6e/0x340
sock_map_update_elem_sys+0x17d/0x240
__sys_bpf+0x26db/0x3250
__x64_sys_bpf+0x21/0x30
do_syscall_64+0x6b/0x3a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Initial idea was to move peer assignment _before_ the sk_state update[1],
but that involved an additional memory barrier, and changing the hot path
was rejected. Then a check during proto update was considered[2], but a
follow-up discussion[3] concluded the root cause is sockmap taking a wrong
lock. Or, more specifically, an insufficient lock[4].
Thus, teach sockmap about the af_unix-specific locking: af_unix protects
critical sections under unix_state_lock() operating on unix_sock::lock.
[1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
[2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
[3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
[4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
1 file changed, 29 insertions(+), 6 deletions(-)
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 7ba6a7f24ccd..6109fbe6f99c 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -12,6 +12,7 @@
#include <linux/list.h>
#include <linux/jhash.h>
#include <linux/sock_diag.h>
+#include <net/af_unix.h>
#include <net/udp.h>
struct bpf_stab {
@@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
}
static void sock_map_sk_acquire(struct sock *sk)
- __acquires(&sk->sk_lock.slock)
{
lock_sock(sk);
+
+ if (sk_is_unix(sk))
+ unix_state_lock(sk);
+
rcu_read_lock();
}
static void sock_map_sk_release(struct sock *sk)
- __releases(&sk->sk_lock.slock)
{
rcu_read_unlock();
+
+ if (sk_is_unix(sk))
+ unix_state_unlock(sk);
+
release_sock(sk);
}
+static inline void sock_map_sk_acquire_fast(struct sock *sk)
+{
+ local_bh_disable();
+ bh_lock_sock(sk);
+
+ if (sk_is_unix(sk))
+ unix_state_lock(sk);
+}
+
+static inline void sock_map_sk_release_fast(struct sock *sk)
+{
+ if (sk_is_unix(sk))
+ unix_state_unlock(sk);
+
+ bh_unlock_sock(sk);
+ local_bh_enable();
+}
+
static void sock_map_add_link(struct sk_psock *psock,
struct sk_psock_link *link,
struct bpf_map *map, void *link_raw)
@@ -604,16 +629,14 @@ static long sock_map_update_elem(struct bpf_map *map, void *key,
if (!sock_map_sk_is_suitable(sk))
return -EOPNOTSUPP;
- local_bh_disable();
- bh_lock_sock(sk);
+ sock_map_sk_acquire_fast(sk);
if (!sock_map_sk_state_allowed(sk))
ret = -EOPNOTSUPP;
else if (map->map_type == BPF_MAP_TYPE_SOCKMAP)
ret = sock_map_update_common(map, *(u32 *)key, sk, flags);
else
ret = sock_hash_update_common(map, key, sk, flags);
- bh_unlock_sock(sk);
- local_bh_enable();
+ sock_map_sk_release_fast(sk);
return ret;
}
--
2.52.0
On 3/6/26 7:30 AM, Michal Luczaj wrote:
> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
> socket is properly set up, which would include having a defined peer. IOW,
> there's a window when unix_stream_bpf_update_proto() can be called on
> socket which still has unix_peer(sk) == NULL.
>
> T0 bpf T1 connect
> ------ ----------
>
> WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
> sock_map_sk_state_allowed(sk)
> ...
> sk_pair = unix_peer(sk)
> sock_hold(sk_pair)
> sock_hold(newsk)
> smp_mb__after_atomic()
> unix_peer(sk) = newsk
>
> BUG: kernel NULL pointer dereference, address: 0000000000000080
> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
> Call Trace:
> sock_map_link+0x564/0x8b0
> sock_map_update_common+0x6e/0x340
> sock_map_update_elem_sys+0x17d/0x240
> __sys_bpf+0x26db/0x3250
> __x64_sys_bpf+0x21/0x30
> do_syscall_64+0x6b/0x3a0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Initial idea was to move peer assignment _before_ the sk_state update[1],
> but that involved an additional memory barrier, and changing the hot path
> was rejected. Then a check during proto update was considered[2], but a
> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
> lock. Or, more specifically, an insufficient lock[4].
>
> Thus, teach sockmap about the af_unix-specific locking: af_unix protects
> critical sections under unix_state_lock() operating on unix_sock::lock.
>
> [1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
> [2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
> [3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
> [4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
>
> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> ---
> net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
> 1 file changed, 29 insertions(+), 6 deletions(-)
>
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 7ba6a7f24ccd..6109fbe6f99c 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -12,6 +12,7 @@
> #include <linux/list.h>
> #include <linux/jhash.h>
> #include <linux/sock_diag.h>
> +#include <net/af_unix.h>
> #include <net/udp.h>
>
> struct bpf_stab {
> @@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
> }
>
> static void sock_map_sk_acquire(struct sock *sk)
> - __acquires(&sk->sk_lock.slock)
> {
> lock_sock(sk);
> +
> + if (sk_is_unix(sk))
> + unix_state_lock(sk);
> +
> rcu_read_lock();
> }
>
This introduces a new ordering constraint: lock_sock() before
unix_state_lock(). Kuniyuki flagged in the v2 review that taking
lock_sock() inside unix_state_lock() in the future would create an
ABBA deadlock, which is exactly why the order was settled this way. However,
the thread did not reach a conclusion on whether that constraint should be
documented in the code.
Since there is nothing enforcing this ordering mechanically, a brief comment
at sock_map_sk_acquire() would help future readers avoid introducing the
inverse ordering.
> static void sock_map_sk_release(struct sock *sk)
> - __releases(&sk->sk_lock.slock)
> {
> rcu_read_unlock();
> +
> + if (sk_is_unix(sk))
> + unix_state_unlock(sk);
> +
> release_sock(sk);
> }
>
> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
> +{
> + local_bh_disable();
> + bh_lock_sock(sk);
> +
> + if (sk_is_unix(sk))
> + unix_state_lock(sk);
> +}
> +
v2 and v3 differ here in a way that deserves a closer look. In v2, AF_UNIX
sockets took only unix_state_lock() in the fast path, skipping
bh_lock_sock() entirely:
/* v2 */
if (sk_is_unix(sk))
unix_state_lock(sk);
else
bh_lock_sock(sk);
v3 takes both for AF_UNIX sockets.
bh_lock_sock() protects sock::sk_lock.slock, whereas the state that
actually needs protection here — sk_state and unix_peer() — lives under
unix_sock::lock. Since unix_state_lock() is already sufficient to close
the race against unix_stream_connect(), is bh_lock_sock() still doing
anything useful for AF_UNIX sockets in this path?
The v3 changelog attributes "Keep lock_sock() along with unix_state_lock()"
to Kuniyuki, but that comment appears to concern sock_map_sk_acquire()
(the slow path); the fast path behaviour change was not explicitly discussed
in the v2 thread.
If bh_lock_sock() is redundant for AF_UNIX here, reverting to the v2
approach would reduce unnecessary lock contention. If it is still needed,
a comment explaining what it protects for AF_UNIX sockets would help.
> +static inline void sock_map_sk_release_fast(struct sock *sk)
> +{
> + if (sk_is_unix(sk))
> + unix_state_unlock(sk);
> +
> + bh_unlock_sock(sk);
> + local_bh_enable();
> +}
> +
> static void sock_map_add_link(struct sk_psock *psock,
> struct sk_psock_link *link,
> struct bpf_map *map, void *link_raw)
> @@ -604,16 +629,14 @@ static long sock_map_update_elem(struct bpf_map *map, void *key,
> if (!sock_map_sk_is_suitable(sk))
> return -EOPNOTSUPP;
>
> - local_bh_disable();
> - bh_lock_sock(sk);
> + sock_map_sk_acquire_fast(sk);
> if (!sock_map_sk_state_allowed(sk))
> ret = -EOPNOTSUPP;
> else if (map->map_type == BPF_MAP_TYPE_SOCKMAP)
> ret = sock_map_update_common(map, *(u32 *)key, sk, flags);
> else
> ret = sock_hash_update_common(map, key, sk, flags);
> - bh_unlock_sock(sk);
> - local_bh_enable();
> + sock_map_sk_release_fast(sk);
> return ret;
> }
>
>
On 3/6/26 06:01, Jiayuan Chen wrote:
>
> On 3/6/26 7:30 AM, Michal Luczaj wrote:
>> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
>> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
>> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
>> socket is properly set up, which would include having a defined peer. IOW,
>> there's a window when unix_stream_bpf_update_proto() can be called on
>> socket which still has unix_peer(sk) == NULL.
>>
>> T0 bpf T1 connect
>> ------ ----------
>>
>> WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
>> sock_map_sk_state_allowed(sk)
>> ...
>> sk_pair = unix_peer(sk)
>> sock_hold(sk_pair)
>> sock_hold(newsk)
>> smp_mb__after_atomic()
>> unix_peer(sk) = newsk
>>
>> BUG: kernel NULL pointer dereference, address: 0000000000000080
>> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
>> Call Trace:
>> sock_map_link+0x564/0x8b0
>> sock_map_update_common+0x6e/0x340
>> sock_map_update_elem_sys+0x17d/0x240
>> __sys_bpf+0x26db/0x3250
>> __x64_sys_bpf+0x21/0x30
>> do_syscall_64+0x6b/0x3a0
>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> Initial idea was to move peer assignment _before_ the sk_state update[1],
>> but that involved an additional memory barrier, and changing the hot path
>> was rejected. Then a check during proto update was considered[2], but a
>> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
>> lock. Or, more specifically, an insufficient lock[4].
>>
>> Thus, teach sockmap about the af_unix-specific locking: af_unix protects
>> critical sections under unix_state_lock() operating on unix_sock::lock.
>>
>> [1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
>> [2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
>> [3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
>> [4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
>>
>> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
>> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
>> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
>> ---
>> net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
>> 1 file changed, 29 insertions(+), 6 deletions(-)
>>
>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>> index 7ba6a7f24ccd..6109fbe6f99c 100644
>> --- a/net/core/sock_map.c
>> +++ b/net/core/sock_map.c
>> @@ -12,6 +12,7 @@
>> #include <linux/list.h>
>> #include <linux/jhash.h>
>> #include <linux/sock_diag.h>
>> +#include <net/af_unix.h>
>> #include <net/udp.h>
>>
>> struct bpf_stab {
>> @@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
>> }
>>
>> static void sock_map_sk_acquire(struct sock *sk)
>> - __acquires(&sk->sk_lock.slock)
>> {
>> lock_sock(sk);
>> +
>> + if (sk_is_unix(sk))
>> + unix_state_lock(sk);
>> +
>> rcu_read_lock();
>> }
>>
>
> This introduces a new ordering constraint: lock_sock() before
> unix_state_lock(). Kuniyuki flagged in the v2 review that taking
> lock_sock() inside unix_state_lock() in the future would create an
> ABBA deadlock, which is exactly why the order was settled this way. However,
> the thread did not reach a conclusion on whether that constraint should be
> documented in the code.
>
> Since there is nothing enforcing this ordering mechanically, a brief comment
> at sock_map_sk_acquire() would help future readers avoid introducing the
> inverse ordering.
Sure, will do.
>> static void sock_map_sk_release(struct sock *sk)
>> - __releases(&sk->sk_lock.slock)
>> {
>> rcu_read_unlock();
>> +
>> + if (sk_is_unix(sk))
>> + unix_state_unlock(sk);
>> +
>> release_sock(sk);
>> }
>>
>> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
>> +{
>> + local_bh_disable();
>> + bh_lock_sock(sk);
>> +
>> + if (sk_is_unix(sk))
>> + unix_state_lock(sk);
>> +}
>> +
>
>
> v2 and v3 differ here in a way that deserves a closer look. In v2, AF_UNIX
> sockets took only unix_state_lock() in the fast path, skipping
> bh_lock_sock() entirely:
>
> /* v2 */
> if (sk_is_unix(sk))
> unix_state_lock(sk);
> else
> bh_lock_sock(sk);
>
> v3 takes both for AF_UNIX sockets.
>
> bh_lock_sock() protects sock::sk_lock.slock, whereas the state that
> actually needs protection here — sk_state and unix_peer() — lives under
> unix_sock::lock. Since unix_state_lock() is already sufficient to close
> the race against unix_stream_connect(), is bh_lock_sock() still doing
> anything useful for AF_UNIX sockets in this path?
Yeah, good point. I think I was just trying to keep the _fast variant
aligned with the sleepy one. Which I agree might be unnecessary.
In a parallel thread I've asked Kuniyuki if it might be better to
completely drop patch 2/5, which would change how we interact with
sock_map_close(). Lets see how it goes.
On 3/6/26 6:09 AM, Michal Luczaj wrote:
> On 3/6/26 06:01, Jiayuan Chen wrote:
>>
>> On 3/6/26 7:30 AM, Michal Luczaj wrote:
>>> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
>>> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
>>> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
>>> socket is properly set up, which would include having a defined peer. IOW,
>>> there's a window when unix_stream_bpf_update_proto() can be called on
>>> socket which still has unix_peer(sk) == NULL.
>>>
>>> T0 bpf T1 connect
>>> ------ ----------
>>>
>>> WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
>>> sock_map_sk_state_allowed(sk)
>>> ...
>>> sk_pair = unix_peer(sk)
>>> sock_hold(sk_pair)
>>> sock_hold(newsk)
>>> smp_mb__after_atomic()
>>> unix_peer(sk) = newsk
>>>
>>> BUG: kernel NULL pointer dereference, address: 0000000000000080
>>> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
>>> Call Trace:
>>> sock_map_link+0x564/0x8b0
>>> sock_map_update_common+0x6e/0x340
>>> sock_map_update_elem_sys+0x17d/0x240
>>> __sys_bpf+0x26db/0x3250
>>> __x64_sys_bpf+0x21/0x30
>>> do_syscall_64+0x6b/0x3a0
>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> Initial idea was to move peer assignment _before_ the sk_state update[1],
>>> but that involved an additional memory barrier, and changing the hot path
>>> was rejected. Then a check during proto update was considered[2], but a
>>> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
>>> lock. Or, more specifically, an insufficient lock[4].
>>>
>>> Thus, teach sockmap about the af_unix-specific locking: af_unix protects
>>> critical sections under unix_state_lock() operating on unix_sock::lock.
>>>
>>> [1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
>>> [2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
>>> [3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
>>> [4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
>>>
>>> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
>>> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
>>> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
>>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
>>> ---
>>> net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
>>> 1 file changed, 29 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>>> index 7ba6a7f24ccd..6109fbe6f99c 100644
>>> --- a/net/core/sock_map.c
>>> +++ b/net/core/sock_map.c
>>> @@ -12,6 +12,7 @@
>>> #include <linux/list.h>
>>> #include <linux/jhash.h>
>>> #include <linux/sock_diag.h>
>>> +#include <net/af_unix.h>
>>> #include <net/udp.h>
>>>
>>> struct bpf_stab {
>>> @@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
>>> }
>>>
>>> static void sock_map_sk_acquire(struct sock *sk)
>>> - __acquires(&sk->sk_lock.slock)
>>> {
>>> lock_sock(sk);
>>> +
>>> + if (sk_is_unix(sk))
>>> + unix_state_lock(sk);
>>> +
>>> rcu_read_lock();
>>> }
>>>
>>
>> This introduces a new ordering constraint: lock_sock() before
>> unix_state_lock(). Kuniyuki flagged in the v2 review that taking
>> lock_sock() inside unix_state_lock() in the future would create an
>> ABBA deadlock, which is exactly why the order was settled this way. However,
>> the thread did not reach a conclusion on whether that constraint should be
>> documented in the code.
>>
>> Since there is nothing enforcing this ordering mechanically, a brief comment
>> at sock_map_sk_acquire() would help future readers avoid introducing the
>> inverse ordering.
>
> Sure, will do.
>
>>> static void sock_map_sk_release(struct sock *sk)
>>> - __releases(&sk->sk_lock.slock)
>>> {
>>> rcu_read_unlock();
>>> +
>>> + if (sk_is_unix(sk))
>>> + unix_state_unlock(sk);
>>> +
>>> release_sock(sk);
>>> }
>>>
>>> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
>>> +{
>>> + local_bh_disable();
>>> + bh_lock_sock(sk);
>>> +
>>> + if (sk_is_unix(sk))
>>> + unix_state_lock(sk);
>>> +}
>>> +
>>
>>
>> v2 and v3 differ here in a way that deserves a closer look. In v2, AF_UNIX
>> sockets took only unix_state_lock() in the fast path, skipping
>> bh_lock_sock() entirely:
>>
>> /* v2 */
>> if (sk_is_unix(sk))
>> unix_state_lock(sk);
>> else
>> bh_lock_sock(sk);
>>
>> v3 takes both for AF_UNIX sockets.
>>
>> bh_lock_sock() protects sock::sk_lock.slock, whereas the state that
>> actually needs protection here — sk_state and unix_peer() — lives under
>> unix_sock::lock. Since unix_state_lock() is already sufficient to close
>> the race against unix_stream_connect(), is bh_lock_sock() still doing
>> anything useful for AF_UNIX sockets in this path?
>
> Yeah, good point. I think I was just trying to keep the _fast variant
> aligned with the sleepy one. Which I agree might be unnecessary.
I hope the common use case should not be calling
bpf_map_update_elem(sockmap) only.
Beside, from looking at the may_update_sockmap(), I don't know if it is
even doable (or useful) to bpf_map_update_elem(unix_sk) in
tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
when the bpf_map_update_elem(sockmap) support was added iirc.
The only path left is bpf_iter, which I believe was the primary use case
when adding bpf_map_update_elem(sockmap) support [1]. It would be nice
to avoid bh_lock_sock() when calling from all bpf_iter (tcp/udp/unix)
where lock_sock() has already been done. It is more for
reading-correctness though. This just came to my mind.
has_current_bpf_ctx() can be used to check this. sockopt_lock_sock() has
been using it to conditionally take lock_sock() or not. [ I would still
keep patch 3 though. ]
[1]: https://lore.kernel.org/bpf/20200821102948.21918-1-lmb@cloudflare.com/
>
> In a parallel thread I've asked Kuniyuki if it might be better to
> completely drop patch 2/5, which would change how we interact with
> sock_map_close(). Lets see how it goes.
>
If patch 2 is dropped, lock_sock() is always needed for unix_sk?
On 3/10/26 23:20, Martin KaFai Lau wrote:
> On 3/6/26 6:09 AM, Michal Luczaj wrote:
>> On 3/6/26 06:01, Jiayuan Chen wrote:
>>>
>>> On 3/6/26 7:30 AM, Michal Luczaj wrote:
>>>> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
>>>> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
>>>> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
>>>> socket is properly set up, which would include having a defined peer. IOW,
>>>> there's a window when unix_stream_bpf_update_proto() can be called on
>>>> socket which still has unix_peer(sk) == NULL.
>>>>
>>>> T0 bpf T1 connect
>>>> ------ ----------
>>>>
>>>> WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
>>>> sock_map_sk_state_allowed(sk)
>>>> ...
>>>> sk_pair = unix_peer(sk)
>>>> sock_hold(sk_pair)
>>>> sock_hold(newsk)
>>>> smp_mb__after_atomic()
>>>> unix_peer(sk) = newsk
>>>>
>>>> BUG: kernel NULL pointer dereference, address: 0000000000000080
>>>> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
>>>> Call Trace:
>>>> sock_map_link+0x564/0x8b0
>>>> sock_map_update_common+0x6e/0x340
>>>> sock_map_update_elem_sys+0x17d/0x240
>>>> __sys_bpf+0x26db/0x3250
>>>> __x64_sys_bpf+0x21/0x30
>>>> do_syscall_64+0x6b/0x3a0
>>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>
>>>> Initial idea was to move peer assignment _before_ the sk_state update[1],
>>>> but that involved an additional memory barrier, and changing the hot path
>>>> was rejected. Then a check during proto update was considered[2], but a
>>>> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
>>>> lock. Or, more specifically, an insufficient lock[4].
>>>>
>>>> Thus, teach sockmap about the af_unix-specific locking: af_unix protects
>>>> critical sections under unix_state_lock() operating on unix_sock::lock.
>>>>
>>>> [1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
>>>> [2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
>>>> [3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
>>>> [4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
>>>>
>>>> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
>>>> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
>>>> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
>>>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
>>>> ---
>>>> net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
>>>> 1 file changed, 29 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>>>> index 7ba6a7f24ccd..6109fbe6f99c 100644
>>>> --- a/net/core/sock_map.c
>>>> +++ b/net/core/sock_map.c
>>>> @@ -12,6 +12,7 @@
>>>> #include <linux/list.h>
>>>> #include <linux/jhash.h>
>>>> #include <linux/sock_diag.h>
>>>> +#include <net/af_unix.h>
>>>> #include <net/udp.h>
>>>>
>>>> struct bpf_stab {
>>>> @@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
>>>> }
>>>>
>>>> static void sock_map_sk_acquire(struct sock *sk)
>>>> - __acquires(&sk->sk_lock.slock)
>>>> {
>>>> lock_sock(sk);
>>>> +
>>>> + if (sk_is_unix(sk))
>>>> + unix_state_lock(sk);
>>>> +
>>>> rcu_read_lock();
>>>> }
>>>>
>>>
>>> This introduces a new ordering constraint: lock_sock() before
>>> unix_state_lock(). Kuniyuki flagged in the v2 review that taking
>>> lock_sock() inside unix_state_lock() in the future would create an
>>> ABBA deadlock, which is exactly why the order was settled this way. However,
>>> the thread did not reach a conclusion on whether that constraint should be
>>> documented in the code.
>>>
>>> Since there is nothing enforcing this ordering mechanically, a brief comment
>>> at sock_map_sk_acquire() would help future readers avoid introducing the
>>> inverse ordering.
>>
>> Sure, will do.
>>
>>>> static void sock_map_sk_release(struct sock *sk)
>>>> - __releases(&sk->sk_lock.slock)
>>>> {
>>>> rcu_read_unlock();
>>>> +
>>>> + if (sk_is_unix(sk))
>>>> + unix_state_unlock(sk);
>>>> +
>>>> release_sock(sk);
>>>> }
>>>>
>>>> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
>>>> +{
>>>> + local_bh_disable();
>>>> + bh_lock_sock(sk);
>>>> +
>>>> + if (sk_is_unix(sk))
>>>> + unix_state_lock(sk);
>>>> +}
>>>> +
>>>
>>>
>>> v2 and v3 differ here in a way that deserves a closer look. In v2, AF_UNIX
>>> sockets took only unix_state_lock() in the fast path, skipping
>>> bh_lock_sock() entirely:
>>>
>>> /* v2 */
>>> if (sk_is_unix(sk))
>>> unix_state_lock(sk);
>>> else
>>> bh_lock_sock(sk);
>>>
>>> v3 takes both for AF_UNIX sockets.
>>>
>>> bh_lock_sock() protects sock::sk_lock.slock, whereas the state that
>>> actually needs protection here — sk_state and unix_peer() — lives under
>>> unix_sock::lock. Since unix_state_lock() is already sufficient to close
>>> the race against unix_stream_connect(), is bh_lock_sock() still doing
>>> anything useful for AF_UNIX sockets in this path?
>>
>> Yeah, good point. I think I was just trying to keep the _fast variant
>> aligned with the sleepy one. Which I agree might be unnecessary.
>
> I hope the common use case should not be calling
> bpf_map_update_elem(sockmap) only.
>
> Beside, from looking at the may_update_sockmap(), I don't know if it is
> even doable (or useful) to bpf_map_update_elem(unix_sk) in
> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
> when the bpf_map_update_elem(sockmap) support was added iirc.
What about a situation when unix_sk is stored in a sockmap, then tc prog
looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
useful, but seems doable.
> The only path left is bpf_iter, which I believe was the primary use case
> when adding bpf_map_update_elem(sockmap) support [1]. It would be nice
> to avoid bh_lock_sock() when calling from all bpf_iter (tcp/udp/unix)
> where lock_sock() has already been done. It is more for
> reading-correctness though. This just came to my mind.
> has_current_bpf_ctx() can be used to check this. sockopt_lock_sock() has
> been using it to conditionally take lock_sock() or not.
[ One clarification: bh_lock_sock() is a sock_map_update_elem() thing,
which can only be called by a bpf prog. IOW, has_current_bpf_ctx() is
always `true` in sock_map_update_elem(), right? ]
Let me know if I'm correctly rephrasing your idea: assume all bpf-context
callers hold the socket locked or keep it "stable" (meaning: "sk won't
surprise sockmap update by some breaking state change coming from another
thread"). As you said, most bpf iters already take the sock_lock(), and I
have a patch that fixes sock_{map,hash}_seq_show(). Then we could try
dropping that bh_lock_sock().
> [ I would still keep patch 3 though. ]
Right.
> [1]: https://lore.kernel.org/bpf/20200821102948.21918-1-lmb@cloudflare.com/
>
>>
>> In a parallel thread I've asked Kuniyuki if it might be better to
>> completely drop patch 2/5, which would change how we interact with
>> sock_map_close(). Lets see how it goes.
>>
>
> If patch 2 is dropped, lock_sock() is always needed for unix_sk?
For sock_map_update_elem_sys() I wanted to lock_sock()+unix_state_lock()
following Kuniyuki's suggestion to keep locking pattern/order (that repeats
when unix bpf iter prog invokes bpf_map_update_elem() ->
sock_map_update_elem()). For sock_map_update_elem() not, we can't sleep there.
On 3/15/26 4:58 PM, Michal Luczaj wrote:
>> Beside, from looking at the may_update_sockmap(), I don't know if it is
>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
>> when the bpf_map_update_elem(sockmap) support was added iirc.
>
> What about a situation when unix_sk is stored in a sockmap, then tc prog
> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
> useful, but seems doable.
[ Sorry for the late reply ]
It is a bummer that the bpf_map_update_elem(unix_sk) path is possible
from tc :(
Then unix_state_lock() in its current form cannot be safely acquired in
sock_map_update_elem(). It is currently a spin_lock() instead of
spin_lock_bh().
>
>> The only path left is bpf_iter, which I believe was the primary use case
>> when adding bpf_map_update_elem(sockmap) support [1]. It would be nice
>> to avoid bh_lock_sock() when calling from all bpf_iter (tcp/udp/unix)
>> where lock_sock() has already been done. It is more for
>> reading-correctness though. This just came to my mind.
>> has_current_bpf_ctx() can be used to check this. sockopt_lock_sock() has
>> been using it to conditionally take lock_sock() or not.
>
> [ One clarification: bh_lock_sock() is a sock_map_update_elem() thing,
> which can only be called by a bpf prog. IOW, has_current_bpf_ctx() is
> always `true` in sock_map_update_elem(), right? ]
For all the bpf prog types allowed by may_update_sockmap() to do
bpf_map_update_elem(sockmap), only BPF_TRACE_ITER should have
has_current_bpf_ctx() == true. The tc prog (and others allowed in
may_update_sockmap()) will have has_current_bpf_ctx() == false when
calling sock_map_update_elem().
The tc case of bpf_map_update_elem(unix_sk) is unfortunate and requires
going back to the drawing board. I think checking unix_peer(sk) for NULL
without acquiring unix_state_lock() is needed for the
sock_map_update_elem() path, since changing unix_state_lock() for this
unknown use case seems overkill.
Whether sock_map_update_elem_"sys"() needs unix_state_lock() is up for
debate.
For bpf_iter_unix_seq_show(), one thought is to add unix_state_lock()
there before running the bpf iter prog. iiuc, it is what Kuniyuki has in
mind also to allow bpf iter prog having a stable view of unix_sock. This
could be a followup.
[fwiw, it was why I first thought of has_current_bpf_ctx() to avoid
sock_map_update_elem() taking unix_state_lock() again if
bpf_iter_unix_seq_show() acquires unix_state_lock() earlier. I later
concluded (but proved to be incorrect) that tc cannot call
bpf_map_update_elem(unix_sk).]
>
> Let me know if I'm correctly rephrasing your idea: assume all bpf-context
> callers hold the socket locked or keep it "stable" (meaning: "sk won't
> surprise sockmap update by some breaking state change coming from another
> thread"). As you said, most bpf iters already take the sock_lock(), and I
Right, all bpf iter (udp, tcp, unix) has acquired the lock_sock() before
running the bpf iter prog. afaik, the only exception is netlink bpf iter
but it cannot be added to sock_map afaik.
> have a patch that fixes sock_{map,hash}_seq_show(). Then we could try
> dropping that bh_lock_sock().
>
>> [ I would still keep patch 3 though. ]
>
> Right.
>
>> [1]: https://lore.kernel.org/bpf/20200821102948.21918-1-lmb@cloudflare.com/
>>
>>>
>>> In a parallel thread I've asked Kuniyuki if it might be better to
>>> completely drop patch 2/5, which would change how we interact with
>>> sock_map_close(). Lets see how it goes.
>>>
>>
>> If patch 2 is dropped, lock_sock() is always needed for unix_sk?
>
> For sock_map_update_elem_sys() I wanted to lock_sock()+unix_state_lock()
> following Kuniyuki's suggestion to keep locking pattern/order (that repeats
> when unix bpf iter prog invokes bpf_map_update_elem() ->
> sock_map_update_elem()). For sock_map_update_elem() not, we can't sleep there.
On 3/26/26 07:26, Martin KaFai Lau wrote:
> On 3/15/26 4:58 PM, Michal Luczaj wrote:
>>> Beside, from looking at the may_update_sockmap(), I don't know if it is
>>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
>>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
>>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
>>> when the bpf_map_update_elem(sockmap) support was added iirc.
>>
>> What about a situation when unix_sk is stored in a sockmap, then tc prog
>> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
>> useful, but seems doable.
>
> [ Sorry for the late reply ]
>
> It is a bummer that the bpf_map_update_elem(unix_sk) path is possible
> from tc :(
>
> Then unix_state_lock() in its current form cannot be safely acquired in
> sock_map_update_elem(). It is currently a spin_lock() instead of
> spin_lock_bh().
Is there a specific deadlock you have in your mind?
>>> The only path left is bpf_iter, which I believe was the primary use case
>>> when adding bpf_map_update_elem(sockmap) support [1]. It would be nice
>>> to avoid bh_lock_sock() when calling from all bpf_iter (tcp/udp/unix)
>>> where lock_sock() has already been done. It is more for
>>> reading-correctness though. This just came to my mind.
>>> has_current_bpf_ctx() can be used to check this. sockopt_lock_sock() has
>>> been using it to conditionally take lock_sock() or not.
>>
>> [ One clarification: bh_lock_sock() is a sock_map_update_elem() thing,
>> which can only be called by a bpf prog. IOW, has_current_bpf_ctx() is
>> always `true` in sock_map_update_elem(), right? ]
>
> For all the bpf prog types allowed by may_update_sockmap() to do
> bpf_map_update_elem(sockmap), only BPF_TRACE_ITER should have
> has_current_bpf_ctx() == true. The tc prog (and others allowed in
> may_update_sockmap()) will have has_current_bpf_ctx() == false when
> calling sock_map_update_elem().
OK, so let's take test_sockmap_update.c:copy_sock_map(). It is a tc prog
and it calls bpf_map_update_elem() -> sock_map_update_elem(), right? But
running `test_progs -t "sockmap_basic/sockmap update"` shows (pr_warn() in
sock_map_update_elem()) that has_current_bpf_ctx() == true. That's expected
and has_current_bpf_ctx() would be false if sock_map_update_elem() was ran
via a hook?
> The tc case of bpf_map_update_elem(unix_sk) is unfortunate and requires
> going back to the drawing board. I think checking unix_peer(sk) for NULL
> without acquiring unix_state_lock() is needed for the
> sock_map_update_elem() path, since changing unix_state_lock() for this
> unknown use case seems overkill.
>
> Whether sock_map_update_elem_"sys"() needs unix_state_lock() is up for
> debate.
All right, I'll re-spin the series reverting back to v1.
> For bpf_iter_unix_seq_show(), one thought is to add unix_state_lock()
> there before running the bpf iter prog. iiuc, it is what Kuniyuki has in
> mind also to allow bpf iter prog having a stable view of unix_sock. This
> could be a followup.
> [fwiw, it was why I first thought of has_current_bpf_ctx() to avoid
> sock_map_update_elem() taking unix_state_lock() again if
> bpf_iter_unix_seq_show() acquires unix_state_lock() earlier. I later
> concluded (but proved to be incorrect) that tc cannot call
> bpf_map_update_elem(unix_sk).]
>
>>
>> Let me know if I'm correctly rephrasing your idea: assume all bpf-context
>> callers hold the socket locked or keep it "stable" (meaning: "sk won't
>> surprise sockmap update by some breaking state change coming from another
>> thread"). As you said, most bpf iters already take the sock_lock(), and I
>
> Right, all bpf iter (udp, tcp, unix) has acquired the lock_sock() before
> running the bpf iter prog. afaik, the only exception is netlink bpf iter
> but it cannot be added to sock_map afaik.
And sock_{map,hash}_seq_show() (being a part of bpf iter machinery) needs
to take lock_sock() just as well? Would that require a special-casing
(unix_state_lock()) for af_unix?
>> have a patch that fixes sock_{map,hash}_seq_show(). Then we could try
>> dropping that bh_lock_sock().
>>
>>> [ I would still keep patch 3 though. ]
>>
>> Right.
>>
>>> [1]: https://lore.kernel.org/bpf/20200821102948.21918-1-lmb@cloudflare.com/
>>>
>>>>
>>>> In a parallel thread I've asked Kuniyuki if it might be better to
>>>> completely drop patch 2/5, which would change how we interact with
>>>> sock_map_close(). Lets see how it goes.
>>>>
>>>
>>> If patch 2 is dropped, lock_sock() is always needed for unix_sk?
>>
>> For sock_map_update_elem_sys() I wanted to lock_sock()+unix_state_lock()
>> following Kuniyuki's suggestion to keep locking pattern/order (that repeats
>> when unix bpf iter prog invokes bpf_map_update_elem() ->
>> sock_map_update_elem()). For sock_map_update_elem() not, we can't sleep there.
>
On 3/30/26 4:03 PM, Michal Luczaj wrote:
> On 3/26/26 07:26, Martin KaFai Lau wrote:
>> On 3/15/26 4:58 PM, Michal Luczaj wrote:
>>>> Beside, from looking at the may_update_sockmap(), I don't know if it is
>>>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
>>>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
>>>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
>>>> when the bpf_map_update_elem(sockmap) support was added iirc.
>>>
>>> What about a situation when unix_sk is stored in a sockmap, then tc prog
>>> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
>>> useful, but seems doable.
>>
>> [ Sorry for the late reply ]
>>
>> It is a bummer that the bpf_map_update_elem(unix_sk) path is possible
>> from tc :(
>>
>> Then unix_state_lock() in its current form cannot be safely acquired in
>> sock_map_update_elem(). It is currently a spin_lock() instead of
>> spin_lock_bh().
>
> Is there a specific deadlock you have in your mind?
e.g. unix_stream_connect() is taking unix_state_lock(). Can a tc's
ingress bpf prog call unix_state_lock()?
>
>>>> The only path left is bpf_iter, which I believe was the primary use case
>>>> when adding bpf_map_update_elem(sockmap) support [1]. It would be nice
>>>> to avoid bh_lock_sock() when calling from all bpf_iter (tcp/udp/unix)
>>>> where lock_sock() has already been done. It is more for
>>>> reading-correctness though. This just came to my mind.
>>>> has_current_bpf_ctx() can be used to check this. sockopt_lock_sock() has
>>>> been using it to conditionally take lock_sock() or not.
>>>
>>> [ One clarification: bh_lock_sock() is a sock_map_update_elem() thing,
>>> which can only be called by a bpf prog. IOW, has_current_bpf_ctx() is
>>> always `true` in sock_map_update_elem(), right? ]
>>
>> For all the bpf prog types allowed by may_update_sockmap() to do
>> bpf_map_update_elem(sockmap), only BPF_TRACE_ITER should have
>> has_current_bpf_ctx() == true. The tc prog (and others allowed in
>> may_update_sockmap()) will have has_current_bpf_ctx() == false when
>> calling sock_map_update_elem().
>
> OK, so let's take test_sockmap_update.c:copy_sock_map(). It is a tc prog
> and it calls bpf_map_update_elem() -> sock_map_update_elem(), right? But
> running `test_progs -t "sockmap_basic/sockmap update"` shows (pr_warn() in
> sock_map_update_elem()) that has_current_bpf_ctx() == true. That's expected
I think it is because of the bpf_prog_test_run_skb() code path used by
the test_sockmap_update() test. This would need to be addressed if
has_current_bpf_ctx() was used in sock_map_update_elem().
> and has_current_bpf_ctx() would be false if sock_map_update_elem() was ran
> via a hook?
It should be false when the bpf prog is run by tc instead of
bpf_prog_test_run_skb().
>>> Let me know if I'm correctly rephrasing your idea: assume all bpf-context
>>> callers hold the socket locked or keep it "stable" (meaning: "sk won't
>>> surprise sockmap update by some breaking state change coming from another
>>> thread"). As you said, most bpf iters already take the sock_lock(), and I
>>
>> Right, all bpf iter (udp, tcp, unix) has acquired the lock_sock() before
>> running the bpf iter prog. afaik, the only exception is netlink bpf iter
>> but it cannot be added to sock_map afaik.
>
> And sock_{map,hash}_seq_show() (being a part of bpf iter machinery) needs
> to take lock_sock() just as well? Would that require a special-casing
> (unix_state_lock()) for af_unix?
I would think so for lock_sock() considering the current bh_lock_sock
without !sock_owned_by_user() usage is incorrect in
sock_map_update_elem(). [ this probably should be a separate issue for
another patch ]
Some more side-tracking... from looking at the code, the bpf_iter of
sock_{map,hash} can do bpf_map_lookup_elem(&sock_map, ...). This
bpr_iter program probably will be failed to load because the
bpf_sk_release() is not available.
I still don't have good idea what to do with the tc's prog calling
sock_map_update_elem().
On 3/31/26 02:20, Martin KaFai Lau wrote:
> On 3/30/26 4:03 PM, Michal Luczaj wrote:
>> On 3/26/26 07:26, Martin KaFai Lau wrote:
>>> On 3/15/26 4:58 PM, Michal Luczaj wrote:
>>>>> Beside, from looking at the may_update_sockmap(), I don't know if it is
>>>>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
>>>>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
>>>>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
>>>>> when the bpf_map_update_elem(sockmap) support was added iirc.
>>>>
>>>> What about a situation when unix_sk is stored in a sockmap, then tc prog
>>>> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
>>>> useful, but seems doable.
>>>
>>> [ Sorry for the late reply ]
>>>
>>> It is a bummer that the bpf_map_update_elem(unix_sk) path is possible
>>> from tc :(
>>>
>>> Then unix_state_lock() in its current form cannot be safely acquired in
>>> sock_map_update_elem(). It is currently a spin_lock() instead of
>>> spin_lock_bh().
>>
>> Is there a specific deadlock you have in your mind?
>
> e.g. unix_stream_connect() is taking unix_state_lock(). Can a tc's
> ingress bpf prog call unix_state_lock()?
Ah, right, that's the problem, thanks for explaining.
But, as I've asked in the parallel thread, do we really need to take the
unix_state_lock() in sock_map_update_elem()? Taking it in
sock_map_update_elem_sys() fixes the null-ptr-deref and does not lead to a
deadlock. Taking unix_state_lock() in sock_map_update_elem() seems
unnecessary. Well, at least under the assumption progs can only access
unix_sk via the sockmap lookup.
>> ...
>> And sock_{map,hash}_seq_show() (being a part of bpf iter machinery) needs
>> to take lock_sock() just as well? Would that require a special-casing
>> (unix_state_lock()) for af_unix?
>
> I would think so for lock_sock() considering the current bh_lock_sock
> without !sock_owned_by_user() usage is incorrect in
> sock_map_update_elem(). [ this probably should be a separate issue for
> another patch ]
All right, leaving that for later.
> Some more side-tracking... from looking at the code, the bpf_iter of
> sock_{map,hash} can do bpf_map_lookup_elem(&sock_map, ...). This
> bpr_iter program probably will be failed to load because the
> bpf_sk_release() is not available.
I think ability to bpf_map_lookup_elem(sockmap) was added way before bpf
iter in
https://lore.kernel.org/bpf/20200429181154.479310-2-jakub@cloudflare.com/
and iter "allows" it somewhat accidentally. Would you rather have it
explicitly dropped?
> I still don't have good idea what to do with the tc's prog calling
> sock_map_update_elem().
On Wed, 01 Apr 2026 00:43:58 +0200, Michal Luczaj wrote:
> On 3/31/26 02:20, Martin KaFai Lau wrote:
> > On 3/30/26 4:03 PM, Michal Luczaj wrote:
> >> On 3/26/26 07:26, Martin KaFai Lau wrote:
> >>> On 3/15/26 4:58 PM, Michal Luczaj wrote:
> >>>>> Beside, from looking at the may_update_sockmap(), I don't know if it is
> >>>>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
> >>>>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
> >>>>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
> >>>>> when the bpf_map_update_elem(sockmap) support was added iirc.
> >>>>
> >>>> What about a situation when unix_sk is stored in a sockmap, then tc prog
> >>>> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
> >>>> useful, but seems doable.
> >>>
> >>> [ Sorry for the late reply ]
> >>>
> >>> It is a bummer that the bpf_map_update_elem(unix_sk) path is possible
> >>> from tc :(
> >>>
> >>> Then unix_state_lock() in its current form cannot be safely acquired in
> >>> sock_map_update_elem(). It is currently a spin_lock() instead of
> >>> spin_lock_bh().
> >>
> >> Is there a specific deadlock you have in your mind?
> >
> > e.g. unix_stream_connect() is taking unix_state_lock(). Can a tc's
> > ingress bpf prog call unix_state_lock()?
>
> Ah, right, that's the problem, thanks for explaining.
>
> But, as I've asked in the parallel thread, do we really need to take the
> unix_state_lock() in sock_map_update_elem()? Taking it in
> sock_map_update_elem_sys() fixes the null-ptr-deref and does not lead to a
> deadlock. Taking unix_state_lock() in sock_map_update_elem() seems
> unnecessary. Well, at least under the assumption progs can only access
> unix_sk via the sockmap lookup.
right, sock_map_update_elem_sys() should be safe to take
unix_state_lock().
If it is fixed by testing unix_peer(), is the TCPF_ESTABLISHED test
in sock_map_sk_state_allowed() still useful and needed? Also,
please explain in detail in the commit message why testing for NULL
without unix_state_lock() is enough. For example, for the BPF iterator on
sock_map, my understanding is that unix_release_sock() can still happen
while the BPF iterator is iterating over a unix_sock. I guess a future
unix_state_lock() in the iterator's seq_show() should be useful.
It will also be useful to mention what was discovered about TC + lookup
+ update_elem(&sock_map, ...) and why it is not safe to take
unix_state_lock() in that path. Thanks.
>
> >> ...
> >> And sock_{map,hash}_seq_show() (being a part of bpf iter machinery) needs
> >> to take lock_sock() just as well? Would that require a special-casing
> >> (unix_state_lock()) for af_unix?
> >
> > I would think so for lock_sock() considering the current bh_lock_sock
> > without !sock_owned_by_user() usage is incorrect in
> > sock_map_update_elem(). [ this probably should be a separate issue for
> > another patch ]
>
> All right, leaving that for later.
>
> > Some more side-tracking... from looking at the code, the bpf_iter of
> > sock_{map,hash} can do bpf_map_lookup_elem(&sock_map, ...). This
> > bpr_iter program probably will be failed to load because the
> > bpf_sk_release() is not available.
>
> I think ability to bpf_map_lookup_elem(sockmap) was added way before bpf
> iter in
> https://lore.kernel.org/bpf/20200429181154.479310-2-jakub@cloudflare.com/
> and iter "allows" it somewhat accidentally. Would you rather have it
> explicitly dropped?
It is fine to leave it as is. It would be a bit nicer to reject map_lookup
on sockmap instead of giving the confusing error that bpf_sk_release
is not available. That is for another day. I was just wondering what else
could go wrong with map_lookup.
On Mon, Mar 30, 2026 at 4:04 PM Michal Luczaj <mhal@rbox.co> wrote:
>
> On 3/26/26 07:26, Martin KaFai Lau wrote:
> > On 3/15/26 4:58 PM, Michal Luczaj wrote:
> >>> Beside, from looking at the may_update_sockmap(), I don't know if it is
> >>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
> >>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
> >>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
> >>> when the bpf_map_update_elem(sockmap) support was added iirc.
> >>
> >> What about a situation when unix_sk is stored in a sockmap, then tc prog
> >> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
> >> useful, but seems doable.
> >
> > [ Sorry for the late reply ]
> >
> > It is a bummer that the bpf_map_update_elem(unix_sk) path is possible
> > from tc :(
> >
> > Then unix_state_lock() in its current form cannot be safely acquired in
> > sock_map_update_elem(). It is currently a spin_lock() instead of
> > spin_lock_bh().
>
> Is there a specific deadlock you have in your mind?
lockdep would complain if we used the same lock from
different contexts.
e.g.)
Process context holding unix_state_lock() with the normal spin_lock()
-> BH interrupt
-> tc prog trying to hold the same lock with spin_lock(). (_bh())
-> deadlock
>
> >>> The only path left is bpf_iter, which I believe was the primary use case
> >>> when adding bpf_map_update_elem(sockmap) support [1]. It would be nice
> >>> to avoid bh_lock_sock() when calling from all bpf_iter (tcp/udp/unix)
> >>> where lock_sock() has already been done. It is more for
> >>> reading-correctness though. This just came to my mind.
> >>> has_current_bpf_ctx() can be used to check this. sockopt_lock_sock() has
> >>> been using it to conditionally take lock_sock() or not.
> >>
> >> [ One clarification: bh_lock_sock() is a sock_map_update_elem() thing,
> >> which can only be called by a bpf prog. IOW, has_current_bpf_ctx() is
> >> always `true` in sock_map_update_elem(), right? ]
> >
> > For all the bpf prog types allowed by may_update_sockmap() to do
> > bpf_map_update_elem(sockmap), only BPF_TRACE_ITER should have
> > has_current_bpf_ctx() == true. The tc prog (and others allowed in
> > may_update_sockmap()) will have has_current_bpf_ctx() == false when
> > calling sock_map_update_elem().
>
> OK, so let's take test_sockmap_update.c:copy_sock_map(). It is a tc prog
> and it calls bpf_map_update_elem() -> sock_map_update_elem(), right? But
> running `test_progs -t "sockmap_basic/sockmap update"` shows (pr_warn() in
> sock_map_update_elem()) that has_current_bpf_ctx() == true. That's expected
> and has_current_bpf_ctx() would be false if sock_map_update_elem() was ran
> via a hook?
>
> > The tc case of bpf_map_update_elem(unix_sk) is unfortunate and requires
> > going back to the drawing board. I think checking unix_peer(sk) for NULL
> > without acquiring unix_state_lock() is needed for the
> > sock_map_update_elem() path, since changing unix_state_lock() for this
> > unknown use case seems overkill.
> >
> > Whether sock_map_update_elem_"sys"() needs unix_state_lock() is up for
> > debate.
>
> All right, I'll re-spin the series reverting back to v1.
Sounds good.
>
> > For bpf_iter_unix_seq_show(), one thought is to add unix_state_lock()
> > there before running the bpf iter prog. iiuc, it is what Kuniyuki has in
> > mind also to allow bpf iter prog having a stable view of unix_sock. This
> > could be a followup.
> > [fwiw, it was why I first thought of has_current_bpf_ctx() to avoid
> > sock_map_update_elem() taking unix_state_lock() again if
> > bpf_iter_unix_seq_show() acquires unix_state_lock() earlier. I later
> > concluded (but proved to be incorrect) that tc cannot call
> > bpf_map_update_elem(unix_sk).]
> >
> >>
> >> Let me know if I'm correctly rephrasing your idea: assume all bpf-context
> >> callers hold the socket locked or keep it "stable" (meaning: "sk won't
> >> surprise sockmap update by some breaking state change coming from another
> >> thread"). As you said, most bpf iters already take the sock_lock(), and I
> >
> > Right, all bpf iter (udp, tcp, unix) has acquired the lock_sock() before
> > running the bpf iter prog. afaik, the only exception is netlink bpf iter
> > but it cannot be added to sock_map afaik.
>
> And sock_{map,hash}_seq_show() (being a part of bpf iter machinery) needs
> to take lock_sock() just as well? Would that require a special-casing
> (unix_state_lock()) for af_unix?
Right, lock_sock() + unix_state_lock() + SOCK_DEAD check
should be best.
>
> >> have a patch that fixes sock_{map,hash}_seq_show(). Then we could try
> >> dropping that bh_lock_sock().
> >>
> >>> [ I would still keep patch 3 though. ]
> >>
> >> Right.
> >>
> >>> [1]: https://lore.kernel.org/bpf/20200821102948.21918-1-lmb@cloudflare.com/
> >>>
> >>>>
> >>>> In a parallel thread I've asked Kuniyuki if it might be better to
> >>>> completely drop patch 2/5, which would change how we interact with
> >>>> sock_map_close(). Lets see how it goes.
> >>>>
> >>>
> >>> If patch 2 is dropped, lock_sock() is always needed for unix_sk?
> >>
> >> For sock_map_update_elem_sys() I wanted to lock_sock()+unix_state_lock()
> >> following Kuniyuki's suggestion to keep locking pattern/order (that repeats
> >> when unix bpf iter prog invokes bpf_map_update_elem() ->
> >> sock_map_update_elem()). For sock_map_update_elem() not, we can't sleep there.
> >
>
On 3/31/26 01:27, Kuniyuki Iwashima wrote:
> On Mon, Mar 30, 2026 at 4:04 PM Michal Luczaj <mhal@rbox.co> wrote:
>> On 3/26/26 07:26, Martin KaFai Lau wrote:
>>> On 3/15/26 4:58 PM, Michal Luczaj wrote:
>>>>> Beside, from looking at the may_update_sockmap(), I don't know if it is
>>>>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
>>>>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
>>>>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
>>>>> when the bpf_map_update_elem(sockmap) support was added iirc.
>>>>
>>>> What about a situation when unix_sk is stored in a sockmap, then tc prog
>>>> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
>>>> useful, but seems doable.
>>>
>>> [ Sorry for the late reply ]
>>>
>>> It is a bummer that the bpf_map_update_elem(unix_sk) path is possible
>>> from tc :(
>>>
>>> Then unix_state_lock() in its current form cannot be safely acquired in
>>> sock_map_update_elem(). It is currently a spin_lock() instead of
>>> spin_lock_bh().
>>
>> Is there a specific deadlock you have in your mind?
>
> lockdep would complain if we used the same lock from
> different contexts.
>
> e.g.)
> Process context holding unix_state_lock() with the normal spin_lock()
> -> BH interrupt
> -> tc prog trying to hold the same lock with spin_lock(). (_bh())
> -> deadlock
OK, I get it, thanks.
So here's one more idea: the null-ptr-deref issue is connect() racing
against sock_map_update_elem_*SYS*() coming from user-space, not the
can-be-called-from-BH sock_map_update_elem() variant. So can't we assume
that for any sock_map_update_elem(unix_sk) invoked by a tc prog, unix_sk
will always be "stable", i.e. in a state that cannot lead to that
null-ptr-deref?
IOW, if for a tc prog the only way to get hold of unix_sk is look it up in
a sockmap, then (by the fact that unix_sk _is_ in the sockmap) unix_sk will
be already safe to use by sock_map_update_elem() without taking the af_unix
state lock.
Long story short: take the unix state lock in sock_map_update_elem_sys(),
don't bother in sock_map_update_elem()?
>> ...
>> And sock_{map,hash}_seq_show() (being a part of bpf iter machinery) needs
>> to take lock_sock() just as well? Would that require a special-casing
>> (unix_state_lock()) for af_unix?
>
> Right, lock_sock() + unix_state_lock() + SOCK_DEAD check
> should be best.
OK, got it.
On Tue, Mar 31, 2026 at 3:44 PM Michal Luczaj <mhal@rbox.co> wrote:
>
> On 3/31/26 01:27, Kuniyuki Iwashima wrote:
> > On Mon, Mar 30, 2026 at 4:04 PM Michal Luczaj <mhal@rbox.co> wrote:
> >> On 3/26/26 07:26, Martin KaFai Lau wrote:
> >>> On 3/15/26 4:58 PM, Michal Luczaj wrote:
> >>>>> Beside, from looking at the may_update_sockmap(), I don't know if it is
> >>>>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
> >>>>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
> >>>>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
> >>>>> when the bpf_map_update_elem(sockmap) support was added iirc.
> >>>>
> >>>> What about a situation when unix_sk is stored in a sockmap, then tc prog
> >>>> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
> >>>> useful, but seems doable.
> >>>
> >>> [ Sorry for the late reply ]
> >>>
> >>> It is a bummer that the bpf_map_update_elem(unix_sk) path is possible
> >>> from tc :(
> >>>
> >>> Then unix_state_lock() in its current form cannot be safely acquired in
> >>> sock_map_update_elem(). It is currently a spin_lock() instead of
> >>> spin_lock_bh().
> >>
> >> Is there a specific deadlock you have in your mind?
> >
> > lockdep would complain if we used the same lock from
> > different contexts.
> >
> > e.g.)
> > Process context holding unix_state_lock() with the normal spin_lock()
> > -> BH interrupt
> > -> tc prog trying to hold the same lock with spin_lock(). (_bh())
> > -> deadlock
>
> OK, I get it, thanks.
>
> So here's one more idea: the null-ptr-deref issue is connect() racing
> against sock_map_update_elem_*SYS*() coming from user-space, not the
> can-be-called-from-BH sock_map_update_elem() variant. So can't we assume
> that for any sock_map_update_elem(unix_sk) invoked by a tc prog, unix_sk
> will always be "stable", i.e. in a state that cannot lead to that
> null-ptr-deref?
>
> IOW, if for a tc prog the only way to get hold of unix_sk is look it up in
> a sockmap, then (by the fact that unix_sk _is_ in the sockmap) unix_sk will
> be already safe to use by sock_map_update_elem() without taking the af_unix
> state lock.
>
> Long story short: take the unix state lock in sock_map_update_elem_sys(),
> don't bother in sock_map_update_elem()?
but it will prevents bpf iter from taking unix_state_lock().
I don't see a good reason to introduce a new locking rule by unnecessarily
wrapping the entire sockmap update with unix_state_lock() even though
the root bug can be avoided by a simple null check in a leaf function.
>
> >> ...
> >> And sock_{map,hash}_seq_show() (being a part of bpf iter machinery) needs
> >> to take lock_sock() just as well? Would that require a special-casing
> >> (unix_state_lock()) for af_unix?
> >
> > Right, lock_sock() + unix_state_lock() + SOCK_DEAD check
> > should be best.
>
> OK, got it.
>
On 4/1/26 01:18, Kuniyuki Iwashima wrote: > On Tue, Mar 31, 2026 at 3:44 PM Michal Luczaj <mhal@rbox.co> wrote: >> >> On 3/31/26 01:27, Kuniyuki Iwashima wrote: >>> On Mon, Mar 30, 2026 at 4:04 PM Michal Luczaj <mhal@rbox.co> wrote: >>>> On 3/26/26 07:26, Martin KaFai Lau wrote: >>>>> On 3/15/26 4:58 PM, Michal Luczaj wrote: >>>>>>> Beside, from looking at the may_update_sockmap(), I don't know if it is >>>>>>> even doable (or useful) to bpf_map_update_elem(unix_sk) in >>>>>>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking >>>>>>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case >>>>>>> when the bpf_map_update_elem(sockmap) support was added iirc. >>>>>> >>>>>> What about a situation when unix_sk is stored in a sockmap, then tc prog >>>>>> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's >>>>>> useful, but seems doable. >>>>> >>>>> [ Sorry for the late reply ] >>>>> >>>>> It is a bummer that the bpf_map_update_elem(unix_sk) path is possible >>>>> from tc :( >>>>> >>>>> Then unix_state_lock() in its current form cannot be safely acquired in >>>>> sock_map_update_elem(). It is currently a spin_lock() instead of >>>>> spin_lock_bh(). >>>> >>>> Is there a specific deadlock you have in your mind? >>> >>> lockdep would complain if we used the same lock from >>> different contexts. >>> >>> e.g.) >>> Process context holding unix_state_lock() with the normal spin_lock() >>> -> BH interrupt >>> -> tc prog trying to hold the same lock with spin_lock(). (_bh()) >>> -> deadlock >> >> OK, I get it, thanks. >> >> So here's one more idea: the null-ptr-deref issue is connect() racing >> against sock_map_update_elem_*SYS*() coming from user-space, not the >> can-be-called-from-BH sock_map_update_elem() variant. So can't we assume >> that for any sock_map_update_elem(unix_sk) invoked by a tc prog, unix_sk >> will always be "stable", i.e. in a state that cannot lead to that >> null-ptr-deref? >> >> IOW, if for a tc prog the only way to get hold of unix_sk is look it up in >> a sockmap, then (by the fact that unix_sk _is_ in the sockmap) unix_sk will >> be already safe to use by sock_map_update_elem() without taking the af_unix >> state lock. >> >> Long story short: take the unix state lock in sock_map_update_elem_sys(), >> don't bother in sock_map_update_elem()? > > but it will prevents bpf iter from taking unix_state_lock(). How come? bpf iter can take unix_state_lock() and the prog is free to invoke sock_map_update_elem(). It's the _sys variant that would be taking unix_state_lock() by itself. Or is there some other locking complexity I'm missing? > I don't see a good reason to introduce a new locking rule by unnecessarily > wrapping the entire sockmap update with unix_state_lock() even though > the root bug can be avoided by a simple null check in a leaf function. Yeah, I get that point. I just assumed the root bug was indeed sockmap update missing a lock.
© 2016 - 2026 Red Hat, Inc.