net/unix/unix_bpf.c | 3 +++ 1 file changed, 3 insertions(+)
BPF_MAP_UPDATE_ELEM races unix_stream_connect(): when
sock_map_sk_state_allowed() passes (sk_state == TCP_ESTABLISHED),
unix_peer(sk) in unix_stream_bpf_update_proto() may still return NULL.
T0 bpf T1 connect
------ ----------
WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
sock_map_sk_state_allowed(sk)
...
sk_pair = unix_peer(sk)
sock_hold(sk_pair)
sock_hold(newsk)
smp_mb__after_atomic()
unix_peer(sk) = newsk
BUG: kernel NULL pointer dereference, address: 0000000000000080
RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
Call Trace:
sock_map_link+0x564/0x8b0
sock_map_update_common+0x6e/0x340
sock_map_update_elem_sys+0x17d/0x240
__sys_bpf+0x26db/0x3250
__x64_sys_bpf+0x21/0x30
do_syscall_64+0x6b/0x3a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Follow-up to discussion at
https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/.
Fixes: 8866730aed51 ("bpf, sockmap: af_unix stream sockets need to hold ref for pair sock")
Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
Re-triggered while working on an unrelated selftest:
https://lore.kernel.org/bpf/20260123-selftest-signal-on-connect-v1-0-b0256e7025b6@rbox.co/
---
net/unix/unix_bpf.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
index e0d30d6d22ac..57f3124c9d8d 100644
--- a/net/unix/unix_bpf.c
+++ b/net/unix/unix_bpf.c
@@ -185,6 +185,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
*/
if (!psock->sk_pair) {
sk_pair = unix_peer(sk);
+ if (unlikely(!sk_pair))
+ return -EINVAL;
+
sock_hold(sk_pair);
psock->sk_pair = sk_pair;
}
---
base-commit: 63804fed149a6750ffd28610c5c1c98cce6bd377
change-id: 20260129-unix-proto-update-null-ptr-deref-6a2733bcbbf8
Best regards,
--
Michal Luczaj <mhal@rbox.co>
On 1/29/26 8:47 AM, Michal Luczaj wrote:
> BPF_MAP_UPDATE_ELEM races unix_stream_connect(): when
> sock_map_sk_state_allowed() passes (sk_state == TCP_ESTABLISHED),
> unix_peer(sk) in unix_stream_bpf_update_proto() may still return NULL.
>
> T0 bpf T1 connect
> ------ ----------
>
> WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
> sock_map_sk_state_allowed(sk)
> ...
> sk_pair = unix_peer(sk)
> sock_hold(sk_pair)
> sock_hold(newsk)
> smp_mb__after_atomic()
> unix_peer(sk) = newsk
>
> BUG: kernel NULL pointer dereference, address: 0000000000000080
> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
> Call Trace:
> sock_map_link+0x564/0x8b0
> sock_map_update_common+0x6e/0x340
> sock_map_update_elem_sys+0x17d/0x240
> __sys_bpf+0x26db/0x3250
> __x64_sys_bpf+0x21/0x30
> do_syscall_64+0x6b/0x3a0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Follow-up to discussion at
> https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/.
It is a long thread to dig. Please summarize the discussion in the
commit message.
From looking at this commit message, if the existing lock_sock held by
update_elem is not useful for af_unix, it is not clear why a new test
"!sk_pair" on top of the existing WRITE_ONCE(sk->sk_state...) is a fix.
A minor thing is sock_map_sk_state_allowed doesn't have
READ_ONCE(sk->sk_state) for sk_is_stream_unix also.
If unix_stream_connect does not hold lock_sock, can unix_state_lock be
used here? lock_sock has already been taken, update_elem should not be
the hot path.
>
> Fixes: 8866730aed51 ("bpf, sockmap: af_unix stream sockets need to hold ref for pair sock")
> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> ---
> Re-triggered while working on an unrelated selftest:
> https://lore.kernel.org/bpf/20260123-selftest-signal-on-connect-v1-0-b0256e7025b6@rbox.co/
> ---
> net/unix/unix_bpf.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
> index e0d30d6d22ac..57f3124c9d8d 100644
> --- a/net/unix/unix_bpf.c
> +++ b/net/unix/unix_bpf.c
> @@ -185,6 +185,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
> */
> if (!psock->sk_pair) {
> sk_pair = unix_peer(sk);
> + if (unlikely(!sk_pair))
> + return -EINVAL;
> +
> sock_hold(sk_pair);
> psock->sk_pair = sk_pair;
> }
>
> ---
> base-commit: 63804fed149a6750ffd28610c5c1c98cce6bd377
> change-id: 20260129-unix-proto-update-null-ptr-deref-6a2733bcbbf8
>
> Best regards,
On 1/29/26 20:41, Martin KaFai Lau wrote:
> On 1/29/26 8:47 AM, Michal Luczaj wrote:
>> BPF_MAP_UPDATE_ELEM races unix_stream_connect(): when
>> sock_map_sk_state_allowed() passes (sk_state == TCP_ESTABLISHED),
>> unix_peer(sk) in unix_stream_bpf_update_proto() may still return NULL.
>>
>> T0 bpf T1 connect
>> ------ ----------
>>
>> WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
>> sock_map_sk_state_allowed(sk)
>> ...
>> sk_pair = unix_peer(sk)
>> sock_hold(sk_pair)
>> sock_hold(newsk)
>> smp_mb__after_atomic()
>> unix_peer(sk) = newsk
>>
>> BUG: kernel NULL pointer dereference, address: 0000000000000080
>> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
>> Call Trace:
>> sock_map_link+0x564/0x8b0
>> sock_map_update_common+0x6e/0x340
>> sock_map_update_elem_sys+0x17d/0x240
>> __sys_bpf+0x26db/0x3250
>> __x64_sys_bpf+0x21/0x30
>> do_syscall_64+0x6b/0x3a0
>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> Follow-up to discussion at
>> https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/.
>
> It is a long thread to dig. Please summarize the discussion in the
> commit message.
OK, there we go:
The root cause of the null-ptr-deref is that unix_stream_connect() sets
sk_state (`WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)`) _before_ it assigns
a peer (`unix_peer(sk) = newsk`). sk_state == TCP_ESTABLISHED makes
sock_map_sk_state_allowed() believe that socket is properly set up, which
would include having a defined peer.
In other words, there's a window when you can call
unix_stream_bpf_update_proto() on socket which still has unix_peer(sk) == NULL.
My initial idea was to simply move peer assignment _before_ the sk_state
update, but the maintainer wasn't interested in changing the
unix_stream_connect() hot path. He suggested taking care of it in the
sockmap code.
My understanding is that users are not supposed to put sockets in a sockmap
when said socket is only half-way through connect() call. Hence `return
-EINVAL` on a missing peer. Now, if users should be allowed to legally race
connect() vs. sockmap update, then I guess we can wait for connect() to
"finalize" e.g. by taking the unix_state_lock(), as discussed below.
> From looking at this commit message, if the existing lock_sock held by
> update_elem is not useful for af_unix,
Right, the existing lock_sock is not useful. update's lock_sock holds
sock::sk_lock, while unix_state_lock() holds unix_sock::lock.
> it is not clear why a new test
> "!sk_pair" on top of the existing WRITE_ONCE(sk->sk_state...) is a fix.
"On top"? Just to make sure we're looking at the same thing: above I was
trying to show two parallel flows with unix_peer() fetch in thread-0 and
WRITE_ONCE(sk->sk_state...) and `unix_peer(sk) = newsk` in thread-1.
It fixes the problem because now update_proto won't call sock_hold(NULL).
> A minor thing is sock_map_sk_state_allowed doesn't have
> READ_ONCE(sk->sk_state) for sk_is_stream_unix also.
Ok, I'll add this as a separate patch in v2. Along with the !tcp case of
sock_map_redirect_allowed()?
> If unix_stream_connect does not hold lock_sock, can unix_state_lock be
> used here? lock_sock has already been taken, update_elem should not be
> the hot path.
Yes, it can be used, it was proposed in the old thread. In fact, critical
section can be empty; only used to wait for unix_stream_connect() to
release the lock, which would guarantee unix_peer(sk) != NULL by then.
if (!psock->sk_pair) {
+ unix_state_lock(sk);
+ unix_state_unlock(sk);
sk_pair = unix_peer(sk);
sock_hold(sk_pair);
>> Fixes: 8866730aed51 ("bpf, sockmap: af_unix stream sockets need to hold ref for pair sock")
>> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
>> ---
>> Re-triggered while working on an unrelated selftest:
>> https://lore.kernel.org/bpf/20260123-selftest-signal-on-connect-v1-0-b0256e7025b6@rbox.co/
>> ---
>> net/unix/unix_bpf.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
>> index e0d30d6d22ac..57f3124c9d8d 100644
>> --- a/net/unix/unix_bpf.c
>> +++ b/net/unix/unix_bpf.c
>> @@ -185,6 +185,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
>> */
>> if (!psock->sk_pair) {
>> sk_pair = unix_peer(sk);
>> + if (unlikely(!sk_pair))
>> + return -EINVAL;
>> +
>> sock_hold(sk_pair);
>> psock->sk_pair = sk_pair;
>> }
>>
>> ---
>> base-commit: 63804fed149a6750ffd28610c5c1c98cce6bd377
>> change-id: 20260129-unix-proto-update-null-ptr-deref-6a2733bcbbf8
>>
>> Best regards,
>
On 1/30/26 3:00 AM, Michal Luczaj wrote:
>>> Follow-up to discussion at
>>> https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/.
>>
>> It is a long thread to dig. Please summarize the discussion in the
>> commit message.
>
> OK, there we go:
>
> The root cause of the null-ptr-deref is that unix_stream_connect() sets
> sk_state (`WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)`) _before_ it assigns
> a peer (`unix_peer(sk) = newsk`). sk_state == TCP_ESTABLISHED makes
> sock_map_sk_state_allowed() believe that socket is properly set up, which
> would include having a defined peer.
>
> In other words, there's a window when you can call
> unix_stream_bpf_update_proto() on socket which still has unix_peer(sk) == NULL.
>
> My initial idea was to simply move peer assignment _before_ the sk_state
> update, but the maintainer wasn't interested in changing the
> unix_stream_connect() hot path. He suggested taking care of it in the
> sockmap code.
>
> My understanding is that users are not supposed to put sockets in a sockmap
> when said socket is only half-way through connect() call. Hence `return
> -EINVAL` on a missing peer. Now, if users should be allowed to legally race
> connect() vs. sockmap update, then I guess we can wait for connect() to
> "finalize" e.g. by taking the unix_state_lock(), as discussed below.
>
>> From looking at this commit message, if the existing lock_sock held by
>> update_elem is not useful for af_unix,
>
> Right, the existing lock_sock is not useful. update's lock_sock holds
> sock::sk_lock, while unix_state_lock() holds unix_sock::lock.
It sounds like lock_sock is the incorrect lock to hold for af_unix. Is
taking lock_sock in sock_map doing anything useful for af_unix? Should
sock_map hold the unix_state_lock instead of lock_sock?
Other than update_elem, do other lock_sock() usages in sock_map have a
similar issue for af_unix?
>
>> it is not clear why a new test
>> "!sk_pair" on top of the existing WRITE_ONCE(sk->sk_state...) is a fix.
>
> "On top"? Just to make sure we're looking at the same thing: above I was
> trying to show two parallel flows with unix_peer() fetch in thread-0 and
> WRITE_ONCE(sk->sk_state...) and `unix_peer(sk) = newsk` in thread-1.
>
> It fixes the problem because now update_proto won't call sock_hold(NULL).
>
>> A minor thing is sock_map_sk_state_allowed doesn't have
>> READ_ONCE(sk->sk_state) for sk_is_stream_unix also.
>
> Ok, I'll add this as a separate patch in v2. Along with the !tcp case of
> sock_map_redirect_allowed()?
sgtm. thanks.
>
>> If unix_stream_connect does not hold lock_sock, can unix_state_lock be
>> used here? lock_sock has already been taken, update_elem should not be
>> the hot path.
>
> Yes, it can be used, it was proposed in the old thread. In fact, critical
> section can be empty; only used to wait for unix_stream_connect() to
> release the lock, which would guarantee unix_peer(sk) != NULL by then.
>
> if (!psock->sk_pair) {
> + unix_state_lock(sk);
> + unix_state_unlock(sk);
> sk_pair = unix_peer(sk);
> sock_hold(sk_pair);
I don't have a strong opinion on waiting or checking NULL. imo, both are
not easy to understand. One is sk_state had already been checked earlier
under a lock_sock but still needs to check NULL on unix_peer(). Another
one is an empty unix_state_[un]lock(). If taking unix_state_lock, may as
well just use the existing unix_peer_get(sk). If its return value cannot
(?) be NULL, WARN_ON_ONCE() instead of having a special empty
lock/unlock pattern here. If the correct lock (unix_state_lock) was held
earlier in update_elem, all these would go away.
Also, it is not immediately clear why a non-NULL unix_peer(sk) is safe
here. From looking around af_unix.c, is it because the sk refcnt is held
earlier in update_elem? For unix_stream, unix_peer(sk) will stay valid
until unix_release_sock(sk). Am I reading it correctly?
On Fri, Jan 30, 2026 at 1:30 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/30/26 3:00 AM, Michal Luczaj wrote:
> >>> Follow-up to discussion at
> >>> https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/.
> >>
> >> It is a long thread to dig. Please summarize the discussion in the
> >> commit message.
> >
> > OK, there we go:
> >
> > The root cause of the null-ptr-deref is that unix_stream_connect() sets
> > sk_state (`WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)`) _before_ it assigns
> > a peer (`unix_peer(sk) = newsk`). sk_state == TCP_ESTABLISHED makes
> > sock_map_sk_state_allowed() believe that socket is properly set up, which
> > would include having a defined peer.
> >
> > In other words, there's a window when you can call
> > unix_stream_bpf_update_proto() on socket which still has unix_peer(sk) == NULL.
> >
> > My initial idea was to simply move peer assignment _before_ the sk_state
> > update, but the maintainer wasn't interested in changing the
> > unix_stream_connect() hot path. He suggested taking care of it in the
> > sockmap code.
Yes, we already have a memory barrier for unix_peer(sk) there
(to save sock_hold()/sock_put() in sendmsg(), see 830a1e5c212fb)
and another one just for sk->sk_state is not worth the unlikely
case in sockmap by a buggy user.
> >
> > My understanding is that users are not supposed to put sockets in a sockmap
> > when said socket is only half-way through connect() call. Hence `return
> > -EINVAL` on a missing peer. Now, if users should be allowed to legally race
> > connect() vs. sockmap update, then I guess we can wait for connect() to
> > "finalize" e.g. by taking the unix_state_lock(), as discussed below.
If a user hit the issue, the user must have updated sockmap while the
user knows connect() had not returned. Such a user must prepare
for failures since it could occur before sock_map_sk_state_allowed() too.
This is a subtle timing issue and I don't think the kernel should be
friendly to such buggy users by waiting for connect() etc.
> >
> >> From looking at this commit message, if the existing lock_sock held by
> >> update_elem is not useful for af_unix,
> >
> > Right, the existing lock_sock is not useful. update's lock_sock holds
> > sock::sk_lock, while unix_state_lock() holds unix_sock::lock.
>
> It sounds like lock_sock is the incorrect lock to hold for af_unix. Is
> taking lock_sock in sock_map doing anything useful for af_unix? Should
> sock_map hold the unix_state_lock instead of lock_sock?
If sockmap code does not sleep, unix_state_lock can be used there.
>
> Other than update_elem, do other lock_sock() usages in sock_map have a
> similar issue for af_unix?
>
> >
> >> it is not clear why a new test
> >> "!sk_pair" on top of the existing WRITE_ONCE(sk->sk_state...) is a fix.
> >
> > "On top"? Just to make sure we're looking at the same thing: above I was
> > trying to show two parallel flows with unix_peer() fetch in thread-0 and
> > WRITE_ONCE(sk->sk_state...) and `unix_peer(sk) = newsk` in thread-1.
> >
> > It fixes the problem because now update_proto won't call sock_hold(NULL).
> >
> >> A minor thing is sock_map_sk_state_allowed doesn't have
> >> READ_ONCE(sk->sk_state) for sk_is_stream_unix also.
> >
> > Ok, I'll add this as a separate patch in v2. Along with the !tcp case of
> > sock_map_redirect_allowed()?
>
> sgtm. thanks.
>
> >
> >> If unix_stream_connect does not hold lock_sock, can unix_state_lock be
> >> used here? lock_sock has already been taken, update_elem should not be
> >> the hot path.
> >
> > Yes, it can be used, it was proposed in the old thread. In fact, critical
> > section can be empty; only used to wait for unix_stream_connect() to
> > release the lock, which would guarantee unix_peer(sk) != NULL by then.
> >
> > if (!psock->sk_pair) {
> > + unix_state_lock(sk);
> > + unix_state_unlock(sk);
I don't like this... we had a similar one in recvmsg(MSG_PEEK) path
for GC with a biiiiiig comment, which I removed in 118f457da9ed .
> > sk_pair = unix_peer(sk);
> > sock_hold(sk_pair);
>
> I don't have a strong opinion on waiting or checking NULL. imo, both are
> not easy to understand. One is sk_state had already been checked earlier
> under a lock_sock but still needs to check NULL on unix_peer(). Another
> one is an empty unix_state_[un]lock(). If taking unix_state_lock, may as
> well just use the existing unix_peer_get(sk).
Yes, unix_peer_get() can be safely used there (with extra sock_put()).
> If its return value cannot
> (?) be NULL, WARN_ON_ONCE() instead of having a special empty
I suggested WARN_ON_ONCE() because Michal reproduced it with
mdelay() and I did not think it could occur in real life, but considering
PREEMPT_RT, it could be real. So, the current form in this patch looks
good to me.
> lock/unlock pattern here. If the correct lock (unix_state_lock) was held
> earlier in update_elem, all these would go away.
>
> Also, it is not immediately clear why a non-NULL unix_peer(sk) is safe
> here. From looking around af_unix.c, is it because the sk refcnt is held
> earlier in update_elem? For unix_stream, unix_peer(sk) will stay valid
> until unix_release_sock(sk). Am I reading it correctly?
unix_stream_connect() holds the peer's refcnt, and once unix_peer(sk)
is set, it and refcnt are not cleared until close()d. So unix_peer_get() is
a bit redundant for sane users.
On 1/31/26 2:06 AM, Kuniyuki Iwashima wrote:
> On Fri, Jan 30, 2026 at 1:30 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 1/30/26 3:00 AM, Michal Luczaj wrote:
>>>>> Follow-up to discussion at
>>>>> https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/.
>>>>
>>>> It is a long thread to dig. Please summarize the discussion in the
>>>> commit message.
>>>
>>> OK, there we go:
>>>
>>> The root cause of the null-ptr-deref is that unix_stream_connect() sets
>>> sk_state (`WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)`) _before_ it assigns
>>> a peer (`unix_peer(sk) = newsk`). sk_state == TCP_ESTABLISHED makes
>>> sock_map_sk_state_allowed() believe that socket is properly set up, which
>>> would include having a defined peer.
>>>
>>> In other words, there's a window when you can call
>>> unix_stream_bpf_update_proto() on socket which still has unix_peer(sk) == NULL.
>>>
>>> My initial idea was to simply move peer assignment _before_ the sk_state
>>> update, but the maintainer wasn't interested in changing the
>>> unix_stream_connect() hot path. He suggested taking care of it in the
>>> sockmap code.
>
> Yes, we already have a memory barrier for unix_peer(sk) there
> (to save sock_hold()/sock_put() in sendmsg(), see 830a1e5c212fb)
> and another one just for sk->sk_state is not worth the unlikely
> case in sockmap by a buggy user.
>
>
>>>
>>> My understanding is that users are not supposed to put sockets in a sockmap
>>> when said socket is only half-way through connect() call. Hence `return
>>> -EINVAL` on a missing peer. Now, if users should be allowed to legally race
>>> connect() vs. sockmap update, then I guess we can wait for connect() to
>>> "finalize" e.g. by taking the unix_state_lock(), as discussed below.
>
> If a user hit the issue, the user must have updated sockmap while the
> user knows connect() had not returned. Such a user must prepare
> for failures since it could occur before sock_map_sk_state_allowed() too.
>
> This is a subtle timing issue and I don't think the kernel should be
> friendly to such buggy users by waiting for connect() etc.
I don't have a use case for parallel connect and map update either.
Also, I have no concern about returning -EINVAL in map_update for the
not-yet-completed unix_stream_connect().
However, TCP/UDP (and probably vsock also?) do not have this racing
issue because sock_map follows the lock usage in TCP/UDP as in other
parts of the kernel. Why AF_UNIX is an exception and unix_state_lock()
is not used in sock_map.
John and Jakub Stinicki, could this be an oversight?
>>>
>>>> From looking at this commit message, if the existing lock_sock held by
>>>> update_elem is not useful for af_unix,
>>>
>>> Right, the existing lock_sock is not useful. update's lock_sock holds
>>> sock::sk_lock, while unix_state_lock() holds unix_sock::lock.
>>
>> It sounds like lock_sock is the incorrect lock to hold for af_unix. Is
>> taking lock_sock in sock_map doing anything useful for af_unix? Should
>> sock_map hold the unix_state_lock instead of lock_sock?
>
> If sockmap code does not sleep, unix_state_lock can be used there.
afaik, bpf prog using the sockmap cannot sleep. Search for "if
(prog->sleepable)" in "check_map_prog_compatibility()". The bpf prog
attached to sock_map cannot sleep either, there is "can_be_sleepable()"
check in the verifier.
>>>
>>> if (!psock->sk_pair) {
>>> + unix_state_lock(sk);
>>> + unix_state_unlock(sk);
>
> I don't like this... we had a similar one in recvmsg(MSG_PEEK) path
> for GC with a biiiiiig comment, which I removed in 118f457da9ed .
Me neither, for both empty critical section and big fat comment. This
usually suggests something else is incorrect to begin with. I believe
the wrong lock is taken in this case. Unless there is something
prohibiting taking unix_state_lock in sock_map, fix it properly instead.
>
>
>>> sk_pair = unix_peer(sk);
>>> sock_hold(sk_pair);
>>
>> I don't have a strong opinion on waiting or checking NULL. imo, both are
>> not easy to understand. One is sk_state had already been checked earlier
>> under a lock_sock but still needs to check NULL on unix_peer(). Another
>> one is an empty unix_state_[un]lock(). If taking unix_state_lock, may as
>> well just use the existing unix_peer_get(sk).
>
> Yes, unix_peer_get() can be safely used there (with extra sock_put()).
The sock_map needs to sock_hold(sk_pair) anyway and stores it in
psock->sk_pair, so I don't think it needs an extra sock_put().
>
>
>> If its return value cannot
>> (?) be NULL, WARN_ON_ONCE() instead of having a special empty
>
> I suggested WARN_ON_ONCE() because Michal reproduced it with
> mdelay() and I did not think it could occur in real life, but considering
> PREEMPT_RT, it could be real. So, the current form in this patch looks
> good to me.
hmm... If unix_peer_get(sk) is used and it takes (and waits for) the
unix_state_lock, it shouldn't be NULL? The above empty [un]lock idea
does not check for NULL on unix_peer() either. Or am I missing something?
Regardless, if the proper lock is held, all this complication and
reasoning will go away.
On 2/2/26 20:15, Martin KaFai Lau wrote: > Regardless, if the proper lock is held, all this complication and > reasoning will go away. Here's my attempt: https://lore.kernel.org/bpf/20260207-unix-proto-update-null-ptr-deref-v2-0-9f091330e7cd@rbox.co/
On 1/31/26 11:06, Kuniyuki Iwashima wrote:
> On Fri, Jan 30, 2026 at 1:30 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 1/30/26 3:00 AM, Michal Luczaj wrote:
>>>>> Follow-up to discussion at
>>>>> https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/.
>>>>
>>>> It is a long thread to dig. Please summarize the discussion in the
>>>> commit message.
>>>
>>> OK, there we go:
>>>
>>> The root cause of the null-ptr-deref is that unix_stream_connect() sets
>>> sk_state (`WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)`) _before_ it assigns
>>> a peer (`unix_peer(sk) = newsk`). sk_state == TCP_ESTABLISHED makes
>>> sock_map_sk_state_allowed() believe that socket is properly set up, which
>>> would include having a defined peer.
>>>
>>> In other words, there's a window when you can call
>>> unix_stream_bpf_update_proto() on socket which still has unix_peer(sk) == NULL.
>>>
>>> My initial idea was to simply move peer assignment _before_ the sk_state
>>> update, but the maintainer wasn't interested in changing the
>>> unix_stream_connect() hot path. He suggested taking care of it in the
>>> sockmap code.
>
> Yes, we already have a memory barrier for unix_peer(sk) there
> (to save sock_hold()/sock_put() in sendmsg(), see 830a1e5c212fb)
> and another one just for sk->sk_state is not worth the unlikely
> case in sockmap by a buggy user.
>
>
>>>
>>> My understanding is that users are not supposed to put sockets in a sockmap
>>> when said socket is only half-way through connect() call. Hence `return
>>> -EINVAL` on a missing peer. Now, if users should be allowed to legally race
>>> connect() vs. sockmap update, then I guess we can wait for connect() to
>>> "finalize" e.g. by taking the unix_state_lock(), as discussed below.
>
> If a user hit the issue, the user must have updated sockmap while the
> user knows connect() had not returned. Such a user must prepare
> for failures since it could occur before sock_map_sk_state_allowed() too.
>
> This is a subtle timing issue and I don't think the kernel should be
> friendly to such buggy users by waiting for connect() etc.
>
>
>>>
>>>> From looking at this commit message, if the existing lock_sock held by
>>>> update_elem is not useful for af_unix,
>>>
>>> Right, the existing lock_sock is not useful. update's lock_sock holds
>>> sock::sk_lock, while unix_state_lock() holds unix_sock::lock.
>>
>> It sounds like lock_sock is the incorrect lock to hold for af_unix. Is
>> taking lock_sock in sock_map doing anything useful for af_unix? Should
>> sock_map hold the unix_state_lock instead of lock_sock?
>
> If sockmap code does not sleep, unix_state_lock can be used there.
>
>
>>
>> Other than update_elem, do other lock_sock() usages in sock_map have a
>> similar issue for af_unix?
As for the sockmap, I think that would be it.
In related news, looks like bpf_iter_unix_seq_show() is missing
unix_state_lock(): lock_sock_fast() won't stop unix_release_sock(). E.g.
bpf iterator can grab unix_sock::peer as it is being released.
>>>> it is not clear why a new test
>>>> "!sk_pair" on top of the existing WRITE_ONCE(sk->sk_state...) is a fix.
>>>
>>> "On top"? Just to make sure we're looking at the same thing: above I was
>>> trying to show two parallel flows with unix_peer() fetch in thread-0 and
>>> WRITE_ONCE(sk->sk_state...) and `unix_peer(sk) = newsk` in thread-1.
>>>
>>> It fixes the problem because now update_proto won't call sock_hold(NULL).
>>>
>>>> A minor thing is sock_map_sk_state_allowed doesn't have
>>>> READ_ONCE(sk->sk_state) for sk_is_stream_unix also.
>>>
>>> Ok, I'll add this as a separate patch in v2. Along with the !tcp case of
>>> sock_map_redirect_allowed()?
>>
>> sgtm. thanks.
>>
>>>
>>>> If unix_stream_connect does not hold lock_sock, can unix_state_lock be
>>>> used here? lock_sock has already been taken, update_elem should not be
>>>> the hot path.
>>>
>>> Yes, it can be used, it was proposed in the old thread. In fact, critical
>>> section can be empty; only used to wait for unix_stream_connect() to
>>> release the lock, which would guarantee unix_peer(sk) != NULL by then.
>>>
>>> if (!psock->sk_pair) {
>>> + unix_state_lock(sk);
>>> + unix_state_unlock(sk);
>
> I don't like this... we had a similar one in recvmsg(MSG_PEEK) path
> for GC with a biiiiiig comment, which I removed in 118f457da9ed .
>
>
>>> sk_pair = unix_peer(sk);
>>> sock_hold(sk_pair);
>>
>> I don't have a strong opinion on waiting or checking NULL. imo, both are
>> not easy to understand. One is sk_state had already been checked earlier
>> under a lock_sock but still needs to check NULL on unix_peer(). Another
>> one is an empty unix_state_[un]lock(). If taking unix_state_lock, may as
>> well just use the existing unix_peer_get(sk).
>
> Yes, unix_peer_get() can be safely used there (with extra sock_put()).
>
>
>> If its return value cannot
>> (?) be NULL, WARN_ON_ONCE() instead of having a special empty
>
> I suggested WARN_ON_ONCE() because Michal reproduced it with
> mdelay() and I did not think it could occur in real life, but considering
> PREEMPT_RT, it could be real. So, the current form in this patch looks
> good to me.
FWIW, it reproduces on CONFIG_PREEMPT_RT=n without mdelay().
Please see
https://lore.kernel.org/bpf/20260123-selftest-signal-on-connect-v1-0-b0256e7025b6@rbox.co/
>> lock/unlock pattern here. If the correct lock (unix_state_lock) was held
>> earlier in update_elem, all these would go away.
>>
>> Also, it is not immediately clear why a non-NULL unix_peer(sk) is safe
>> here. From looking around af_unix.c, is it because the sk refcnt is held
>> earlier in update_elem? For unix_stream, unix_peer(sk) will stay valid
>> until unix_release_sock(sk). Am I reading it correctly?
>
> unix_stream_connect() holds the peer's refcnt, and once unix_peer(sk)
> is set, it and refcnt are not cleared until close()d. So unix_peer_get() is
> a bit redundant for sane users.
On 2/2/26 7:10 AM, Michal Luczaj wrote: >>> Other than update_elem, do other lock_sock() usages in sock_map have a >>> similar issue for af_unix? > As for the sockmap, I think that would be it. Thanks for checking. > > In related news, looks like bpf_iter_unix_seq_show() is missing > unix_state_lock(): lock_sock_fast() won't stop unix_release_sock(). E.g. > bpf iterator can grab unix_sock::peer as it is being released. If the concern is the bpf iterator prog may use a released unix_peer(sk) pointer, it should be fine. The unix_peer(sk) pointer is not a trusted pointer to the bpf prog, so nothing bad will happen other than potentially reading incorrect values. However, yeah, the bpf_iter_(tcp|udp)_seq_show is better in the sense that the correct lock is used. For tcp_sock that has many stats, I think it will be particularly useful to read them in a consistent state. I don't have a strong opinion on af_unix. Unlike the sock_map where the lock_sock is not useful for af_unix. The bpf iterator can do bpf_setsockopt, so a lock_sock_fast() is still needed in bpf_iter_unix_seq_show and I think it is the reason lock_sock_fast() is used here.
On 2/3/26 04:53, Martin KaFai Lau wrote: > On 2/2/26 7:10 AM, Michal Luczaj wrote: >> In related news, looks like bpf_iter_unix_seq_show() is missing >> unix_state_lock(): lock_sock_fast() won't stop unix_release_sock(). E.g. >> bpf iterator can grab unix_sock::peer as it is being released. > > If the concern is the bpf iterator prog may use a released unix_peer(sk) > pointer, it should be fine. The unix_peer(sk) pointer is not a trusted > pointer to the bpf prog, so nothing bad will happen other than > potentially reading incorrect values. But if the prog passes a released peer pointer to a bpf helper: BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0x95/0xb0 Read of size 1 at addr ffff888110654c92 by task test_progs/1936 Call Trace: dump_stack_lvl+0x5d/0x80 print_report+0x170/0x4f3 kasan_report+0xe1/0x180 bpf_skc_to_unix_sock+0x95/0xb0 bpf_prog_382743e45576abd5_dump_unix+0x4a0/0x576 bpf_iter_run_prog+0x5b9/0xb00 bpf_iter_unix_seq_show+0x1f7/0x2e0 bpf_seq_read+0x42c/0x10d0 vfs_read+0x171/0xb20 ksys_read+0xff/0x200 do_syscall_64+0x6b/0x3a0 entry_SYSCALL_64_after_hwframe+0x76/0x7e Allocated by task 1941: kasan_save_stack+0x30/0x50 kasan_save_track+0x14/0x30 __kasan_slab_alloc+0x63/0x80 kmem_cache_alloc_noprof+0x1f7/0x6f0 sk_prot_alloc+0x59/0x210 sk_alloc+0x34/0x470 unix_create1+0x86/0x8a0 unix_create+0xd4/0x2d0 __sock_create+0x243/0x5a0 __sys_socket+0x119/0x1d0 __x64_sys_socket+0x72/0xd0 do_syscall_64+0x6b/0x3a0 entry_SYSCALL_64_after_hwframe+0x76/0x7e Freed by task 1941: kasan_save_stack+0x30/0x50 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x70 __kasan_slab_free+0x47/0x70 kmem_cache_free+0x12c/0x5d0 __sk_destruct+0x432/0x6e0 unix_release_sock+0x9b3/0xf60 unix_release_sock+0x99f/0xf60 unix_release+0x8a/0xf0 __sock_release+0xb0/0x270 sock_close+0x18/0x20 __fput+0x36e/0xac0 fput_close_sync+0xe5/0x1a0 __x64_sys_close+0x7d/0xd0 do_syscall_64+0x6b/0x3a0 entry_SYSCALL_64_after_hwframe+0x76/0x7e
From: Michal Luczaj <mhal@rbox.co>
Date: Tue, 3 Feb 2026 10:57:46 +0100
> On 2/3/26 04:53, Martin KaFai Lau wrote:
> > On 2/2/26 7:10 AM, Michal Luczaj wrote:
> >> In related news, looks like bpf_iter_unix_seq_show() is missing
> >> unix_state_lock(): lock_sock_fast() won't stop unix_release_sock(). E.g.
> >> bpf iterator can grab unix_sock::peer as it is being released.
> >
> > If the concern is the bpf iterator prog may use a released unix_peer(sk)
> > pointer, it should be fine. The unix_peer(sk) pointer is not a trusted
> > pointer to the bpf prog, so nothing bad will happen other than
> > potentially reading incorrect values.
>
> But if the prog passes a released peer pointer to a bpf helper:
>
> BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0x95/0xb0
> Read of size 1 at addr ffff888110654c92 by task test_progs/1936
Can you cook a patch for this ? probably like below
---8<---
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 02ebad6afac7..9c7e9fbde362 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -3740,8 +3740,9 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
return 0;
slow = lock_sock_fast(sk);
+ unix_state_lock(sk);
- if (unlikely(sk_unhashed(sk))) {
+ if (unlikely(sock_flag(other, SOCK_DEAD))) {
ret = SEQ_SKIP;
goto unlock;
}
@@ -3751,6 +3752,7 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
prog = bpf_iter_get_info(&meta, false);
ret = unix_prog_seq_show(prog, &meta, v, uid);
unlock:
+ unix_staet_unlock(sk);
unlock_sock_fast(sk, slow);
return ret;
}
---8<---
Thanks!
On 2/3/26 11:47 AM, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Tue, 3 Feb 2026 10:57:46 +0100
>> On 2/3/26 04:53, Martin KaFai Lau wrote:
>>> On 2/2/26 7:10 AM, Michal Luczaj wrote:
>>>> In related news, looks like bpf_iter_unix_seq_show() is missing
>>>> unix_state_lock(): lock_sock_fast() won't stop unix_release_sock(). E.g.
>>>> bpf iterator can grab unix_sock::peer as it is being released.
>>>
>>> If the concern is the bpf iterator prog may use a released unix_peer(sk)
>>> pointer, it should be fine. The unix_peer(sk) pointer is not a trusted
>>> pointer to the bpf prog, so nothing bad will happen other than
>>> potentially reading incorrect values.
>>
>> But if the prog passes a released peer pointer to a bpf helper:
>>
>> BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0x95/0xb0
>> Read of size 1 at addr ffff888110654c92 by task test_progs/1936
hmm... bpf_skc_to_unix_sock is exposed to tracing. bpf_iter is a tracing
bpf prog.
>
> Can you cook a patch for this ? probably like below
This can help the bpf_iter but not the other tracing prog such as fentry.
>
> ---8<---
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 02ebad6afac7..9c7e9fbde362 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -3740,8 +3740,9 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
> return 0;
>
> slow = lock_sock_fast(sk);
> + unix_state_lock(sk);
>
> - if (unlikely(sk_unhashed(sk))) {
> + if (unlikely(sock_flag(other, SOCK_DEAD))) {
> ret = SEQ_SKIP;
> goto unlock;
> }
> @@ -3751,6 +3752,7 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
> prog = bpf_iter_get_info(&meta, false);
> ret = unix_prog_seq_show(prog, &meta, v, uid);
> unlock:
> + unix_staet_unlock(sk);
> unlock_sock_fast(sk, slow);
> return ret;
> }
> ---8<---
>
> Thanks!
On Tue, Feb 3, 2026 at 11:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 2/3/26 11:47 AM, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Tue, 3 Feb 2026 10:57:46 +0100
> >> On 2/3/26 04:53, Martin KaFai Lau wrote:
> >>> On 2/2/26 7:10 AM, Michal Luczaj wrote:
> >>>> In related news, looks like bpf_iter_unix_seq_show() is missing
> >>>> unix_state_lock(): lock_sock_fast() won't stop unix_release_sock(). E.g.
> >>>> bpf iterator can grab unix_sock::peer as it is being released.
> >>>
> >>> If the concern is the bpf iterator prog may use a released unix_peer(sk)
> >>> pointer, it should be fine. The unix_peer(sk) pointer is not a trusted
> >>> pointer to the bpf prog, so nothing bad will happen other than
> >>> potentially reading incorrect values.
> >>
> >> But if the prog passes a released peer pointer to a bpf helper:
> >>
> >> BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0x95/0xb0
> >> Read of size 1 at addr ffff888110654c92 by task test_progs/1936
>
> hmm... bpf_skc_to_unix_sock is exposed to tracing. bpf_iter is a tracing
> bpf prog.
>
> >
> > Can you cook a patch for this ? probably like below
>
> This can help the bpf_iter but not the other tracing prog such as fentry.
Oh well ... then bpf_skc_to_unix_sock() can be used even
with SEQ_START_TOKEN at fentry of bpf_iter_unix_seq_show() ??
How about adding notrace to all af_unix bpf iterator functions ?
The procfs iterator holds a spinlock of the hashtable from
->start/next() to ->stop() to prevent the race with unix_release_sock().
I think other (non-iterator) functions cannot do such racy
access with tracing prog.
>
> >
> > ---8<---
> > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> > index 02ebad6afac7..9c7e9fbde362 100644
> > --- a/net/unix/af_unix.c
> > +++ b/net/unix/af_unix.c
> > @@ -3740,8 +3740,9 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
> > return 0;
> >
> > slow = lock_sock_fast(sk);
> > + unix_state_lock(sk);
> >
> > - if (unlikely(sk_unhashed(sk))) {
> > + if (unlikely(sock_flag(other, SOCK_DEAD))) {
> > ret = SEQ_SKIP;
> > goto unlock;
> > }
> > @@ -3751,6 +3752,7 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
> > prog = bpf_iter_get_info(&meta, false);
> > ret = unix_prog_seq_show(prog, &meta, v, uid);
> > unlock:
> > + unix_staet_unlock(sk);
> > unlock_sock_fast(sk, slow);
> > return ret;
> > }
> > ---8<---
> >
> > Thanks!
>
On 2/4/26 08:58, Kuniyuki Iwashima wrote:
> On Tue, Feb 3, 2026 at 11:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 2/3/26 11:47 AM, Kuniyuki Iwashima wrote:
>>> From: Michal Luczaj <mhal@rbox.co>
>>> Date: Tue, 3 Feb 2026 10:57:46 +0100
>>>> On 2/3/26 04:53, Martin KaFai Lau wrote:
>>>>> On 2/2/26 7:10 AM, Michal Luczaj wrote:
>>>>>> In related news, looks like bpf_iter_unix_seq_show() is missing
>>>>>> unix_state_lock(): lock_sock_fast() won't stop unix_release_sock(). E.g.
>>>>>> bpf iterator can grab unix_sock::peer as it is being released.
>>>>>
>>>>> If the concern is the bpf iterator prog may use a released unix_peer(sk)
>>>>> pointer, it should be fine. The unix_peer(sk) pointer is not a trusted
>>>>> pointer to the bpf prog, so nothing bad will happen other than
>>>>> potentially reading incorrect values.
>>>>
>>>> But if the prog passes a released peer pointer to a bpf helper:
>>>>
>>>> BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0x95/0xb0
>>>> Read of size 1 at addr ffff888110654c92 by task test_progs/1936
>>
>> hmm... bpf_skc_to_unix_sock is exposed to tracing. bpf_iter is a tracing
>> bpf prog.
>>
>>>
>>> Can you cook a patch for this ? probably like below
>>
>> This can help the bpf_iter but not the other tracing prog such as fentry.
>
> Oh well ... then bpf_skc_to_unix_sock() can be used even
> with SEQ_START_TOKEN at fentry of bpf_iter_unix_seq_show() ??
>
> How about adding notrace to all af_unix bpf iterator functions ?
>
> The procfs iterator holds a spinlock of the hashtable from
> ->start/next() to ->stop() to prevent the race with unix_release_sock().
>
> I think other (non-iterator) functions cannot do such racy
> access with tracing prog.
But then there's SOCK_DGRAM where you can drop unix_peer(sk) without
releasing sk; see AF_UNSPEC in unix_dgram_connect(). I think Martin is
right, we can crash at many fentries.
BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0xa4/0xb0
Read of size 2 at addr ffff888147d38890 by task test_progs/2495
Call Trace:
dump_stack_lvl+0x5d/0x80
print_report+0x170/0x4f3
kasan_report+0xe1/0x180
bpf_skc_to_unix_sock+0xa4/0xb0
bpf_prog_564a1c39c35d86a2_unix_shutdown_entry+0x8a/0x8e
bpf_trampoline_6442564662+0x47/0xab
unix_shutdown+0x9/0x880
__sys_shutdown+0xe1/0x160
__x64_sys_shutdown+0x52/0x90
do_syscall_64+0x6b/0x3a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
>>
>>>
>>> ---8<---
>>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>>> index 02ebad6afac7..9c7e9fbde362 100644
>>> --- a/net/unix/af_unix.c
>>> +++ b/net/unix/af_unix.c
>>> @@ -3740,8 +3740,9 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
>>> return 0;
>>>
>>> slow = lock_sock_fast(sk);
>>> + unix_state_lock(sk);
>>>
>>> - if (unlikely(sk_unhashed(sk))) {
>>> + if (unlikely(sock_flag(other, SOCK_DEAD))) {
>>> ret = SEQ_SKIP;
>>> goto unlock;
>>> }
>>> @@ -3751,6 +3752,7 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
>>> prog = bpf_iter_get_info(&meta, false);
>>> ret = unix_prog_seq_show(prog, &meta, v, uid);
>>> unlock:
>>> + unix_staet_unlock(sk);
>>> unlock_sock_fast(sk, slow);
>>> return ret;
>>> }
>>> ---8<---
>>>
>>> Thanks!
>>
On 2/4/26 7:41 AM, Michal Luczaj wrote: >>>>>> If the concern is the bpf iterator prog may use a released unix_peer(sk) >>>>>> pointer, it should be fine. The unix_peer(sk) pointer is not a trusted >>>>>> pointer to the bpf prog, so nothing bad will happen other than >>>>>> potentially reading incorrect values. I misremembered that following unix->peer would be marked as (PTR_TO_BTF_ID | PTR_UNTRUSTED). I forgot there are some legacy supports on the PTR_TO_BTF_ID (i.e. without PTR_UNTRUSTED marking). >>>>> >>>>> But if the prog passes a released peer pointer to a bpf helper: >>>>> >>>>> BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0x95/0xb0 >>>>> Read of size 1 at addr ffff888110654c92 by task test_progs/1936 >>> >>> hmm... bpf_skc_to_unix_sock is exposed to tracing. bpf_iter is a tracing >>> bpf prog. >>> >>>> >>>> Can you cook a patch for this ? probably like below >>> >>> This can help the bpf_iter but not the other tracing prog such as fentry. >> >> Oh well ... then bpf_skc_to_unix_sock() can be used even >> with SEQ_START_TOKEN at fentry of bpf_iter_unix_seq_show() ?? It is fine. The type is void. >> >> How about adding notrace to all af_unix bpf iterator functions ? but right, other functions taking [unix_]sock pointer could be audited. I don't know af_unix well enough to assess the blast radius or whether some useful functions may become untraceable. >> >> The procfs iterator holds a spinlock of the hashtable from >> ->start/next() to ->stop() to prevent the race with unix_release_sock(). >> >> I think other (non-iterator) functions cannot do such racy >> access with tracing prog. > > But then there's SOCK_DGRAM where you can drop unix_peer(sk) without > releasing sk; see AF_UNSPEC in unix_dgram_connect(). I think Martin is > right, we can crash at many fentries. > > BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0xa4/0xb0 > Read of size 2 at addr ffff888147d38890 by task test_progs/2495 > Call Trace: > dump_stack_lvl+0x5d/0x80 > print_report+0x170/0x4f3 > kasan_report+0xe1/0x180 > bpf_skc_to_unix_sock+0xa4/0xb0 > bpf_prog_564a1c39c35d86a2_unix_shutdown_entry+0x8a/0x8e > bpf_trampoline_6442564662+0x47/0xab > unix_shutdown+0x9/0x880 > __sys_shutdown+0xe1/0x160 > __x64_sys_shutdown+0x52/0x90 > do_syscall_64+0x6b/0x3a0 > entry_SYSCALL_64_after_hwframe+0x76/0x7e This probably is the first case where reading a sk pointer requires a lock. I think it will need to be marked as PTR_UNTRUSTED in the verifier for the unix->peer access, so that it cannot be passed to a helper. There is a BTF_TYPE_SAFE_TRUSTED list. afaik, there is no untrusted one now.
On 2/4/26 20:34, Martin KaFai Lau wrote: >> But then there's SOCK_DGRAM where you can drop unix_peer(sk) without >> releasing sk; see AF_UNSPEC in unix_dgram_connect(). I think Martin is >> right, we can crash at many fentries. >> >> BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0xa4/0xb0 >> Read of size 2 at addr ffff888147d38890 by task test_progs/2495 >> Call Trace: >> dump_stack_lvl+0x5d/0x80 >> print_report+0x170/0x4f3 >> kasan_report+0xe1/0x180 >> bpf_skc_to_unix_sock+0xa4/0xb0 >> bpf_prog_564a1c39c35d86a2_unix_shutdown_entry+0x8a/0x8e >> bpf_trampoline_6442564662+0x47/0xab >> unix_shutdown+0x9/0x880 >> __sys_shutdown+0xe1/0x160 >> __x64_sys_shutdown+0x52/0x90 >> do_syscall_64+0x6b/0x3a0 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e > > This probably is the first case where reading a sk pointer requires a > lock. I think it will need to be marked as PTR_UNTRUSTED in the verifier > for the unix->peer access, so that it cannot be passed to a helper. > There is a BTF_TYPE_SAFE_TRUSTED list. afaik, there is no untrusted one now. What about 'fexit/sk_common_release' that does 'bpf_skc_to_unix_sock(sk)'. Is this something the verifies is supposed to handle? $ python -c 'from socket import *; socket(AF_INET, SOCK_DGRAM)' BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0x95/0xb0 Read of size 1 at addr ffff888128312112 by task python/4076 Call Trace: dump_stack_lvl+0x5d/0x80 print_report+0x170/0x4f3 kasan_report+0xe1/0x180 bpf_skc_to_unix_sock+0x95/0xb0 bpf_prog_70a8d38bd5dbf399_release_exit+0x87/0x8b bpf_trampoline_6442556594+0x64/0xd3 inet_release+0x104/0x230 __sock_release+0xb0/0x270 sock_close+0x18/0x20 __fput+0x36e/0xac0 fput_close_sync+0xe5/0x1a0 __x64_sys_close+0x7d/0xd0 do_syscall_64+0x6b/0x3a0 entry_SYSCALL_64_after_hwframe+0x76/0x7e
On 2/4/26 3:25 PM, Michal Luczaj wrote: > On 2/4/26 20:34, Martin KaFai Lau wrote: >>> But then there's SOCK_DGRAM where you can drop unix_peer(sk) without >>> releasing sk; see AF_UNSPEC in unix_dgram_connect(). I think Martin is >>> right, we can crash at many fentries. >>> >>> BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0xa4/0xb0 >>> Read of size 2 at addr ffff888147d38890 by task test_progs/2495 >>> Call Trace: >>> dump_stack_lvl+0x5d/0x80 >>> print_report+0x170/0x4f3 >>> kasan_report+0xe1/0x180 >>> bpf_skc_to_unix_sock+0xa4/0xb0 >>> bpf_prog_564a1c39c35d86a2_unix_shutdown_entry+0x8a/0x8e >>> bpf_trampoline_6442564662+0x47/0xab >>> unix_shutdown+0x9/0x880 >>> __sys_shutdown+0xe1/0x160 >>> __x64_sys_shutdown+0x52/0x90 >>> do_syscall_64+0x6b/0x3a0 >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> >> This probably is the first case where reading a sk pointer requires a >> lock. I think it will need to be marked as PTR_UNTRUSTED in the verifier >> for the unix->peer access, so that it cannot be passed to a helper. >> There is a BTF_TYPE_SAFE_TRUSTED list. afaik, there is no untrusted one now. > > What about 'fexit/sk_common_release' that does 'bpf_skc_to_unix_sock(sk)'. > Is this something the verifies is supposed to handle? fexit is an obvious one if the bpf prog wants to shoot itself on the foot by using things at the fexit of a free/release function. There are other things can break also if a tracing-capable user wants to exploit at fexit. I don't have good idea how to solve it. Going back to the bpf_skc_to_* helper. The bigger hammer may be, we should properly depreciate the bpf_skc_to_* usage from tracing instead of fixing it and ask the user to stay with the bpf_core_cast.
On Wed, Feb 4, 2026 at 3:25 PM Michal Luczaj <mhal@rbox.co> wrote: > > On 2/4/26 20:34, Martin KaFai Lau wrote: > >> But then there's SOCK_DGRAM where you can drop unix_peer(sk) without > >> releasing sk; see AF_UNSPEC in unix_dgram_connect(). I think Martin is > >> right, we can crash at many fentries. > >> > >> BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0xa4/0xb0 > >> Read of size 2 at addr ffff888147d38890 by task test_progs/2495 > >> Call Trace: > >> dump_stack_lvl+0x5d/0x80 > >> print_report+0x170/0x4f3 > >> kasan_report+0xe1/0x180 > >> bpf_skc_to_unix_sock+0xa4/0xb0 > >> bpf_prog_564a1c39c35d86a2_unix_shutdown_entry+0x8a/0x8e > >> bpf_trampoline_6442564662+0x47/0xab > >> unix_shutdown+0x9/0x880 > >> __sys_shutdown+0xe1/0x160 > >> __x64_sys_shutdown+0x52/0x90 > >> do_syscall_64+0x6b/0x3a0 > >> entry_SYSCALL_64_after_hwframe+0x76/0x7e > > > > This probably is the first case where reading a sk pointer requires a > > lock. I think it will need to be marked as PTR_UNTRUSTED in the verifier > > for the unix->peer access, so that it cannot be passed to a helper. > > There is a BTF_TYPE_SAFE_TRUSTED list. afaik, there is no untrusted one now. > > What about 'fexit/sk_common_release' that does 'bpf_skc_to_unix_sock(sk)'. > Is this something the verifies is supposed to handle? sk_common_release() is not called from AF_UNIX and even other helpers accessing sk->sk_xxx should trigger the same splat. Moreover, any read would trigger the splat at other functions like sk_prot_free(). We could sprinkle notrace over the kernel or have a blocked list for each helper in the verifier, but then we need to inspect all bpf helpers. In the first place, it makes no sense that someone who knows where a pointer is freed tries to read it after being freed and complains about UAF :/ > > $ python -c 'from socket import *; socket(AF_INET, SOCK_DGRAM)' > > BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0x95/0xb0 > Read of size 1 at addr ffff888128312112 by task python/4076 > Call Trace: > dump_stack_lvl+0x5d/0x80 > print_report+0x170/0x4f3 > kasan_report+0xe1/0x180 > bpf_skc_to_unix_sock+0x95/0xb0 > bpf_prog_70a8d38bd5dbf399_release_exit+0x87/0x8b > bpf_trampoline_6442556594+0x64/0xd3 > inet_release+0x104/0x230 > __sock_release+0xb0/0x270 > sock_close+0x18/0x20 > __fput+0x36e/0xac0 > fput_close_sync+0xe5/0x1a0 > __x64_sys_close+0x7d/0xd0 > do_syscall_64+0x6b/0x3a0 > entry_SYSCALL_64_after_hwframe+0x76/0x7e >
From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Wed, 4 Feb 2026 11:34:55 -0800
> On 2/4/26 7:41 AM, Michal Luczaj wrote:
> >>>>>> If the concern is the bpf iterator prog may use a released unix_peer(sk)
> >>>>>> pointer, it should be fine. The unix_peer(sk) pointer is not a trusted
> >>>>>> pointer to the bpf prog, so nothing bad will happen other than
> >>>>>> potentially reading incorrect values.
>
> I misremembered that following unix->peer would be marked as
> (PTR_TO_BTF_ID | PTR_UNTRUSTED). I forgot there are some legacy supports
> on the PTR_TO_BTF_ID (i.e. without PTR_UNTRUSTED marking).
>
> >>>>>
> >>>>> But if the prog passes a released peer pointer to a bpf helper:
> >>>>>
> >>>>> BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0x95/0xb0
> >>>>> Read of size 1 at addr ffff888110654c92 by task test_progs/1936
> >>>
> >>> hmm... bpf_skc_to_unix_sock is exposed to tracing. bpf_iter is a tracing
> >>> bpf prog.
> >>>
> >>>>
> >>>> Can you cook a patch for this ? probably like below
> >>>
> >>> This can help the bpf_iter but not the other tracing prog such as fentry.
> >>
> >> Oh well ... then bpf_skc_to_unix_sock() can be used even
> >> with SEQ_START_TOKEN at fentry of bpf_iter_unix_seq_show() ??
>
> It is fine. The type is void.
>
> >>
> >> How about adding notrace to all af_unix bpf iterator functions ?
>
> but right, other functions taking [unix_]sock pointer could be audited.
> I don't know af_unix well enough to assess the blast radius or whether
> some useful functions may become untraceable.
Considering SOCK_DGRAM, the blast radus is much bigger than
I thought, so I'd avoid this way if possible by modifying
the verifier.
>
> >>
> >> The procfs iterator holds a spinlock of the hashtable from
> >> ->start/next() to ->stop() to prevent the race with unix_release_sock().
> >>
> >> I think other (non-iterator) functions cannot do such racy
> >> access with tracing prog.
> >
> > But then there's SOCK_DGRAM where you can drop unix_peer(sk) without
> > releasing sk; see AF_UNSPEC in unix_dgram_connect(). I think Martin is
> > right, we can crash at many fentries.
> >
> > BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0xa4/0xb0
> > Read of size 2 at addr ffff888147d38890 by task test_progs/2495
> > Call Trace:
> > dump_stack_lvl+0x5d/0x80
> > print_report+0x170/0x4f3
> > kasan_report+0xe1/0x180
> > bpf_skc_to_unix_sock+0xa4/0xb0
> > bpf_prog_564a1c39c35d86a2_unix_shutdown_entry+0x8a/0x8e
> > bpf_trampoline_6442564662+0x47/0xab
> > unix_shutdown+0x9/0x880
> > __sys_shutdown+0xe1/0x160
> > __x64_sys_shutdown+0x52/0x90
> > do_syscall_64+0x6b/0x3a0
> > entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> This probably is the first case where reading a sk pointer requires a
> lock. I think it will need to be marked as PTR_UNTRUSTED in the verifier
> for the unix->peer access, so that it cannot be passed to a helper.
> There is a BTF_TYPE_SAFE_TRUSTED list. afaik, there is no untrusted one now.
Just skimmed the code, and I guess something like below would
do that ? and if needed, we could add another helper to fetch
peer with a proper release function ?
---8<---
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 3135643d5695..ef8b4dd21923 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7177,6 +7177,14 @@ static bool type_is_rcu_or_null(struct bpf_verifier_env *env,
return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id, "__safe_rcu_or_null");
}
+static bool type_is_untrusted(struct bpf_verifier_env *env,
+ struct bpf_reg_state *reg,
+ const char *field_name, u32 btf_id)
+{
+ /* TODO: return true if field_name and btf_id is unix_sock.peer. */
+ return false;
+}
+
static bool type_is_trusted(struct bpf_verifier_env *env,
struct bpf_reg_state *reg,
const char *field_name, u32 btf_id)
@@ -7307,7 +7315,9 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
* A regular RCU-protected pointer with __rcu tag can also be deemed
* trusted if we are in an RCU CS. Such pointer can be NULL.
*/
- if (type_is_trusted(env, reg, field_name, btf_id)) {
+ if (type_is_untrusted(env, reg, field_name, btf_id)) {
+ flag |= PTR_UNTRUSTED;
+ } else if (type_is_trusted(env, reg, field_name, btf_id)) {
flag |= PTR_TRUSTED;
} else if (type_is_trusted_or_null(env, reg, field_name, btf_id)) {
flag |= PTR_TRUSTED | PTR_MAYBE_NULL;
---8<---
On 2/4/26 1:09 PM, Kuniyuki Iwashima wrote:
> From: Martin KaFai Lau <martin.lau@linux.dev>
> Date: Wed, 4 Feb 2026 11:34:55 -0800
>> On 2/4/26 7:41 AM, Michal Luczaj wrote:
>>>>>>>> If the concern is the bpf iterator prog may use a released unix_peer(sk)
>>>>>>>> pointer, it should be fine. The unix_peer(sk) pointer is not a trusted
>>>>>>>> pointer to the bpf prog, so nothing bad will happen other than
>>>>>>>> potentially reading incorrect values.
>>
>> I misremembered that following unix->peer would be marked as
>> (PTR_TO_BTF_ID | PTR_UNTRUSTED). I forgot there are some legacy supports
>> on the PTR_TO_BTF_ID (i.e. without PTR_UNTRUSTED marking).
>>
>>>>>>>
>>>>>>> But if the prog passes a released peer pointer to a bpf helper:
>>>>>>>
>>>>>>> BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0x95/0xb0
>>>>>>> Read of size 1 at addr ffff888110654c92 by task test_progs/1936
>>>>>
>>>>> hmm... bpf_skc_to_unix_sock is exposed to tracing. bpf_iter is a tracing
>>>>> bpf prog.
>>>>>
>>>>>>
>>>>>> Can you cook a patch for this ? probably like below
>>>>>
>>>>> This can help the bpf_iter but not the other tracing prog such as fentry.
>>>>
>>>> Oh well ... then bpf_skc_to_unix_sock() can be used even
>>>> with SEQ_START_TOKEN at fentry of bpf_iter_unix_seq_show() ??
>>
>> It is fine. The type is void.
>>
>>>>
>>>> How about adding notrace to all af_unix bpf iterator functions ?
>>
>> but right, other functions taking [unix_]sock pointer could be audited.
>> I don't know af_unix well enough to assess the blast radius or whether
>> some useful functions may become untraceable.
>
> Considering SOCK_DGRAM, the blast radus is much bigger than
> I thought, so I'd avoid this way if possible by modifying
> the verifier.
>
>
>>
>>>>
>>>> The procfs iterator holds a spinlock of the hashtable from
>>>> ->start/next() to ->stop() to prevent the race with unix_release_sock().
>>>>
>>>> I think other (non-iterator) functions cannot do such racy
>>>> access with tracing prog.
>>>
>>> But then there's SOCK_DGRAM where you can drop unix_peer(sk) without
>>> releasing sk; see AF_UNSPEC in unix_dgram_connect(). I think Martin is
>>> right, we can crash at many fentries.
>>>
>>> BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0xa4/0xb0
>>> Read of size 2 at addr ffff888147d38890 by task test_progs/2495
>>> Call Trace:
>>> dump_stack_lvl+0x5d/0x80
>>> print_report+0x170/0x4f3
>>> kasan_report+0xe1/0x180
>>> bpf_skc_to_unix_sock+0xa4/0xb0
>>> bpf_prog_564a1c39c35d86a2_unix_shutdown_entry+0x8a/0x8e
>>> bpf_trampoline_6442564662+0x47/0xab
>>> unix_shutdown+0x9/0x880
>>> __sys_shutdown+0xe1/0x160
>>> __x64_sys_shutdown+0x52/0x90
>>> do_syscall_64+0x6b/0x3a0
>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> This probably is the first case where reading a sk pointer requires a
>> lock. I think it will need to be marked as PTR_UNTRUSTED in the verifier
>> for the unix->peer access, so that it cannot be passed to a helper.
>> There is a BTF_TYPE_SAFE_TRUSTED list. afaik, there is no untrusted one now.
>
> Just skimmed the code, and I guess something like below would
> do that ? and if needed, we could add another helper to fetch
> peer with a proper release function ?
>
> ---8<---
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 3135643d5695..ef8b4dd21923 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -7177,6 +7177,14 @@ static bool type_is_rcu_or_null(struct bpf_verifier_env *env,
> return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id, "__safe_rcu_or_null");
> }
>
> +static bool type_is_untrusted(struct bpf_verifier_env *env,
> + struct bpf_reg_state *reg,
> + const char *field_name, u32 btf_id)
> +{
> + /* TODO: return true if field_name and btf_id is unix_sock.peer. */
> + return false;
> +}
> +
> static bool type_is_trusted(struct bpf_verifier_env *env,
> struct bpf_reg_state *reg,
> const char *field_name, u32 btf_id)
> @@ -7307,7 +7315,9 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
> * A regular RCU-protected pointer with __rcu tag can also be deemed
> * trusted if we are in an RCU CS. Such pointer can be NULL.
> */
> - if (type_is_trusted(env, reg, field_name, btf_id)) {
> + if (type_is_untrusted(env, reg, field_name, btf_id)) {
> + flag |= PTR_UNTRUSTED;
Something like this but I think the PTR_UNTRUSTED marking should be done
right after the clear_trusted_flags() where it is for supporting the
depreciated PTR_TO_BTF_ID. Before that ...
Alexei, can you advise if we should change the verifier to mark
PTR_UNTRUSTED on unix_sock->peer or we can deprecate the bpf_skc_to_*
helper support from tracing and ask the user to switch to bpf_core_cast
(i.e. bpf_rdonly_cast) by using a WARN_ON_ONCE message?
The problem is that the unix_sock->peer pointer is not always valid when
passing to the bpf_skc_to_* helpers, so it is a UAF.
On 2/4/26 4:55 PM, Martin KaFai Lau wrote:
>
>
> On 2/4/26 1:09 PM, Kuniyuki Iwashima wrote:
>> From: Martin KaFai Lau <martin.lau@linux.dev>
>> Date: Wed, 4 Feb 2026 11:34:55 -0800
>>> On 2/4/26 7:41 AM, Michal Luczaj wrote:
>>>>>>>>> If the concern is the bpf iterator prog may use a released
>>>>>>>>> unix_peer(sk)
>>>>>>>>> pointer, it should be fine. The unix_peer(sk) pointer is not a
>>>>>>>>> trusted
>>>>>>>>> pointer to the bpf prog, so nothing bad will happen other than
>>>>>>>>> potentially reading incorrect values.
>>>
>>> I misremembered that following unix->peer would be marked as
>>> (PTR_TO_BTF_ID | PTR_UNTRUSTED). I forgot there are some legacy supports
>>> on the PTR_TO_BTF_ID (i.e. without PTR_UNTRUSTED marking).
>>>
>>>>>>>>
>>>>>>>> But if the prog passes a released peer pointer to a bpf helper:
>>>>>>>>
>>>>>>>> BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0x95/0xb0
>>>>>>>> Read of size 1 at addr ffff888110654c92 by task test_progs/1936
>>>>>>
>>>>>> hmm... bpf_skc_to_unix_sock is exposed to tracing. bpf_iter is a
>>>>>> tracing
>>>>>> bpf prog.
>>>>>>
>>>>>>>
>>>>>>> Can you cook a patch for this ? probably like below
>>>>>>
>>>>>> This can help the bpf_iter but not the other tracing prog such as
>>>>>> fentry.
>>>>>
>>>>> Oh well ... then bpf_skc_to_unix_sock() can be used even
>>>>> with SEQ_START_TOKEN at fentry of bpf_iter_unix_seq_show() ??
>>>
>>> It is fine. The type is void.
>>>
>>>>>
>>>>> How about adding notrace to all af_unix bpf iterator functions ?
>>>
>>> but right, other functions taking [unix_]sock pointer could be audited.
>>> I don't know af_unix well enough to assess the blast radius or whether
>>> some useful functions may become untraceable.
>>
>> Considering SOCK_DGRAM, the blast radus is much bigger than
>> I thought, so I'd avoid this way if possible by modifying
>> the verifier.
>>
>>
>>>
>>>>>
>>>>> The procfs iterator holds a spinlock of the hashtable from
>>>>> ->start/next() to ->stop() to prevent the race with
>>>>> unix_release_sock().
>>>>>
>>>>> I think other (non-iterator) functions cannot do such racy
>>>>> access with tracing prog.
>>>>
>>>> But then there's SOCK_DGRAM where you can drop unix_peer(sk) without
>>>> releasing sk; see AF_UNSPEC in unix_dgram_connect(). I think Martin is
>>>> right, we can crash at many fentries.
>>>>
>>>> BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0xa4/0xb0
>>>> Read of size 2 at addr ffff888147d38890 by task test_progs/2495
>>>> Call Trace:
>>>> dump_stack_lvl+0x5d/0x80
>>>> print_report+0x170/0x4f3
>>>> kasan_report+0xe1/0x180
>>>> bpf_skc_to_unix_sock+0xa4/0xb0
>>>> bpf_prog_564a1c39c35d86a2_unix_shutdown_entry+0x8a/0x8e
>>>> bpf_trampoline_6442564662+0x47/0xab
>>>> unix_shutdown+0x9/0x880
>>>> __sys_shutdown+0xe1/0x160
>>>> __x64_sys_shutdown+0x52/0x90
>>>> do_syscall_64+0x6b/0x3a0
>>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> This probably is the first case where reading a sk pointer requires a
>>> lock. I think it will need to be marked as PTR_UNTRUSTED in the verifier
>>> for the unix->peer access, so that it cannot be passed to a helper.
>>> There is a BTF_TYPE_SAFE_TRUSTED list. afaik, there is no untrusted
>>> one now.
>>
>> Just skimmed the code, and I guess something like below would
>> do that ? and if needed, we could add another helper to fetch
>> peer with a proper release function ?
>>
>> ---8<---
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 3135643d5695..ef8b4dd21923 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -7177,6 +7177,14 @@ static bool type_is_rcu_or_null(struct
>> bpf_verifier_env *env,
>> return btf_nested_type_is_trusted(&env->log, reg, field_name,
>> btf_id, "__safe_rcu_or_null");
>> }
>> +static bool type_is_untrusted(struct bpf_verifier_env *env,
>> + struct bpf_reg_state *reg,
>> + const char *field_name, u32 btf_id)
>> +{
>> + /* TODO: return true if field_name and btf_id is unix_sock.peer. */
>> + return false;
>> +}
>> +
>> static bool type_is_trusted(struct bpf_verifier_env *env,
>> struct bpf_reg_state *reg,
>> const char *field_name, u32 btf_id)
>> @@ -7307,7 +7315,9 @@ static int check_ptr_to_btf_access(struct
>> bpf_verifier_env *env,
>> * A regular RCU-protected pointer with __rcu tag can also
>> be deemed
>> * trusted if we are in an RCU CS. Such pointer can be NULL.
>> */
>> - if (type_is_trusted(env, reg, field_name, btf_id)) {
>> + if (type_is_untrusted(env, reg, field_name, btf_id)) {
>> + flag |= PTR_UNTRUSTED;
>
> Something like this but I think the PTR_UNTRUSTED marking should be done
> right after the clear_trusted_flags() where it is for supporting the
> depreciated PTR_TO_BTF_ID. Before that ...
>
> Alexei, can you advise if we should change the verifier to mark
> PTR_UNTRUSTED on unix_sock->peer or we can deprecate the bpf_skc_to_*
> helper support from tracing and ask the user to switch to bpf_core_cast
> (i.e. bpf_rdonly_cast) by using a WARN_ON_ONCE message?
After trying more, taking out bpf_skc_to_* is not enough. It still needs
to reject passing unix->peer to bpf_setsockopt for bpf_iter, so
PTR_UNTRUSTED mark is needed.
>
> The problem is that the unix_sock->peer pointer is not always valid when
> passing to the bpf_skc_to_* helpers, so it is a UAF.
>
On Wed, Feb 4, 2026 at 6:00 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
>
>
> On 2/4/26 4:55 PM, Martin KaFai Lau wrote:
> >
> >
> > On 2/4/26 1:09 PM, Kuniyuki Iwashima wrote:
> >> From: Martin KaFai Lau <martin.lau@linux.dev>
> >> Date: Wed, 4 Feb 2026 11:34:55 -0800
> >>> On 2/4/26 7:41 AM, Michal Luczaj wrote:
> >>>>>>>>> If the concern is the bpf iterator prog may use a released
> >>>>>>>>> unix_peer(sk)
> >>>>>>>>> pointer, it should be fine. The unix_peer(sk) pointer is not a
> >>>>>>>>> trusted
> >>>>>>>>> pointer to the bpf prog, so nothing bad will happen other than
> >>>>>>>>> potentially reading incorrect values.
> >>>
> >>> I misremembered that following unix->peer would be marked as
> >>> (PTR_TO_BTF_ID | PTR_UNTRUSTED). I forgot there are some legacy supports
> >>> on the PTR_TO_BTF_ID (i.e. without PTR_UNTRUSTED marking).
> >>>
> >>>>>>>>
> >>>>>>>> But if the prog passes a released peer pointer to a bpf helper:
> >>>>>>>>
> >>>>>>>> BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0x95/0xb0
> >>>>>>>> Read of size 1 at addr ffff888110654c92 by task test_progs/1936
> >>>>>>
> >>>>>> hmm... bpf_skc_to_unix_sock is exposed to tracing. bpf_iter is a
> >>>>>> tracing
> >>>>>> bpf prog.
> >>>>>>
> >>>>>>>
> >>>>>>> Can you cook a patch for this ? probably like below
> >>>>>>
> >>>>>> This can help the bpf_iter but not the other tracing prog such as
> >>>>>> fentry.
> >>>>>
> >>>>> Oh well ... then bpf_skc_to_unix_sock() can be used even
> >>>>> with SEQ_START_TOKEN at fentry of bpf_iter_unix_seq_show() ??
> >>>
> >>> It is fine. The type is void.
> >>>
> >>>>>
> >>>>> How about adding notrace to all af_unix bpf iterator functions ?
> >>>
> >>> but right, other functions taking [unix_]sock pointer could be audited.
> >>> I don't know af_unix well enough to assess the blast radius or whether
> >>> some useful functions may become untraceable.
> >>
> >> Considering SOCK_DGRAM, the blast radus is much bigger than
> >> I thought, so I'd avoid this way if possible by modifying
> >> the verifier.
> >>
> >>
> >>>
> >>>>>
> >>>>> The procfs iterator holds a spinlock of the hashtable from
> >>>>> ->start/next() to ->stop() to prevent the race with
> >>>>> unix_release_sock().
> >>>>>
> >>>>> I think other (non-iterator) functions cannot do such racy
> >>>>> access with tracing prog.
> >>>>
> >>>> But then there's SOCK_DGRAM where you can drop unix_peer(sk) without
> >>>> releasing sk; see AF_UNSPEC in unix_dgram_connect(). I think Martin is
> >>>> right, we can crash at many fentries.
> >>>>
> >>>> BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0xa4/0xb0
> >>>> Read of size 2 at addr ffff888147d38890 by task test_progs/2495
> >>>> Call Trace:
> >>>> dump_stack_lvl+0x5d/0x80
> >>>> print_report+0x170/0x4f3
> >>>> kasan_report+0xe1/0x180
> >>>> bpf_skc_to_unix_sock+0xa4/0xb0
> >>>> bpf_prog_564a1c39c35d86a2_unix_shutdown_entry+0x8a/0x8e
> >>>> bpf_trampoline_6442564662+0x47/0xab
> >>>> unix_shutdown+0x9/0x880
> >>>> __sys_shutdown+0xe1/0x160
> >>>> __x64_sys_shutdown+0x52/0x90
> >>>> do_syscall_64+0x6b/0x3a0
> >>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >>>
> >>> This probably is the first case where reading a sk pointer requires a
> >>> lock. I think it will need to be marked as PTR_UNTRUSTED in the verifier
> >>> for the unix->peer access, so that it cannot be passed to a helper.
> >>> There is a BTF_TYPE_SAFE_TRUSTED list. afaik, there is no untrusted
> >>> one now.
> >>
> >> Just skimmed the code, and I guess something like below would
> >> do that ? and if needed, we could add another helper to fetch
> >> peer with a proper release function ?
> >>
> >> ---8<---
> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> index 3135643d5695..ef8b4dd21923 100644
> >> --- a/kernel/bpf/verifier.c
> >> +++ b/kernel/bpf/verifier.c
> >> @@ -7177,6 +7177,14 @@ static bool type_is_rcu_or_null(struct
> >> bpf_verifier_env *env,
> >> return btf_nested_type_is_trusted(&env->log, reg, field_name,
> >> btf_id, "__safe_rcu_or_null");
> >> }
> >> +static bool type_is_untrusted(struct bpf_verifier_env *env,
> >> + struct bpf_reg_state *reg,
> >> + const char *field_name, u32 btf_id)
> >> +{
> >> + /* TODO: return true if field_name and btf_id is unix_sock.peer. */
> >> + return false;
> >> +}
> >> +
> >> static bool type_is_trusted(struct bpf_verifier_env *env,
> >> struct bpf_reg_state *reg,
> >> const char *field_name, u32 btf_id)
> >> @@ -7307,7 +7315,9 @@ static int check_ptr_to_btf_access(struct
> >> bpf_verifier_env *env,
> >> * A regular RCU-protected pointer with __rcu tag can also
> >> be deemed
> >> * trusted if we are in an RCU CS. Such pointer can be NULL.
> >> */
> >> - if (type_is_trusted(env, reg, field_name, btf_id)) {
> >> + if (type_is_untrusted(env, reg, field_name, btf_id)) {
> >> + flag |= PTR_UNTRUSTED;
> >
> > Something like this but I think the PTR_UNTRUSTED marking should be done
> > right after the clear_trusted_flags() where it is for supporting the
> > depreciated PTR_TO_BTF_ID. Before that ...
> >
> > Alexei, can you advise if we should change the verifier to mark
> > PTR_UNTRUSTED on unix_sock->peer or we can deprecate the bpf_skc_to_*
> > helper support from tracing and ask the user to switch to bpf_core_cast
> > (i.e. bpf_rdonly_cast) by using a WARN_ON_ONCE message?
>
> After trying more, taking out bpf_skc_to_* is not enough. It still needs
> to reject passing unix->peer to bpf_setsockopt for bpf_iter, so
> PTR_UNTRUSTED mark is needed.
Ah exactly, I'll cook a patch in that way.
>
> >
> > The problem is that the unix_sock->peer pointer is not always valid when
> > passing to the bpf_skc_to_* helpers, so it is a UAF.
> >
>
On Wed, Feb 4, 2026 at 7:41 AM Michal Luczaj <mhal@rbox.co> wrote:
>
> On 2/4/26 08:58, Kuniyuki Iwashima wrote:
> > On Tue, Feb 3, 2026 at 11:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 2/3/26 11:47 AM, Kuniyuki Iwashima wrote:
> >>> From: Michal Luczaj <mhal@rbox.co>
> >>> Date: Tue, 3 Feb 2026 10:57:46 +0100
> >>>> On 2/3/26 04:53, Martin KaFai Lau wrote:
> >>>>> On 2/2/26 7:10 AM, Michal Luczaj wrote:
> >>>>>> In related news, looks like bpf_iter_unix_seq_show() is missing
> >>>>>> unix_state_lock(): lock_sock_fast() won't stop unix_release_sock(). E.g.
> >>>>>> bpf iterator can grab unix_sock::peer as it is being released.
> >>>>>
> >>>>> If the concern is the bpf iterator prog may use a released unix_peer(sk)
> >>>>> pointer, it should be fine. The unix_peer(sk) pointer is not a trusted
> >>>>> pointer to the bpf prog, so nothing bad will happen other than
> >>>>> potentially reading incorrect values.
> >>>>
> >>>> But if the prog passes a released peer pointer to a bpf helper:
> >>>>
> >>>> BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0x95/0xb0
> >>>> Read of size 1 at addr ffff888110654c92 by task test_progs/1936
> >>
> >> hmm... bpf_skc_to_unix_sock is exposed to tracing. bpf_iter is a tracing
> >> bpf prog.
> >>
> >>>
> >>> Can you cook a patch for this ? probably like below
> >>
> >> This can help the bpf_iter but not the other tracing prog such as fentry.
> >
> > Oh well ... then bpf_skc_to_unix_sock() can be used even
> > with SEQ_START_TOKEN at fentry of bpf_iter_unix_seq_show() ??
> >
> > How about adding notrace to all af_unix bpf iterator functions ?
> >
> > The procfs iterator holds a spinlock of the hashtable from
> > ->start/next() to ->stop() to prevent the race with unix_release_sock().
> >
> > I think other (non-iterator) functions cannot do such racy
> > access with tracing prog.
>
> But then there's SOCK_DGRAM where you can drop unix_peer(sk) without
> releasing sk; see AF_UNSPEC in unix_dgram_connect(). I think Martin is
> right, we can crash at many fentries.
Ah I was only considering SOCK_STREAM :p
Only solution that came to mind is breaking change enforcing
a release function to bpf_skc_to_unix_sock() so that it can
hold the peer's refcnt and save the pointer somewhere else
until the release function, but I think this is unacceptable.
Also, I guess this type of issue could be triggered with any
objects that are not refcounted by bpf tracing prog ?
For example, inet_sock(sk)->inet_opt could be freed by
setsockopt(IP_OPTIONS) even after fentry prog verifies
that it's not NULL.
I'm not sure if bpf_core_cast() etc allows such access, but
if it's allowed, I think there is no general solution.
Fortunately that's not null-deref nor oob-write, and it just reads
stale info as Martin mentioned... so probably this is WAI for
tracing prog ?
>
> BUG: KASAN: slab-use-after-free in bpf_skc_to_unix_sock+0xa4/0xb0
> Read of size 2 at addr ffff888147d38890 by task test_progs/2495
> Call Trace:
> dump_stack_lvl+0x5d/0x80
> print_report+0x170/0x4f3
> kasan_report+0xe1/0x180
> bpf_skc_to_unix_sock+0xa4/0xb0
> bpf_prog_564a1c39c35d86a2_unix_shutdown_entry+0x8a/0x8e
> bpf_trampoline_6442564662+0x47/0xab
> unix_shutdown+0x9/0x880
> __sys_shutdown+0xe1/0x160
> __x64_sys_shutdown+0x52/0x90
> do_syscall_64+0x6b/0x3a0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> >
> >>
> >>>
> >>> ---8<---
> >>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> >>> index 02ebad6afac7..9c7e9fbde362 100644
> >>> --- a/net/unix/af_unix.c
> >>> +++ b/net/unix/af_unix.c
> >>> @@ -3740,8 +3740,9 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
> >>> return 0;
> >>>
> >>> slow = lock_sock_fast(sk);
> >>> + unix_state_lock(sk);
> >>>
> >>> - if (unlikely(sk_unhashed(sk))) {
> >>> + if (unlikely(sock_flag(other, SOCK_DEAD))) {
> >>> ret = SEQ_SKIP;
> >>> goto unlock;
> >>> }
> >>> @@ -3751,6 +3752,7 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
> >>> prog = bpf_iter_get_info(&meta, false);
> >>> ret = unix_prog_seq_show(prog, &meta, v, uid);
> >>> unlock:
> >>> + unix_staet_unlock(sk);
> >>> unlock_sock_fast(sk, slow);
> >>> return ret;
> >>> }
> >>> ---8<---
> >>>
> >>> Thanks!
> >>
>
On 2/4/26 11:16 AM, Kuniyuki Iwashima wrote: > For example, inet_sock(sk)->inet_opt could be freed by > setsockopt(IP_OPTIONS) even after fentry prog verifies > that it's not NULL. This one should be fine because of rcu. > > I'm not sure if bpf_core_cast() etc allows such access, but > if it's allowed, I think there is no general solution. bpf_core_cast (i.e. the "kfunc" bpf_rdonly_cast) does not use the pointer argument, so should be fine. Its return value is marked as PTR_UNTRUSTED. iirc, PTR_UNTRUSTED cannot be passed to helper, so bpf_core_cast should be fine overall. > > Fortunately that's not null-deref nor oob-write, and it just reads > stale info as Martin mentioned... so probably this is WAI for > tracing prog ? afaik, the tracing radius is large, so the prog cannot expect much guarantee. Reading in the bpf prog is fine. The exception is handled. The problem here is passing it to a helper (not kfunc) that depends on the arg being valid.
© 2016 - 2026 Red Hat, Inc.