Documentation/locking/mutex-design.rst | 6 ++++-- kernel/locking/mutex.c | 6 ++++++ 2 files changed, 10 insertions(+), 2 deletions(-)
Jann reported a possible UAF scenario where a task in the mutex_unlock()
path had released the mutex and was about to acquire the wait_lock
to check out the waiters. In the interim, another task could come in,
acquire and release the mutex and then free the memory object holding
the mutex after that.
Thread A Thread B
======== ========
eventpoll_release_file
mutex_lock
[success on trylock fastpath]
__ep_remove
ep_refcount_dec_and_test
[drop refcount from 2 to 1]
ep_eventpoll_release
ep_clear_and_put
mutex_lock
__mutex_lock_slowpath
__mutex_lock
__mutex_lock_common
__mutex_add_waiter
[enqueue waiter]
[set MUTEX_FLAG_WAITERS]
mutex_unlock
__mutex_unlock_slowpath
atomic_long_try_cmpxchg_release
[reads MUTEX_FLAG_WAITERS]
[drops lock ownership]
__mutex_trylock
[success]
__mutex_remove_waiter
ep_refcount_dec_and_test
[drop refcount from 1 to 0]
mutex_unlock
ep_free
kfree(ep)
raw_spin_lock_irqsave(&lock->wait_lock)
*** UAF WRITE ***
This race condition is possible especially if a preemption happens right
after releasing the lock but before acquiring the wait_lock. Rwsem's
__up_write() and __up_read() helpers have already disabled
preemption to minimize this vulnernable time period, do the same for
__mutex_unlock_slowpath() to minimize the chance of this race condition.
Also add a note in Documentation/locking/mutex-design.rst to suggest
that callers can use rcu_free() to delay the actual memory free to
eliminate this UAF scenario.
Signed-off-by: Waiman Long <longman@redhat.com>
---
Documentation/locking/mutex-design.rst | 6 ++++--
kernel/locking/mutex.c | 6 ++++++
2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/Documentation/locking/mutex-design.rst b/Documentation/locking/mutex-design.rst
index 7c30b4aa5e28..51a3a28ca830 100644
--- a/Documentation/locking/mutex-design.rst
+++ b/Documentation/locking/mutex-design.rst
@@ -117,8 +117,10 @@ the structure anymore.
The mutex user must ensure that the mutex is not destroyed while a
release operation is still in progress - in other words, callers of
-mutex_unlock() must ensure that the mutex stays alive until mutex_unlock()
-has returned.
+mutex_unlock() must ensure that the mutex stays alive until
+mutex_unlock() has returned. One possible way to do that is to use
+kfree_rcu() or its variants to delay the actual freeing the memory
+object containing the mutex.
Interfaces
----------
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a39ecccbd106..d33f36d305fb 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -912,9 +912,15 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
* Release the lock before (potentially) taking the spinlock such that
* other contenders can get on with things ASAP.
*
+ * Preemption is disabled to minimize the time gap between releasing
+ * the lock and acquiring the wait_lock. Callers may consider using
+ * kfree_rcu() if the memory holding the mutex may be freed after
+ * another mutex_unlock() call to ensure that UAF will not happen.
+ *
* Except when HANDOFF, in that case we must not clear the owner field,
* but instead set it to the top waiter.
*/
+ guard(preempt)();
owner = atomic_long_read(&lock->owner);
for (;;) {
MUTEX_WARN_ON(__owner_task(owner) != current);
--
2.50.0
On 7/9/25 2:05 PM, Waiman Long wrote: > Jann reported a possible UAF scenario where a task in the mutex_unlock() > path had released the mutex and was about to acquire the wait_lock > to check out the waiters. In the interim, another task could come in, > acquire and release the mutex and then free the memory object holding > the mutex after that. > > Thread A Thread B > ======== ======== > eventpoll_release_file > mutex_lock > [success on trylock fastpath] > __ep_remove > ep_refcount_dec_and_test > [drop refcount from 2 to 1] > > ep_eventpoll_release > ep_clear_and_put > mutex_lock > __mutex_lock_slowpath > __mutex_lock > __mutex_lock_common > __mutex_add_waiter > [enqueue waiter] > [set MUTEX_FLAG_WAITERS] > > mutex_unlock > __mutex_unlock_slowpath > atomic_long_try_cmpxchg_release > [reads MUTEX_FLAG_WAITERS] > [drops lock ownership] > > __mutex_trylock > [success] > __mutex_remove_waiter > ep_refcount_dec_and_test > [drop refcount from 1 to 0] > mutex_unlock > ep_free > kfree(ep) > > raw_spin_lock_irqsave(&lock->wait_lock) > *** UAF WRITE *** > > This race condition is possible especially if a preemption happens right > after releasing the lock but before acquiring the wait_lock. Rwsem's > __up_write() and __up_read() helpers have already disabled > preemption to minimize this vulnernable time period, do the same for > __mutex_unlock_slowpath() to minimize the chance of this race condition. > > Also add a note in Documentation/locking/mutex-design.rst to suggest > that callers can use rcu_free() to delay the actual memory free to > eliminate this UAF scenario. > > Signed-off-by: Waiman Long <longman@redhat.com> > --- > Documentation/locking/mutex-design.rst | 6 ++++-- > kernel/locking/mutex.c | 6 ++++++ > 2 files changed, 10 insertions(+), 2 deletions(-) > > diff --git a/Documentation/locking/mutex-design.rst b/Documentation/locking/mutex-design.rst > index 7c30b4aa5e28..51a3a28ca830 100644 > --- a/Documentation/locking/mutex-design.rst > +++ b/Documentation/locking/mutex-design.rst > @@ -117,8 +117,10 @@ the structure anymore. > > The mutex user must ensure that the mutex is not destroyed while a > release operation is still in progress - in other words, callers of > -mutex_unlock() must ensure that the mutex stays alive until mutex_unlock() > -has returned. > +mutex_unlock() must ensure that the mutex stays alive until > +mutex_unlock() has returned. One possible way to do that is to use > +kfree_rcu() or its variants to delay the actual freeing the memory > +object containing the mutex. > > Interfaces > ---------- > diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c > index a39ecccbd106..d33f36d305fb 100644 > --- a/kernel/locking/mutex.c > +++ b/kernel/locking/mutex.c > @@ -912,9 +912,15 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne > * Release the lock before (potentially) taking the spinlock such that > * other contenders can get on with things ASAP. > * > + * Preemption is disabled to minimize the time gap between releasing > + * the lock and acquiring the wait_lock. Callers may consider using > + * kfree_rcu() if the memory holding the mutex may be freed after > + * another mutex_unlock() call to ensure that UAF will not happen. > + * > * Except when HANDOFF, in that case we must not clear the owner field, > * but instead set it to the top waiter. > */ > + guard(preempt)(); > owner = atomic_long_read(&lock->owner); > for (;;) { > MUTEX_WARN_ON(__owner_task(owner) != current); Note that Linus' patch should fix this particular eventpoll UAF race condition, but there may be other code paths where similar situations can happen. Cheers, Longman
On Wed, 9 Jul 2025 at 11:06, Waiman Long <longman@redhat.com> wrote: > > This race condition is possible especially if a preemption happens right > after releasing the lock but before acquiring the wait_lock. Rwsem's > __up_write() and __up_read() helpers have already disabled > preemption to minimize this vulnernable time period, do the same for > __mutex_unlock_slowpath() to minimize the chance of this race condition. I think this patch is actively detrimental, in that it only helps hide the bug. The bug still exists, it's just harder to hit. Maybe that is worth it as a "hardening" thing, but I feel this makes people believe even *more* that they can use mutexes for object lifetimes. And that's a fundamentally buggy assumption. Locking is about mutual exclusion, not lifetimes. There are very specific things where "release the lock" also releases an object, but they should be considered very very special. All objects that aren't purely thread-local should have lifetimes that depend *solely* on refcounts. Anything else is typically a serious bug, or needs to be actively discouraged, not encouraged like this. Even with things like RCU, the lifetime of the object should be about refcounts, and the RCU thing should be purely about "I'm going asynchronous lookups that aren't protected by any locks outside this object, so I may see objects that are being torn down". I absolutely detest the notion of "let's make locking be tied to object lifetimes". Note that locks *outside* the object is obviously very very normal, but then the *lock* has a totally different lifetime entirely, and the lifetime of the lock has nothing to do with the lifetime of the object. Please don't confuse the two. This was eventpoll being completely broken. As usual. We've had less eventpoll breakage lately than we historically used to have, but that's hopefully because that horrid pile is not actively developed any more, and is slowly - oh so very slowly - getting fixed. I'm very sad that io_uring ended up with eventpoll support, but that was sold on the premise that it makes it easier to incrementally turn some eventpoll user into an io_uring user. I certainly hope it leads to less epoll use in the long run, rather than more. Linus
On Wed, 9 Jul 2025 at 11:19, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I absolutely detest the notion of "let's make locking be tied to > object lifetimes". Side note: I wonder if there's any way to detect this kind of race in general. And I suspect it would involve the exact *opposite* of your patch: make mutex_unlock() actively cause preemption after it has released the lock but before it has done the final accesses. Linus
On 7/9/25 2:21 PM, Linus Torvalds wrote: > On Wed, 9 Jul 2025 at 11:19, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> I absolutely detest the notion of "let's make locking be tied to >> object lifetimes". > Side note: I wonder if there's any way to detect this kind of race in general. > > And I suspect it would involve the exact *opposite* of your patch: > make mutex_unlock() actively cause preemption after it has released > the lock but before it has done the final accesses. I think we can do a a cond_resched() under CONFIG_DEBUG_MUTEXES and CONFIG_KASAN. We certainly don't want to do that with a production kernel. Cheers, Longman
On Wed, 9 Jul 2025 at 11:21, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And I suspect it would involve the exact *opposite* of your patch: > make mutex_unlock() actively cause preemption after it has released > the lock but before it has done the final accesses. .. sadly, I suspect we have a ton of mutex_unlock() users in atomic contexts, so we probably can't do that. It's not like you *should* do it, but I don't think we've ever disallowed it. You can't use mutex_unlock from interrupts etc, but you can use it while holding a spinlock. Linus
On 7/9/25 2:28 PM, Linus Torvalds wrote: > On Wed, 9 Jul 2025 at 11:21, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> And I suspect it would involve the exact *opposite* of your patch: >> make mutex_unlock() actively cause preemption after it has released >> the lock but before it has done the final accesses. > .. sadly, I suspect we have a ton of mutex_unlock() users in atomic > contexts, so we probably can't do that. It's not like you *should* do > it, but I don't think we've ever disallowed it. > > You can't use mutex_unlock from interrupts etc, but you can use it > while holding a spinlock. I have just sent out another mutex patch to enforce a context switch in __mutex_unlock_slowpath() under the right context. As for this one, you are right that hiding it may not be the best idea. So I am going to drop it. Cheers, Longman
© 2016 - 2025 Red Hat, Inc.