[PATCH] locking/mutex: Disable preemption in __mutex_unlock_slowpath()

Waiman Long posted 1 patch 2 months, 4 weeks ago
Documentation/locking/mutex-design.rst | 6 ++++--
kernel/locking/mutex.c                 | 6 ++++++
2 files changed, 10 insertions(+), 2 deletions(-)
[PATCH] locking/mutex: Disable preemption in __mutex_unlock_slowpath()
Posted by Waiman Long 2 months, 4 weeks ago
Jann reported a possible UAF scenario where a task in the mutex_unlock()
path had released the mutex and was about to acquire the wait_lock
to check out the waiters. In the interim, another task could come in,
acquire and release the mutex and then free the memory object holding
the mutex after that.

Thread A                      Thread B
========                      ========
                              eventpoll_release_file
                                mutex_lock
                                  [success on trylock fastpath]
                                __ep_remove
                                  ep_refcount_dec_and_test
                                    [drop refcount from 2 to 1]

ep_eventpoll_release
  ep_clear_and_put
    mutex_lock
      __mutex_lock_slowpath
        __mutex_lock
          __mutex_lock_common
            __mutex_add_waiter
              [enqueue waiter]
              [set MUTEX_FLAG_WAITERS]

                                mutex_unlock
                                  __mutex_unlock_slowpath
                                    atomic_long_try_cmpxchg_release
                                      [reads MUTEX_FLAG_WAITERS]
                                      [drops lock ownership]

            __mutex_trylock
              [success]
            __mutex_remove_waiter
    ep_refcount_dec_and_test
      [drop refcount from 1 to 0]
    mutex_unlock
    ep_free
      kfree(ep)

                                    raw_spin_lock_irqsave(&lock->wait_lock)
                                      *** UAF WRITE ***

This race condition is possible especially if a preemption happens right
after releasing the lock but before acquiring the wait_lock. Rwsem's
__up_write() and __up_read() helpers have already disabled
preemption to minimize this vulnernable time period, do the same for
__mutex_unlock_slowpath() to minimize the chance of this race condition.

Also add a note in Documentation/locking/mutex-design.rst to suggest
that callers can use rcu_free() to delay the actual memory free to
eliminate this UAF scenario.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/locking/mutex-design.rst | 6 ++++--
 kernel/locking/mutex.c                 | 6 ++++++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/Documentation/locking/mutex-design.rst b/Documentation/locking/mutex-design.rst
index 7c30b4aa5e28..51a3a28ca830 100644
--- a/Documentation/locking/mutex-design.rst
+++ b/Documentation/locking/mutex-design.rst
@@ -117,8 +117,10 @@ the structure anymore.
 
 The mutex user must ensure that the mutex is not destroyed while a
 release operation is still in progress - in other words, callers of
-mutex_unlock() must ensure that the mutex stays alive until mutex_unlock()
-has returned.
+mutex_unlock() must ensure that the mutex stays alive until
+mutex_unlock() has returned. One possible way to do that is to use
+kfree_rcu() or its variants to delay the actual freeing the memory
+object containing the mutex.
 
 Interfaces
 ----------
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a39ecccbd106..d33f36d305fb 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -912,9 +912,15 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 	 * Release the lock before (potentially) taking the spinlock such that
 	 * other contenders can get on with things ASAP.
 	 *
+	 * Preemption is disabled to minimize the time gap between releasing
+	 * the lock and acquiring the wait_lock. Callers may consider using
+	 * kfree_rcu() if the memory holding the mutex may be freed after
+	 * another mutex_unlock() call to ensure that UAF will not happen.
+	 *
 	 * Except when HANDOFF, in that case we must not clear the owner field,
 	 * but instead set it to the top waiter.
 	 */
+	guard(preempt)();
 	owner = atomic_long_read(&lock->owner);
 	for (;;) {
 		MUTEX_WARN_ON(__owner_task(owner) != current);
-- 
2.50.0
Re: [PATCH] locking/mutex: Disable preemption in __mutex_unlock_slowpath()
Posted by Waiman Long 2 months, 4 weeks ago
On 7/9/25 2:05 PM, Waiman Long wrote:
> Jann reported a possible UAF scenario where a task in the mutex_unlock()
> path had released the mutex and was about to acquire the wait_lock
> to check out the waiters. In the interim, another task could come in,
> acquire and release the mutex and then free the memory object holding
> the mutex after that.
>
> Thread A                      Thread B
> ========                      ========
>                                eventpoll_release_file
>                                  mutex_lock
>                                    [success on trylock fastpath]
>                                  __ep_remove
>                                    ep_refcount_dec_and_test
>                                      [drop refcount from 2 to 1]
>
> ep_eventpoll_release
>    ep_clear_and_put
>      mutex_lock
>        __mutex_lock_slowpath
>          __mutex_lock
>            __mutex_lock_common
>              __mutex_add_waiter
>                [enqueue waiter]
>                [set MUTEX_FLAG_WAITERS]
>
>                                  mutex_unlock
>                                    __mutex_unlock_slowpath
>                                      atomic_long_try_cmpxchg_release
>                                        [reads MUTEX_FLAG_WAITERS]
>                                        [drops lock ownership]
>
>              __mutex_trylock
>                [success]
>              __mutex_remove_waiter
>      ep_refcount_dec_and_test
>        [drop refcount from 1 to 0]
>      mutex_unlock
>      ep_free
>        kfree(ep)
>
>                                      raw_spin_lock_irqsave(&lock->wait_lock)
>                                        *** UAF WRITE ***
>
> This race condition is possible especially if a preemption happens right
> after releasing the lock but before acquiring the wait_lock. Rwsem's
> __up_write() and __up_read() helpers have already disabled
> preemption to minimize this vulnernable time period, do the same for
> __mutex_unlock_slowpath() to minimize the chance of this race condition.
>
> Also add a note in Documentation/locking/mutex-design.rst to suggest
> that callers can use rcu_free() to delay the actual memory free to
> eliminate this UAF scenario.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>   Documentation/locking/mutex-design.rst | 6 ++++--
>   kernel/locking/mutex.c                 | 6 ++++++
>   2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/locking/mutex-design.rst b/Documentation/locking/mutex-design.rst
> index 7c30b4aa5e28..51a3a28ca830 100644
> --- a/Documentation/locking/mutex-design.rst
> +++ b/Documentation/locking/mutex-design.rst
> @@ -117,8 +117,10 @@ the structure anymore.
>   
>   The mutex user must ensure that the mutex is not destroyed while a
>   release operation is still in progress - in other words, callers of
> -mutex_unlock() must ensure that the mutex stays alive until mutex_unlock()
> -has returned.
> +mutex_unlock() must ensure that the mutex stays alive until
> +mutex_unlock() has returned. One possible way to do that is to use
> +kfree_rcu() or its variants to delay the actual freeing the memory
> +object containing the mutex.
>   
>   Interfaces
>   ----------
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index a39ecccbd106..d33f36d305fb 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -912,9 +912,15 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
>   	 * Release the lock before (potentially) taking the spinlock such that
>   	 * other contenders can get on with things ASAP.
>   	 *
> +	 * Preemption is disabled to minimize the time gap between releasing
> +	 * the lock and acquiring the wait_lock. Callers may consider using
> +	 * kfree_rcu() if the memory holding the mutex may be freed after
> +	 * another mutex_unlock() call to ensure that UAF will not happen.
> +	 *
>   	 * Except when HANDOFF, in that case we must not clear the owner field,
>   	 * but instead set it to the top waiter.
>   	 */
> +	guard(preempt)();
>   	owner = atomic_long_read(&lock->owner);
>   	for (;;) {
>   		MUTEX_WARN_ON(__owner_task(owner) != current);

Note that Linus' patch should fix this particular eventpoll UAF race 
condition, but there may be other code paths where similar situations 
can happen.

Cheers,
Longman
Re: [PATCH] locking/mutex: Disable preemption in __mutex_unlock_slowpath()
Posted by Linus Torvalds 2 months, 4 weeks ago
On Wed, 9 Jul 2025 at 11:06, Waiman Long <longman@redhat.com> wrote:
>
> This race condition is possible especially if a preemption happens right
> after releasing the lock but before acquiring the wait_lock. Rwsem's
> __up_write() and __up_read() helpers have already disabled
> preemption to minimize this vulnernable time period, do the same for
> __mutex_unlock_slowpath() to minimize the chance of this race condition.

I think this patch is actively detrimental, in that it only helps hide
the bug. The bug still exists, it's just harder to hit.

Maybe that is worth it as a "hardening" thing, but I feel this makes
people believe even *more* that they can use mutexes for object
lifetimes.

And that's a fundamentally buggy assumption. Locking is about mutual
exclusion, not lifetimes. There are very specific things where
"release the lock" also releases an object, but they should be
considered very very special.

All objects that aren't purely thread-local should have lifetimes that
depend *solely* on refcounts. Anything else is typically a serious
bug, or needs to be actively discouraged, not encouraged like this.

Even with things like RCU, the lifetime of the object should be about
refcounts, and the RCU thing should be purely about "I'm going
asynchronous lookups that aren't protected by any locks outside this
object, so I may see objects that are being torn down".

I absolutely detest the notion of "let's make locking be tied to
object lifetimes".

Note that locks *outside* the object is obviously very very normal,
but then the *lock* has a totally different lifetime entirely, and the
lifetime of the lock has nothing to do with the lifetime of the
object.

Please don't confuse the two. This was eventpoll being completely
broken. As usual. We've had less eventpoll breakage lately than we
historically used to have, but that's hopefully because that horrid
pile is not actively developed any more, and is slowly - oh so very
slowly - getting fixed.

I'm very sad that io_uring ended up with eventpoll support, but that
was sold on the premise that it makes it easier to incrementally turn
some eventpoll user into an io_uring user. I certainly hope it leads
to less epoll use in the long run, rather than more.

             Linus
Re: [PATCH] locking/mutex: Disable preemption in __mutex_unlock_slowpath()
Posted by Linus Torvalds 2 months, 4 weeks ago
On Wed, 9 Jul 2025 at 11:19, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I absolutely detest the notion of "let's make locking be tied to
> object lifetimes".

Side note: I wonder if there's any way to detect this kind of race in general.

And I suspect it would involve the exact *opposite* of your patch:
make mutex_unlock() actively cause preemption after it has released
the lock but before it has done the final accesses.

                 Linus
Re: [PATCH] locking/mutex: Disable preemption in __mutex_unlock_slowpath()
Posted by Waiman Long 2 months, 4 weeks ago
On 7/9/25 2:21 PM, Linus Torvalds wrote:
> On Wed, 9 Jul 2025 at 11:19, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> I absolutely detest the notion of "let's make locking be tied to
>> object lifetimes".
> Side note: I wonder if there's any way to detect this kind of race in general.
>
> And I suspect it would involve the exact *opposite* of your patch:
> make mutex_unlock() actively cause preemption after it has released
> the lock but before it has done the final accesses.

I think we can do a a cond_resched() under CONFIG_DEBUG_MUTEXES and 
CONFIG_KASAN. We certainly don't want to do that with a production kernel.

Cheers,
Longman
Re: [PATCH] locking/mutex: Disable preemption in __mutex_unlock_slowpath()
Posted by Linus Torvalds 2 months, 4 weeks ago
On Wed, 9 Jul 2025 at 11:21, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And I suspect it would involve the exact *opposite* of your patch:
> make mutex_unlock() actively cause preemption after it has released
> the lock but before it has done the final accesses.

.. sadly, I suspect we have a ton of mutex_unlock() users in atomic
contexts, so we probably can't do that. It's not like you *should* do
it, but I don't think we've ever disallowed it.

You can't use mutex_unlock from interrupts etc, but you can use it
while holding a spinlock.

           Linus
Re: [PATCH] locking/mutex: Disable preemption in __mutex_unlock_slowpath()
Posted by Waiman Long 2 months, 4 weeks ago
On 7/9/25 2:28 PM, Linus Torvalds wrote:
> On Wed, 9 Jul 2025 at 11:21, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> And I suspect it would involve the exact *opposite* of your patch:
>> make mutex_unlock() actively cause preemption after it has released
>> the lock but before it has done the final accesses.
> .. sadly, I suspect we have a ton of mutex_unlock() users in atomic
> contexts, so we probably can't do that. It's not like you *should* do
> it, but I don't think we've ever disallowed it.
>
> You can't use mutex_unlock from interrupts etc, but you can use it
> while holding a spinlock.

I have just sent out another mutex patch to enforce a context switch in 
__mutex_unlock_slowpath() under the right context.

As for this one, you are right that hiding it may not be the best idea. 
So I am going to drop it.

Cheers,
Longman