sched: Fix race in rt_mutex_pre_schedule by removing non-atomic fetch_and_set

[PATCH] sched: Fix race in rt_mutex_pre_schedule by removing non-atomic fetch_and_set

Posted by cuiguoqi 1 month, 1 week ago

During Wound/Wait testing on PREEMPT_RT, a WARNING was hit:

  WARNING: CPU: 0 PID: 0 at kernel/sched/core.c:7085 rt_mutex_pre_schedule+0xa8/0x108
  Call trace:
   rt_mutex_pre_schedule+0xa8/0x108
   __ww_rt_mutex_lock+0x1d4/0x300
   ww_mutex_lock+0x1c/0x30

The issue stems from the non-atomic `fetch_and_set` macro:
    #define fetch_and_set(x, v) ({ int _x = (x); (x) = (v); _x; })

It lacks atomicity and memory ordering, leading to race conditions under
preemption or interrupts, where `current->sched_rt_mutex` may be corrupted.

Since this flag is only used for lockdep assertions and accessed per-task,
replace the unsafe macro with direct assignment and explicit state checks:

- In rt_mutex_pre_schedule(): assert is 0 before setting to 1.
- In rt_mutex_post_schedule(): assert is 1 before clearing to 0.

This fixes the false-positive warning without needing atomic operations.

Signed-off-by: cuiguoqi <cuiguoqi@kylinos.cn>
---
 kernel/sched/core.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e89a6eeadba..fb4c446e46f7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7078,11 +7078,11 @@ const struct sched_class *__setscheduler_class(struct task_struct *p, int prio)
  * name such that if someone were to implement this function we get to compare
  * notes.
  */
-#define fetch_and_set(x, v) ({ int _x = (x); (x) = (v); _x; })
 
 void rt_mutex_pre_schedule(void)
 {
-	lockdep_assert(!fetch_and_set(current->sched_rt_mutex, 1));
+	lockdep_assert(!current->sched_rt_mutex);
+	current->sched_rt_mutex = 1;
 	sched_submit_work(current);
 }
 
@@ -7095,7 +7095,9 @@ void rt_mutex_schedule(void)
 void rt_mutex_post_schedule(void)
 {
 	sched_update_worker(current);
-	lockdep_assert(fetch_and_set(current->sched_rt_mutex, 0));
+	lockdep_assert(current->sched_rt_mutex);
+	current->sched_rt_mutex = 0;
+
 }
 
 /*
-- 
2.25.1

Re: [PATCH] sched: Fix race in rt_mutex_pre_schedule by removing non-atomic fetch_and_set

Posted by Steven Rostedt 1 month, 1 week ago

On Tue, 26 Aug 2025 19:08:33 +0800
cuiguoqi <cuiguoqi@kylinos.cn> wrote:

> @@ -7078,11 +7078,11 @@ const struct sched_class *__setscheduler_class(struct task_struct *p, int prio)
>   * name such that if someone were to implement this function we get to compare
>   * notes.
>   */
> -#define fetch_and_set(x, v) ({ int _x = (x); (x) = (v); _x; })
>  
>  void rt_mutex_pre_schedule(void)
>  {
> -	lockdep_assert(!fetch_and_set(current->sched_rt_mutex, 1));
> +	lockdep_assert(!current->sched_rt_mutex);
> +	current->sched_rt_mutex = 1;
>  	sched_submit_work(current);
>  }
>  
> @@ -7095,7 +7095,9 @@ void rt_mutex_schedule(void)
>  void rt_mutex_post_schedule(void)
>  {
>  	sched_update_worker(current);
> -	lockdep_assert(fetch_and_set(current->sched_rt_mutex, 0));
> +	lockdep_assert(current->sched_rt_mutex);
> +	current->sched_rt_mutex = 0;
> +
>  }

Honestly, without adding a READ_ONCE() or barrier() I don't see how your
change would prevent the compiler from making the code any different?

It may have "fixed" your issue because the compiler may have done things
differently, but this change doesn't guarantee anything.

Also, could you show an example of how current->sched_rt_mutex could be
corrupted?

-- Steve

Re: [PATCH] sched: Fix race in rt_mutex_pre_schedule by removing non-atomic fetch_and_set

Posted by cuiguoqi 1 month, 1 week ago

The issue arises during EDEADLK testing in `lib/locking-selftest.c` when `is_wait_die=1`.

In this mode, the current thread's `debug_locks` flag is disabled via `__debug_locks_off` (which calls `xchg(&debug_locks, 0)`) during the blocking path of `rt_mutex_slowlock`, specifically in `rt_mutex_slowlock_block()`:

  rt_mutex_slowlock()
    rt_mutex_pre_schedule()
      rt_mutex_slowlock_block()
        DEBUG_LOCKS_WARN_ON(ww_ctx->contending_lock)
          __debug_locks_off();  // xchg(&debug_locks, 0)

However, `rt_mutex_post_schedule()` still performs:

  lockdep_assert(fetch_and_set(current->sched_rt_mutex, 0));

Which expands to:

  do {
      WARN_ON(debug_locks && !( ({ int _x = current->sched_rt_mutex; current->sched_rt_mutex = 0; _x; }) ));
  } while (0)

The generated assembly shows that the entire assertion is conditional on `debug_locks`:

  adrp    x0, debug_locks
  ldr     w0, [x0]
  cbz     w0, .label_skip_warn   // Skip WARN if debug_locks == 0

This means: if `debug_locks` was cleared earlier, the check on `current->sched_rt_mutex` is effectively skipped, and the flag may remain set.

Later, when the same task re-enters `rt_mutex_slowlock`, it calls `lockdep_reset()` to re-enable `debug_locks`, but the stale `current->sched_rt_mutex` state (left over from the previous lock attempt) causes a false-positive warning in `rt_mutex_pre_schedule()`:

  WARNING: CPU: 0 PID: 0 at kernel/sched/core.c:7085 rt_mutex_pre_schedule+0xa8/0x108

Because:
  - `rt_mutex_pre_schedule()` asserts `!current->sched_rt_mutex`
  - But the flag was never properly cleared due to the skipped post-schedule check.

This is not a data race on the flag itself, but a **state inconsistency caused by conditional debugging logic** — the `fetch_and_set` macro is not atomic, but more importantly, the assertion is bypassed when `debug_locks` is off, breaking the expected state transition.

Re: [PATCH] sched: Fix race in rt_mutex_pre_schedule by removing non-atomic fetch_and_set

Posted by Sebastian Andrzej Siewior 1 month, 1 week ago

On 2025-08-26 09:56:15 [-0400], Steven Rostedt wrote:
> Honestly, without adding a READ_ONCE() or barrier() I don't see how your
> change would prevent the compiler from making the code any different?

Other than that, that flag is supposed to be only set/ cleared by the
thread itself. There should be no need for it to be atomic.

> It may have "fixed" your issue because the compiler may have done things
> differently, but this change doesn't guarantee anything.
> 
> Also, could you show an example of how current->sched_rt_mutex could be
> corrupted?

That would be important.

> -- Steve

Sebastian