sched/core: Fix proxy-exec/core-sched interactions

[PATCH v2 0/2] sched/core: Fix proxy-exec/core-sched interactions

Posted by Vasily Gorbik 1 month, 1 week ago

v1 [1] consisted of a fix for a scheduler corruption where
try_steal_cookie() could migrate a proxy-exec donor away from the source
rq while that rq still used it as the active scheduling context.

Prateek pointed out [2] a separate proxy-exec/core-sched issue: after
pick_next_task() selects a core cookie compatible donor, find_proxy_task()
can replace the execution context with a mutex owner with a different
cookie.

This v2 keeps the donor steal fix as patch 1 and adds patch 2 to reject
mismatched final proxy owners.

The v1 reported the issue reproduced on s390 LPAR, but it seems to be
easily reproducible with strace test suite "make -j$(nproc) check" on
any system with SMT, CONFIG_SCHED_CORE=y and CONFIG_SCHED_PROXY_EXEC=y
enabled, e.g. on x86 KVM with -smp cpus=16,sockets=1,cores=8,threads=2:

[  283.181298] WARNING: kernel/sched/fair.c:5788 at put_prev_entity+0x4f/0x90, CPU#2: unshare-report-/27895
[  283.185230] Modules linked in:
[  283.186480] CPU: 2 UID: 0 PID: 27895 Comm: unshare-report- Not tainted 7.1.0-rc2-00076-g74fe02ce122a #26 PREEMPT(full)
[  283.190699] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-9.fc43 06/10/2025
[  283.194482] RIP: 0010:put_prev_entity+0x4f/0x90
[  283.196591] Code: fd ff ff 80 7b 58 00 74 e0 66 90 48 89 de 48 89 ef e8 85 a9 ff ff 31 d2 48 89 de 48 89 ef e8 d8 d6 ff ff 48 39 5d 58 74 c6 90 <0f> 0b 90 48 c7 45 58 00 00 00 00 5b 5d e9 7f cb 31 01 48 83 bb b8
[  283.205157] RSP: 0018:ffffc90009177af0 EFLAGS: 00010006
[  283.207443] RAX: 0000000000000000 RBX: ffff888102de8080 RCX: 000000000004f800
[  283.210442] RDX: 0000000000000000 RSI: 0000000000027c00 RDI: 00000041dd7d5860
[  283.213528] RBP: ffff888116cb2200 R08: ffff888116fe8080 R09: 0000000000000002
[  283.216766] R10: 0000000005bf08d6 R11: 00000000000002b7 R12: ffff8881192da4a0
[  283.219872] R13: ffff88813a3ec801 R14: 0000000000000001 R15: ffff88813a3ec800
[  283.222777] FS:  00007f6b5ca21780(0000) GS:ffff8881b628c000(0000) knlGS:0000000000000000
[  283.226171] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  283.228493] CR2: 000000001319e358 CR3: 000000001f322000 CR4: 00000000000006f0
[  283.231951] Call Trace:
[  283.233137]  <TASK>
[  283.234066]  put_prev_task_fair+0x1d/0x40
[  283.235943]  __schedule+0x1165/0x28d0
[  283.237599]  ? __resched_curr+0x372/0x3a0
[  283.239413]  ? detach_task+0xc1/0xd0
[  283.241015]  ? lockdep_hardirqs_on_prepare+0xd7/0x190
[  283.243170]  ? trace_hardirqs_on+0x18/0x100
[  283.244852]  preempt_schedule+0x2e/0x50
[  283.246707]  preempt_schedule_thunk+0x16/0x30
[  283.248680]  ? _raw_spin_unlock_irqrestore+0x3f/0x50
[  283.251012]  __mutex_unlock_slowpath+0x2d9/0x3d0
[  283.253196]  pcpu_alloc_noprof+0x3e6/0xbd0
[  283.255187]  alloc_vfsmnt+0xd7/0x1e0
[  283.256651]  clone_mnt+0x1e/0x280
[  283.258061]  copy_tree+0x127/0x420
[  283.259449]  copy_mnt_ns+0x13f/0x520
[  283.260926]  create_new_namespaces+0x54/0x2e0
[  283.262974]  unshare_nsproxy_namespaces+0x7e/0xb0
[  283.265317]  ksys_unshare+0x196/0x550
[  283.267097]  __x64_sys_unshare+0xd/0x20
[  283.268876]  do_syscall_64+0xf3/0x6a0
[  283.270611]  ? exc_page_fault+0xfa/0x240
[  283.272329]  ? __irq_exit_rcu+0x3c/0x100
[  283.274006]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  283.276085] RIP: 0033:0x7f6b5cb1730d
[  283.277509] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 5a 0f 00 f7 d8 64 89 01 48
[  283.285484] RSP: 002b:00007ffef4e305d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
[  283.288741] RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007f6b5cb1730d
[  283.291711] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000020000
[  283.294664] RBP: 0000000000000000 R08: 0000000000000000 R09: 3230345b3a737475
[  283.298149] R10: 000000000000eefe R11: 0000000000000246 R12: 00000000000000f0
[  283.301268] R13: 0000000000000001 R14: 00007f6b5cd53000 R15: 0000000000404df0
[  283.304570]  </TASK>
[  283.305583] irq event stamp: 2018
[  283.307085] hardirqs last  enabled at (2017): [<ffffffff8269777f>] _raw_spin_unlock_irqrestore+0x3f/0x50
[  283.311026] hardirqs last disabled at (2018): [<ffffffff82689aff>] __schedule+0x13df/0x28d0
[  283.314726] softirqs last  enabled at (2008): [<ffffffff81324f40>] __irq_exit_rcu+0xc0/0x100
[  283.318427] softirqs last disabled at (2001): [<ffffffff81324f40>] __irq_exit_rcu+0xc0/0x100
[  283.321920] ---[ end trace 0000000000000000 ]---
[  283.323878] BUG: kernel NULL pointer dereference, address: 0000000000000059
[  283.326033] #PF: supervisor read access in kernel mode
[  283.327357] #PF: error_code(0x0000) - not-present page
[  283.328698] PGD 800000000a8c5067 P4D 800000000a8c5067 PUD 12879067 PMD 0
[  283.329491] Oops: Oops: 0000 [#1] SMP PTI
[  283.329796] CPU: 2 UID: 0 PID: 0 Comm: swapper/2 Tainted: G        W           7.1.0-rc2-00076-g74fe02ce122a #26 PREEMPT(full)
[  283.331183] Tainted: [W]=WARN
[  283.331468] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-9.fc43 06/10/2025
[  283.332346] RIP: 0010:pick_task_fair+0x2d/0xb0
[  283.332735] Code: fa 8b 97 10 01 00 00 85 d2 0f 84 92 00 00 00 53 48 89 fb 48 83 ec 08 48 8d bb 00 01 00 00 eb 21 be 01 00 00 00 e8 13 8b ff ff <80> 78 59 00 75 3d 48 85 c0 74 48 48 8b b8 b8 00 00 00 48 85 ff 74
[  283.334364] RSP: 0018:ffffc900000b7e20 EFLAGS: 00010086
[  283.334992] RAX: 0000000000000000 RBX: ffff88813a5ec800 RCX: 041dd83271100000
[  283.335731] RDX: 0000000000000000 RSI: 0000000000200000 RDI: ffff888116cb2200
[  283.336404] RBP: ffffc900000b7f20 R08: 041dd83271100000 R09: 0000000000200000
[  283.337852] R10: 00000005252d41b2 R11: 0000000000000001 R12: 0000000000000002
[  283.338802] R13: ffff888025c18000 R14: 0000000000000003 R15: ffffffff84160800
[  283.339827] FS:  0000000000000000(0000) GS:ffff8881b628c000(0000) knlGS:0000000000000000
[  283.341158] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  283.342502] CR2: 0000000000000059 CR3: 000000001f322000 CR4: 00000000000006f0
[  283.345455] Call Trace:
[  283.347033]  <TASK>
[  283.348350]  __schedule+0xc65/0x28d0
[  283.349703]  ? tick_nohz_idle_exit+0x66/0x160
[  283.350882]  ? do_idle+0x17c/0x2b0
[  283.351454]  schedule_idle+0x1d/0x40
[  283.352017]  cpu_startup_entry+0x24/0x30
[  283.352594]  start_secondary+0xf8/0x100
[  283.353272]  common_startup_64+0x13e/0x148
[  283.353840]  </TASK>

Tested with strace test suite as well as hackbench and stress-ng on s390 and x86.

v1-v2:
- added a fix to prevent proxy-exec of unmatched cookie lock owners

[1] https://lore.kernel.org/all/c00-01.ttedd70@ub.hpns/
[2] https://lore.kernel.org/all/10282ce9-f4ae-498f-9b57-f4e1e61fffbc@amd.com/

Vasily Gorbik (2):
  sched/core: Don't steal a proxy-exec donor
  sched/core: Don't proxy-exec unmatched cookie lock owners

 kernel/sched/core.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

-- 
2.53.0

Re: [PATCH v2 0/2] sched/core: Fix proxy-exec/core-sched interactions

Posted by John Stultz 1 month ago

On Thu, May 7, 2026 at 3:42 AM Vasily Gorbik <gor@linux.ibm.com> wrote:
>
> v1 [1] consisted of a fix for a scheduler corruption where
> try_steal_cookie() could migrate a proxy-exec donor away from the source
> rq while that rq still used it as the active scheduling context.
>
> Prateek pointed out [2] a separate proxy-exec/core-sched issue: after
> pick_next_task() selects a core cookie compatible donor, find_proxy_task()
> can replace the execution context with a mutex owner with a different
> cookie.
>
> This v2 keeps the donor steal fix as patch 1 and adds patch 2 to reject
> mismatched final proxy owners.
>
> The v1 reported the issue reproduced on s390 LPAR, but it seems to be
> easily reproducible with strace test suite "make -j$(nproc) check" on
> any system with SMT, CONFIG_SCHED_CORE=y and CONFIG_SCHED_PROXY_EXEC=y
> enabled, e.g. on x86 KVM with -smp cpus=16,sockets=1,cores=8,threads=2:
>

Vasily! Thank you so much for reporting this and working out fixes
(along with K Prateek!)

Apologies for being slow to reply, I've been under the weather.

I really appreciate this reproducer detail, but I've so far not been
able to trip this issue up (SCHED_CORE=y, SCHED_PROXY_EXEC=y and using
the qemu arguments you included above). Could you mail me your .config
in case something else is needed?

thanks
-john

Re: [PATCH v2 0/2] sched/core: Fix proxy-exec/core-sched interactions

Posted by John Stultz 1 month ago

On Tue, May 12, 2026 at 2:17 PM John Stultz <jstultz@google.com> wrote:
> On Thu, May 7, 2026 at 3:42 AM Vasily Gorbik <gor@linux.ibm.com> wrote:
> >
> > v1 [1] consisted of a fix for a scheduler corruption where
> > try_steal_cookie() could migrate a proxy-exec donor away from the source
> > rq while that rq still used it as the active scheduling context.
> >
> > Prateek pointed out [2] a separate proxy-exec/core-sched issue: after
> > pick_next_task() selects a core cookie compatible donor, find_proxy_task()
> > can replace the execution context with a mutex owner with a different
> > cookie.
> >
> > This v2 keeps the donor steal fix as patch 1 and adds patch 2 to reject
> > mismatched final proxy owners.
> >
> > The v1 reported the issue reproduced on s390 LPAR, but it seems to be
> > easily reproducible with strace test suite "make -j$(nproc) check" on
> > any system with SMT, CONFIG_SCHED_CORE=y and CONFIG_SCHED_PROXY_EXEC=y
> > enabled, e.g. on x86 KVM with -smp cpus=16,sockets=1,cores=8,threads=2:
> >
>
> Vasily! Thank you so much for reporting this and working out fixes
> (along with K Prateek!)
>
> Apologies for being slow to reply, I've been under the weather.
>
> I really appreciate this reproducer detail, but I've so far not been
> able to trip this issue up (SCHED_CORE=y, SCHED_PROXY_EXEC=y and using
> the qemu arguments you included above). Could you mail me your .config
> in case something else is needed?

Ok, I think I was able to force it using my priority-inversion-demo by
taking the spots in the run.sh script where we kick off the
rename-test and prefixing it with `coresched new -t pid --`
  https://github.com/johnstultz-work/priority-inversion-demo/blob/main/run.sh#L89

That way the foreground/background tasks run with separate cookies and
that forces proxying across cookies, and with that I've tripped over
the issues you highlight.

That said, I'm still curious to learn more about your x86 environment
and why it tripped so much more easily there, so let me know.

With your patches it does seem to resolve things, but I'm also hoping
to find some better ways to more thoroughly stress the
proxy+core-sched logic.

thanks again!
-john

Re: [PATCH v2 0/2] sched/core: Fix proxy-exec/core-sched interactions

Posted by Vasily Gorbik 1 month ago

On Tue, May 12, 2026 at 05:48:19PM -0700, John Stultz wrote:
> On Tue, May 12, 2026 at 2:17 PM John Stultz <jstultz@google.com> wrote:
> > On Thu, May 7, 2026 at 3:42 AM Vasily Gorbik <gor@linux.ibm.com> wrote:
> > > The v1 reported the issue reproduced on s390 LPAR, but it seems to be
> > > easily reproducible with strace test suite "make -j$(nproc) check" on
> > > any system with SMT, CONFIG_SCHED_CORE=y and CONFIG_SCHED_PROXY_EXEC=y
> > > enabled, e.g. on x86 KVM with -smp cpus=16,sockets=1,cores=8,threads=2:
> > >
> > I really appreciate this reproducer detail, but I've so far not been
> > able to trip this issue up (SCHED_CORE=y, SCHED_PROXY_EXEC=y and using
> > the qemu arguments you included above). Could you mail me your .config
> > in case something else is needed?
> 
> Ok, I think I was able to force it using my priority-inversion-demo by
> taking the spots in the run.sh script where we kick off the
> rename-test and prefixing it with `coresched new -t pid --`
>   https://github.com/johnstultz-work/priority-inversion-demo/blob/main/run.sh#L89
> 
> That way the foreground/background tasks run with separate cookies and
> that forces proxying across cookies, and with that I've tripped over
> the issues you highlight.
> 
> That said, I'm still curious to learn more about your x86 environment
> and why it tripped so much more easily there, so let me know.

I retried the repro on commit 66182ca873a4 (yesterday's Linus master)
with the same "make -j$(nproc) check".
The claim of "easily reproducible" on x86 KVM with
-smp cpus=16,sockets=1,cores=8,threads=2 and "JUST" CONFIG_SCHED_CORE=y and
CONFIG_SCHED_PROXY_EXEC=y was an overstatement for x86.

But it triggers with at least 50% probability in KVM on my machine with the
config attached. I don't have any large x86 machine available to me,
so my setup is a laptop with an i7-1360P, Fedora 43 on host and guest,
plus the latest strace git.

Compared with the x86 defconfig + CONFIG_SCHED_CORE=y and
CONFIG_SCHED_PROXY_EXEC=y, my best guess is that PREEMPT=y and
PROVE_LOCKING=y might cause the issue to trigger more often.

With just
CONFIG_SCHED_CORE=y
CONFIG_SCHED_PROXY_EXEC=y
PREEMPT=y
I got only 2/10 repro success rate.

On s390 with 64 SMT-2 cores I've just triggered the problem 3/3 even with
arch/s390/configs/defconfig, which has:
CONFIG_PREEMPT_LAZY=y
CONFIG_SCHED_CORE=y
CONFIG_SCHED_PROXY_EXEC=y
and no debug options. I wouldn't expect anything particularly special
about s390, it's just the number of cores.