sched/fair: Kernel panics in pick_next_entity

Vishal Chourasia posted 1 patch 2 months ago
sched/fair: Kernel panics in pick_next_entity
Posted by Vishal Chourasia 2 months ago
I've noticed a kernel panic consistently occurring on the mainline v6.11
kernel (see attached dmesg log below). 

The panic occurs almost every time I build the Linux kernel from source.

Steps to Reproduce:

make clean
./scripts/config -e LOCALVERSION_AUTO
./scripts/config --set-str LOCALVERSION -master-with-print
make localmodconfig
make -j8 -s vmlinux modules

>From my investigation, it seems that the function pick_eevdf() can return NULL.
Commit f12e1488 ("sched/fair: Prepare pick_next_task() for delayed dequeue") 
introduces an access on the return value of pick_eevdf(). If 'se' was NULL, 
it can lead to a null pointer dereference. 

# objdump -S vmlinux | grep -C 5 c0000000001cfebc
c0000000001cfeb0:       00 00 00 60     nop
        struct sched_entity *se = pick_eevdf(cfs_rq);
c0000000001cfeb4:       78 f3 c3 7f     mr      r3,r30
c0000000001cfeb8:       01 46 ff 4b     bl      c0000000001c44b8 <pick_eevdf+0x8>
        if (se->sched_delayed) {
c0000000001cfebc:       51 00 23 89     lbz     r9,81(r3)  <<<<<<
        struct sched_entity *se = pick_eevdf(cfs_rq);
c0000000001cfec0:       78 1b 7f 7c     mr      r31,r3
        if (se->sched_delayed) {
c0000000001cfec4:       00 00 09 2c     cmpwi   r9,0
c0000000001cfec8:       98 00 82 40     bne     c0000000001cff60 <pick_next_entity+0xe0>

r3 is NULL which can be verified from the register context shared in the
dmesg logs

Here is the state of my git repository:
# git log --oneline
684a64bf32b6 (HEAD -> master, origin/master, origin/HEAD) Merge tag 'nfs-for-6.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
f7fccaa77271 Merge tag 'fuse-update-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
4165cee7ecb1 Merge tag 'exfat-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
79952bdcbcea Merge tag 'f2fs-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
fa8380a06bd0 Merge tag 'bpf-next-6.12-struct-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
68e5c7d4cefb Merge tag 'kbuild-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
7f8de2bf0725 Merge tag 'linux-cpupower-6.12-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux
cd3d64772981 Merge tag 'i3c/for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux


To determine why pick_eevdf() would return NULL, I added a few printk statements
Based on one of the printk logs in the shared dmesg log, it appears that if
pick_eevdf() is called for a 'cfs_rq' whose 'cfs_rq->curr' is NULL and there
are no eligible entities on that 'cfs_rq', it will return NULL. 

# git diff
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 225b31aaee55..8c5b96f1cd49 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -48,6 +48,7 @@
 #include <linux/ratelimit.h>
 #include <linux/task_work.h>
 #include <linux/rbtree_augmented.h>
+#include <linux/delay.h>

 #include <asm/switch_to.h>

@@ -907,16 +908,25 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
 static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 {
        struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
+       struct rb_node *tmpnode = node;
        struct sched_entity *se = __pick_first_entity(cfs_rq);
+       struct sched_entity *tmpse = se;
        struct sched_entity *curr = cfs_rq->curr;
+       struct sched_entity *tmpcurr = curr;
        struct sched_entity *best = NULL;
-
+       struct sched_entity *tmp = NULL;
        /*
         * We can safely skip eligibility check if there is only one entity
         * in this cfs_rq, saving some cycles.
         */
-       if (cfs_rq->nr_running == 1)
-               return curr && curr->on_rq ? curr : se;
+       if (cfs_rq->nr_running == 1) {
+               tmp = curr && curr->on_rq ? curr : se;
+               if (!tmp) {
+                       printk(KERN_INFO "pick_eevdf curr: %p, se: %p\n", curr, se);
+                       mdelay(10);
+               }
+               return tmp;
+       }

        if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
                curr = NULL;
@@ -966,6 +976,11 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
        if (!best || (curr && entity_before(curr, best)))
                best = curr;

+       if (!best) {
+               printk(KERN_INFO "best=%p, curr=%p, se=%p, node=%p ocrr=%p, ose=%p, onode=%p\n", best, curr, se, node, tmpcurr, tmpse, tmpnode);
+               mdelay(10);
+       }
+
        return best;
 }

From the logs below:
[ 1355.763494] best=0000000000000000, curr=0000000000000000, se=00000000be02c573, node=0000000000000000 ocrr=0000000000000000, ose=00000000b1d4c4d5, onode=0000000023eb8c00

I have not been able to think of a quick reproducer to trigger a panic
for this case. Hoping if someone can guide me on this.

Note: The following dmesg log also contains a warning reported too. Panic
happens later.

------------[ cut here ]------------
!se->on_rq
WARNING: CPU: 1 PID: 92333 at kernel/sched/fair.c:705 update_entity_lag+0xcc/0xf0
Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 1 UID: 0 PID: 92333 Comm: genksyms Tainted: G        W          6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64
Tainted: [W]=WARN
Hardware name: IBM,9080-HEX POWER10 (architected) hv:phyp pSeries
NIP:  c0000000001cdfcc LR: c0000000001cdfc8 CTR: 0000000000000000
REGS: c00000005c62ee50 TRAP: 0700   Tainted: G        W           (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24002222  XER: 00000005
CFAR: c000000000156a10 IRQMASK: 1
GPR00: c0000000001cdfc8 c00000005c62f0f0 c000000001b57400 000000000000000a
GPR04: 00000000ffff7fff c00000005c62eee0 c00000005c62eed8 00000007fb050000
GPR08: 0000000000000027 0000000000000000 0000000000000000 c000000002758de0
GPR12: c000000002a18d88 c0000007fffef480 0000000000000000 0000000000000000
GPR16: c000000002c56d40 0000000000000000 c00000005c62f5b4 0000000000000000
GPR20: fffffffffffffdef 0000000000000000 0000000000000002 c000000003cd7300
GPR24: 0000000000000000 0000000000000008 c0000007fd1d3f80 0000000000000000
GPR28: 0000000000000001 0000000000000009 c0000007fd1d4080 c0000000656a0000
NIP [c0000000001cdfcc] update_entity_lag+0xcc/0xf0
LR [c0000000001cdfc8] update_entity_lag+0xc8/0xf0
Call Trace:
[c00000005c62f0f0] [c0000000001cdfc8] update_entity_lag+0xc8/0xf0 (unreliable)
[c00000005c62f160] [c0000000001cea80] dequeue_entity+0xb0/0x6d0
[c00000005c62f1f0] [c0000000001cf8b0] dequeue_entities+0x150/0x600
[c00000005c62f2c0] [c0000000001d02a8] dequeue_task_fair+0x158/0x2e0
[c00000005c62f300] [c0000000001b5ea4] dequeue_task+0x64/0x200
[c00000005c62f380] [c0000000001cc950] detach_tasks+0x140/0x420
[c00000005c62f3f0] [c0000000001d6044] sched_balance_rq+0x214/0x7c0
[c00000005c62f550] [c0000000001d6830] sched_balance_newidle+0x240/0x630
[c00000005c62f640] [c0000000001d6d0c] pick_next_task_fair+0x7c/0x4a0
[c00000005c62f6d0] [c0000000001afc50] __pick_next_task+0x60/0x2d0
[c00000005c62f730] [c0000000010e8ce8] __schedule+0x198/0x840
[c00000005c62f810] [c0000000010e93d0] schedule+0x40/0x110
[c00000005c62f880] [c00000000064c574] pipe_read+0x424/0x6a0
[c00000005c62f960] [c00000000063a0fc] vfs_read+0x30c/0x3d0
[c00000005c62fa10] [c00000000063adf4] ksys_read+0x104/0x160
[c00000005c62fa60] [c000000000031678] system_call_exception+0x138/0x2d0
[c00000005c62fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
--- interrupt: 3000 at 0x7fffb8f4a0c4
NIP:  00007fffb8f4a0c4 LR: 00007fffb8f4a0c4 CTR: 0000000000000000
REGS: c00000005c62fe80 TRAP: 3000   Tainted: G        W           (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 48004222  XER: 00000000
IRQMASK: 0
GPR00: 0000000000000003 00007fffe27ca2d0 00007fffb9067100 0000000000000000
GPR04: 000000003076051b 0000000000002000 00007fffb9060b50 0000000000000000
GPR08: 000000000000006f 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fffb917a560 000000000000000b 0000000000000000
GPR16: 00007fffb9173570 0000000000002000 000000000000000c 000000000000000b
GPR20: 00000000307604c0 0000000030762509 00000000100484e8 0000000030762515
GPR24: 0000000010022b80 0000000000000000 0000000000000005 0000000000002000
GPR28: 000000003076051b 0000000000002000 00007fffb905e508 00007fffb9060b50
NIP [00007fffb8f4a0c4] 0x7fffb8f4a0c4
LR [00007fffb8f4a0c4] 0x7fffb8f4a0c4
--- interrupt: 3000
Code: 4e800020 3d220104 89297c19 2c090000 4082ff8c 3c62ff99 39200001 3d420104 38632d90 992a7c19 4bf88965 60000000 <0fe00000> 4bffff68 60000000 60000000
---[ end trace 0000000000000000 ]---
best=0000000000000000, curr=0000000000000000, se=00000000be02c573, node=0000000000000000 ocrr=0000000000000000, ose=00000000b1d4c4d5, onode=0000000023eb8c00
Kernel attempted to read user page (51) - exploit attempt? (uid: 0)
BUG: Kernel NULL pointer dereference on read at 0x00000051
Faulting instruction address: 0xc0000000001cfebc
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G        W          6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64
Tainted: [W]=WARN
Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries
NIP:  c0000000001cfebc LR: c0000000001cfebc CTR: 0000000000000000
REGS: c000000002c13950 TRAP: 0300   Tainted: G        W           (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44022282  XER: 0000000d
CFAR: c0000000001c4758 DAR: 0000000000000051 DSISR: 40000000 IRQMASK: 1
GPR00: c0000000001cfebc c000000002c13bf0 c000000001b57400 0000000000000000
GPR04: c0000007fd147108 c0000007fd1cd600 c000000002c13968 00000007fafb0000
GPR08: 0000000000000027 00000000004e2002 0000896cdac1f9a0 0000000000002000
GPR12: c000000002a18d88 c000000002f10000 0000000000000000 00000007fffe0000
GPR16: 00000007fffd0000 0000000000000000 00000007fffe0114 0000000000000000
GPR20: 0000000000000000 0000000000000000 c0000000010e95ec c000000002bd6380
GPR24: c000000002bd6da8 c000000002c524e8 0000000000000000 c000000002bd6380
GPR28: c000000002bd6380 c0000007fd1d3f80 c0000007fd1d4080 c0000007fd1d3f80
NIP [c0000000001cfebc] pick_next_entity+0x3c/0x180
LR [c0000000001cfebc] pick_next_entity+0x3c/0x180
Call Trace:
[c000000002c13bf0] [c0000000001cfebc] pick_next_entity+0x3c/0x180 (unreliable)
[c000000002c13c70] [c0000000001d0064] pick_task_fair+0x64/0x130
[c000000002c13cb0] [c0000000001d6cd8] pick_next_task_fair+0x48/0x4a0
[c000000002c13d40] [c0000000001afc50] __pick_next_task+0x60/0x2d0
[c000000002c13da0] [c0000000010e8ce8] __schedule+0x198/0x840
[c000000002c13e80] [c0000000010e95ec] schedule_idle+0x3c/0x70
[c000000002c13eb0] [c0000000001eb1d0] do_idle+0x160/0x1b0
[c000000002c13f00] [c0000000001eb4d0] cpu_startup_entry+0x50/0x60
[c000000002c13f30] [c0000000000110e8] rest_init+0xf0/0xf4
[c000000002c13f60] [c0000000020053a4] do_initcalls+0x0/0x190
[c000000002c13fe0] [c00000000000e788] start_here_common+0x1c/0x20
Code: 60000000 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7c7d1b78 7c9e2378 f8010010 f821ff81 60000000 7fc3f378 4bff4601 <89230051> 7c7f1b78 2c090000 40820098
---[ end trace 0000000000000000 ]---
pstore: backend (nvram) writing error (-1)

Kernel panic - not syncing: Fatal exception
Rebooting in 10 seconds..
Re: sched/fair: Kernel panics in pick_next_entity
Posted by Peter Zijlstra 1 month, 4 weeks ago
On Thu, Sep 26, 2024 at 06:12:19PM +0530, Vishal Chourasia wrote:
> I've noticed a kernel panic consistently occurring on the mainline v6.11
> kernel (see attached dmesg log below). 
> 
> The panic occurs almost every time I build the Linux kernel from source.
> 
> Steps to Reproduce:
> 
> make clean
> ./scripts/config -e LOCALVERSION_AUTO
> ./scripts/config --set-str LOCALVERSION -master-with-print
> make localmodconfig
> make -j8 -s vmlinux modules
> 
> >From my investigation, it seems that the function pick_eevdf() can return NULL.
> Commit f12e1488 ("sched/fair: Prepare pick_next_task() for delayed dequeue") 
> introduces an access on the return value of pick_eevdf(). If 'se' was NULL, 
> it can lead to a null pointer dereference. 

Even before that commit we relied on that thing not being NULL, notably
f12e1488^1 has:

                se = pick_next_entity(cfs_rq);
                cfs_rq = group_cfs_rq(se);

Which will similarly explode when pick_eevdf() goes wobbly.

> To determine why pick_eevdf() would return NULL, I added a few printk statements
> Based on one of the printk logs in the shared dmesg log, it appears that if
> pick_eevdf() is called for a 'cfs_rq' whose 'cfs_rq->curr' is NULL and there
> are no eligible entities on that 'cfs_rq', it will return NULL. 

Right, that is not a valid state. Which seems to suggest something went
sideways with the eligibility thing -- as Luis suggested.

> I have not been able to think of a quick reproducer to trigger a panic
> for this case. Hoping if someone can guide me on this.
> 
> Note: The following dmesg log also contains a warning reported too. Panic
> happens later.
> 
> ------------[ cut here ]------------
> !se->on_rq
> WARNING: CPU: 1 PID: 92333 at kernel/sched/fair.c:705 update_entity_lag+0xcc/0xf0
> Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
> CPU: 1 UID: 0 PID: 92333 Comm: genksyms Tainted: G        W          6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64
> Tainted: [W]=WARN
> Hardware name: IBM,9080-HEX POWER10 (architected) hv:phyp pSeries
> NIP:  c0000000001cdfcc LR: c0000000001cdfc8 CTR: 0000000000000000
> REGS: c00000005c62ee50 TRAP: 0700   Tainted: G        W           (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
> MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24002222  XER: 00000005
> CFAR: c000000000156a10 IRQMASK: 1
> GPR00: c0000000001cdfc8 c00000005c62f0f0 c000000001b57400 000000000000000a
> GPR04: 00000000ffff7fff c00000005c62eee0 c00000005c62eed8 00000007fb050000
> GPR08: 0000000000000027 0000000000000000 0000000000000000 c000000002758de0
> GPR12: c000000002a18d88 c0000007fffef480 0000000000000000 0000000000000000
> GPR16: c000000002c56d40 0000000000000000 c00000005c62f5b4 0000000000000000
> GPR20: fffffffffffffdef 0000000000000000 0000000000000002 c000000003cd7300
> GPR24: 0000000000000000 0000000000000008 c0000007fd1d3f80 0000000000000000
> GPR28: 0000000000000001 0000000000000009 c0000007fd1d4080 c0000000656a0000
> NIP [c0000000001cdfcc] update_entity_lag+0xcc/0xf0
> LR [c0000000001cdfc8] update_entity_lag+0xc8/0xf0
> Call Trace:
> [c00000005c62f0f0] [c0000000001cdfc8] update_entity_lag+0xc8/0xf0 (unreliable)
> [c00000005c62f160] [c0000000001cea80] dequeue_entity+0xb0/0x6d0
> [c00000005c62f1f0] [c0000000001cf8b0] dequeue_entities+0x150/0x600
> [c00000005c62f2c0] [c0000000001d02a8] dequeue_task_fair+0x158/0x2e0
> [c00000005c62f300] [c0000000001b5ea4] dequeue_task+0x64/0x200
> [c00000005c62f380] [c0000000001cc950] detach_tasks+0x140/0x420
> [c00000005c62f3f0] [c0000000001d6044] sched_balance_rq+0x214/0x7c0
> [c00000005c62f550] [c0000000001d6830] sched_balance_newidle+0x240/0x630
> [c00000005c62f640] [c0000000001d6d0c] pick_next_task_fair+0x7c/0x4a0
> [c00000005c62f6d0] [c0000000001afc50] __pick_next_task+0x60/0x2d0
> [c00000005c62f730] [c0000000010e8ce8] __schedule+0x198/0x840
> [c00000005c62f810] [c0000000010e93d0] schedule+0x40/0x110
> [c00000005c62f880] [c00000000064c574] pipe_read+0x424/0x6a0
> [c00000005c62f960] [c00000000063a0fc] vfs_read+0x30c/0x3d0
> [c00000005c62fa10] [c00000000063adf4] ksys_read+0x104/0x160
> [c00000005c62fa60] [c000000000031678] system_call_exception+0x138/0x2d0
> [c00000005c62fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec

So that is a 'fun' one, I don't remember seeing that before. It says
we're trying to dequeue a task that is not on the runqueue.

The big new thing this merge window -- I'm assuming v6.11 is good -- is
DEQUEUE_DELAYED. Does this error go away if you flip that in
kernel/sched/features.h ?
Re: sched/fair: Kernel panics in pick_next_entity
Posted by Vishal Chourasia 1 month, 4 weeks ago
On Mon, Sep 30, 2024 at 04:41:57PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 26, 2024 at 06:12:19PM +0530, Vishal Chourasia wrote:
> > I've noticed a kernel panic consistently occurring on the mainline v6.11
> > kernel (see attached dmesg log below). 
> > 
> > The panic occurs almost every time I build the Linux kernel from source.
> > 
> > Steps to Reproduce:
> > 
> > make clean
> > ./scripts/config -e LOCALVERSION_AUTO
> > ./scripts/config --set-str LOCALVERSION -master-with-print
> > make localmodconfig
> > make -j8 -s vmlinux modules
> > 
> > >From my investigation, it seems that the function pick_eevdf() can return NULL.
> > Commit f12e1488 ("sched/fair: Prepare pick_next_task() for delayed dequeue") 
> > introduces an access on the return value of pick_eevdf(). If 'se' was NULL, 
> > it can lead to a null pointer dereference. 
> 
> Even before that commit we relied on that thing not being NULL, notably
> f12e1488^1 has:
> 
>                 se = pick_next_entity(cfs_rq);
>                 cfs_rq = group_cfs_rq(se);
> 
> Which will similarly explode when pick_eevdf() goes wobbly.
> 
> > To determine why pick_eevdf() would return NULL, I added a few printk statements
> > Based on one of the printk logs in the shared dmesg log, it appears that if
> > pick_eevdf() is called for a 'cfs_rq' whose 'cfs_rq->curr' is NULL and there
> > are no eligible entities on that 'cfs_rq', it will return NULL. 
> 
> Right, that is not a valid state. Which seems to suggest something went
> sideways with the eligibility thing -- as Luis suggested.
> 
> > I have not been able to think of a quick reproducer to trigger a panic
> > for this case. Hoping if someone can guide me on this.
> > 
> > Note: The following dmesg log also contains a warning reported too. Panic
> > happens later.
> > 
> > ------------[ cut here ]------------
> > !se->on_rq
> > WARNING: CPU: 1 PID: 92333 at kernel/sched/fair.c:705 update_entity_lag+0xcc/0xf0
> > Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
> > CPU: 1 UID: 0 PID: 92333 Comm: genksyms Tainted: G        W          6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64
> > Tainted: [W]=WARN
> > Hardware name: IBM,9080-HEX POWER10 (architected) hv:phyp pSeries
> > NIP:  c0000000001cdfcc LR: c0000000001cdfc8 CTR: 0000000000000000
> > REGS: c00000005c62ee50 TRAP: 0700   Tainted: G        W           (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
> > MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24002222  XER: 00000005
> > CFAR: c000000000156a10 IRQMASK: 1
> > GPR00: c0000000001cdfc8 c00000005c62f0f0 c000000001b57400 000000000000000a
> > GPR04: 00000000ffff7fff c00000005c62eee0 c00000005c62eed8 00000007fb050000
> > GPR08: 0000000000000027 0000000000000000 0000000000000000 c000000002758de0
> > GPR12: c000000002a18d88 c0000007fffef480 0000000000000000 0000000000000000
> > GPR16: c000000002c56d40 0000000000000000 c00000005c62f5b4 0000000000000000
> > GPR20: fffffffffffffdef 0000000000000000 0000000000000002 c000000003cd7300
> > GPR24: 0000000000000000 0000000000000008 c0000007fd1d3f80 0000000000000000
> > GPR28: 0000000000000001 0000000000000009 c0000007fd1d4080 c0000000656a0000
> > NIP [c0000000001cdfcc] update_entity_lag+0xcc/0xf0
> > LR [c0000000001cdfc8] update_entity_lag+0xc8/0xf0
> > Call Trace:
> > [c00000005c62f0f0] [c0000000001cdfc8] update_entity_lag+0xc8/0xf0 (unreliable)
> > [c00000005c62f160] [c0000000001cea80] dequeue_entity+0xb0/0x6d0
> > [c00000005c62f1f0] [c0000000001cf8b0] dequeue_entities+0x150/0x600
> > [c00000005c62f2c0] [c0000000001d02a8] dequeue_task_fair+0x158/0x2e0
> > [c00000005c62f300] [c0000000001b5ea4] dequeue_task+0x64/0x200
> > [c00000005c62f380] [c0000000001cc950] detach_tasks+0x140/0x420
> > [c00000005c62f3f0] [c0000000001d6044] sched_balance_rq+0x214/0x7c0
> > [c00000005c62f550] [c0000000001d6830] sched_balance_newidle+0x240/0x630
> > [c00000005c62f640] [c0000000001d6d0c] pick_next_task_fair+0x7c/0x4a0
> > [c00000005c62f6d0] [c0000000001afc50] __pick_next_task+0x60/0x2d0
> > [c00000005c62f730] [c0000000010e8ce8] __schedule+0x198/0x840
> > [c00000005c62f810] [c0000000010e93d0] schedule+0x40/0x110
> > [c00000005c62f880] [c00000000064c574] pipe_read+0x424/0x6a0
> > [c00000005c62f960] [c00000000063a0fc] vfs_read+0x30c/0x3d0
> > [c00000005c62fa10] [c00000000063adf4] ksys_read+0x104/0x160
> > [c00000005c62fa60] [c000000000031678] system_call_exception+0x138/0x2d0
> > [c00000005c62fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
> 
> So that is a 'fun' one, I don't remember seeing that before. It says
> we're trying to dequeue a task that is not on the runqueue.
> 
> The big new thing this merge window -- I'm assuming v6.11 is good -- is
> DEQUEUE_DELAYED. Does this error go away if you flip that in
> kernel/sched/features.h ?
Yes, with the below diff. I didn't see any warnings or kernel panic
while running the workload

# git diff
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 290874079f60..38bf8df813d1 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -46,7 +46,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
  *
  * DELAY_ZERO clips the lag on dequeue (or wakeup) to 0.
  */
-SCHED_FEAT(DELAY_DEQUEUE, true)
+SCHED_FEAT(DELAY_DEQUEUE, false)
 SCHED_FEAT(DELAY_ZERO, true)

 /*
Re: sched/fair: Kernel panics in pick_next_entity
Posted by Vishal Chourasia 1 month, 4 weeks ago
On Tue, Oct 01, 2024 at 12:35:16AM +0530, Vishal Chourasia wrote:
> On Mon, Sep 30, 2024 at 04:41:57PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 26, 2024 at 06:12:19PM +0530, Vishal Chourasia wrote:
> > > I've noticed a kernel panic consistently occurring on the mainline v6.11
> > > kernel (see attached dmesg log below). 
> > > 
> > > The panic occurs almost every time I build the Linux kernel from source.
> > > 
> > > Steps to Reproduce:
> > > 
> > > make clean
> > > ./scripts/config -e LOCALVERSION_AUTO
> > > ./scripts/config --set-str LOCALVERSION -master-with-print
> > > make localmodconfig
> > > make -j8 -s vmlinux modules
> > > 
> > > >From my investigation, it seems that the function pick_eevdf() can return NULL.
> > > Commit f12e1488 ("sched/fair: Prepare pick_next_task() for delayed dequeue") 
> > > introduces an access on the return value of pick_eevdf(). If 'se' was NULL, 
> > > it can lead to a null pointer dereference. 
> > 
> > Even before that commit we relied on that thing not being NULL, notably
> > f12e1488^1 has:
> > 
> >                 se = pick_next_entity(cfs_rq);
> >                 cfs_rq = group_cfs_rq(se);
> > 
> > Which will similarly explode when pick_eevdf() goes wobbly.
> > 
> > > To determine why pick_eevdf() would return NULL, I added a few printk statements
> > > Based on one of the printk logs in the shared dmesg log, it appears that if
> > > pick_eevdf() is called for a 'cfs_rq' whose 'cfs_rq->curr' is NULL and there
> > > are no eligible entities on that 'cfs_rq', it will return NULL. 
> > 
> > Right, that is not a valid state. Which seems to suggest something went
> > sideways with the eligibility thing -- as Luis suggested.
> > 
> > > I have not been able to think of a quick reproducer to trigger a panic
> > > for this case. Hoping if someone can guide me on this.
> > > 
> > > Note: The following dmesg log also contains a warning reported too. Panic
> > > happens later.
> > > 
> > > ------------[ cut here ]------------
> > > !se->on_rq
> > > WARNING: CPU: 1 PID: 92333 at kernel/sched/fair.c:705 update_entity_lag+0xcc/0xf0
> > > Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
> > > CPU: 1 UID: 0 PID: 92333 Comm: genksyms Tainted: G        W          6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64
> > > Tainted: [W]=WARN
> > > Hardware name: IBM,9080-HEX POWER10 (architected) hv:phyp pSeries
> > > NIP:  c0000000001cdfcc LR: c0000000001cdfc8 CTR: 0000000000000000
> > > REGS: c00000005c62ee50 TRAP: 0700   Tainted: G        W           (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
> > > MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24002222  XER: 00000005
> > > CFAR: c000000000156a10 IRQMASK: 1
> > > GPR00: c0000000001cdfc8 c00000005c62f0f0 c000000001b57400 000000000000000a
> > > GPR04: 00000000ffff7fff c00000005c62eee0 c00000005c62eed8 00000007fb050000
> > > GPR08: 0000000000000027 0000000000000000 0000000000000000 c000000002758de0
> > > GPR12: c000000002a18d88 c0000007fffef480 0000000000000000 0000000000000000
> > > GPR16: c000000002c56d40 0000000000000000 c00000005c62f5b4 0000000000000000
> > > GPR20: fffffffffffffdef 0000000000000000 0000000000000002 c000000003cd7300
> > > GPR24: 0000000000000000 0000000000000008 c0000007fd1d3f80 0000000000000000
> > > GPR28: 0000000000000001 0000000000000009 c0000007fd1d4080 c0000000656a0000
> > > NIP [c0000000001cdfcc] update_entity_lag+0xcc/0xf0
> > > LR [c0000000001cdfc8] update_entity_lag+0xc8/0xf0
> > > Call Trace:
> > > [c00000005c62f0f0] [c0000000001cdfc8] update_entity_lag+0xc8/0xf0 (unreliable)
> > > [c00000005c62f160] [c0000000001cea80] dequeue_entity+0xb0/0x6d0
> > > [c00000005c62f1f0] [c0000000001cf8b0] dequeue_entities+0x150/0x600
> > > [c00000005c62f2c0] [c0000000001d02a8] dequeue_task_fair+0x158/0x2e0
> > > [c00000005c62f300] [c0000000001b5ea4] dequeue_task+0x64/0x200
> > > [c00000005c62f380] [c0000000001cc950] detach_tasks+0x140/0x420
> > > [c00000005c62f3f0] [c0000000001d6044] sched_balance_rq+0x214/0x7c0
> > > [c00000005c62f550] [c0000000001d6830] sched_balance_newidle+0x240/0x630
> > > [c00000005c62f640] [c0000000001d6d0c] pick_next_task_fair+0x7c/0x4a0
> > > [c00000005c62f6d0] [c0000000001afc50] __pick_next_task+0x60/0x2d0
> > > [c00000005c62f730] [c0000000010e8ce8] __schedule+0x198/0x840
> > > [c00000005c62f810] [c0000000010e93d0] schedule+0x40/0x110
> > > [c00000005c62f880] [c00000000064c574] pipe_read+0x424/0x6a0
> > > [c00000005c62f960] [c00000000063a0fc] vfs_read+0x30c/0x3d0
> > > [c00000005c62fa10] [c00000000063adf4] ksys_read+0x104/0x160
> > > [c00000005c62fa60] [c000000000031678] system_call_exception+0x138/0x2d0
> > > [c00000005c62fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
> > 
> > So that is a 'fun' one, I don't remember seeing that before. It says
> > we're trying to dequeue a task that is not on the runqueue.
> > 
> > The big new thing this merge window -- I'm assuming v6.11 is good -- is
> > DEQUEUE_DELAYED. Does this error go away if you flip that in
> > kernel/sched/features.h ?
> Yes, with the below diff. I didn't see any warnings or kernel panic
> while running the workload
> 
> # git diff
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 290874079f60..38bf8df813d1 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -46,7 +46,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
>   *
>   * DELAY_ZERO clips the lag on dequeue (or wakeup) to 0.
>   */
> -SCHED_FEAT(DELAY_DEQUEUE, true)
> +SCHED_FEAT(DELAY_DEQUEUE, false)
>  SCHED_FEAT(DELAY_ZERO, true)
> 
>  /*
> 
> 
for sanity, I ran the workload (kernel compilation) on the base commit
where the kernel panic was initially observed, which resulted in a
kernel panic, along with it couple of warnings where also printed on the
console, and a circular locking dependency warning with it.

Kernel 6.11.0-kp-base-10547-g684a64bf32b6 on an ppc64le

------------[ cut here ]------------

======================================================
WARNING: possible circular locking dependency detected
6.11.0-kp-base-10547-g684a64bf32b6 #69 Not tainted
------------------------------------------------------
cc1/27867 is trying to acquire lock:
c000000002950270 ((console_sem).lock){-.-.}-{2:2}, at: down_trylock+0x2c/0x70

but task is already holding lock:
c0000007fb0e6218 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x44/0x130

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (&rq->__lock){-.-.}-{2:2}:
       __lock_acquire+0x730/0xf00
       lock_acquire.part.0+0xf0/0x2a0
       _raw_spin_lock_nested+0x64/0x100
       raw_spin_rq_lock_nested+0x44/0x130
       __task_rq_lock+0x7c/0x1e0
       wake_up_new_task+0x1d0/0x560
       kernel_clone+0x1e0/0x5c0
       user_mode_thread+0x84/0xc0
       rest_init+0x40/0x208
       do_initcalls+0x0/0x190
       start_here_common+0x1c/0x20

-> #1 (&p->pi_lock){-.-.}-{2:2}:
       __lock_acquire+0x730/0xf00
       lock_acquire.part.0+0xf0/0x2a0
       _raw_spin_lock_irqsave+0x7c/0x130
       try_to_wake_up+0x70/0xa40
       __up.isra.0+0x74/0x90
       up+0x7c/0xa0
       __up_console_sem+0xb8/0x110
       console_unlock+0x140/0x220
       vprintk_emit+0x1e4/0x300
       devkmsg_emit.constprop.0+0x44/0x54
       devkmsg_write+0x11c/0x230
       do_iter_readv_writev+0x194/0x280
       vfs_writev+0x18c/0x4b0
       do_writev+0xa8/0x1f0
       system_call_exception+0x148/0x310
       system_call_vectored_common+0x15c/0x2ec

-> #0 ((console_sem).lock){-.-.}-{2:2}:
       check_prev_add+0x158/0xe70
       validate_chain+0x6b4/0x7e0
       __lock_acquire+0x730/0xf00
       lock_acquire.part.0+0xf0/0x2a0
       _raw_spin_lock_irqsave+0x7c/0x130
       down_trylock+0x2c/0x70
       __down_trylock_console_sem+0x60/0x170
       console_trylock_spinning+0x3c/0x300
       vprintk_emit+0x1a4/0x300
       vprintk+0x58/0xe0
       _printk+0x40/0x54
       __warn_printk+0xd8/0x120
       unthrottle_cfs_rq+0x640/0x660
       distribute_cfs_runtime+0x454/0x8e0
       sched_cfs_period_timer+0x130/0x310
       __run_hrtimer+0x28c/0x4d0
       __hrtimer_run_queues+0xb4/0x140
       hrtimer_interrupt+0x124/0x2f0
       timer_interrupt+0x15c/0x460
       decrementer_common_virt+0x28c/0x290

other info that might help us debug this:

Chain exists of:
  (console_sem).lock --> &p->pi_lock --> &rq->__lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&rq->__lock);
                               lock(&p->pi_lock);
                               lock(&rq->__lock);
  lock((console_sem).lock);

 *** DEADLOCK ***

2 locks held by cc1/27867:
 #0: c000000002cd1670 (rcu_read_lock){....}-{1:3}, at: distribute_cfs_runtime+0x10/0x8e0
 #1: c0000007fb0e6218 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x44/0x130

stack backtrace:
CPU: 1 UID: 0 PID: 27867 Comm: cc1 Not tainted 6.11.0-kp-base-10547-g684a64bf32b6 #69
Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries
Call Trace:
[c0000000f994f210] [c00000000126f0a4] dump_stack_lvl+0xc8/0x130 (unreliable)
[c0000000f994f250] [c00000000024de78] print_circular_bug+0x248/0x2b0
[c0000000f994f2f0] [c00000000024e154] check_noncircular+0x274/0x2a0
[c0000000f994f3c0] [c00000000024f8d8] check_prev_add+0x158/0xe70
[c0000000f994f470] [c000000000250ca4] validate_chain+0x6b4/0x7e0
[c0000000f994f550] [c0000000002542a0] __lock_acquire+0x730/0xf00
[c0000000f994f660] [c000000000254b60] lock_acquire.part.0+0xf0/0x2a0
[c0000000f994f780] [c0000000012bcb2c] _raw_spin_lock_irqsave+0x7c/0x130
[c0000000f994f7c0] [c0000000012b422c] down_trylock+0x2c/0x70
[c0000000f994f7f0] [c000000000267920] __down_trylock_console_sem+0x60/0x170
[c0000000f994f830] [c0000000002698cc] console_trylock_spinning+0x3c/0x300
[c0000000f994f880] [c00000000026ccf4] vprintk_emit+0x1a4/0x300
[c0000000f994f8f0] [c00000000026f318] vprintk+0x58/0xe0
[c0000000f994f920] [c00000000026ee30] _printk+0x40/0x54
[c0000000f994f940] [c000000000160ac8] __warn_printk+0xd8/0x120
[c0000000f994f9b0] [c0000000001f07c0] unthrottle_cfs_rq+0x640/0x660
[c0000000f994fa50] [c0000000001f1314] distribute_cfs_runtime+0x454/0x8e0
[c0000000f994fb60] [c0000000001f1a40] sched_cfs_period_timer+0x130/0x310
[c0000000f994fc00] [c0000000002db5dc] __run_hrtimer+0x28c/0x4d0
[c0000000f994fca0] [c0000000002db8d4] __hrtimer_run_queues+0xb4/0x140
[c0000000f994fd00] [c0000000002dc834] hrtimer_interrupt+0x124/0x2f0
[c0000000f994fdb0] [c000000000029f1c] timer_interrupt+0x15c/0x460
[c0000000f994fe50] [c000000000009f8c] decrementer_common_virt+0x28c/0x290
--- interrupt: 900 at 0x13b4b0108
NIP:  000000013b4b0108 LR: 000000013b4af88c CTR: 000000013b4b00d0
REGS: c0000000f994fe80 TRAP: 0900   Not tainted  (6.11.0-kp-base-10547-g684a64bf32b6)
MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 88822804  XER: 20040000
CFAR: 0000000000000000 IRQMASK: 0
GPR00: 000000013b4af88c 00007ffff4260530 000000013c9f7a00 00007fff93a70738
GPR04: 00007fff94866708 0000000000000000 0000000000000000 00007fff93a70738
GPR08: 0000000000000008 0000000000000008 0000000000000000 0000000088822804
GPR12: 000000013b4b00d0 00007fff9645cc80 00007fff94c44ad8 0000000000000000
GPR16: 0000000040000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 000000013c948900 000000013c50e818
GPR24: 00007fff94c44ad8 0000000000000000 0000000000000001 000000013ca33ec0
GPR28: 000000013c322130 00007fff94c23408 00007fff93a70738 00007fff94866708
NIP [000000013b4b0108] 0x13b4b0108
LR [000000013b4af88c] 0x13b4af88c
--- interrupt: 900
se->sched_delayed
WARNING: CPU: 1 PID: 27867 at kernel/sched/fair.c:6062 unthrottle_cfs_rq+0x644/0x660
Modules linked in: binfmt_misc bonding tls rfkill pseries_rng ibmveth nd_pmem vmx_crypto nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 1 UID: 0 PID: 27867 Comm: cc1 Not tainted 6.11.0-kp-base-10547-g684a64bf32b6 #69
Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries
NIP:  c0000000001f07c4 LR: c0000000001f07c0 CTR: 0000000000000000
REGS: c0000000f994f710 TRAP: 0700   Not tainted  (6.11.0-kp-base-10547-g684a64bf32b6)
MSR:  8000000000021033 <SF,ME,IR,DR,RI,LE>  CR: 24822202  XER: 00000003
CFAR: c000000000160b00 IRQMASK: 3
GPR00: c0000000001f07c0 c0000000f994f9b0 c000000001d57400 0000000000000011
GPR04: 00000000ffff7fff c0000000f994f7b0 c0000000f994f7a8 00000007f8cf0000
GPR08: 0000000000000027 0000000000010002 c00000005bd32000 c000000002950510
GPR12: c000000002c104b8 c0000007fffef480 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000001 0000000000000001 c000000002dd4928 c000000002cd1670
GPR24: 0000000000000000 0000000000000001 0000000000000001 c0000007fb0e6200
GPR28: 0000000000000000 c00000015326d800 c0000007fb0e6380 c0000000572f2400
NIP [c0000000001f07c4] unthrottle_cfs_rq+0x644/0x660
LR [c0000000001f07c0] unthrottle_cfs_rq+0x640/0x660
Call Trace:
[c0000000f994f9b0] [c0000000001f07c0] unthrottle_cfs_rq+0x640/0x660 (unreliable)
[c0000000f994fa50] [c0000000001f1314] distribute_cfs_runtime+0x454/0x8e0
[c0000000f994fb60] [c0000000001f1a40] sched_cfs_period_timer+0x130/0x310
[c0000000f994fc00] [c0000000002db5dc] __run_hrtimer+0x28c/0x4d0
[c0000000f994fca0] [c0000000002db8d4] __hrtimer_run_queues+0xb4/0x140
[c0000000f994fd00] [c0000000002dc834] hrtimer_interrupt+0x124/0x2f0
[c0000000f994fdb0] [c000000000029f1c] timer_interrupt+0x15c/0x460
[c0000000f994fe50] [c000000000009f8c] decrementer_common_virt+0x28c/0x290
--- interrupt: 900 at 0x13b4b0108
NIP:  000000013b4b0108 LR: 000000013b4af88c CTR: 000000013b4b00d0
REGS: c0000000f994fe80 TRAP: 0900   Not tainted  (6.11.0-kp-base-10547-g684a64bf32b6)
MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 88822804  XER: 20040000
CFAR: 0000000000000000 IRQMASK: 0
GPR00: 000000013b4af88c 00007ffff4260530 000000013c9f7a00 00007fff93a70738
GPR04: 00007fff94866708 0000000000000000 0000000000000000 00007fff93a70738
GPR08: 0000000000000008 0000000000000008 0000000000000000 0000000088822804
GPR12: 000000013b4b00d0 00007fff9645cc80 00007fff94c44ad8 0000000000000000
GPR16: 0000000040000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 000000013c948900 000000013c50e818
GPR24: 00007fff94c44ad8 0000000000000000 0000000000000001 000000013ca33ec0
GPR28: 000000013c322130 00007fff94c23408 00007fff93a70738 00007fff94866708
NIP [000000013b4b0108] 0x13b4b0108
LR [000000013b4af88c] 0x13b4af88c
--- interrupt: 900
Code: 4bfffc40 3d220108 8929d54b 2c090000 4082fccc 3c62ff97 39200001 3d420108 38634718 992ad54b 4bf7023d 60000000 <0fe00000> 4bfffca8 eb010060 4bfffb80
irq event stamp: 42054
hardirqs last  enabled at (42053): [<c0000000000318ac>] interrupt_exit_user_prepare_main+0x7c/0x290
hardirqs last disabled at (42054): [<c000000000029c70>] interrupt_enter_prepare+0x90/0x1e0
softirqs last  enabled at (41856): [<c000000000170710>] handle_softirqs+0x590/0x5f0
softirqs last disabled at (41851): [<c0000000000186e0>] do_softirq_own_stack+0x40/0x60
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
delay && se->sched_delayed
WARNING: CPU: 7 PID: 869 at kernel/sched/fair.c:5493 dequeue_entity+0x598/0x6c0
Modules linked in: binfmt_misc bonding tls rfkill pseries_rng ibmveth nd_pmem vmx_crypto nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 7 UID: 0 PID: 869 Comm: auditd Tainted: G        W          6.11.0-kp-base-10547-g684a64bf32b6 #69
Tainted: [W]=WARN
Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries
NIP:  c0000000001edae8 LR: c0000000001edae4 CTR: 0000000000000000
REGS: c000000007d76c20 TRAP: 0700   Tainted: G        W           (6.11.0-kp-base-10547-g684a64bf32b6)
MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24002202  XER: 00000005
CFAR: c000000000160b00 IRQMASK: 1
GPR00: c0000000001edae4 c000000007d76ec0 c000000001d57400 000000000000001a
GPR04: 00000000ffff7fff c000000007d76cc0 c000000007d76cb8 00000007f9bf0000
GPR08: 0000000000000027 0000000000000003 c00000005d18f000 c000000002950510
GPR12: c000000002c104b8 c0000007fffe7080 0000000000000000 c000000004088600
GPR16: 0000000000000000 0000000000000007 0000000000000002 0000000000000000
GPR20: fffffffffffffdef 0000000000000000 0000000000000002 c00000000a234800
GPR24: 0000000000000000 0000000000000008 c0000007fb0e6200 0000000000000000
GPR28: 0000000000000001 0000000000000009 c0000007fb0e6380 c0000000572f2400
NIP [c0000000001edae8] dequeue_entity+0x598/0x6c0
LR [c0000000001edae4] dequeue_entity+0x594/0x6c0
Call Trace:
[c000000007d76ec0] [c0000000001edae4] dequeue_entity+0x594/0x6c0 (unreliable)
[c000000007d76f60] [c0000000001edd60] dequeue_entities+0x150/0x600
[c000000007d77030] [c0000000001ee54c] dequeue_task_fair+0x6c/0x310
[c000000007d770c0] [c0000000001cda54] dequeue_task+0x64/0x240
[c000000007d77140] [c0000000001e6e80] detach_tasks+0x1c0/0x4d0
[c000000007d771b0] [c0000000001f52e0] sched_balance_rq+0x250/0x910
[c000000007d77320] [c0000000001f5d50] sched_balance_newidle+0x3b0/0xa90
[c000000007d77410] [c0000000001f651c] pick_next_task_fair+0x7c/0x560
[c000000007d774a0] [c0000000001c3fa0] __pick_next_task+0x60/0x2d0
[c000000007d77500] [c0000000012ae314] __schedule+0x184/0x940
[c000000007d775d0] [c0000000012aeb3c] schedule+0x6c/0x190
[c000000007d77640] [c0000000012aeed4] schedule_preempt_disabled+0x34/0x60
[c000000007d77670] [c0000000012b4de4] rwsem_down_write_slowpath+0x234/0x820
[c000000007d77770] [c00000000024584c] down_write_nested+0x1ec/0x1f0
[c000000007d777b0] [c008000004d65688] xfs_ilock+0x1c0/0x240 [xfs]
[c000000007d77830] [c008000004d60e4c] xfs_buffered_write_iomap_begin+0x184/0xd50 [xfs]
[c000000007d77990] [c0000000007da1a8] iomap_iter+0x118/0x270
[c000000007d77a20] [c0000000007dd63c] iomap_file_buffered_write+0xbc/0x120
[c000000007d77b40] [c008000004d4994c] xfs_file_buffered_write+0xf4/0x3e0 [xfs]
[c000000007d77c10] [c0000000006fa450] new_sync_write+0x160/0x1e0
[c000000007d77ca0] [c0000000006fe524] vfs_write+0x2f4/0x3e0
[c000000007d77d00] [c0000000006fe820] ksys_write+0x90/0x160
[c000000007d77d50] [c000000000032f98] system_call_exception+0x148/0x310
[c000000007d77e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
--- interrupt: 3000 at 0x7fff8894b344
NIP:  00007fff8894b344 LR: 00007fff8894b344 CTR: 0000000000000000
REGS: c000000007d77e80 TRAP: 3000   Tainted: G        W           (6.11.0-kp-base-10547-g684a64bf32b6)
MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 4a004402  XER: 00000000
IRQMASK: 0
GPR00: 0000000000000004 00007ffff84de490 00007fff88a67100 0000000000000004
GPR04: 0000000122fe9090 0000000000000041 0100000000000000 00007fff88db0180
GPR08: 00007fff88da8ac8 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fff88db0180 0000000000000001 000000011de63d40
GPR16: 00007ffff84e0f00 00007ffff84e0ed0 00007ffff84e0ea0 00007ffff84e0f30
GPR20: 00007ffff84e0e40 00007ffff84e0f60 0000000000000000 00007ffff84e0ff0
GPR24: 0000000000000000 00007ffff84e1118 00007ffff84de828 0000000122fe9090
GPR28: 0000000000000041 0000000122fe9090 0000000000000000 0000000000000004
NIP [00007fff8894b344] 0x7fff8894b344
LR [00007fff8894b344] 0x7fff8894b344
--- interrupt: 3000
Code: 4bff5571 39200001 38600000 993f0051 4bfffc10 3c62ff97 39200001 3d420108 38634818 992ad541 4bf72f19 60000000 <0fe00000> 60000000 4bfffe18 60000000
irq event stamp: 64870392
hardirqs last  enabled at (64870391): [<c000000000031a34>] interrupt_exit_user_prepare_main+0x204/0x290
hardirqs last disabled at (64870392): [<c0000000012ae55c>] __schedule+0x3cc/0x940
softirqs last  enabled at (64867724): [<c000000000170710>] handle_softirqs+0x590/0x5f0
softirqs last disabled at (64867713): [<c0000000000186e0>] do_softirq_own_stack+0x40/0x60
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
!se->on_rq
WARNING: CPU: 0 PID: 870 at kernel/sched/fair.c:704 update_entity_lag+0xcc/0xf0
Modules linked in: binfmt_misc bonding tls rfkill pseries_rng ibmveth nd_pmem vmx_crypto nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 0 UID: 0 PID: 870 Comm: auditd Tainted: G        W          6.11.0-kp-base-10547-g684a64bf32b6 #69
Tainted: [W]=WARN
Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries
NIP:  c0000000001eacdc LR: c0000000001eacd8 CTR: 0000000000000000
REGS: c00000005a89ebd0 TRAP: 0700   Tainted: G        W           (6.11.0-kp-base-10547-g684a64bf32b6)
MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24282282  XER: 00000005
CFAR: c000000000160b00 IRQMASK: 1
GPR00: c0000000001eacd8 c00000005a89ee70 c000000001d57400 000000000000000a
GPR04: 00000000ffff7fff c00000005a89ec70 c00000005a89ec68 00000007f8a70000
GPR08: 0000000000000027 0000000000000003 c00000005308f000 c000000002950510
GPR12: c000000002c104b8 c000000003cd0000 0000000000000000 c000000004087800
GPR16: c00000005308f000 0000000000000000 0000000000000002 0000000000000000
GPR20: fffffffffffffdef 0000000000000000 0000000000000002 c00000000a20a000
GPR24: 0000000000000000 0000000000000008 c0000007fb866200 0000000000000000
GPR28: 0000000000000001 0000000000000009 c0000007fb866380 c0000000572ff400
NIP [c0000000001eacdc] update_entity_lag+0xcc/0xf0
LR [c0000000001eacd8] update_entity_lag+0xc8/0xf0
Call Trace:
[c00000005a89ee70] [c0000000001eacd8] update_entity_lag+0xc8/0xf0 (unreliable)
[c00000005a89eee0] [c0000000001ed600] dequeue_entity+0xb0/0x6c0
[c00000005a89ef80] [c0000000001edd60] dequeue_entities+0x150/0x600
[c00000005a89f050] [c0000000001ee54c] dequeue_task_fair+0x6c/0x310
[c00000005a89f0e0] [c0000000001cda54] dequeue_task+0x64/0x240
[c00000005a89f160] [c0000000001e6e80] detach_tasks+0x1c0/0x4d0
[c00000005a89f1d0] [c0000000001f52e0] sched_balance_rq+0x250/0x910
[c00000005a89f340] [c0000000001f5d50] sched_balance_newidle+0x3b0/0xa90
[c00000005a89f430] [c0000000001f651c] pick_next_task_fair+0x7c/0x560
[c00000005a89f4c0] [c0000000001c3fa0] __pick_next_task+0x60/0x2d0
[c00000005a89f520] [c0000000012ae314] __schedule+0x184/0x940
[c00000005a89f5f0] [c0000000012aeb3c] schedule+0x6c/0x190
[c00000005a89f660] [c0000000012aecb4] io_schedule+0x54/0x80
[c00000005a89f690] [c0000000005489b4] folio_wait_bit_common+0x1e4/0x4c0
[c00000005a89f790] [c00000000055d01c] folio_wait_writeback+0x5c/0x150
[c00000005a89f810] [c00000000054811c] __filemap_fdatawait_range+0xcc/0x1c0
[c00000005a89f970] [c00000000054da90] file_write_and_wait_range+0xf0/0x110
[c00000005a89f9c0] [c008000004d475d4] xfs_file_fsync+0x9c/0x380 [xfs]
[c00000005a89fa80] [c000000000774eec] vfs_fsync_range+0x6c/0xf0
[c00000005a89fac0] [c00000000077503c] do_fsync+0x5c/0xd0
[c00000005a89fb00] [c0000000007754c8] sys_fsync+0x28/0x40
[c00000005a89fb20] [c000000000032f98] system_call_exception+0x148/0x310
[c00000005a89fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
--- interrupt: 3000 at 0x7fff8894f6e4
NIP:  00007fff8894f6e4 LR: 00007fff8894f6e4 CTR: 0000000000000000
REGS: c00000005a89fe80 TRAP: 3000   Tainted: G        W           (6.11.0-kp-base-10547-g684a64bf32b6)
MSR:  800000000000d033 <SF,EE,PR,ME,IR,DR,RI,LE>  CR: 42004482  XER: 00000000
IRQMASK: 0
GPR00: 0000000000000076 00007fff87c0e400 00007fff88a67100 0000000000000004
GPR04: 0000000000000081 0000000000000001 0000000000000000 00007fff87c16780
GPR08: 00007fff87c0f0c8 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fff87c16780 0000000000000000 0000000000000000
GPR16: 00007fff87c0f080 00007fff87400000 00007fff87c0efc0 0000000000000000
GPR20: 0000000000000000 00007fff88da0000 000000000080e880 00007fff87400000
GPR24: 0000000000000001 0000000000000000 0000000000000002 00007fff87c0e798
GPR28: 00007fff87c0f3dc 000000011de80d58 0000000000000000 0000000000000004
NIP [00007fff8894f6e4] 0x7fff8894f6e4
LR [00007fff8894f6e4] 0x7fff8894f6e4
--- interrupt: 3000
Code: 4e800020 3d220108 8929d53d 2c090000 4082ff8c 3c62ff97 39200001 3d420108 38634708 992ad53d 4bf75d25 60000000 <0fe00000> 4bffff68 60000000 60000000
irq event stamp: 11088690
hardirqs last  enabled at (11088689): [<c000000000973e74>] blk_account_io_start+0x204/0x260
hardirqs last disabled at (11088690): [<c0000000012bca54>] _raw_spin_lock_irq+0xc4/0x120
softirqs last  enabled at (11087028): [<c000000000170710>] handle_softirqs+0x590/0x5f0
softirqs last disabled at (11087021): [<c0000000000186e0>] do_softirq_own_stack+0x40/0x60
---[ end trace 0000000000000000 ]---

Thanks
Re: sched/fair: Kernel panics in pick_next_entity
Posted by Mike Galbraith 1 month, 4 weeks ago
On Tue, 2024-10-01 at 00:45 +0530, Vishal Chourasia wrote:
> >
> for sanity, I ran the workload (kernel compilation) on the base commit
> where the kernel panic was initially observed, which resulted in a
> kernel panic, along with it couple of warnings where also printed on the
> console, and a circular locking dependency warning with it.
>
> Kernel 6.11.0-kp-base-10547-g684a64bf32b6 on an ppc64le
>
> ------------[ cut here ]------------
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.11.0-kp-base-10547-g684a64bf32b6 #69 Not tainted
> ------------------------------------------------------

...

> --- interrupt: 900
> se->sched_delayed
> WARNING: CPU: 1 PID: 27867 at kernel/sched/fair.c:6062 unthrottle_cfs_rq+0x644/0x660

...that warning also spells eventual doom for the box, here it does
anyway, running LTPs cfs_bandwidth01 testcase and hackbench together,
box grinds to a halt in pretty short order.

With the patchlet below (submitted), I can beat on box to my hearts
content without meeting throttle/unthrottle woes.

sched: Fix sched_delayed vs cfs_bandwidth

Meeting an unfinished DELAY_DEQUEUE treated entity in unthrottle_cfs_rq()
leads to a couple terminal scenarios.  Finish it first, so ENQUEUE_WAKEUP
can proceed as it would have sans DELAY_DEQUEUE treatment.

Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
---
 kernel/sched/fair.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6058,10 +6058,13 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);

-		if (se->on_rq) {
-			SCHED_WARN_ON(se->sched_delayed);
+		/* Handle any unfinished DELAY_DEQUEUE business first. */
+		if (se->sched_delayed) {
+			int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED;
+
+			dequeue_entity(qcfs_rq, se, flags);
+		} else if (se->on_rq)
 			break;
-		}
 		enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);

 		if (cfs_rq_is_idle(group_cfs_rq(se)))
Re: sched/fair: Kernel panics in pick_next_entity
Posted by Benjamin Segall 1 month, 3 weeks ago
Mike Galbraith <efault@gmx.de> writes:

> On Tue, 2024-10-01 at 00:45 +0530, Vishal Chourasia wrote:
>> >
>> for sanity, I ran the workload (kernel compilation) on the base commit
>> where the kernel panic was initially observed, which resulted in a
>> kernel panic, along with it couple of warnings where also printed on the
>> console, and a circular locking dependency warning with it.
>>
>> Kernel 6.11.0-kp-base-10547-g684a64bf32b6 on an ppc64le
>>
>> ------------[ cut here ]------------
>>
>> ======================================================
>> WARNING: possible circular locking dependency detected
>> 6.11.0-kp-base-10547-g684a64bf32b6 #69 Not tainted
>> ------------------------------------------------------
>
> ...
>
>> --- interrupt: 900
>> se->sched_delayed
>> WARNING: CPU: 1 PID: 27867 at kernel/sched/fair.c:6062 unthrottle_cfs_rq+0x644/0x660
>
> ...that warning also spells eventual doom for the box, here it does
> anyway, running LTPs cfs_bandwidth01 testcase and hackbench together,
> box grinds to a halt in pretty short order.
>
> With the patchlet below (submitted), I can beat on box to my hearts
> content without meeting throttle/unthrottle woes.
>
> sched: Fix sched_delayed vs cfs_bandwidth
>
> Meeting an unfinished DELAY_DEQUEUE treated entity in unthrottle_cfs_rq()
> leads to a couple terminal scenarios.  Finish it first, so ENQUEUE_WAKEUP
> can proceed as it would have sans DELAY_DEQUEUE treatment.
>
> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
> Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> ---
>  kernel/sched/fair.c |    9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6058,10 +6058,13 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
>  	for_each_sched_entity(se) {
>  		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
>
> -		if (se->on_rq) {
> -			SCHED_WARN_ON(se->sched_delayed);
> +		/* Handle any unfinished DELAY_DEQUEUE business first. */
> +		if (se->sched_delayed) {
> +			int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED;
> +
> +			dequeue_entity(qcfs_rq, se, flags);
> +		} else if (se->on_rq)
>  			break;
> -		}
>  		enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
>
>  		if (cfs_rq_is_idle(group_cfs_rq(se)))

Yeah, if we can wind up here to hit that warning, then we need to get it
out of delay state, not just leave it. Whether dequeue_entity +
enqueue_entity is better or worse than requeue_delayed_entity (+ break), I really
don't know.

This did prompt me to try and figure out if we could limit delay to the
EEVDF parts. I think ideally, we'd only have delayed tasks accounted for
in min_vruntime/avg_vruntime/avg_load, but trying to keep them out of
nr_running/load.weight seems like it runs into too many issues, while
having them still counted there means skipping dequeue_entity like the
current version, which means it wouldn't have saved us this.

I think we could still probably remove the current core.c touchpoints in
favor of just having migrate_task_rq_fair take a parameter for "do we
already hold rq_lock rather than just p->pi_lock" (and it and
switched_from/changed_group handling things)? We'd probably need to do
something very careful to avoid migrate_task_rq_fair unconditionally
needing rq_lock though, which would be a pain, so it might not be worth
it even if I'm not forgetting some other reason for pushing the delay
knowledge all the way out to p->on_rq.
Re: sched/fair: Kernel panics in pick_next_entity
Posted by Mike Galbraith 1 month, 3 weeks ago
On Wed, 2024-10-02 at 15:31 -0700, Benjamin Segall wrote:
> Mike Galbraith <efault@gmx.de> writes:
> > 
> > ---
> >  kernel/sched/fair.c |    9 ++++++---
> >  1 file changed, 6 insertions(+), 3 deletions(-)
> > 
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6058,10 +6058,13 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
> >         for_each_sched_entity(se) {
> >                 struct cfs_rq *qcfs_rq = cfs_rq_of(se);
> > 
> > -               if (se->on_rq) {
> > -                       SCHED_WARN_ON(se->sched_delayed);
> > +               /* Handle any unfinished DELAY_DEQUEUE business first. */
> > +               if (se->sched_delayed) {
> > +                       int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED;
> > +
> > +                       dequeue_entity(qcfs_rq, se, flags);
> > +               } else if (se->on_rq)
> >                         break;
> > -               }
> >                 enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
> > 
> >                 if (cfs_rq_is_idle(group_cfs_rq(se)))
> 
> Yeah, if we can wind up here to hit that warning, then we need to get it
> out of delay state, not just leave it. Whether dequeue_entity +
> enqueue_entity is better or worse than requeue_delayed_entity (+ break), I really
> don't know.

Hm, I'd say requeue_delayed_entity() not only fits better, it using
less lines gives it an extra brownie point.

	-Mike
Re: sched/fair: Kernel panics in pick_next_entity
Posted by Mike Galbraith 1 month, 3 weeks ago
On Thu, 2024-10-03 at 06:41 +0200, Mike Galbraith wrote:
> On Wed, 2024-10-02 at 15:31 -0700, Benjamin Segall wrote:

> > Whether dequeue_entity + enqueue_entity is better or worse than
> > requeue_delayed_entity (+break), I really don't know.
>
> Hm, I'd say requeue_delayed_entity() not only fits better, it using
> less lines gives it an extra brownie point.

Probable not worth any churn or effort, but it is an option.

sched: Clean up sched_delayed handling in unthrottle_cfs_rq()

requeue_delayed_entity() achieves and documents in one line what a less
clear preparatory dequeue facilitates over several, so use it instead,
and remove the superfluous comment.

Signed-off-by: Mike Galbraith <efault@gmx.de>
---
 kernel/sched/fair.c |    9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6058,12 +6058,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);

-		/* Handle any unfinished DELAY_DEQUEUE business first. */
-		if (se->sched_delayed) {
-			int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED;
-
-			dequeue_entity(qcfs_rq, se, flags);
-		} else if (se->on_rq)
+		if (se->sched_delayed)
+			requeue_delayed_entity(se);
+		if (se->on_rq)
 			break;
 		enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
Re: sched/fair: Kernel panics in pick_next_entity
Posted by Vishal Chourasia 1 month, 3 weeks ago
On Thu, Oct 03, 2024 at 11:31:06AM +0200, Mike Galbraith wrote:
> On Thu, 2024-10-03 at 06:41 +0200, Mike Galbraith wrote:
> > On Wed, 2024-10-02 at 15:31 -0700, Benjamin Segall wrote:
> 
> > > Whether dequeue_entity + enqueue_entity is better or worse than
> > > requeue_delayed_entity (+break), I really don't know.
> >
> > Hm, I'd say requeue_delayed_entity() not only fits better, it using
> > less lines gives it an extra brownie point.
> 
> Probable not worth any churn or effort, but it is an option.
> 
> sched: Clean up sched_delayed handling in unthrottle_cfs_rq()
> 
> requeue_delayed_entity() achieves and documents in one line what a less
> clear preparatory dequeue facilitates over several, so use it instead,
> and remove the superfluous comment.
> 
> Signed-off-by: Mike Galbraith <efault@gmx.de>
Hi Mike, thanks for the patch!

I didn't see any warnings or kernel panics in my setup. 

Tested-by: Vishal Chourasia <vishalc@linux.ibm.com>
> ---
>  kernel/sched/fair.c |    9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6058,12 +6058,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
>  	for_each_sched_entity(se) {
>  		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
> 
> -		/* Handle any unfinished DELAY_DEQUEUE business first. */
> -		if (se->sched_delayed) {
> -			int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED;
> -
> -			dequeue_entity(qcfs_rq, se, flags);
> -		} else if (se->on_rq)
> +		if (se->sched_delayed)
> +			requeue_delayed_entity(se);
> +		if (se->on_rq)
>  			break;
>  		enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
> 
>
Re: sched/fair: Kernel panics in pick_next_entity
Posted by Peter Zijlstra 1 month, 3 weeks ago
On Tue, Oct 01, 2024 at 10:30:26AM +0200, Mike Galbraith wrote:
> On Tue, 2024-10-01 at 00:45 +0530, Vishal Chourasia wrote:
> > >
> > for sanity, I ran the workload (kernel compilation) on the base commit
> > where the kernel panic was initially observed, which resulted in a
> > kernel panic, along with it couple of warnings where also printed on the
> > console, and a circular locking dependency warning with it.
> >
> > Kernel 6.11.0-kp-base-10547-g684a64bf32b6 on an ppc64le
> >
> > ------------[ cut here ]------------
> >
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 6.11.0-kp-base-10547-g684a64bf32b6 #69 Not tainted
> > ------------------------------------------------------
> 
> ...
> 
> > --- interrupt: 900
> > se->sched_delayed
> > WARNING: CPU: 1 PID: 27867 at kernel/sched/fair.c:6062 unthrottle_cfs_rq+0x644/0x660
> 
> ...that warning also spells eventual doom for the box, here it does
> anyway, running LTPs cfs_bandwidth01 testcase and hackbench together,
> box grinds to a halt in pretty short order.
> 

Right, I've picked up your patch for sched/urgent. But this does make me
question Vishal's setup.

He said all he does is compile a kernel, but afaik no regular setup uses
CFS bandwidth by default. So something is 'special' at his end that he's
not been telling us about.

Vishal, could you expand upon your configuration? How come you're using
CFS bandwidth, what else is special?
Re: sched/fair: Kernel panics in pick_next_entity
Posted by Vishal Chourasia 1 month, 3 weeks ago
On Wed, Oct 02, 2024 at 10:49:32AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 01, 2024 at 10:30:26AM +0200, Mike Galbraith wrote:
> > On Tue, 2024-10-01 at 00:45 +0530, Vishal Chourasia wrote:
> > > >
> > > for sanity, I ran the workload (kernel compilation) on the base commit
> > > where the kernel panic was initially observed, which resulted in a
> > > kernel panic, along with it couple of warnings where also printed on the
> > > console, and a circular locking dependency warning with it.
> > >
> > > Kernel 6.11.0-kp-base-10547-g684a64bf32b6 on an ppc64le
> > >
> > > ------------[ cut here ]------------
> > >
> > > ======================================================
> > > WARNING: possible circular locking dependency detected
> > > 6.11.0-kp-base-10547-g684a64bf32b6 #69 Not tainted
> > > ------------------------------------------------------
> > 
> > ...
> > 
> > > --- interrupt: 900
> > > se->sched_delayed
> > > WARNING: CPU: 1 PID: 27867 at kernel/sched/fair.c:6062 unthrottle_cfs_rq+0x644/0x660
> > 
> > ...that warning also spells eventual doom for the box, here it does
> > anyway, running LTPs cfs_bandwidth01 testcase and hackbench together,
> > box grinds to a halt in pretty short order.
> > 
> 
> Right, I've picked up your patch for sched/urgent. But this does make me
> question Vishal's setup.
> 
> He said all he does is compile a kernel, but afaik no regular setup uses
> CFS bandwidth by default. So something is 'special' at his end that he's
> not been telling us about.
Yes Peter, I'm compiling the kernel from source. While I'm not running the 
compilation within a cgroup that has bandwidth limits set, there are some 
system services running in the background that do have bandwidth 
limitations applied.


# find . -name cpu.max -exec cat {} +
max 100000
max 100000
max 100000
max 100000
max 100000
max 100000
5000 100000
34000 100000
10000 100000
31000 100000
max 100000
max 100000
max 100000
max 100000
max 100000
max 100000

> 
> Vishal, could you expand upon your configuration? How come you're using
> CFS bandwidth, what else is special?
config cfs_bandwidth is enabled by default in both the
pseries_le_defconfig and the distro kernel config I'm using for the
compilation.

Let me know if you need any more info. I hope I have answered your
queries.

Thanks!
Re: sched/fair: Kernel panics in pick_next_entity
Posted by Vishal Chourasia 1 month, 4 weeks ago
On Tue, Oct 01, 2024 at 10:30:26AM +0200, Mike Galbraith wrote:
> On Tue, 2024-10-01 at 00:45 +0530, Vishal Chourasia wrote:
> > >
> > for sanity, I ran the workload (kernel compilation) on the base commit
> > where the kernel panic was initially observed, which resulted in a
> > kernel panic, along with it couple of warnings where also printed on the
> > console, and a circular locking dependency warning with it.
> >
> > Kernel 6.11.0-kp-base-10547-g684a64bf32b6 on an ppc64le
> >
> > ------------[ cut here ]------------
> >
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 6.11.0-kp-base-10547-g684a64bf32b6 #69 Not tainted
> > ------------------------------------------------------
> 
> ...
> 
> > --- interrupt: 900
> > se->sched_delayed
> > WARNING: CPU: 1 PID: 27867 at kernel/sched/fair.c:6062 unthrottle_cfs_rq+0x644/0x660
> 
> ...that warning also spells eventual doom for the box, here it does
> anyway, running LTPs cfs_bandwidth01 testcase and hackbench together,
> box grinds to a halt in pretty short order.
> 
> With the patchlet below (submitted), I can beat on box to my hearts
> content without meeting throttle/unthrottle woes.
> 
> sched: Fix sched_delayed vs cfs_bandwidth
> 
> Meeting an unfinished DELAY_DEQUEUE treated entity in unthrottle_cfs_rq()
> leads to a couple terminal scenarios.  Finish it first, so ENQUEUE_WAKEUP
> can proceed as it would have sans DELAY_DEQUEUE treatment.
> 
> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
> Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
> Signed-off-by: Mike Galbraith <efault@gmx.de>
Hello Mike, Thank you the patch!

With the below changes, I don't see any warnings been printed on the
console, along with it, there were no kernel panics

Tested-by: Vishal Chourasia <vishalc@linux.ibm.com>
> ---
>  kernel/sched/fair.c |    9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6058,10 +6058,13 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
>  	for_each_sched_entity(se) {
>  		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
> 
> -		if (se->on_rq) {
> -			SCHED_WARN_ON(se->sched_delayed);
> +		/* Handle any unfinished DELAY_DEQUEUE business first. */
> +		if (se->sched_delayed) {
> +			int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED;
> +
> +			dequeue_entity(qcfs_rq, se, flags);
> +		} else if (se->on_rq)
>  			break;
> -		}
>  		enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
> 
>  		if (cfs_rq_is_idle(group_cfs_rq(se)))
>
Re: sched/fair: Kernel panics in pick_next_entity
Posted by Mike Galbraith 1 month, 4 weeks ago
On Tue, 2024-10-01 at 19:38 +0530, Vishal Chourasia wrote:
>
> With the below changes, I don't see any warnings been printed on the
> console, along with it, there were no kernel panics

Darn.  I was hoping to quash the easily reproducible noisy avalanche of
doom to let elusive $subject through, that being way more interesting.

When I hit $subject, LTPs cfs_bandwidth01 was running, but there was no
warning prelude, box went straight to panic.  Trying to reproduce using
that testcase plus hackbench as efficacy booster produced lots of dying
box noise, but zero sneaky $subject instances before or after quash.

	-Mike
Re: sched/fair: Kernel panics in pick_next_entity
Posted by Mike Galbraith 1 month, 3 weeks ago
On Tue, 2024-10-01 at 18:41 +0200, Mike Galbraith wrote:
>
> When I hit $subject, LTPs cfs_bandwidth01 was running, but there was no
> warning prelude, box went straight to panic.  Trying to reproduce using
> that testcase plus hackbench as efficacy booster produced lots of dying
> box noise, but zero sneaky $subject instances before or after quash.

Hohum, this morning I did hit..

1. WARNING: CPU: 5 PID: 931 at kernel/sched/fair.c:6062 unthrottle_cfs_rq+0x4c3/0x4d0
2. WARNING: CPU: 0 PID: 786 at kernel/sched/fair.c:704 update_entity_lag+0x79/0x90
3. NULL dereference in pick_next_entity()

..instead of brick, workqueue stall etc. Twice.  Not that it matters.
I was only mucking about with it because I was curious whether telling
LB to stop moving sched_delayed tasks about would matter. (nope)

	-Mike
Re: sched/fair: Kernel panics in pick_next_entity
Posted by Luis Machado 2 months ago
On 9/26/24 13:42, Vishal Chourasia wrote:
> I've noticed a kernel panic consistently occurring on the mainline v6.11
> kernel (see attached dmesg log below). 
> 
> The panic occurs almost every time I build the Linux kernel from source.
> 
> Steps to Reproduce:
> 
> make clean
> ./scripts/config -e LOCALVERSION_AUTO
> ./scripts/config --set-str LOCALVERSION -master-with-print
> make localmodconfig
> make -j8 -s vmlinux modules
> 
>>From my investigation, it seems that the function pick_eevdf() can return NULL.
> Commit f12e1488 ("sched/fair: Prepare pick_next_task() for delayed dequeue") 
> introduces an access on the return value of pick_eevdf(). If 'se' was NULL, 
> it can lead to a null pointer dereference. 
> 
> # objdump -S vmlinux | grep -C 5 c0000000001cfebc
> c0000000001cfeb0:       00 00 00 60     nop
>         struct sched_entity *se = pick_eevdf(cfs_rq);
> c0000000001cfeb4:       78 f3 c3 7f     mr      r3,r30
> c0000000001cfeb8:       01 46 ff 4b     bl      c0000000001c44b8 <pick_eevdf+0x8>
>         if (se->sched_delayed) {
> c0000000001cfebc:       51 00 23 89     lbz     r9,81(r3)  <<<<<<
>         struct sched_entity *se = pick_eevdf(cfs_rq);
> c0000000001cfec0:       78 1b 7f 7c     mr      r31,r3
>         if (se->sched_delayed) {
> c0000000001cfec4:       00 00 09 2c     cmpwi   r9,0
> c0000000001cfec8:       98 00 82 40     bne     c0000000001cff60 <pick_next_entity+0xe0>
> 
> r3 is NULL which can be verified from the register context shared in the
> dmesg logs
> 
> Here is the state of my git repository:
> # git log --oneline
> 684a64bf32b6 (HEAD -> master, origin/master, origin/HEAD) Merge tag 'nfs-for-6.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
> f7fccaa77271 Merge tag 'fuse-update-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
> 4165cee7ecb1 Merge tag 'exfat-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
> 79952bdcbcea Merge tag 'f2fs-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
> fa8380a06bd0 Merge tag 'bpf-next-6.12-struct-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
> 68e5c7d4cefb Merge tag 'kbuild-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
> 7f8de2bf0725 Merge tag 'linux-cpupower-6.12-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux
> cd3d64772981 Merge tag 'i3c/for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux
> 
> 
> To determine why pick_eevdf() would return NULL, I added a few printk statements
> Based on one of the printk logs in the shared dmesg log, it appears that if
> pick_eevdf() is called for a 'cfs_rq' whose 'cfs_rq->curr' is NULL and there
> are no eligible entities on that 'cfs_rq', it will return NULL. 
> 
> # git diff
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 225b31aaee55..8c5b96f1cd49 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -48,6 +48,7 @@
>  #include <linux/ratelimit.h>
>  #include <linux/task_work.h>
>  #include <linux/rbtree_augmented.h>
> +#include <linux/delay.h>
> 
>  #include <asm/switch_to.h>
> 
> @@ -907,16 +908,25 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
>  static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>  {
>         struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> +       struct rb_node *tmpnode = node;
>         struct sched_entity *se = __pick_first_entity(cfs_rq);
> +       struct sched_entity *tmpse = se;
>         struct sched_entity *curr = cfs_rq->curr;
> +       struct sched_entity *tmpcurr = curr;
>         struct sched_entity *best = NULL;
> -
> +       struct sched_entity *tmp = NULL;
>         /*
>          * We can safely skip eligibility check if there is only one entity
>          * in this cfs_rq, saving some cycles.
>          */
> -       if (cfs_rq->nr_running == 1)
> -               return curr && curr->on_rq ? curr : se;
> +       if (cfs_rq->nr_running == 1) {
> +               tmp = curr && curr->on_rq ? curr : se;
> +               if (!tmp) {
> +                       printk(KERN_INFO "pick_eevdf curr: %p, se: %p\n", curr, se);
> +                       mdelay(10);
> +               }
> +               return tmp;
> +       }
> 
>         if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>                 curr = NULL;
> @@ -966,6 +976,11 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>         if (!best || (curr && entity_before(curr, best)))
>                 best = curr;
> 
> +       if (!best) {
> +               printk(KERN_INFO "best=%p, curr=%p, se=%p, node=%p ocrr=%p, ose=%p, onode=%p\n", best, curr, se, node, tmpcurr, tmpse, tmpnode);
> +               mdelay(10);
> +       }
> +
>         return best;
>  }
> 
>>From the logs below:
> [ 1355.763494] best=0000000000000000, curr=0000000000000000, se=00000000be02c573, node=0000000000000000 ocrr=0000000000000000, ose=00000000b1d4c4d5, onode=0000000023eb8c00
> 
> I have not been able to think of a quick reproducer to trigger a panic
> https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@intel.com/for this case. Hoping if someone can guide me on this.
> 
> Note: The following dmesg log also contains a warning reported too. Panic
> happens later.
> 
> ------------[ cut here ]------------
> !se->on_rq
> WARNING: CPU: 1 PID: 92333 at kernel/sched/fair.c:705 update_entity_lag+0xcc/0xf0
> Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
> CPU: 1 UID: 0 PID: 92333 Comm: genksyms Tainted: G        W          6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64
> Tainted: [W]=WARN
> Hardware name: IBM,9080-HEX POWER10 (architected) hv:phyp pSeries
> NIP:  c0000000001cdfcc LR: c0000000001cdfc8 CTR: 0000000000000000
> REGS: c00000005c62ee50 TRAP: 0700   Tainted: G        W           (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
> MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24002222  XER: 00000005
> CFAR: c000000000156a10 IRQMASK: 1
> GPR00: c0000000001cdfc8 c00000005c62f0f0 c000000001b57400 000000000000000a
> GPR04: 00000000ffff7fff c00000005c62eee0 c00000005c62eed8 00000007fb050000
> GPR08: 0000000000000027 0000000000000000 0000000000000000 c000000002758de0
> GPR12: c000000002a18d88 c0000007fffef480 0000000000000000 0000000000000000
> GPR16: c000000002c56d40 0000000000000000 c00000005c62f5b4 0000000000000000
> GPR20: fffffffffffffdef 0000000000000000 0000000000000002 c000000003cd7300
> GPR24: 0000000000000000 0000000000000008 c0000007fd1d3f80 0000000000000000
> GPR28: 0000000000000001 0000000000000009 c0000007fd1d4080 c0000000656a0000
> NIP [c0000000001cdfcc] update_entity_lag+0xcc/0xf0
> LR [c0000000001cdfc8] update_entity_lag+0xc8/0xf0
> Call Trace:
> [c00000005c62f0f0] [c0000000001cdfc8] update_entity_lag+0xc8/0xf0 (unreliable)
> [c00000005c62f160] [c0000000001cea80] dequeue_entity+0xb0/0x6d0
> [c00000005c62f1f0] [c0000000001cf8b0] dequeue_entities+0x150/0x600
> [c00000005c62f2c0] [c0000000001d02a8] dequeue_task_fair+0x158/0x2e0
> [c00000005c62f300] [c0000000001b5ea4] dequeue_task+0x64/0x200
> [c00000005c62f380] [c0000000001cc950] detach_tasks+0x140/0x420
> [c00000005c62f3f0] [c0000000001d6044] sched_balance_rq+0x214/0x7c0
> [c00000005c62f550] [c0000000001d6830] sched_balance_newidle+0x240/0x630
> [c00000005c62f640] [c0000000001d6d0c] pick_next_task_fair+0x7c/0x4a0
> [c00000005c62f6d0] [c0000000001afc50] __pick_next_task+0x60/0x2d0
> [c00000005c62f730] [c0000000010e8ce8] __schedule+0x198/0x840
> [c00000005c62f810] [c0000000010e93d0] schedule+0x40/0x110
> [c00000005c62f880] [c00000000064c574] pipe_read+0x424/0x6a0
> [c00000005c62f960] [c00000000063a0fc] vfs_read+0x30c/0x3d0
> [c00000005c62fa10] [c00000000063adf4] ksys_read+0x104/0x160
> [c00000005c62fa60] [c000000000031678] system_call_exception+0x138/0x2d0
> [c00000005c62fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
> --- interrupt: 3000 at 0x7fffb8f4a0c4
> NIP:  00007fffb8f4a0c4 LR: 00007fffb8f4a0c4 CTR: 0000000000000000
> REGS: c00000005c62fe80 TRAP: 3000   Tainted: G        W           (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
> MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 48004222  XER: 00000000
> IRQMASK: 0
> GPR00: 0000000000000003 00007fffe27ca2d0 00007fffb9067100 0000000000000000
> GPR04: 000000003076051b 0000000000002000 00007fffb9060b50 0000000000000000
> GPR08: 000000000000006f 0000000000000000 0000000000000000 0000000000000000
> GPR12: 0000000000000000 00007fffb917a560 000000000000000b 0000000000000000
> GPR16: 00007fffb9173570 0000000000002000 000000000000000c 000000000000000b
> GPR20: 00000000307604c0 0000000030762509 00000000100484e8 0000000030762515
> GPR24: 0000000010022b80 0000000000000000 0000000000000005 0000000000002000
> GPR28: 000000003076051b 0000000000002000 00007fffb905e508 00007fffb9060b50
> NIP [00007fffb8f4a0c4] 0x7fffb8f4a0c4
> LR [00007fffb8f4a0c4] 0x7fffb8f4a0c4
> --- interrupt: 3000
> Code: 4e800020 3d220104 89297c19 2c090000 4082ff8c 3c62ff99 39200001 3d420104 38632d90 992a7c19 4bf88965 60000000 <0fe00000> 4bffff68 60000000 60000000
> ---[ end trace 0000000000000000 ]---
> best=0000000000000000, curr=0000000000000000, se=00000000be02c573, node=0000000000000000 ocrr=0000000000000000, ose=00000000b1d4c4d5, onode=0000000023eb8c00
> Kernel attempted to read user page (51) - exploit attempt? (uid: 0)
> BUG: Kernel NULL pointer dereference on read at 0x00000051
> Faulting instruction address: 0xc0000000001cfebc
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G        W          6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64
> Tainted: [W]=WARN
> Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries
> NIP:  c0000000001cfebc LR: c0000000001cfebc CTR: 0000000000000000
> REGS: c000000002c13950 TRAP: 0300   Tainted: G        W           (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
> MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44022282  XER: 0000000d
> CFAR: c0000000001c4758 DAR: 0000000000000051 DSISR: 40000000 IRQMASK: 1
> GPR00: c0000000001cfebc c000000002c13bf0 c000000001b57400 0000000000000000
> GPR04: c0000007fd147108 c0000007fd1cd600 c000000002c13968 00000007fafb0000
> GPR08: 0000000000000027 00000000004e2002 0000896cdac1f9a0 0000000000002000
> GPR12: c000000002a18d88 c000000002f10000 0000000000000000 00000007fffe0000
> GPR16: 00000007fffd0000 0000000000000000 00000007fffe0114 0000000000000000
> GPR20: 0000000000000000 0000000000000000 c0000000010e95ec c000000002bd6380
> GPR24: c000000002bd6da8 c000000002c524e8 0000000000000000 c000000002bd6380
> GPR28: c000000002bd6380 c0000007fd1d3f80 c0000007fd1d4080 c0000007fd1d3f80
> NIP [c0000000001cfebc] pick_next_entity+0x3c/0x180
> LR [c0000000001cfebc] pick_next_entity+0x3c/0x180
> Call Trace:
> [c000000002c13bf0] [c0000000001cfebc] pick_next_entity+0x3c/0x180 (unreliable)
> [c000000002c13c70] [c0000000001d0064] pick_task_fair+0x64/0x130
> [c000000002c13cb0] [c0000000001d6cd8] pick_next_task_fair+0x48/0x4a0
> [c000000002c13d40] [c0000000001afc50] __pick_next_task+0x60/0x2d0
> [c000000002c13da0] [c0000000010e8ce8] __schedule+0x198/0x840
> [c000000002c13e80] [c0000000010e95ec] schedule_idle+0x3c/0x70
> [c000000002c13eb0] [c0000000001eb1d0] do_idle+0x160/0x1b0
> [c000000002c13f00] [c0000000001eb4d0] cpu_startup_entry+0x50/0x60
> [c000000002c13f30] [c0000000000110e8] rest_init+0xf0/0xf4
> [c000000002c13f60] [c0000000020053a4] do_initcalls+0x0/0x190
> [c000000002c13fe0] [c00000000000e788] start_here_common+0x1c/0x20
> Code: 60000000 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7c7d1b78 7c9e2378 f8010010 f821ff81 60000000 7fc3f378 4bff4601 <89230051> 7c7f1b78 2c090000 40820098
> ---[ end trace 0000000000000000 ]---
> pstore: backend (nvram) writing error (-1)
> 
> Kernel panic - not syncing: Fatal exception
> Rebooting in 10 seconds..
> 
> 

I have a feeling this might be what is described in this thread: https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@intel.com/

Likely a result of an overflow for vruntime. And we come out of pick_eevdf without a valid entity.