I've noticed a kernel panic consistently occurring on the mainline v6.11
kernel (see attached dmesg log below).
The panic occurs almost every time I build the Linux kernel from source.
Steps to Reproduce:
make clean
./scripts/config -e LOCALVERSION_AUTO
./scripts/config --set-str LOCALVERSION -master-with-print
make localmodconfig
make -j8 -s vmlinux modules
>From my investigation, it seems that the function pick_eevdf() can return NULL.
Commit f12e1488 ("sched/fair: Prepare pick_next_task() for delayed dequeue")
introduces an access on the return value of pick_eevdf(). If 'se' was NULL,
it can lead to a null pointer dereference.
# objdump -S vmlinux | grep -C 5 c0000000001cfebc
c0000000001cfeb0: 00 00 00 60 nop
struct sched_entity *se = pick_eevdf(cfs_rq);
c0000000001cfeb4: 78 f3 c3 7f mr r3,r30
c0000000001cfeb8: 01 46 ff 4b bl c0000000001c44b8 <pick_eevdf+0x8>
if (se->sched_delayed) {
c0000000001cfebc: 51 00 23 89 lbz r9,81(r3) <<<<<<
struct sched_entity *se = pick_eevdf(cfs_rq);
c0000000001cfec0: 78 1b 7f 7c mr r31,r3
if (se->sched_delayed) {
c0000000001cfec4: 00 00 09 2c cmpwi r9,0
c0000000001cfec8: 98 00 82 40 bne c0000000001cff60 <pick_next_entity+0xe0>
r3 is NULL which can be verified from the register context shared in the
dmesg logs
Here is the state of my git repository:
# git log --oneline
684a64bf32b6 (HEAD -> master, origin/master, origin/HEAD) Merge tag 'nfs-for-6.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
f7fccaa77271 Merge tag 'fuse-update-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
4165cee7ecb1 Merge tag 'exfat-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
79952bdcbcea Merge tag 'f2fs-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
fa8380a06bd0 Merge tag 'bpf-next-6.12-struct-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
68e5c7d4cefb Merge tag 'kbuild-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
7f8de2bf0725 Merge tag 'linux-cpupower-6.12-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux
cd3d64772981 Merge tag 'i3c/for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux
To determine why pick_eevdf() would return NULL, I added a few printk statements
Based on one of the printk logs in the shared dmesg log, it appears that if
pick_eevdf() is called for a 'cfs_rq' whose 'cfs_rq->curr' is NULL and there
are no eligible entities on that 'cfs_rq', it will return NULL.
# git diff
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 225b31aaee55..8c5b96f1cd49 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -48,6 +48,7 @@
#include <linux/ratelimit.h>
#include <linux/task_work.h>
#include <linux/rbtree_augmented.h>
+#include <linux/delay.h>
#include <asm/switch_to.h>
@@ -907,16 +908,25 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
{
struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
+ struct rb_node *tmpnode = node;
struct sched_entity *se = __pick_first_entity(cfs_rq);
+ struct sched_entity *tmpse = se;
struct sched_entity *curr = cfs_rq->curr;
+ struct sched_entity *tmpcurr = curr;
struct sched_entity *best = NULL;
-
+ struct sched_entity *tmp = NULL;
/*
* We can safely skip eligibility check if there is only one entity
* in this cfs_rq, saving some cycles.
*/
- if (cfs_rq->nr_running == 1)
- return curr && curr->on_rq ? curr : se;
+ if (cfs_rq->nr_running == 1) {
+ tmp = curr && curr->on_rq ? curr : se;
+ if (!tmp) {
+ printk(KERN_INFO "pick_eevdf curr: %p, se: %p\n", curr, se);
+ mdelay(10);
+ }
+ return tmp;
+ }
if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
curr = NULL;
@@ -966,6 +976,11 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
if (!best || (curr && entity_before(curr, best)))
best = curr;
+ if (!best) {
+ printk(KERN_INFO "best=%p, curr=%p, se=%p, node=%p ocrr=%p, ose=%p, onode=%p\n", best, curr, se, node, tmpcurr, tmpse, tmpnode);
+ mdelay(10);
+ }
+
return best;
}
From the logs below:
[ 1355.763494] best=0000000000000000, curr=0000000000000000, se=00000000be02c573, node=0000000000000000 ocrr=0000000000000000, ose=00000000b1d4c4d5, onode=0000000023eb8c00
I have not been able to think of a quick reproducer to trigger a panic
for this case. Hoping if someone can guide me on this.
Note: The following dmesg log also contains a warning reported too. Panic
happens later.
------------[ cut here ]------------
!se->on_rq
WARNING: CPU: 1 PID: 92333 at kernel/sched/fair.c:705 update_entity_lag+0xcc/0xf0
Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 1 UID: 0 PID: 92333 Comm: genksyms Tainted: G W 6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64
Tainted: [W]=WARN
Hardware name: IBM,9080-HEX POWER10 (architected) hv:phyp pSeries
NIP: c0000000001cdfcc LR: c0000000001cdfc8 CTR: 0000000000000000
REGS: c00000005c62ee50 TRAP: 0700 Tainted: G W (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 24002222 XER: 00000005
CFAR: c000000000156a10 IRQMASK: 1
GPR00: c0000000001cdfc8 c00000005c62f0f0 c000000001b57400 000000000000000a
GPR04: 00000000ffff7fff c00000005c62eee0 c00000005c62eed8 00000007fb050000
GPR08: 0000000000000027 0000000000000000 0000000000000000 c000000002758de0
GPR12: c000000002a18d88 c0000007fffef480 0000000000000000 0000000000000000
GPR16: c000000002c56d40 0000000000000000 c00000005c62f5b4 0000000000000000
GPR20: fffffffffffffdef 0000000000000000 0000000000000002 c000000003cd7300
GPR24: 0000000000000000 0000000000000008 c0000007fd1d3f80 0000000000000000
GPR28: 0000000000000001 0000000000000009 c0000007fd1d4080 c0000000656a0000
NIP [c0000000001cdfcc] update_entity_lag+0xcc/0xf0
LR [c0000000001cdfc8] update_entity_lag+0xc8/0xf0
Call Trace:
[c00000005c62f0f0] [c0000000001cdfc8] update_entity_lag+0xc8/0xf0 (unreliable)
[c00000005c62f160] [c0000000001cea80] dequeue_entity+0xb0/0x6d0
[c00000005c62f1f0] [c0000000001cf8b0] dequeue_entities+0x150/0x600
[c00000005c62f2c0] [c0000000001d02a8] dequeue_task_fair+0x158/0x2e0
[c00000005c62f300] [c0000000001b5ea4] dequeue_task+0x64/0x200
[c00000005c62f380] [c0000000001cc950] detach_tasks+0x140/0x420
[c00000005c62f3f0] [c0000000001d6044] sched_balance_rq+0x214/0x7c0
[c00000005c62f550] [c0000000001d6830] sched_balance_newidle+0x240/0x630
[c00000005c62f640] [c0000000001d6d0c] pick_next_task_fair+0x7c/0x4a0
[c00000005c62f6d0] [c0000000001afc50] __pick_next_task+0x60/0x2d0
[c00000005c62f730] [c0000000010e8ce8] __schedule+0x198/0x840
[c00000005c62f810] [c0000000010e93d0] schedule+0x40/0x110
[c00000005c62f880] [c00000000064c574] pipe_read+0x424/0x6a0
[c00000005c62f960] [c00000000063a0fc] vfs_read+0x30c/0x3d0
[c00000005c62fa10] [c00000000063adf4] ksys_read+0x104/0x160
[c00000005c62fa60] [c000000000031678] system_call_exception+0x138/0x2d0
[c00000005c62fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
--- interrupt: 3000 at 0x7fffb8f4a0c4
NIP: 00007fffb8f4a0c4 LR: 00007fffb8f4a0c4 CTR: 0000000000000000
REGS: c00000005c62fe80 TRAP: 3000 Tainted: G W (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 48004222 XER: 00000000
IRQMASK: 0
GPR00: 0000000000000003 00007fffe27ca2d0 00007fffb9067100 0000000000000000
GPR04: 000000003076051b 0000000000002000 00007fffb9060b50 0000000000000000
GPR08: 000000000000006f 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fffb917a560 000000000000000b 0000000000000000
GPR16: 00007fffb9173570 0000000000002000 000000000000000c 000000000000000b
GPR20: 00000000307604c0 0000000030762509 00000000100484e8 0000000030762515
GPR24: 0000000010022b80 0000000000000000 0000000000000005 0000000000002000
GPR28: 000000003076051b 0000000000002000 00007fffb905e508 00007fffb9060b50
NIP [00007fffb8f4a0c4] 0x7fffb8f4a0c4
LR [00007fffb8f4a0c4] 0x7fffb8f4a0c4
--- interrupt: 3000
Code: 4e800020 3d220104 89297c19 2c090000 4082ff8c 3c62ff99 39200001 3d420104 38632d90 992a7c19 4bf88965 60000000 <0fe00000> 4bffff68 60000000 60000000
---[ end trace 0000000000000000 ]---
best=0000000000000000, curr=0000000000000000, se=00000000be02c573, node=0000000000000000 ocrr=0000000000000000, ose=00000000b1d4c4d5, onode=0000000023eb8c00
Kernel attempted to read user page (51) - exploit attempt? (uid: 0)
BUG: Kernel NULL pointer dereference on read at 0x00000051
Faulting instruction address: 0xc0000000001cfebc
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G W 6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64
Tainted: [W]=WARN
Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries
NIP: c0000000001cfebc LR: c0000000001cfebc CTR: 0000000000000000
REGS: c000000002c13950 TRAP: 0300 Tainted: G W (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 44022282 XER: 0000000d
CFAR: c0000000001c4758 DAR: 0000000000000051 DSISR: 40000000 IRQMASK: 1
GPR00: c0000000001cfebc c000000002c13bf0 c000000001b57400 0000000000000000
GPR04: c0000007fd147108 c0000007fd1cd600 c000000002c13968 00000007fafb0000
GPR08: 0000000000000027 00000000004e2002 0000896cdac1f9a0 0000000000002000
GPR12: c000000002a18d88 c000000002f10000 0000000000000000 00000007fffe0000
GPR16: 00000007fffd0000 0000000000000000 00000007fffe0114 0000000000000000
GPR20: 0000000000000000 0000000000000000 c0000000010e95ec c000000002bd6380
GPR24: c000000002bd6da8 c000000002c524e8 0000000000000000 c000000002bd6380
GPR28: c000000002bd6380 c0000007fd1d3f80 c0000007fd1d4080 c0000007fd1d3f80
NIP [c0000000001cfebc] pick_next_entity+0x3c/0x180
LR [c0000000001cfebc] pick_next_entity+0x3c/0x180
Call Trace:
[c000000002c13bf0] [c0000000001cfebc] pick_next_entity+0x3c/0x180 (unreliable)
[c000000002c13c70] [c0000000001d0064] pick_task_fair+0x64/0x130
[c000000002c13cb0] [c0000000001d6cd8] pick_next_task_fair+0x48/0x4a0
[c000000002c13d40] [c0000000001afc50] __pick_next_task+0x60/0x2d0
[c000000002c13da0] [c0000000010e8ce8] __schedule+0x198/0x840
[c000000002c13e80] [c0000000010e95ec] schedule_idle+0x3c/0x70
[c000000002c13eb0] [c0000000001eb1d0] do_idle+0x160/0x1b0
[c000000002c13f00] [c0000000001eb4d0] cpu_startup_entry+0x50/0x60
[c000000002c13f30] [c0000000000110e8] rest_init+0xf0/0xf4
[c000000002c13f60] [c0000000020053a4] do_initcalls+0x0/0x190
[c000000002c13fe0] [c00000000000e788] start_here_common+0x1c/0x20
Code: 60000000 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7c7d1b78 7c9e2378 f8010010 f821ff81 60000000 7fc3f378 4bff4601 <89230051> 7c7f1b78 2c090000 40820098
---[ end trace 0000000000000000 ]---
pstore: backend (nvram) writing error (-1)
Kernel panic - not syncing: Fatal exception
Rebooting in 10 seconds..
On Thu, Sep 26, 2024 at 06:12:19PM +0530, Vishal Chourasia wrote: > I've noticed a kernel panic consistently occurring on the mainline v6.11 > kernel (see attached dmesg log below). > > The panic occurs almost every time I build the Linux kernel from source. > > Steps to Reproduce: > > make clean > ./scripts/config -e LOCALVERSION_AUTO > ./scripts/config --set-str LOCALVERSION -master-with-print > make localmodconfig > make -j8 -s vmlinux modules > > >From my investigation, it seems that the function pick_eevdf() can return NULL. > Commit f12e1488 ("sched/fair: Prepare pick_next_task() for delayed dequeue") > introduces an access on the return value of pick_eevdf(). If 'se' was NULL, > it can lead to a null pointer dereference. Even before that commit we relied on that thing not being NULL, notably f12e1488^1 has: se = pick_next_entity(cfs_rq); cfs_rq = group_cfs_rq(se); Which will similarly explode when pick_eevdf() goes wobbly. > To determine why pick_eevdf() would return NULL, I added a few printk statements > Based on one of the printk logs in the shared dmesg log, it appears that if > pick_eevdf() is called for a 'cfs_rq' whose 'cfs_rq->curr' is NULL and there > are no eligible entities on that 'cfs_rq', it will return NULL. Right, that is not a valid state. Which seems to suggest something went sideways with the eligibility thing -- as Luis suggested. > I have not been able to think of a quick reproducer to trigger a panic > for this case. Hoping if someone can guide me on this. > > Note: The following dmesg log also contains a warning reported too. Panic > happens later. > > ------------[ cut here ]------------ > !se->on_rq > WARNING: CPU: 1 PID: 92333 at kernel/sched/fair.c:705 update_entity_lag+0xcc/0xf0 > Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse > CPU: 1 UID: 0 PID: 92333 Comm: genksyms Tainted: G W 6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64 > Tainted: [W]=WARN > Hardware name: IBM,9080-HEX POWER10 (architected) hv:phyp pSeries > NIP: c0000000001cdfcc LR: c0000000001cdfc8 CTR: 0000000000000000 > REGS: c00000005c62ee50 TRAP: 0700 Tainted: G W (6.11.0-master-with-print-10547-g684a64bf32b6-dirty) > MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 24002222 XER: 00000005 > CFAR: c000000000156a10 IRQMASK: 1 > GPR00: c0000000001cdfc8 c00000005c62f0f0 c000000001b57400 000000000000000a > GPR04: 00000000ffff7fff c00000005c62eee0 c00000005c62eed8 00000007fb050000 > GPR08: 0000000000000027 0000000000000000 0000000000000000 c000000002758de0 > GPR12: c000000002a18d88 c0000007fffef480 0000000000000000 0000000000000000 > GPR16: c000000002c56d40 0000000000000000 c00000005c62f5b4 0000000000000000 > GPR20: fffffffffffffdef 0000000000000000 0000000000000002 c000000003cd7300 > GPR24: 0000000000000000 0000000000000008 c0000007fd1d3f80 0000000000000000 > GPR28: 0000000000000001 0000000000000009 c0000007fd1d4080 c0000000656a0000 > NIP [c0000000001cdfcc] update_entity_lag+0xcc/0xf0 > LR [c0000000001cdfc8] update_entity_lag+0xc8/0xf0 > Call Trace: > [c00000005c62f0f0] [c0000000001cdfc8] update_entity_lag+0xc8/0xf0 (unreliable) > [c00000005c62f160] [c0000000001cea80] dequeue_entity+0xb0/0x6d0 > [c00000005c62f1f0] [c0000000001cf8b0] dequeue_entities+0x150/0x600 > [c00000005c62f2c0] [c0000000001d02a8] dequeue_task_fair+0x158/0x2e0 > [c00000005c62f300] [c0000000001b5ea4] dequeue_task+0x64/0x200 > [c00000005c62f380] [c0000000001cc950] detach_tasks+0x140/0x420 > [c00000005c62f3f0] [c0000000001d6044] sched_balance_rq+0x214/0x7c0 > [c00000005c62f550] [c0000000001d6830] sched_balance_newidle+0x240/0x630 > [c00000005c62f640] [c0000000001d6d0c] pick_next_task_fair+0x7c/0x4a0 > [c00000005c62f6d0] [c0000000001afc50] __pick_next_task+0x60/0x2d0 > [c00000005c62f730] [c0000000010e8ce8] __schedule+0x198/0x840 > [c00000005c62f810] [c0000000010e93d0] schedule+0x40/0x110 > [c00000005c62f880] [c00000000064c574] pipe_read+0x424/0x6a0 > [c00000005c62f960] [c00000000063a0fc] vfs_read+0x30c/0x3d0 > [c00000005c62fa10] [c00000000063adf4] ksys_read+0x104/0x160 > [c00000005c62fa60] [c000000000031678] system_call_exception+0x138/0x2d0 > [c00000005c62fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec So that is a 'fun' one, I don't remember seeing that before. It says we're trying to dequeue a task that is not on the runqueue. The big new thing this merge window -- I'm assuming v6.11 is good -- is DEQUEUE_DELAYED. Does this error go away if you flip that in kernel/sched/features.h ?
On Mon, Sep 30, 2024 at 04:41:57PM +0200, Peter Zijlstra wrote: > On Thu, Sep 26, 2024 at 06:12:19PM +0530, Vishal Chourasia wrote: > > I've noticed a kernel panic consistently occurring on the mainline v6.11 > > kernel (see attached dmesg log below). > > > > The panic occurs almost every time I build the Linux kernel from source. > > > > Steps to Reproduce: > > > > make clean > > ./scripts/config -e LOCALVERSION_AUTO > > ./scripts/config --set-str LOCALVERSION -master-with-print > > make localmodconfig > > make -j8 -s vmlinux modules > > > > >From my investigation, it seems that the function pick_eevdf() can return NULL. > > Commit f12e1488 ("sched/fair: Prepare pick_next_task() for delayed dequeue") > > introduces an access on the return value of pick_eevdf(). If 'se' was NULL, > > it can lead to a null pointer dereference. > > Even before that commit we relied on that thing not being NULL, notably > f12e1488^1 has: > > se = pick_next_entity(cfs_rq); > cfs_rq = group_cfs_rq(se); > > Which will similarly explode when pick_eevdf() goes wobbly. > > > To determine why pick_eevdf() would return NULL, I added a few printk statements > > Based on one of the printk logs in the shared dmesg log, it appears that if > > pick_eevdf() is called for a 'cfs_rq' whose 'cfs_rq->curr' is NULL and there > > are no eligible entities on that 'cfs_rq', it will return NULL. > > Right, that is not a valid state. Which seems to suggest something went > sideways with the eligibility thing -- as Luis suggested. > > > I have not been able to think of a quick reproducer to trigger a panic > > for this case. Hoping if someone can guide me on this. > > > > Note: The following dmesg log also contains a warning reported too. Panic > > happens later. > > > > ------------[ cut here ]------------ > > !se->on_rq > > WARNING: CPU: 1 PID: 92333 at kernel/sched/fair.c:705 update_entity_lag+0xcc/0xf0 > > Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse > > CPU: 1 UID: 0 PID: 92333 Comm: genksyms Tainted: G W 6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64 > > Tainted: [W]=WARN > > Hardware name: IBM,9080-HEX POWER10 (architected) hv:phyp pSeries > > NIP: c0000000001cdfcc LR: c0000000001cdfc8 CTR: 0000000000000000 > > REGS: c00000005c62ee50 TRAP: 0700 Tainted: G W (6.11.0-master-with-print-10547-g684a64bf32b6-dirty) > > MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 24002222 XER: 00000005 > > CFAR: c000000000156a10 IRQMASK: 1 > > GPR00: c0000000001cdfc8 c00000005c62f0f0 c000000001b57400 000000000000000a > > GPR04: 00000000ffff7fff c00000005c62eee0 c00000005c62eed8 00000007fb050000 > > GPR08: 0000000000000027 0000000000000000 0000000000000000 c000000002758de0 > > GPR12: c000000002a18d88 c0000007fffef480 0000000000000000 0000000000000000 > > GPR16: c000000002c56d40 0000000000000000 c00000005c62f5b4 0000000000000000 > > GPR20: fffffffffffffdef 0000000000000000 0000000000000002 c000000003cd7300 > > GPR24: 0000000000000000 0000000000000008 c0000007fd1d3f80 0000000000000000 > > GPR28: 0000000000000001 0000000000000009 c0000007fd1d4080 c0000000656a0000 > > NIP [c0000000001cdfcc] update_entity_lag+0xcc/0xf0 > > LR [c0000000001cdfc8] update_entity_lag+0xc8/0xf0 > > Call Trace: > > [c00000005c62f0f0] [c0000000001cdfc8] update_entity_lag+0xc8/0xf0 (unreliable) > > [c00000005c62f160] [c0000000001cea80] dequeue_entity+0xb0/0x6d0 > > [c00000005c62f1f0] [c0000000001cf8b0] dequeue_entities+0x150/0x600 > > [c00000005c62f2c0] [c0000000001d02a8] dequeue_task_fair+0x158/0x2e0 > > [c00000005c62f300] [c0000000001b5ea4] dequeue_task+0x64/0x200 > > [c00000005c62f380] [c0000000001cc950] detach_tasks+0x140/0x420 > > [c00000005c62f3f0] [c0000000001d6044] sched_balance_rq+0x214/0x7c0 > > [c00000005c62f550] [c0000000001d6830] sched_balance_newidle+0x240/0x630 > > [c00000005c62f640] [c0000000001d6d0c] pick_next_task_fair+0x7c/0x4a0 > > [c00000005c62f6d0] [c0000000001afc50] __pick_next_task+0x60/0x2d0 > > [c00000005c62f730] [c0000000010e8ce8] __schedule+0x198/0x840 > > [c00000005c62f810] [c0000000010e93d0] schedule+0x40/0x110 > > [c00000005c62f880] [c00000000064c574] pipe_read+0x424/0x6a0 > > [c00000005c62f960] [c00000000063a0fc] vfs_read+0x30c/0x3d0 > > [c00000005c62fa10] [c00000000063adf4] ksys_read+0x104/0x160 > > [c00000005c62fa60] [c000000000031678] system_call_exception+0x138/0x2d0 > > [c00000005c62fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec > > So that is a 'fun' one, I don't remember seeing that before. It says > we're trying to dequeue a task that is not on the runqueue. > > The big new thing this merge window -- I'm assuming v6.11 is good -- is > DEQUEUE_DELAYED. Does this error go away if you flip that in > kernel/sched/features.h ? Yes, with the below diff. I didn't see any warnings or kernel panic while running the workload # git diff diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 290874079f60..38bf8df813d1 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -46,7 +46,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true) * * DELAY_ZERO clips the lag on dequeue (or wakeup) to 0. */ -SCHED_FEAT(DELAY_DEQUEUE, true) +SCHED_FEAT(DELAY_DEQUEUE, false) SCHED_FEAT(DELAY_ZERO, true) /*
On Tue, Oct 01, 2024 at 12:35:16AM +0530, Vishal Chourasia wrote: > On Mon, Sep 30, 2024 at 04:41:57PM +0200, Peter Zijlstra wrote: > > On Thu, Sep 26, 2024 at 06:12:19PM +0530, Vishal Chourasia wrote: > > > I've noticed a kernel panic consistently occurring on the mainline v6.11 > > > kernel (see attached dmesg log below). > > > > > > The panic occurs almost every time I build the Linux kernel from source. > > > > > > Steps to Reproduce: > > > > > > make clean > > > ./scripts/config -e LOCALVERSION_AUTO > > > ./scripts/config --set-str LOCALVERSION -master-with-print > > > make localmodconfig > > > make -j8 -s vmlinux modules > > > > > > >From my investigation, it seems that the function pick_eevdf() can return NULL. > > > Commit f12e1488 ("sched/fair: Prepare pick_next_task() for delayed dequeue") > > > introduces an access on the return value of pick_eevdf(). If 'se' was NULL, > > > it can lead to a null pointer dereference. > > > > Even before that commit we relied on that thing not being NULL, notably > > f12e1488^1 has: > > > > se = pick_next_entity(cfs_rq); > > cfs_rq = group_cfs_rq(se); > > > > Which will similarly explode when pick_eevdf() goes wobbly. > > > > > To determine why pick_eevdf() would return NULL, I added a few printk statements > > > Based on one of the printk logs in the shared dmesg log, it appears that if > > > pick_eevdf() is called for a 'cfs_rq' whose 'cfs_rq->curr' is NULL and there > > > are no eligible entities on that 'cfs_rq', it will return NULL. > > > > Right, that is not a valid state. Which seems to suggest something went > > sideways with the eligibility thing -- as Luis suggested. > > > > > I have not been able to think of a quick reproducer to trigger a panic > > > for this case. Hoping if someone can guide me on this. > > > > > > Note: The following dmesg log also contains a warning reported too. Panic > > > happens later. > > > > > > ------------[ cut here ]------------ > > > !se->on_rq > > > WARNING: CPU: 1 PID: 92333 at kernel/sched/fair.c:705 update_entity_lag+0xcc/0xf0 > > > Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse > > > CPU: 1 UID: 0 PID: 92333 Comm: genksyms Tainted: G W 6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64 > > > Tainted: [W]=WARN > > > Hardware name: IBM,9080-HEX POWER10 (architected) hv:phyp pSeries > > > NIP: c0000000001cdfcc LR: c0000000001cdfc8 CTR: 0000000000000000 > > > REGS: c00000005c62ee50 TRAP: 0700 Tainted: G W (6.11.0-master-with-print-10547-g684a64bf32b6-dirty) > > > MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 24002222 XER: 00000005 > > > CFAR: c000000000156a10 IRQMASK: 1 > > > GPR00: c0000000001cdfc8 c00000005c62f0f0 c000000001b57400 000000000000000a > > > GPR04: 00000000ffff7fff c00000005c62eee0 c00000005c62eed8 00000007fb050000 > > > GPR08: 0000000000000027 0000000000000000 0000000000000000 c000000002758de0 > > > GPR12: c000000002a18d88 c0000007fffef480 0000000000000000 0000000000000000 > > > GPR16: c000000002c56d40 0000000000000000 c00000005c62f5b4 0000000000000000 > > > GPR20: fffffffffffffdef 0000000000000000 0000000000000002 c000000003cd7300 > > > GPR24: 0000000000000000 0000000000000008 c0000007fd1d3f80 0000000000000000 > > > GPR28: 0000000000000001 0000000000000009 c0000007fd1d4080 c0000000656a0000 > > > NIP [c0000000001cdfcc] update_entity_lag+0xcc/0xf0 > > > LR [c0000000001cdfc8] update_entity_lag+0xc8/0xf0 > > > Call Trace: > > > [c00000005c62f0f0] [c0000000001cdfc8] update_entity_lag+0xc8/0xf0 (unreliable) > > > [c00000005c62f160] [c0000000001cea80] dequeue_entity+0xb0/0x6d0 > > > [c00000005c62f1f0] [c0000000001cf8b0] dequeue_entities+0x150/0x600 > > > [c00000005c62f2c0] [c0000000001d02a8] dequeue_task_fair+0x158/0x2e0 > > > [c00000005c62f300] [c0000000001b5ea4] dequeue_task+0x64/0x200 > > > [c00000005c62f380] [c0000000001cc950] detach_tasks+0x140/0x420 > > > [c00000005c62f3f0] [c0000000001d6044] sched_balance_rq+0x214/0x7c0 > > > [c00000005c62f550] [c0000000001d6830] sched_balance_newidle+0x240/0x630 > > > [c00000005c62f640] [c0000000001d6d0c] pick_next_task_fair+0x7c/0x4a0 > > > [c00000005c62f6d0] [c0000000001afc50] __pick_next_task+0x60/0x2d0 > > > [c00000005c62f730] [c0000000010e8ce8] __schedule+0x198/0x840 > > > [c00000005c62f810] [c0000000010e93d0] schedule+0x40/0x110 > > > [c00000005c62f880] [c00000000064c574] pipe_read+0x424/0x6a0 > > > [c00000005c62f960] [c00000000063a0fc] vfs_read+0x30c/0x3d0 > > > [c00000005c62fa10] [c00000000063adf4] ksys_read+0x104/0x160 > > > [c00000005c62fa60] [c000000000031678] system_call_exception+0x138/0x2d0 > > > [c00000005c62fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec > > > > So that is a 'fun' one, I don't remember seeing that before. It says > > we're trying to dequeue a task that is not on the runqueue. > > > > The big new thing this merge window -- I'm assuming v6.11 is good -- is > > DEQUEUE_DELAYED. Does this error go away if you flip that in > > kernel/sched/features.h ? > Yes, with the below diff. I didn't see any warnings or kernel panic > while running the workload > > # git diff > diff --git a/kernel/sched/features.h b/kernel/sched/features.h > index 290874079f60..38bf8df813d1 100644 > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -46,7 +46,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true) > * > * DELAY_ZERO clips the lag on dequeue (or wakeup) to 0. > */ > -SCHED_FEAT(DELAY_DEQUEUE, true) > +SCHED_FEAT(DELAY_DEQUEUE, false) > SCHED_FEAT(DELAY_ZERO, true) > > /* > > for sanity, I ran the workload (kernel compilation) on the base commit where the kernel panic was initially observed, which resulted in a kernel panic, along with it couple of warnings where also printed on the console, and a circular locking dependency warning with it. Kernel 6.11.0-kp-base-10547-g684a64bf32b6 on an ppc64le ------------[ cut here ]------------ ====================================================== WARNING: possible circular locking dependency detected 6.11.0-kp-base-10547-g684a64bf32b6 #69 Not tainted ------------------------------------------------------ cc1/27867 is trying to acquire lock: c000000002950270 ((console_sem).lock){-.-.}-{2:2}, at: down_trylock+0x2c/0x70 but task is already holding lock: c0000007fb0e6218 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x44/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&rq->__lock){-.-.}-{2:2}: __lock_acquire+0x730/0xf00 lock_acquire.part.0+0xf0/0x2a0 _raw_spin_lock_nested+0x64/0x100 raw_spin_rq_lock_nested+0x44/0x130 __task_rq_lock+0x7c/0x1e0 wake_up_new_task+0x1d0/0x560 kernel_clone+0x1e0/0x5c0 user_mode_thread+0x84/0xc0 rest_init+0x40/0x208 do_initcalls+0x0/0x190 start_here_common+0x1c/0x20 -> #1 (&p->pi_lock){-.-.}-{2:2}: __lock_acquire+0x730/0xf00 lock_acquire.part.0+0xf0/0x2a0 _raw_spin_lock_irqsave+0x7c/0x130 try_to_wake_up+0x70/0xa40 __up.isra.0+0x74/0x90 up+0x7c/0xa0 __up_console_sem+0xb8/0x110 console_unlock+0x140/0x220 vprintk_emit+0x1e4/0x300 devkmsg_emit.constprop.0+0x44/0x54 devkmsg_write+0x11c/0x230 do_iter_readv_writev+0x194/0x280 vfs_writev+0x18c/0x4b0 do_writev+0xa8/0x1f0 system_call_exception+0x148/0x310 system_call_vectored_common+0x15c/0x2ec -> #0 ((console_sem).lock){-.-.}-{2:2}: check_prev_add+0x158/0xe70 validate_chain+0x6b4/0x7e0 __lock_acquire+0x730/0xf00 lock_acquire.part.0+0xf0/0x2a0 _raw_spin_lock_irqsave+0x7c/0x130 down_trylock+0x2c/0x70 __down_trylock_console_sem+0x60/0x170 console_trylock_spinning+0x3c/0x300 vprintk_emit+0x1a4/0x300 vprintk+0x58/0xe0 _printk+0x40/0x54 __warn_printk+0xd8/0x120 unthrottle_cfs_rq+0x640/0x660 distribute_cfs_runtime+0x454/0x8e0 sched_cfs_period_timer+0x130/0x310 __run_hrtimer+0x28c/0x4d0 __hrtimer_run_queues+0xb4/0x140 hrtimer_interrupt+0x124/0x2f0 timer_interrupt+0x15c/0x460 decrementer_common_virt+0x28c/0x290 other info that might help us debug this: Chain exists of: (console_sem).lock --> &p->pi_lock --> &rq->__lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&rq->__lock); lock(&p->pi_lock); lock(&rq->__lock); lock((console_sem).lock); *** DEADLOCK *** 2 locks held by cc1/27867: #0: c000000002cd1670 (rcu_read_lock){....}-{1:3}, at: distribute_cfs_runtime+0x10/0x8e0 #1: c0000007fb0e6218 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x44/0x130 stack backtrace: CPU: 1 UID: 0 PID: 27867 Comm: cc1 Not tainted 6.11.0-kp-base-10547-g684a64bf32b6 #69 Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries Call Trace: [c0000000f994f210] [c00000000126f0a4] dump_stack_lvl+0xc8/0x130 (unreliable) [c0000000f994f250] [c00000000024de78] print_circular_bug+0x248/0x2b0 [c0000000f994f2f0] [c00000000024e154] check_noncircular+0x274/0x2a0 [c0000000f994f3c0] [c00000000024f8d8] check_prev_add+0x158/0xe70 [c0000000f994f470] [c000000000250ca4] validate_chain+0x6b4/0x7e0 [c0000000f994f550] [c0000000002542a0] __lock_acquire+0x730/0xf00 [c0000000f994f660] [c000000000254b60] lock_acquire.part.0+0xf0/0x2a0 [c0000000f994f780] [c0000000012bcb2c] _raw_spin_lock_irqsave+0x7c/0x130 [c0000000f994f7c0] [c0000000012b422c] down_trylock+0x2c/0x70 [c0000000f994f7f0] [c000000000267920] __down_trylock_console_sem+0x60/0x170 [c0000000f994f830] [c0000000002698cc] console_trylock_spinning+0x3c/0x300 [c0000000f994f880] [c00000000026ccf4] vprintk_emit+0x1a4/0x300 [c0000000f994f8f0] [c00000000026f318] vprintk+0x58/0xe0 [c0000000f994f920] [c00000000026ee30] _printk+0x40/0x54 [c0000000f994f940] [c000000000160ac8] __warn_printk+0xd8/0x120 [c0000000f994f9b0] [c0000000001f07c0] unthrottle_cfs_rq+0x640/0x660 [c0000000f994fa50] [c0000000001f1314] distribute_cfs_runtime+0x454/0x8e0 [c0000000f994fb60] [c0000000001f1a40] sched_cfs_period_timer+0x130/0x310 [c0000000f994fc00] [c0000000002db5dc] __run_hrtimer+0x28c/0x4d0 [c0000000f994fca0] [c0000000002db8d4] __hrtimer_run_queues+0xb4/0x140 [c0000000f994fd00] [c0000000002dc834] hrtimer_interrupt+0x124/0x2f0 [c0000000f994fdb0] [c000000000029f1c] timer_interrupt+0x15c/0x460 [c0000000f994fe50] [c000000000009f8c] decrementer_common_virt+0x28c/0x290 --- interrupt: 900 at 0x13b4b0108 NIP: 000000013b4b0108 LR: 000000013b4af88c CTR: 000000013b4b00d0 REGS: c0000000f994fe80 TRAP: 0900 Not tainted (6.11.0-kp-base-10547-g684a64bf32b6) MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 88822804 XER: 20040000 CFAR: 0000000000000000 IRQMASK: 0 GPR00: 000000013b4af88c 00007ffff4260530 000000013c9f7a00 00007fff93a70738 GPR04: 00007fff94866708 0000000000000000 0000000000000000 00007fff93a70738 GPR08: 0000000000000008 0000000000000008 0000000000000000 0000000088822804 GPR12: 000000013b4b00d0 00007fff9645cc80 00007fff94c44ad8 0000000000000000 GPR16: 0000000040000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 000000013c948900 000000013c50e818 GPR24: 00007fff94c44ad8 0000000000000000 0000000000000001 000000013ca33ec0 GPR28: 000000013c322130 00007fff94c23408 00007fff93a70738 00007fff94866708 NIP [000000013b4b0108] 0x13b4b0108 LR [000000013b4af88c] 0x13b4af88c --- interrupt: 900 se->sched_delayed WARNING: CPU: 1 PID: 27867 at kernel/sched/fair.c:6062 unthrottle_cfs_rq+0x644/0x660 Modules linked in: binfmt_misc bonding tls rfkill pseries_rng ibmveth nd_pmem vmx_crypto nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse CPU: 1 UID: 0 PID: 27867 Comm: cc1 Not tainted 6.11.0-kp-base-10547-g684a64bf32b6 #69 Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries NIP: c0000000001f07c4 LR: c0000000001f07c0 CTR: 0000000000000000 REGS: c0000000f994f710 TRAP: 0700 Not tainted (6.11.0-kp-base-10547-g684a64bf32b6) MSR: 8000000000021033 <SF,ME,IR,DR,RI,LE> CR: 24822202 XER: 00000003 CFAR: c000000000160b00 IRQMASK: 3 GPR00: c0000000001f07c0 c0000000f994f9b0 c000000001d57400 0000000000000011 GPR04: 00000000ffff7fff c0000000f994f7b0 c0000000f994f7a8 00000007f8cf0000 GPR08: 0000000000000027 0000000000010002 c00000005bd32000 c000000002950510 GPR12: c000000002c104b8 c0000007fffef480 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000001 0000000000000001 c000000002dd4928 c000000002cd1670 GPR24: 0000000000000000 0000000000000001 0000000000000001 c0000007fb0e6200 GPR28: 0000000000000000 c00000015326d800 c0000007fb0e6380 c0000000572f2400 NIP [c0000000001f07c4] unthrottle_cfs_rq+0x644/0x660 LR [c0000000001f07c0] unthrottle_cfs_rq+0x640/0x660 Call Trace: [c0000000f994f9b0] [c0000000001f07c0] unthrottle_cfs_rq+0x640/0x660 (unreliable) [c0000000f994fa50] [c0000000001f1314] distribute_cfs_runtime+0x454/0x8e0 [c0000000f994fb60] [c0000000001f1a40] sched_cfs_period_timer+0x130/0x310 [c0000000f994fc00] [c0000000002db5dc] __run_hrtimer+0x28c/0x4d0 [c0000000f994fca0] [c0000000002db8d4] __hrtimer_run_queues+0xb4/0x140 [c0000000f994fd00] [c0000000002dc834] hrtimer_interrupt+0x124/0x2f0 [c0000000f994fdb0] [c000000000029f1c] timer_interrupt+0x15c/0x460 [c0000000f994fe50] [c000000000009f8c] decrementer_common_virt+0x28c/0x290 --- interrupt: 900 at 0x13b4b0108 NIP: 000000013b4b0108 LR: 000000013b4af88c CTR: 000000013b4b00d0 REGS: c0000000f994fe80 TRAP: 0900 Not tainted (6.11.0-kp-base-10547-g684a64bf32b6) MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 88822804 XER: 20040000 CFAR: 0000000000000000 IRQMASK: 0 GPR00: 000000013b4af88c 00007ffff4260530 000000013c9f7a00 00007fff93a70738 GPR04: 00007fff94866708 0000000000000000 0000000000000000 00007fff93a70738 GPR08: 0000000000000008 0000000000000008 0000000000000000 0000000088822804 GPR12: 000000013b4b00d0 00007fff9645cc80 00007fff94c44ad8 0000000000000000 GPR16: 0000000040000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 000000013c948900 000000013c50e818 GPR24: 00007fff94c44ad8 0000000000000000 0000000000000001 000000013ca33ec0 GPR28: 000000013c322130 00007fff94c23408 00007fff93a70738 00007fff94866708 NIP [000000013b4b0108] 0x13b4b0108 LR [000000013b4af88c] 0x13b4af88c --- interrupt: 900 Code: 4bfffc40 3d220108 8929d54b 2c090000 4082fccc 3c62ff97 39200001 3d420108 38634718 992ad54b 4bf7023d 60000000 <0fe00000> 4bfffca8 eb010060 4bfffb80 irq event stamp: 42054 hardirqs last enabled at (42053): [<c0000000000318ac>] interrupt_exit_user_prepare_main+0x7c/0x290 hardirqs last disabled at (42054): [<c000000000029c70>] interrupt_enter_prepare+0x90/0x1e0 softirqs last enabled at (41856): [<c000000000170710>] handle_softirqs+0x590/0x5f0 softirqs last disabled at (41851): [<c0000000000186e0>] do_softirq_own_stack+0x40/0x60 ---[ end trace 0000000000000000 ]--- ------------[ cut here ]------------ delay && se->sched_delayed WARNING: CPU: 7 PID: 869 at kernel/sched/fair.c:5493 dequeue_entity+0x598/0x6c0 Modules linked in: binfmt_misc bonding tls rfkill pseries_rng ibmveth nd_pmem vmx_crypto nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse CPU: 7 UID: 0 PID: 869 Comm: auditd Tainted: G W 6.11.0-kp-base-10547-g684a64bf32b6 #69 Tainted: [W]=WARN Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries NIP: c0000000001edae8 LR: c0000000001edae4 CTR: 0000000000000000 REGS: c000000007d76c20 TRAP: 0700 Tainted: G W (6.11.0-kp-base-10547-g684a64bf32b6) MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 24002202 XER: 00000005 CFAR: c000000000160b00 IRQMASK: 1 GPR00: c0000000001edae4 c000000007d76ec0 c000000001d57400 000000000000001a GPR04: 00000000ffff7fff c000000007d76cc0 c000000007d76cb8 00000007f9bf0000 GPR08: 0000000000000027 0000000000000003 c00000005d18f000 c000000002950510 GPR12: c000000002c104b8 c0000007fffe7080 0000000000000000 c000000004088600 GPR16: 0000000000000000 0000000000000007 0000000000000002 0000000000000000 GPR20: fffffffffffffdef 0000000000000000 0000000000000002 c00000000a234800 GPR24: 0000000000000000 0000000000000008 c0000007fb0e6200 0000000000000000 GPR28: 0000000000000001 0000000000000009 c0000007fb0e6380 c0000000572f2400 NIP [c0000000001edae8] dequeue_entity+0x598/0x6c0 LR [c0000000001edae4] dequeue_entity+0x594/0x6c0 Call Trace: [c000000007d76ec0] [c0000000001edae4] dequeue_entity+0x594/0x6c0 (unreliable) [c000000007d76f60] [c0000000001edd60] dequeue_entities+0x150/0x600 [c000000007d77030] [c0000000001ee54c] dequeue_task_fair+0x6c/0x310 [c000000007d770c0] [c0000000001cda54] dequeue_task+0x64/0x240 [c000000007d77140] [c0000000001e6e80] detach_tasks+0x1c0/0x4d0 [c000000007d771b0] [c0000000001f52e0] sched_balance_rq+0x250/0x910 [c000000007d77320] [c0000000001f5d50] sched_balance_newidle+0x3b0/0xa90 [c000000007d77410] [c0000000001f651c] pick_next_task_fair+0x7c/0x560 [c000000007d774a0] [c0000000001c3fa0] __pick_next_task+0x60/0x2d0 [c000000007d77500] [c0000000012ae314] __schedule+0x184/0x940 [c000000007d775d0] [c0000000012aeb3c] schedule+0x6c/0x190 [c000000007d77640] [c0000000012aeed4] schedule_preempt_disabled+0x34/0x60 [c000000007d77670] [c0000000012b4de4] rwsem_down_write_slowpath+0x234/0x820 [c000000007d77770] [c00000000024584c] down_write_nested+0x1ec/0x1f0 [c000000007d777b0] [c008000004d65688] xfs_ilock+0x1c0/0x240 [xfs] [c000000007d77830] [c008000004d60e4c] xfs_buffered_write_iomap_begin+0x184/0xd50 [xfs] [c000000007d77990] [c0000000007da1a8] iomap_iter+0x118/0x270 [c000000007d77a20] [c0000000007dd63c] iomap_file_buffered_write+0xbc/0x120 [c000000007d77b40] [c008000004d4994c] xfs_file_buffered_write+0xf4/0x3e0 [xfs] [c000000007d77c10] [c0000000006fa450] new_sync_write+0x160/0x1e0 [c000000007d77ca0] [c0000000006fe524] vfs_write+0x2f4/0x3e0 [c000000007d77d00] [c0000000006fe820] ksys_write+0x90/0x160 [c000000007d77d50] [c000000000032f98] system_call_exception+0x148/0x310 [c000000007d77e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec --- interrupt: 3000 at 0x7fff8894b344 NIP: 00007fff8894b344 LR: 00007fff8894b344 CTR: 0000000000000000 REGS: c000000007d77e80 TRAP: 3000 Tainted: G W (6.11.0-kp-base-10547-g684a64bf32b6) MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 4a004402 XER: 00000000 IRQMASK: 0 GPR00: 0000000000000004 00007ffff84de490 00007fff88a67100 0000000000000004 GPR04: 0000000122fe9090 0000000000000041 0100000000000000 00007fff88db0180 GPR08: 00007fff88da8ac8 0000000000000000 0000000000000000 0000000000000000 GPR12: 0000000000000000 00007fff88db0180 0000000000000001 000000011de63d40 GPR16: 00007ffff84e0f00 00007ffff84e0ed0 00007ffff84e0ea0 00007ffff84e0f30 GPR20: 00007ffff84e0e40 00007ffff84e0f60 0000000000000000 00007ffff84e0ff0 GPR24: 0000000000000000 00007ffff84e1118 00007ffff84de828 0000000122fe9090 GPR28: 0000000000000041 0000000122fe9090 0000000000000000 0000000000000004 NIP [00007fff8894b344] 0x7fff8894b344 LR [00007fff8894b344] 0x7fff8894b344 --- interrupt: 3000 Code: 4bff5571 39200001 38600000 993f0051 4bfffc10 3c62ff97 39200001 3d420108 38634818 992ad541 4bf72f19 60000000 <0fe00000> 60000000 4bfffe18 60000000 irq event stamp: 64870392 hardirqs last enabled at (64870391): [<c000000000031a34>] interrupt_exit_user_prepare_main+0x204/0x290 hardirqs last disabled at (64870392): [<c0000000012ae55c>] __schedule+0x3cc/0x940 softirqs last enabled at (64867724): [<c000000000170710>] handle_softirqs+0x590/0x5f0 softirqs last disabled at (64867713): [<c0000000000186e0>] do_softirq_own_stack+0x40/0x60 ---[ end trace 0000000000000000 ]--- ------------[ cut here ]------------ !se->on_rq WARNING: CPU: 0 PID: 870 at kernel/sched/fair.c:704 update_entity_lag+0xcc/0xf0 Modules linked in: binfmt_misc bonding tls rfkill pseries_rng ibmveth nd_pmem vmx_crypto nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse CPU: 0 UID: 0 PID: 870 Comm: auditd Tainted: G W 6.11.0-kp-base-10547-g684a64bf32b6 #69 Tainted: [W]=WARN Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries NIP: c0000000001eacdc LR: c0000000001eacd8 CTR: 0000000000000000 REGS: c00000005a89ebd0 TRAP: 0700 Tainted: G W (6.11.0-kp-base-10547-g684a64bf32b6) MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 24282282 XER: 00000005 CFAR: c000000000160b00 IRQMASK: 1 GPR00: c0000000001eacd8 c00000005a89ee70 c000000001d57400 000000000000000a GPR04: 00000000ffff7fff c00000005a89ec70 c00000005a89ec68 00000007f8a70000 GPR08: 0000000000000027 0000000000000003 c00000005308f000 c000000002950510 GPR12: c000000002c104b8 c000000003cd0000 0000000000000000 c000000004087800 GPR16: c00000005308f000 0000000000000000 0000000000000002 0000000000000000 GPR20: fffffffffffffdef 0000000000000000 0000000000000002 c00000000a20a000 GPR24: 0000000000000000 0000000000000008 c0000007fb866200 0000000000000000 GPR28: 0000000000000001 0000000000000009 c0000007fb866380 c0000000572ff400 NIP [c0000000001eacdc] update_entity_lag+0xcc/0xf0 LR [c0000000001eacd8] update_entity_lag+0xc8/0xf0 Call Trace: [c00000005a89ee70] [c0000000001eacd8] update_entity_lag+0xc8/0xf0 (unreliable) [c00000005a89eee0] [c0000000001ed600] dequeue_entity+0xb0/0x6c0 [c00000005a89ef80] [c0000000001edd60] dequeue_entities+0x150/0x600 [c00000005a89f050] [c0000000001ee54c] dequeue_task_fair+0x6c/0x310 [c00000005a89f0e0] [c0000000001cda54] dequeue_task+0x64/0x240 [c00000005a89f160] [c0000000001e6e80] detach_tasks+0x1c0/0x4d0 [c00000005a89f1d0] [c0000000001f52e0] sched_balance_rq+0x250/0x910 [c00000005a89f340] [c0000000001f5d50] sched_balance_newidle+0x3b0/0xa90 [c00000005a89f430] [c0000000001f651c] pick_next_task_fair+0x7c/0x560 [c00000005a89f4c0] [c0000000001c3fa0] __pick_next_task+0x60/0x2d0 [c00000005a89f520] [c0000000012ae314] __schedule+0x184/0x940 [c00000005a89f5f0] [c0000000012aeb3c] schedule+0x6c/0x190 [c00000005a89f660] [c0000000012aecb4] io_schedule+0x54/0x80 [c00000005a89f690] [c0000000005489b4] folio_wait_bit_common+0x1e4/0x4c0 [c00000005a89f790] [c00000000055d01c] folio_wait_writeback+0x5c/0x150 [c00000005a89f810] [c00000000054811c] __filemap_fdatawait_range+0xcc/0x1c0 [c00000005a89f970] [c00000000054da90] file_write_and_wait_range+0xf0/0x110 [c00000005a89f9c0] [c008000004d475d4] xfs_file_fsync+0x9c/0x380 [xfs] [c00000005a89fa80] [c000000000774eec] vfs_fsync_range+0x6c/0xf0 [c00000005a89fac0] [c00000000077503c] do_fsync+0x5c/0xd0 [c00000005a89fb00] [c0000000007754c8] sys_fsync+0x28/0x40 [c00000005a89fb20] [c000000000032f98] system_call_exception+0x148/0x310 [c00000005a89fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec --- interrupt: 3000 at 0x7fff8894f6e4 NIP: 00007fff8894f6e4 LR: 00007fff8894f6e4 CTR: 0000000000000000 REGS: c00000005a89fe80 TRAP: 3000 Tainted: G W (6.11.0-kp-base-10547-g684a64bf32b6) MSR: 800000000000d033 <SF,EE,PR,ME,IR,DR,RI,LE> CR: 42004482 XER: 00000000 IRQMASK: 0 GPR00: 0000000000000076 00007fff87c0e400 00007fff88a67100 0000000000000004 GPR04: 0000000000000081 0000000000000001 0000000000000000 00007fff87c16780 GPR08: 00007fff87c0f0c8 0000000000000000 0000000000000000 0000000000000000 GPR12: 0000000000000000 00007fff87c16780 0000000000000000 0000000000000000 GPR16: 00007fff87c0f080 00007fff87400000 00007fff87c0efc0 0000000000000000 GPR20: 0000000000000000 00007fff88da0000 000000000080e880 00007fff87400000 GPR24: 0000000000000001 0000000000000000 0000000000000002 00007fff87c0e798 GPR28: 00007fff87c0f3dc 000000011de80d58 0000000000000000 0000000000000004 NIP [00007fff8894f6e4] 0x7fff8894f6e4 LR [00007fff8894f6e4] 0x7fff8894f6e4 --- interrupt: 3000 Code: 4e800020 3d220108 8929d53d 2c090000 4082ff8c 3c62ff97 39200001 3d420108 38634708 992ad53d 4bf75d25 60000000 <0fe00000> 4bffff68 60000000 60000000 irq event stamp: 11088690 hardirqs last enabled at (11088689): [<c000000000973e74>] blk_account_io_start+0x204/0x260 hardirqs last disabled at (11088690): [<c0000000012bca54>] _raw_spin_lock_irq+0xc4/0x120 softirqs last enabled at (11087028): [<c000000000170710>] handle_softirqs+0x590/0x5f0 softirqs last disabled at (11087021): [<c0000000000186e0>] do_softirq_own_stack+0x40/0x60 ---[ end trace 0000000000000000 ]--- Thanks
On Tue, 2024-10-01 at 00:45 +0530, Vishal Chourasia wrote: > > > for sanity, I ran the workload (kernel compilation) on the base commit > where the kernel panic was initially observed, which resulted in a > kernel panic, along with it couple of warnings where also printed on the > console, and a circular locking dependency warning with it. > > Kernel 6.11.0-kp-base-10547-g684a64bf32b6 on an ppc64le > > ------------[ cut here ]------------ > > ====================================================== > WARNING: possible circular locking dependency detected > 6.11.0-kp-base-10547-g684a64bf32b6 #69 Not tainted > ------------------------------------------------------ ... > --- interrupt: 900 > se->sched_delayed > WARNING: CPU: 1 PID: 27867 at kernel/sched/fair.c:6062 unthrottle_cfs_rq+0x644/0x660 ...that warning also spells eventual doom for the box, here it does anyway, running LTPs cfs_bandwidth01 testcase and hackbench together, box grinds to a halt in pretty short order. With the patchlet below (submitted), I can beat on box to my hearts content without meeting throttle/unthrottle woes. sched: Fix sched_delayed vs cfs_bandwidth Meeting an unfinished DELAY_DEQUEUE treated entity in unthrottle_cfs_rq() leads to a couple terminal scenarios. Finish it first, so ENQUEUE_WAKEUP can proceed as it would have sans DELAY_DEQUEUE treatment. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> Signed-off-by: Mike Galbraith <efault@gmx.de> --- kernel/sched/fair.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6058,10 +6058,13 @@ void unthrottle_cfs_rq(struct cfs_rq *cf for_each_sched_entity(se) { struct cfs_rq *qcfs_rq = cfs_rq_of(se); - if (se->on_rq) { - SCHED_WARN_ON(se->sched_delayed); + /* Handle any unfinished DELAY_DEQUEUE business first. */ + if (se->sched_delayed) { + int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED; + + dequeue_entity(qcfs_rq, se, flags); + } else if (se->on_rq) break; - } enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP); if (cfs_rq_is_idle(group_cfs_rq(se)))
Mike Galbraith <efault@gmx.de> writes: > On Tue, 2024-10-01 at 00:45 +0530, Vishal Chourasia wrote: >> > >> for sanity, I ran the workload (kernel compilation) on the base commit >> where the kernel panic was initially observed, which resulted in a >> kernel panic, along with it couple of warnings where also printed on the >> console, and a circular locking dependency warning with it. >> >> Kernel 6.11.0-kp-base-10547-g684a64bf32b6 on an ppc64le >> >> ------------[ cut here ]------------ >> >> ====================================================== >> WARNING: possible circular locking dependency detected >> 6.11.0-kp-base-10547-g684a64bf32b6 #69 Not tainted >> ------------------------------------------------------ > > ... > >> --- interrupt: 900 >> se->sched_delayed >> WARNING: CPU: 1 PID: 27867 at kernel/sched/fair.c:6062 unthrottle_cfs_rq+0x644/0x660 > > ...that warning also spells eventual doom for the box, here it does > anyway, running LTPs cfs_bandwidth01 testcase and hackbench together, > box grinds to a halt in pretty short order. > > With the patchlet below (submitted), I can beat on box to my hearts > content without meeting throttle/unthrottle woes. > > sched: Fix sched_delayed vs cfs_bandwidth > > Meeting an unfinished DELAY_DEQUEUE treated entity in unthrottle_cfs_rq() > leads to a couple terminal scenarios. Finish it first, so ENQUEUE_WAKEUP > can proceed as it would have sans DELAY_DEQUEUE treatment. > > Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") > Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> > Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> > Signed-off-by: Mike Galbraith <efault@gmx.de> > --- > kernel/sched/fair.c | 9 ++++++--- > 1 file changed, 6 insertions(+), 3 deletions(-) > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6058,10 +6058,13 @@ void unthrottle_cfs_rq(struct cfs_rq *cf > for_each_sched_entity(se) { > struct cfs_rq *qcfs_rq = cfs_rq_of(se); > > - if (se->on_rq) { > - SCHED_WARN_ON(se->sched_delayed); > + /* Handle any unfinished DELAY_DEQUEUE business first. */ > + if (se->sched_delayed) { > + int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED; > + > + dequeue_entity(qcfs_rq, se, flags); > + } else if (se->on_rq) > break; > - } > enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP); > > if (cfs_rq_is_idle(group_cfs_rq(se))) Yeah, if we can wind up here to hit that warning, then we need to get it out of delay state, not just leave it. Whether dequeue_entity + enqueue_entity is better or worse than requeue_delayed_entity (+ break), I really don't know. This did prompt me to try and figure out if we could limit delay to the EEVDF parts. I think ideally, we'd only have delayed tasks accounted for in min_vruntime/avg_vruntime/avg_load, but trying to keep them out of nr_running/load.weight seems like it runs into too many issues, while having them still counted there means skipping dequeue_entity like the current version, which means it wouldn't have saved us this. I think we could still probably remove the current core.c touchpoints in favor of just having migrate_task_rq_fair take a parameter for "do we already hold rq_lock rather than just p->pi_lock" (and it and switched_from/changed_group handling things)? We'd probably need to do something very careful to avoid migrate_task_rq_fair unconditionally needing rq_lock though, which would be a pain, so it might not be worth it even if I'm not forgetting some other reason for pushing the delay knowledge all the way out to p->on_rq.
On Wed, 2024-10-02 at 15:31 -0700, Benjamin Segall wrote: > Mike Galbraith <efault@gmx.de> writes: > > > > --- > > kernel/sched/fair.c | 9 ++++++--- > > 1 file changed, 6 insertions(+), 3 deletions(-) > > > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6058,10 +6058,13 @@ void unthrottle_cfs_rq(struct cfs_rq *cf > > for_each_sched_entity(se) { > > struct cfs_rq *qcfs_rq = cfs_rq_of(se); > > > > - if (se->on_rq) { > > - SCHED_WARN_ON(se->sched_delayed); > > + /* Handle any unfinished DELAY_DEQUEUE business first. */ > > + if (se->sched_delayed) { > > + int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED; > > + > > + dequeue_entity(qcfs_rq, se, flags); > > + } else if (se->on_rq) > > break; > > - } > > enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP); > > > > if (cfs_rq_is_idle(group_cfs_rq(se))) > > Yeah, if we can wind up here to hit that warning, then we need to get it > out of delay state, not just leave it. Whether dequeue_entity + > enqueue_entity is better or worse than requeue_delayed_entity (+ break), I really > don't know. Hm, I'd say requeue_delayed_entity() not only fits better, it using less lines gives it an extra brownie point. -Mike
On Thu, 2024-10-03 at 06:41 +0200, Mike Galbraith wrote: > On Wed, 2024-10-02 at 15:31 -0700, Benjamin Segall wrote: > > Whether dequeue_entity + enqueue_entity is better or worse than > > requeue_delayed_entity (+break), I really don't know. > > Hm, I'd say requeue_delayed_entity() not only fits better, it using > less lines gives it an extra brownie point. Probable not worth any churn or effort, but it is an option. sched: Clean up sched_delayed handling in unthrottle_cfs_rq() requeue_delayed_entity() achieves and documents in one line what a less clear preparatory dequeue facilitates over several, so use it instead, and remove the superfluous comment. Signed-off-by: Mike Galbraith <efault@gmx.de> --- kernel/sched/fair.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6058,12 +6058,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cf for_each_sched_entity(se) { struct cfs_rq *qcfs_rq = cfs_rq_of(se); - /* Handle any unfinished DELAY_DEQUEUE business first. */ - if (se->sched_delayed) { - int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED; - - dequeue_entity(qcfs_rq, se, flags); - } else if (se->on_rq) + if (se->sched_delayed) + requeue_delayed_entity(se); + if (se->on_rq) break; enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
On Thu, Oct 03, 2024 at 11:31:06AM +0200, Mike Galbraith wrote: > On Thu, 2024-10-03 at 06:41 +0200, Mike Galbraith wrote: > > On Wed, 2024-10-02 at 15:31 -0700, Benjamin Segall wrote: > > > > Whether dequeue_entity + enqueue_entity is better or worse than > > > requeue_delayed_entity (+break), I really don't know. > > > > Hm, I'd say requeue_delayed_entity() not only fits better, it using > > less lines gives it an extra brownie point. > > Probable not worth any churn or effort, but it is an option. > > sched: Clean up sched_delayed handling in unthrottle_cfs_rq() > > requeue_delayed_entity() achieves and documents in one line what a less > clear preparatory dequeue facilitates over several, so use it instead, > and remove the superfluous comment. > > Signed-off-by: Mike Galbraith <efault@gmx.de> Hi Mike, thanks for the patch! I didn't see any warnings or kernel panics in my setup. Tested-by: Vishal Chourasia <vishalc@linux.ibm.com> > --- > kernel/sched/fair.c | 9 +++------ > 1 file changed, 3 insertions(+), 6 deletions(-) > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6058,12 +6058,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cf > for_each_sched_entity(se) { > struct cfs_rq *qcfs_rq = cfs_rq_of(se); > > - /* Handle any unfinished DELAY_DEQUEUE business first. */ > - if (se->sched_delayed) { > - int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED; > - > - dequeue_entity(qcfs_rq, se, flags); > - } else if (se->on_rq) > + if (se->sched_delayed) > + requeue_delayed_entity(se); > + if (se->on_rq) > break; > enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP); > >
On Tue, Oct 01, 2024 at 10:30:26AM +0200, Mike Galbraith wrote: > On Tue, 2024-10-01 at 00:45 +0530, Vishal Chourasia wrote: > > > > > for sanity, I ran the workload (kernel compilation) on the base commit > > where the kernel panic was initially observed, which resulted in a > > kernel panic, along with it couple of warnings where also printed on the > > console, and a circular locking dependency warning with it. > > > > Kernel 6.11.0-kp-base-10547-g684a64bf32b6 on an ppc64le > > > > ------------[ cut here ]------------ > > > > ====================================================== > > WARNING: possible circular locking dependency detected > > 6.11.0-kp-base-10547-g684a64bf32b6 #69 Not tainted > > ------------------------------------------------------ > > ... > > > --- interrupt: 900 > > se->sched_delayed > > WARNING: CPU: 1 PID: 27867 at kernel/sched/fair.c:6062 unthrottle_cfs_rq+0x644/0x660 > > ...that warning also spells eventual doom for the box, here it does > anyway, running LTPs cfs_bandwidth01 testcase and hackbench together, > box grinds to a halt in pretty short order. > Right, I've picked up your patch for sched/urgent. But this does make me question Vishal's setup. He said all he does is compile a kernel, but afaik no regular setup uses CFS bandwidth by default. So something is 'special' at his end that he's not been telling us about. Vishal, could you expand upon your configuration? How come you're using CFS bandwidth, what else is special?
On Wed, Oct 02, 2024 at 10:49:32AM +0200, Peter Zijlstra wrote: > On Tue, Oct 01, 2024 at 10:30:26AM +0200, Mike Galbraith wrote: > > On Tue, 2024-10-01 at 00:45 +0530, Vishal Chourasia wrote: > > > > > > > for sanity, I ran the workload (kernel compilation) on the base commit > > > where the kernel panic was initially observed, which resulted in a > > > kernel panic, along with it couple of warnings where also printed on the > > > console, and a circular locking dependency warning with it. > > > > > > Kernel 6.11.0-kp-base-10547-g684a64bf32b6 on an ppc64le > > > > > > ------------[ cut here ]------------ > > > > > > ====================================================== > > > WARNING: possible circular locking dependency detected > > > 6.11.0-kp-base-10547-g684a64bf32b6 #69 Not tainted > > > ------------------------------------------------------ > > > > ... > > > > > --- interrupt: 900 > > > se->sched_delayed > > > WARNING: CPU: 1 PID: 27867 at kernel/sched/fair.c:6062 unthrottle_cfs_rq+0x644/0x660 > > > > ...that warning also spells eventual doom for the box, here it does > > anyway, running LTPs cfs_bandwidth01 testcase and hackbench together, > > box grinds to a halt in pretty short order. > > > > Right, I've picked up your patch for sched/urgent. But this does make me > question Vishal's setup. > > He said all he does is compile a kernel, but afaik no regular setup uses > CFS bandwidth by default. So something is 'special' at his end that he's > not been telling us about. Yes Peter, I'm compiling the kernel from source. While I'm not running the compilation within a cgroup that has bandwidth limits set, there are some system services running in the background that do have bandwidth limitations applied. # find . -name cpu.max -exec cat {} + max 100000 max 100000 max 100000 max 100000 max 100000 max 100000 5000 100000 34000 100000 10000 100000 31000 100000 max 100000 max 100000 max 100000 max 100000 max 100000 max 100000 > > Vishal, could you expand upon your configuration? How come you're using > CFS bandwidth, what else is special? config cfs_bandwidth is enabled by default in both the pseries_le_defconfig and the distro kernel config I'm using for the compilation. Let me know if you need any more info. I hope I have answered your queries. Thanks!
On Tue, Oct 01, 2024 at 10:30:26AM +0200, Mike Galbraith wrote: > On Tue, 2024-10-01 at 00:45 +0530, Vishal Chourasia wrote: > > > > > for sanity, I ran the workload (kernel compilation) on the base commit > > where the kernel panic was initially observed, which resulted in a > > kernel panic, along with it couple of warnings where also printed on the > > console, and a circular locking dependency warning with it. > > > > Kernel 6.11.0-kp-base-10547-g684a64bf32b6 on an ppc64le > > > > ------------[ cut here ]------------ > > > > ====================================================== > > WARNING: possible circular locking dependency detected > > 6.11.0-kp-base-10547-g684a64bf32b6 #69 Not tainted > > ------------------------------------------------------ > > ... > > > --- interrupt: 900 > > se->sched_delayed > > WARNING: CPU: 1 PID: 27867 at kernel/sched/fair.c:6062 unthrottle_cfs_rq+0x644/0x660 > > ...that warning also spells eventual doom for the box, here it does > anyway, running LTPs cfs_bandwidth01 testcase and hackbench together, > box grinds to a halt in pretty short order. > > With the patchlet below (submitted), I can beat on box to my hearts > content without meeting throttle/unthrottle woes. > > sched: Fix sched_delayed vs cfs_bandwidth > > Meeting an unfinished DELAY_DEQUEUE treated entity in unthrottle_cfs_rq() > leads to a couple terminal scenarios. Finish it first, so ENQUEUE_WAKEUP > can proceed as it would have sans DELAY_DEQUEUE treatment. > > Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") > Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> > Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> > Signed-off-by: Mike Galbraith <efault@gmx.de> Hello Mike, Thank you the patch! With the below changes, I don't see any warnings been printed on the console, along with it, there were no kernel panics Tested-by: Vishal Chourasia <vishalc@linux.ibm.com> > --- > kernel/sched/fair.c | 9 ++++++--- > 1 file changed, 6 insertions(+), 3 deletions(-) > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6058,10 +6058,13 @@ void unthrottle_cfs_rq(struct cfs_rq *cf > for_each_sched_entity(se) { > struct cfs_rq *qcfs_rq = cfs_rq_of(se); > > - if (se->on_rq) { > - SCHED_WARN_ON(se->sched_delayed); > + /* Handle any unfinished DELAY_DEQUEUE business first. */ > + if (se->sched_delayed) { > + int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED; > + > + dequeue_entity(qcfs_rq, se, flags); > + } else if (se->on_rq) > break; > - } > enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP); > > if (cfs_rq_is_idle(group_cfs_rq(se))) >
On Tue, 2024-10-01 at 19:38 +0530, Vishal Chourasia wrote: > > With the below changes, I don't see any warnings been printed on the > console, along with it, there were no kernel panics Darn. I was hoping to quash the easily reproducible noisy avalanche of doom to let elusive $subject through, that being way more interesting. When I hit $subject, LTPs cfs_bandwidth01 was running, but there was no warning prelude, box went straight to panic. Trying to reproduce using that testcase plus hackbench as efficacy booster produced lots of dying box noise, but zero sneaky $subject instances before or after quash. -Mike
On Tue, 2024-10-01 at 18:41 +0200, Mike Galbraith wrote: > > When I hit $subject, LTPs cfs_bandwidth01 was running, but there was no > warning prelude, box went straight to panic. Trying to reproduce using > that testcase plus hackbench as efficacy booster produced lots of dying > box noise, but zero sneaky $subject instances before or after quash. Hohum, this morning I did hit.. 1. WARNING: CPU: 5 PID: 931 at kernel/sched/fair.c:6062 unthrottle_cfs_rq+0x4c3/0x4d0 2. WARNING: CPU: 0 PID: 786 at kernel/sched/fair.c:704 update_entity_lag+0x79/0x90 3. NULL dereference in pick_next_entity() ..instead of brick, workqueue stall etc. Twice. Not that it matters. I was only mucking about with it because I was curious whether telling LB to stop moving sched_delayed tasks about would matter. (nope) -Mike
On 9/26/24 13:42, Vishal Chourasia wrote: > I've noticed a kernel panic consistently occurring on the mainline v6.11 > kernel (see attached dmesg log below). > > The panic occurs almost every time I build the Linux kernel from source. > > Steps to Reproduce: > > make clean > ./scripts/config -e LOCALVERSION_AUTO > ./scripts/config --set-str LOCALVERSION -master-with-print > make localmodconfig > make -j8 -s vmlinux modules > >>From my investigation, it seems that the function pick_eevdf() can return NULL. > Commit f12e1488 ("sched/fair: Prepare pick_next_task() for delayed dequeue") > introduces an access on the return value of pick_eevdf(). If 'se' was NULL, > it can lead to a null pointer dereference. > > # objdump -S vmlinux | grep -C 5 c0000000001cfebc > c0000000001cfeb0: 00 00 00 60 nop > struct sched_entity *se = pick_eevdf(cfs_rq); > c0000000001cfeb4: 78 f3 c3 7f mr r3,r30 > c0000000001cfeb8: 01 46 ff 4b bl c0000000001c44b8 <pick_eevdf+0x8> > if (se->sched_delayed) { > c0000000001cfebc: 51 00 23 89 lbz r9,81(r3) <<<<<< > struct sched_entity *se = pick_eevdf(cfs_rq); > c0000000001cfec0: 78 1b 7f 7c mr r31,r3 > if (se->sched_delayed) { > c0000000001cfec4: 00 00 09 2c cmpwi r9,0 > c0000000001cfec8: 98 00 82 40 bne c0000000001cff60 <pick_next_entity+0xe0> > > r3 is NULL which can be verified from the register context shared in the > dmesg logs > > Here is the state of my git repository: > # git log --oneline > 684a64bf32b6 (HEAD -> master, origin/master, origin/HEAD) Merge tag 'nfs-for-6.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs > f7fccaa77271 Merge tag 'fuse-update-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse > 4165cee7ecb1 Merge tag 'exfat-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat > 79952bdcbcea Merge tag 'f2fs-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs > fa8380a06bd0 Merge tag 'bpf-next-6.12-struct-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next > 68e5c7d4cefb Merge tag 'kbuild-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild > 7f8de2bf0725 Merge tag 'linux-cpupower-6.12-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux > cd3d64772981 Merge tag 'i3c/for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux > > > To determine why pick_eevdf() would return NULL, I added a few printk statements > Based on one of the printk logs in the shared dmesg log, it appears that if > pick_eevdf() is called for a 'cfs_rq' whose 'cfs_rq->curr' is NULL and there > are no eligible entities on that 'cfs_rq', it will return NULL. > > # git diff > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 225b31aaee55..8c5b96f1cd49 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -48,6 +48,7 @@ > #include <linux/ratelimit.h> > #include <linux/task_work.h> > #include <linux/rbtree_augmented.h> > +#include <linux/delay.h> > > #include <asm/switch_to.h> > > @@ -907,16 +908,25 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq) > static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq) > { > struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node; > + struct rb_node *tmpnode = node; > struct sched_entity *se = __pick_first_entity(cfs_rq); > + struct sched_entity *tmpse = se; > struct sched_entity *curr = cfs_rq->curr; > + struct sched_entity *tmpcurr = curr; > struct sched_entity *best = NULL; > - > + struct sched_entity *tmp = NULL; > /* > * We can safely skip eligibility check if there is only one entity > * in this cfs_rq, saving some cycles. > */ > - if (cfs_rq->nr_running == 1) > - return curr && curr->on_rq ? curr : se; > + if (cfs_rq->nr_running == 1) { > + tmp = curr && curr->on_rq ? curr : se; > + if (!tmp) { > + printk(KERN_INFO "pick_eevdf curr: %p, se: %p\n", curr, se); > + mdelay(10); > + } > + return tmp; > + } > > if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) > curr = NULL; > @@ -966,6 +976,11 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq) > if (!best || (curr && entity_before(curr, best))) > best = curr; > > + if (!best) { > + printk(KERN_INFO "best=%p, curr=%p, se=%p, node=%p ocrr=%p, ose=%p, onode=%p\n", best, curr, se, node, tmpcurr, tmpse, tmpnode); > + mdelay(10); > + } > + > return best; > } > >>From the logs below: > [ 1355.763494] best=0000000000000000, curr=0000000000000000, se=00000000be02c573, node=0000000000000000 ocrr=0000000000000000, ose=00000000b1d4c4d5, onode=0000000023eb8c00 > > I have not been able to think of a quick reproducer to trigger a panic > https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@intel.com/for this case. Hoping if someone can guide me on this. > > Note: The following dmesg log also contains a warning reported too. Panic > happens later. > > ------------[ cut here ]------------ > !se->on_rq > WARNING: CPU: 1 PID: 92333 at kernel/sched/fair.c:705 update_entity_lag+0xcc/0xf0 > Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse > CPU: 1 UID: 0 PID: 92333 Comm: genksyms Tainted: G W 6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64 > Tainted: [W]=WARN > Hardware name: IBM,9080-HEX POWER10 (architected) hv:phyp pSeries > NIP: c0000000001cdfcc LR: c0000000001cdfc8 CTR: 0000000000000000 > REGS: c00000005c62ee50 TRAP: 0700 Tainted: G W (6.11.0-master-with-print-10547-g684a64bf32b6-dirty) > MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 24002222 XER: 00000005 > CFAR: c000000000156a10 IRQMASK: 1 > GPR00: c0000000001cdfc8 c00000005c62f0f0 c000000001b57400 000000000000000a > GPR04: 00000000ffff7fff c00000005c62eee0 c00000005c62eed8 00000007fb050000 > GPR08: 0000000000000027 0000000000000000 0000000000000000 c000000002758de0 > GPR12: c000000002a18d88 c0000007fffef480 0000000000000000 0000000000000000 > GPR16: c000000002c56d40 0000000000000000 c00000005c62f5b4 0000000000000000 > GPR20: fffffffffffffdef 0000000000000000 0000000000000002 c000000003cd7300 > GPR24: 0000000000000000 0000000000000008 c0000007fd1d3f80 0000000000000000 > GPR28: 0000000000000001 0000000000000009 c0000007fd1d4080 c0000000656a0000 > NIP [c0000000001cdfcc] update_entity_lag+0xcc/0xf0 > LR [c0000000001cdfc8] update_entity_lag+0xc8/0xf0 > Call Trace: > [c00000005c62f0f0] [c0000000001cdfc8] update_entity_lag+0xc8/0xf0 (unreliable) > [c00000005c62f160] [c0000000001cea80] dequeue_entity+0xb0/0x6d0 > [c00000005c62f1f0] [c0000000001cf8b0] dequeue_entities+0x150/0x600 > [c00000005c62f2c0] [c0000000001d02a8] dequeue_task_fair+0x158/0x2e0 > [c00000005c62f300] [c0000000001b5ea4] dequeue_task+0x64/0x200 > [c00000005c62f380] [c0000000001cc950] detach_tasks+0x140/0x420 > [c00000005c62f3f0] [c0000000001d6044] sched_balance_rq+0x214/0x7c0 > [c00000005c62f550] [c0000000001d6830] sched_balance_newidle+0x240/0x630 > [c00000005c62f640] [c0000000001d6d0c] pick_next_task_fair+0x7c/0x4a0 > [c00000005c62f6d0] [c0000000001afc50] __pick_next_task+0x60/0x2d0 > [c00000005c62f730] [c0000000010e8ce8] __schedule+0x198/0x840 > [c00000005c62f810] [c0000000010e93d0] schedule+0x40/0x110 > [c00000005c62f880] [c00000000064c574] pipe_read+0x424/0x6a0 > [c00000005c62f960] [c00000000063a0fc] vfs_read+0x30c/0x3d0 > [c00000005c62fa10] [c00000000063adf4] ksys_read+0x104/0x160 > [c00000005c62fa60] [c000000000031678] system_call_exception+0x138/0x2d0 > [c00000005c62fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec > --- interrupt: 3000 at 0x7fffb8f4a0c4 > NIP: 00007fffb8f4a0c4 LR: 00007fffb8f4a0c4 CTR: 0000000000000000 > REGS: c00000005c62fe80 TRAP: 3000 Tainted: G W (6.11.0-master-with-print-10547-g684a64bf32b6-dirty) > MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 48004222 XER: 00000000 > IRQMASK: 0 > GPR00: 0000000000000003 00007fffe27ca2d0 00007fffb9067100 0000000000000000 > GPR04: 000000003076051b 0000000000002000 00007fffb9060b50 0000000000000000 > GPR08: 000000000000006f 0000000000000000 0000000000000000 0000000000000000 > GPR12: 0000000000000000 00007fffb917a560 000000000000000b 0000000000000000 > GPR16: 00007fffb9173570 0000000000002000 000000000000000c 000000000000000b > GPR20: 00000000307604c0 0000000030762509 00000000100484e8 0000000030762515 > GPR24: 0000000010022b80 0000000000000000 0000000000000005 0000000000002000 > GPR28: 000000003076051b 0000000000002000 00007fffb905e508 00007fffb9060b50 > NIP [00007fffb8f4a0c4] 0x7fffb8f4a0c4 > LR [00007fffb8f4a0c4] 0x7fffb8f4a0c4 > --- interrupt: 3000 > Code: 4e800020 3d220104 89297c19 2c090000 4082ff8c 3c62ff99 39200001 3d420104 38632d90 992a7c19 4bf88965 60000000 <0fe00000> 4bffff68 60000000 60000000 > ---[ end trace 0000000000000000 ]--- > best=0000000000000000, curr=0000000000000000, se=00000000be02c573, node=0000000000000000 ocrr=0000000000000000, ose=00000000b1d4c4d5, onode=0000000023eb8c00 > Kernel attempted to read user page (51) - exploit attempt? (uid: 0) > BUG: Kernel NULL pointer dereference on read at 0x00000051 > Faulting instruction address: 0xc0000000001cfebc > Oops: Kernel access of bad area, sig: 11 [#1] > LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries > Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse > CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G W 6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64 > Tainted: [W]=WARN > Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries > NIP: c0000000001cfebc LR: c0000000001cfebc CTR: 0000000000000000 > REGS: c000000002c13950 TRAP: 0300 Tainted: G W (6.11.0-master-with-print-10547-g684a64bf32b6-dirty) > MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 44022282 XER: 0000000d > CFAR: c0000000001c4758 DAR: 0000000000000051 DSISR: 40000000 IRQMASK: 1 > GPR00: c0000000001cfebc c000000002c13bf0 c000000001b57400 0000000000000000 > GPR04: c0000007fd147108 c0000007fd1cd600 c000000002c13968 00000007fafb0000 > GPR08: 0000000000000027 00000000004e2002 0000896cdac1f9a0 0000000000002000 > GPR12: c000000002a18d88 c000000002f10000 0000000000000000 00000007fffe0000 > GPR16: 00000007fffd0000 0000000000000000 00000007fffe0114 0000000000000000 > GPR20: 0000000000000000 0000000000000000 c0000000010e95ec c000000002bd6380 > GPR24: c000000002bd6da8 c000000002c524e8 0000000000000000 c000000002bd6380 > GPR28: c000000002bd6380 c0000007fd1d3f80 c0000007fd1d4080 c0000007fd1d3f80 > NIP [c0000000001cfebc] pick_next_entity+0x3c/0x180 > LR [c0000000001cfebc] pick_next_entity+0x3c/0x180 > Call Trace: > [c000000002c13bf0] [c0000000001cfebc] pick_next_entity+0x3c/0x180 (unreliable) > [c000000002c13c70] [c0000000001d0064] pick_task_fair+0x64/0x130 > [c000000002c13cb0] [c0000000001d6cd8] pick_next_task_fair+0x48/0x4a0 > [c000000002c13d40] [c0000000001afc50] __pick_next_task+0x60/0x2d0 > [c000000002c13da0] [c0000000010e8ce8] __schedule+0x198/0x840 > [c000000002c13e80] [c0000000010e95ec] schedule_idle+0x3c/0x70 > [c000000002c13eb0] [c0000000001eb1d0] do_idle+0x160/0x1b0 > [c000000002c13f00] [c0000000001eb4d0] cpu_startup_entry+0x50/0x60 > [c000000002c13f30] [c0000000000110e8] rest_init+0xf0/0xf4 > [c000000002c13f60] [c0000000020053a4] do_initcalls+0x0/0x190 > [c000000002c13fe0] [c00000000000e788] start_here_common+0x1c/0x20 > Code: 60000000 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7c7d1b78 7c9e2378 f8010010 f821ff81 60000000 7fc3f378 4bff4601 <89230051> 7c7f1b78 2c090000 40820098 > ---[ end trace 0000000000000000 ]--- > pstore: backend (nvram) writing error (-1) > > Kernel panic - not syncing: Fatal exception > Rebooting in 10 seconds.. > > I have a feeling this might be what is described in this thread: https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@intel.com/ Likely a result of an overflow for vruntime. And we come out of pick_eevdf without a valid entity.
© 2016 - 2024 Red Hat, Inc.