kernel/sched/deadline.c | 1 - 1 file changed, 1 deletion(-)
When we enable the schedstat, we will meet an OS hang like this:
--------------------------------------------------------
[ 134.104253] kernel BUG at kernel/sched/deadline.c:63!
[ 134.132013] ------------[ cut here ]------------
[ 134.133441] x27: 0000000000000001
[ 134.138048] kernel BUG at kernel/sched/deadline.c:63!
[ 134.146478] x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000001
[ 134.153607] x23: 0000000000000001 x22: 0000000000000000 x21: 0000000000000001
[ 134.160734] x20: ffff007dbf1b6d00 x19: ffff007dbf1b7610 x18: 0000000000000014
[ 134.162027] ------------[ cut here ]------------
[ 134.167861] x17: 000000009deab6cd x16: 00000000527c9a1c x15: 00000000000000dc
[ 134.172473] kernel BUG at kernel/sched/deadline.c:63!
[ 134.179595] x14: 0000000001200011 x13: 0000000040001000 x12: 0000ffffb6df05bc
[ 134.191760] x11: ffff007dbf1b6d00 x10: ffff0001062dd2e8 x9 : ffff8000801215ac
[ 134.192036] ------------[ cut here ]------------
[ 134.198888] x8 : 0000000000000000 x7 : 0000000000000021 x6 : ffff0001764ed280
[ 134.203498] kernel BUG at kernel/sched/deadline.c:63!
[ 134.210622] x5 : 0000000000000000 x4 : 0000000000000001 x3 : ffff807d3dd24000
[ 134.222787] x2 : 000000028b77a140 x1 : 0000003400000000 x0 : ffff007dbf1b6c80
[ 134.229915] Call trace:
[ 134.232353] dl_task_of.part.0+0x0/0x10
[ 134.236182] dl_server_start+0x54/0x158
[ 134.240013] enqueue_task_fair+0x138/0x420
[ 134.244100] enqueue_task+0x44/0xb0
[ 134.247584] wake_up_new_task+0x1c0/0x3a0
[ 134.251584] kernel_clone+0xe8/0x3e8
[ 134.252022] ------------[ cut here ]------------
[ 134.255156] __do_sys_clone+0x70/0xa8
[ 134.259764] kernel BUG at kernel/sched/deadline.c:63!
[ 134.263412] __arm64_sys_clone+0x28/0x40
[ 134.272360] invoke_syscall+0x50/0x120
[ 134.276101] el0_svc_common+0x44/0xf8
[ 134.279753] do_el0_svc+0x28/0x40
[ 134.283058] el0_svc+0x40/0x150
[ 134.286195] el0t_64_sync_handler+0x100/0x130
[ 134.290546] el0t_64_sync+0x1a4/0x1a8
[ 134.294200] Code: 35ffffa2 17ffffe3 d4210000 17ffffb4 (d4210000)
[ 134.300283] ---[ end trace 0000000000000000 ]---
[ 134.304890] Kernel panic - not syncing: Oops - BUG: Fatal exception
[ 134.311147] SMP: stopping secondary CPUs
[ 135.365096] SMP: failed to stop secondary CPUs 8-9,16,30,43,86,88,121,149
[ 135.371884] Kernel Offset: disabled
[ 135.375361] CPU features: 0x00,00100003,80153d29,d75ffea7
[ 135.380749] Memory Limit: none
[ 135.383793] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]
--------------------------------------------------------
In dl_server_start(), we set the dl_se->dl_server with 1. When schedstats
is enabled, in the following:
dl_server_start() --> enqueue_dl_entity() --> update_stats_enqueue_dl()
__schedstats_from_dl_se() -->dl_task_of()
we will meet the BUG_ON.
So just remove the BUG_ON in dl_task_of to fix the hang.
Tested this patch with memcached in Altra 2P.
Fixes: 5f6bd380c7bd ("sched/rt: Remove default bandwidth control")
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
---
kernel/sched/deadline.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0f2df67f710b..76cce7adcb7d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -60,7 +60,6 @@ static bool dl_server(struct sched_dl_entity *dl_se)
static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
{
- BUG_ON(dl_server(dl_se));
return container_of(dl_se, struct task_struct, dl);
}
--
2.40.1
Hi,
On 26/08/24 10:11, Huang Shijie wrote:
> When we enable the schedstat, we will meet an OS hang like this:
> --------------------------------------------------------
> [ 134.104253] kernel BUG at kernel/sched/deadline.c:63!
> [ 134.132013] ------------[ cut here ]------------
> [ 134.133441] x27: 0000000000000001
> [ 134.138048] kernel BUG at kernel/sched/deadline.c:63!
> [ 134.146478] x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000001
> [ 134.153607] x23: 0000000000000001 x22: 0000000000000000 x21: 0000000000000001
> [ 134.160734] x20: ffff007dbf1b6d00 x19: ffff007dbf1b7610 x18: 0000000000000014
> [ 134.162027] ------------[ cut here ]------------
> [ 134.167861] x17: 000000009deab6cd x16: 00000000527c9a1c x15: 00000000000000dc
> [ 134.172473] kernel BUG at kernel/sched/deadline.c:63!
> [ 134.179595] x14: 0000000001200011 x13: 0000000040001000 x12: 0000ffffb6df05bc
> [ 134.191760] x11: ffff007dbf1b6d00 x10: ffff0001062dd2e8 x9 : ffff8000801215ac
> [ 134.192036] ------------[ cut here ]------------
> [ 134.198888] x8 : 0000000000000000 x7 : 0000000000000021 x6 : ffff0001764ed280
> [ 134.203498] kernel BUG at kernel/sched/deadline.c:63!
> [ 134.210622] x5 : 0000000000000000 x4 : 0000000000000001 x3 : ffff807d3dd24000
> [ 134.222787] x2 : 000000028b77a140 x1 : 0000003400000000 x0 : ffff007dbf1b6c80
> [ 134.229915] Call trace:
> [ 134.232353] dl_task_of.part.0+0x0/0x10
> [ 134.236182] dl_server_start+0x54/0x158
> [ 134.240013] enqueue_task_fair+0x138/0x420
> [ 134.244100] enqueue_task+0x44/0xb0
> [ 134.247584] wake_up_new_task+0x1c0/0x3a0
> [ 134.251584] kernel_clone+0xe8/0x3e8
> [ 134.252022] ------------[ cut here ]------------
> [ 134.255156] __do_sys_clone+0x70/0xa8
> [ 134.259764] kernel BUG at kernel/sched/deadline.c:63!
> [ 134.263412] __arm64_sys_clone+0x28/0x40
> [ 134.272360] invoke_syscall+0x50/0x120
> [ 134.276101] el0_svc_common+0x44/0xf8
> [ 134.279753] do_el0_svc+0x28/0x40
> [ 134.283058] el0_svc+0x40/0x150
> [ 134.286195] el0t_64_sync_handler+0x100/0x130
> [ 134.290546] el0t_64_sync+0x1a4/0x1a8
> [ 134.294200] Code: 35ffffa2 17ffffe3 d4210000 17ffffb4 (d4210000)
> [ 134.300283] ---[ end trace 0000000000000000 ]---
> [ 134.304890] Kernel panic - not syncing: Oops - BUG: Fatal exception
> [ 134.311147] SMP: stopping secondary CPUs
> [ 135.365096] SMP: failed to stop secondary CPUs 8-9,16,30,43,86,88,121,149
> [ 135.371884] Kernel Offset: disabled
> [ 135.375361] CPU features: 0x00,00100003,80153d29,d75ffea7
> [ 135.380749] Memory Limit: none
> [ 135.383793] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]
> --------------------------------------------------------
>
> In dl_server_start(), we set the dl_se->dl_server with 1. When schedstats
> is enabled, in the following:
> dl_server_start() --> enqueue_dl_entity() --> update_stats_enqueue_dl()
> __schedstats_from_dl_se() -->dl_task_of()
>
> we will meet the BUG_ON.
>
> So just remove the BUG_ON in dl_task_of to fix the hang.
AFAIU the fair task enqueue (the "real" one) eventually leading to this
BUG has already had its own schedstats tracking dealt with and I don't
think we need/want to track anything for the associated dl_server.
Shouldn't we rather return early if update_stats_{wait_start,wait_end,
enqueue_sleeper}_dl are called for a dl_server dl_se?
Thanks,
Juri
On 2024/8/27 20:57, Juri Lelli wrote:
> AFAIU the fair task enqueue (the "real" one) eventually leading to this
> BUG has already had its own schedstats tracking dealt with and I don't
> think we need/want to track anything for the associated dl_server.
okay.
>
> Shouldn't we rather return early if update_stats_{wait_start,wait_end,
> enqueue_sleeper}_dl are called for a dl_server dl_se?
okay, I can add the "early return" in next version.
Thanks
Huang Shijie
© 2016 - 2025 Red Hat, Inc.