[v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()

[PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()

Posted by Ingo Molnar 3 years, 8 months ago


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Aug 11, 2022 at 12:13 AM Ingo Molnar <mingo@kernel.org> wrote:
> >
> > By using a WARN_ON() we at least give the user a chance to report
> > any bugs triggered here - instead of getting silent hangs.
> >
> > None of these WARN_ON()s are supposed to trigger, ever - so we ignore
> > cases where a NULL check is done via a BUG_ON() and we let a NULL
> > pointer through after a WARN_ON().
> 
> May I suggest going one step further, and making these WARN_ON_ONCE() instead.

Ok, agreed.

> From personal experience, once some scheduler bug (or task struct 
> corruption) happens, ti often *keeps* happening, and the logs just fill 
> up with more and more data, to the point where you lose sight of the 
> original report (and the machine can even get unusable just from the 
> logging).
> 
> WARN_ON_ONCE() can help that situation.

Yeah, true.

> Now, obviously
> 
>  (a) WARN_ON_ONCE *can* also result in less information, and maybe there 
> are situations where having more - possibly different - cases of the same 
> thing triggering could be useful.

None of these warnings are supposed to happen absent serious random data 
corruption, ever™, so if against expectations they do happen once, 
somewhere, and its brevity makes it hard to figure out, we can still 
reconsider on a case by case basis & increase verbosity.

>  (b) WARN_ON_ONCE historically generated a bit bigger code than
> WARN_ON simply due to the extra "did this already trigger" check.
> 
> I *think* (b) is no longer true, and it's just a flag these days, but I 
> didn't actually check.

Yeah, so on architectures that implement a smart __WARN_FLAGS() primitive, 
such as x86, overhead is pretty good:

#define __WARN_FLAGS(flags)                                     \
do {                                                            \
        __auto_type __flags = BUGFLAG_WARNING|(flags);          \
        instrumentation_begin();                                \
        _BUG_FLAGS(ASM_UD2, __flags, ASM_REACHABLE);            \
        instrumentation_end();                                  \
} while (0)

For them the runtime overhead of a WARN_ON_ONCE() is basically just the 
check itself:

# no checks:

ffffffff810a3ae0 <check_preempt_curr>:
ffffffff810a3ae0:       48 8b 87 20 09 00 00    mov    0x920(%rdi),%rax
ffffffff810a3ae7:       48 8b 8e 90 02 00 00    mov    0x290(%rsi),%rcx
ffffffff810a3aee:       53                      push   %rbx
ffffffff810a3aef:       48 89 fb                mov    %rdi,%rbx
ffffffff810a3af2:       48 3b 88 90 02 00 00    cmp    0x290(%rax),%rcx

# Single-branch WARN_ON_ONCE():
#
#  void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
#  {
# +       WARN_ON_ONCE(!rq);
# +
#         if (p->sched_class == rq->curr->sched_class)
#                 rq->curr->sched_class->check_preempt_curr(rq, p, flags);
#         else if (sched_class_above(p->sched_class, rq->curr->sched_class))

ffffffff810a3ae0 <check_preempt_curr>:
ffffffff810a3ae0:       53                      push   %rbx
ffffffff810a3ae1:       48 89 fb                mov    %rdi,%rbx
ffffffff810a3ae4:  |    48 85 ff                test   %rdi,%rdi
ffffffff810a3ae7:  |    74 69                   je     ffffffff810a3b52 <check_preempt_curr+0x72>
ffffffff810a3ae9:       48 8b 83 20 09 00 00    mov    0x920(%rbx),%rax
ffffffff810a3af0:       48 8b 8e 90 02 00 00    mov    0x290(%rsi),%rcx
ffffffff810a3af7:       48 3b 88 90 02 00 00    cmp    0x290(%rax),%rcx
...
ffffffff810a3b50:       eb d1                   jmp    ffffffff810a3b23 <check_preempt_curr+0x43>    # tail-call
ffffffff810a3b52:  |    0f 0b                   ud2    

So it's a test instruction and an unlikely-forward-branch, plus a UD2 trap 
squeezed into the function epilogue, often hidden by alignment holes.

As lightweight as it gets, and no in-line penalty from being a 'once' 
warning.

> so it's not like there aren't potential downsides, but in general I think 
> the sanest and most natural thing is to have BUG_ON() translate to 
> WARN_ON_ONCE().
>
> For the "reboot-on-warn" people, it ends up being the same thing. And for 
> the rest of us, the "give me *one* warning" can end up making the 
> reporting a lot easier.

Agreed - updated v2 patch attached.

> Obviously, with the "this never actually happens", the whole "once or 
> many times" is kind of moot. But if it never happens at all, to the point 
> where it doesn't even add a chance of helping debugging, maybe the whole 
> test should be removed entirely...

Yeah.

Thanks,

	Ingo

========================================

From: Ingo Molnar <mingo@kernel.org>
Date: Thu, 11 Aug 2022 08:54:52 +0200
Subject: [PATCH] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()

There's no good reason to crash a user's system with a BUG_ON(),
chances are high that they'll never even see the crash message on
Xorg, and it won't make it into the syslog either.

By using a WARN_ON_ONCE() we at least give the user a chance to report
any bugs triggered here - instead of getting silent hangs.

None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore
cases where a NULL check is done via a BUG_ON() and we let a NULL
pointer through after a WARN_ON_ONCE().

There's one exception: WARN_ON_ONCE() arguments with side-effects,
such as locking - in this case we use the return value of the
WARN_ON_ONCE(), such as in:

 -       BUG_ON(!lock_task_sighand(p, &flags));
 +       if (WARN_ON_ONCE(!lock_task_sighand(p, &flags)))
 +               return;

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com
---
 kernel/sched/autogroup.c |  3 ++-
 kernel/sched/core.c      |  2 +-
 kernel/sched/cpupri.c    |  2 +-
 kernel/sched/deadline.c  | 26 +++++++++++++-------------
 kernel/sched/fair.c      | 10 +++++-----
 kernel/sched/rt.c        |  2 +-
 kernel/sched/sched.h     |  6 +++---
 7 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
index 4ebaf97f7bd8..991fc9002535 100644
--- a/kernel/sched/autogroup.c
+++ b/kernel/sched/autogroup.c
@@ -161,7 +161,8 @@ autogroup_move_group(struct task_struct *p, struct autogroup *ag)
 	struct task_struct *t;
 	unsigned long flags;
 
-	BUG_ON(!lock_task_sighand(p, &flags));
+	if (WARN_ON_ONCE(!lock_task_sighand(p, &flags)))
+		return;
 
 	prev = p->signal->autogroup;
 	if (prev == ag) {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee28253c9ac0..813687a5f5cf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2328,7 +2328,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 	rq = cpu_rq(new_cpu);
 
 	rq_lock(rq, rf);
-	BUG_ON(task_cpu(p) != new_cpu);
+	WARN_ON_ONCE(task_cpu(p) != new_cpu);
 	activate_task(rq, p, 0);
 	check_preempt_curr(rq, p, 0);
 
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index fa9ce9d83683..a286e726eb4b 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -147,7 +147,7 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
 	int task_pri = convert_prio(p->prio);
 	int idx, cpu;
 
-	BUG_ON(task_pri >= CPUPRI_NR_PRIORITIES);
+	WARN_ON_ONCE(task_pri >= CPUPRI_NR_PRIORITIES);
 
 	for (idx = 0; idx < task_pri; idx++) {
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0ab79d819a0d..962b169b05cf 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -310,7 +310,7 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw)
 {
 	struct rq *rq;
 
-	BUG_ON(p->dl.flags & SCHED_FLAG_SUGOV);
+	WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
 
 	if (task_on_rq_queued(p))
 		return;
@@ -607,7 +607,7 @@ static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
 {
 	struct rb_node *leftmost;
 
-	BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
+	WARN_ON_ONCE(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
 
 	leftmost = rb_add_cached(&p->pushable_dl_tasks,
 				 &rq->dl.pushable_dl_tasks_root,
@@ -684,7 +684,7 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
 			 * Failed to find any suitable CPU.
 			 * The task will never come back!
 			 */
-			BUG_ON(dl_bandwidth_enabled());
+			WARN_ON_ONCE(dl_bandwidth_enabled());
 
 			/*
 			 * If admission control is disabled we
@@ -830,7 +830,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
-	BUG_ON(pi_of(dl_se)->dl_runtime <= 0);
+	WARN_ON_ONCE(pi_of(dl_se)->dl_runtime <= 0);
 
 	/*
 	 * This could be the case for a !-dl task that is boosted.
@@ -1616,7 +1616,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 
-	BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
+	WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node));
 
 	rb_add_cached(&dl_se->rb_node, &dl_rq->root, __dl_less);
 
@@ -1640,7 +1640,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
 static void
 enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 {
-	BUG_ON(on_dl_rq(dl_se));
+	WARN_ON_ONCE(on_dl_rq(dl_se));
 
 	update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags);
 
@@ -2017,7 +2017,7 @@ static struct task_struct *pick_task_dl(struct rq *rq)
 		return NULL;
 
 	dl_se = pick_next_dl_entity(dl_rq);
-	BUG_ON(!dl_se);
+	WARN_ON_ONCE(!dl_se);
 	p = dl_task_of(dl_se);
 
 	return p;
@@ -2277,12 +2277,12 @@ static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
 
 	p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
 
-	BUG_ON(rq->cpu != task_cpu(p));
-	BUG_ON(task_current(rq, p));
-	BUG_ON(p->nr_cpus_allowed <= 1);
+	WARN_ON_ONCE(rq->cpu != task_cpu(p));
+	WARN_ON_ONCE(task_current(rq, p));
+	WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
 
-	BUG_ON(!task_on_rq_queued(p));
-	BUG_ON(!dl_task(p));
+	WARN_ON_ONCE(!task_on_rq_queued(p));
+	WARN_ON_ONCE(!dl_task(p));
 
 	return p;
 }
@@ -2492,7 +2492,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
 	struct root_domain *src_rd;
 	struct rq *rq;
 
-	BUG_ON(!dl_task(p));
+	WARN_ON_ONCE(!dl_task(p));
 
 	rq = task_rq(p);
 	src_rd = rq->rd;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 914096c5b1ae..28f10dccd194 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2600,7 +2600,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 	if (!join)
 		return;
 
-	BUG_ON(irqs_disabled());
+	WARN_ON_ONCE(irqs_disabled());
 	double_lock_irq(&my_grp->lock, &grp->lock);
 
 	for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {
@@ -7279,7 +7279,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 		return;
 
 	find_matching_se(&se, &pse);
-	BUG_ON(!pse);
+	WARN_ON_ONCE(!pse);
 
 	cse_is_idle = se_is_idle(se);
 	pse_is_idle = se_is_idle(pse);
@@ -8159,7 +8159,7 @@ static void attach_task(struct rq *rq, struct task_struct *p)
 {
 	lockdep_assert_rq_held(rq);
 
-	BUG_ON(task_rq(p) != rq);
+	WARN_ON_ONCE(task_rq(p) != rq);
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
 	check_preempt_curr(rq, p, 0);
 }
@@ -10134,7 +10134,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		goto out_balanced;
 	}
 
-	BUG_ON(busiest == env.dst_rq);
+	WARN_ON_ONCE(busiest == env.dst_rq);
 
 	schedstat_add(sd->lb_imbalance[idle], env.imbalance);
 
@@ -10430,7 +10430,7 @@ static int active_load_balance_cpu_stop(void *data)
 	 * we need to fix it. Originally reported by
 	 * Bjorn Helgaas on a 128-CPU setup.
 	 */
-	BUG_ON(busiest_rq == target_rq);
+	WARN_ON_ONCE(busiest_rq == target_rq);
 
 	/* Search for an sd spanning us and the target CPU. */
 	rcu_read_lock();
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 55f39c8f4203..2936fe55cef7 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -843,7 +843,7 @@ static void __disable_runtime(struct rq *rq)
 		 * We cannot be left wanting - that would mean some runtime
 		 * leaked out of the system.
 		 */
-		BUG_ON(want);
+		WARN_ON_ONCE(want);
 balanced:
 		/*
 		 * Disable all the borrow logic by pretending we have inf
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e26688d387ae..7a44dceeb50a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2709,8 +2709,8 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
 	__acquires(rq1->lock)
 	__acquires(rq2->lock)
 {
-	BUG_ON(!irqs_disabled());
-	BUG_ON(rq1 != rq2);
+	WARN_ON_ONCE(!irqs_disabled());
+	WARN_ON_ONCE(rq1 != rq2);
 	raw_spin_rq_lock(rq1);
 	__acquire(rq2->lock);	/* Fake it out ;) */
 	double_rq_clock_clear_update(rq1, rq2);
@@ -2726,7 +2726,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 	__releases(rq1->lock)
 	__releases(rq2->lock)
 {
-	BUG_ON(rq1 != rq2);
+	WARN_ON_ONCE(rq1 != rq2);
 	raw_spin_rq_unlock(rq1);
 	__release(rq2->lock);
 }

Re: [PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()

Posted by Mel Gorman 3 years, 8 months ago

On Fri, Aug 12, 2022 at 11:29:18AM +0200, Ingo Molnar wrote:
> From: Ingo Molnar <mingo@kernel.org>
> Date: Thu, 11 Aug 2022 08:54:52 +0200
> Subject: [PATCH] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()
> 
> There's no good reason to crash a user's system with a BUG_ON(),
> chances are high that they'll never even see the crash message on
> Xorg, and it won't make it into the syslog either.
> 
> By using a WARN_ON_ONCE() we at least give the user a chance to report
> any bugs triggered here - instead of getting silent hangs.
> 
> None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore
> cases where a NULL check is done via a BUG_ON() and we let a NULL
> pointer through after a WARN_ON_ONCE().
> 
> There's one exception: WARN_ON_ONCE() arguments with side-effects,
> such as locking - in this case we use the return value of the
> WARN_ON_ONCE(), such as in:
> 
>  -       BUG_ON(!lock_task_sighand(p, &flags));
>  +       if (WARN_ON_ONCE(!lock_task_sighand(p, &flags)))
>  +               return;
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com
> ---
>  kernel/sched/autogroup.c |  3 ++-
>  kernel/sched/core.c      |  2 +-
>  kernel/sched/cpupri.c    |  2 +-
>  kernel/sched/deadline.c  | 26 +++++++++++++-------------
>  kernel/sched/fair.c      | 10 +++++-----
>  kernel/sched/rt.c        |  2 +-
>  kernel/sched/sched.h     |  6 +++---
>  7 files changed, 26 insertions(+), 25 deletions(-)
> 
> diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> index fa9ce9d83683..a286e726eb4b 100644
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -147,7 +147,7 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
>  	int task_pri = convert_prio(p->prio);
>  	int idx, cpu;
>  
> -	BUG_ON(task_pri >= CPUPRI_NR_PRIORITIES);
> +	WARN_ON_ONCE(task_pri >= CPUPRI_NR_PRIORITIES);
>  
>  	for (idx = 0; idx < task_pri; idx++) {
>  

Should the return value be used here to clamp task_pri to
CPUPRI_NR_PRIORITIES? task_pri is used for index which in __cpupri_find
then accesses beyond the end of an array and the future behaviour will be
very unpredictable.

> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 0ab79d819a0d..962b169b05cf 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2017,7 +2017,7 @@ static struct task_struct *pick_task_dl(struct rq *rq)
>  		return NULL;
>  
>  	dl_se = pick_next_dl_entity(dl_rq);
> -	BUG_ON(!dl_se);
> +	WARN_ON_ONCE(!dl_se);
>  	p = dl_task_of(dl_se);
>  
>  	return p;

It's a somewhat redundant check, it'll NULL pointer dereference shortly
afterwards but it'll be a bit more obvious.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 914096c5b1ae..28f10dccd194 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2600,7 +2600,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
>  	if (!join)
>  		return;
>  
> -	BUG_ON(irqs_disabled());
> +	WARN_ON_ONCE(irqs_disabled());
>  	double_lock_irq(&my_grp->lock, &grp->lock);
>  
>  	for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {

Recoverable with a goto no_join. It'll be a terrible recovery because
there is no way IRQs should be disabled here. Something else incredibly
bad happened before this would fire.

> @@ -7279,7 +7279,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>  		return;
>  
>  	find_matching_se(&se, &pse);
> -	BUG_ON(!pse);
> +	WARN_ON_ONCE(!pse);
>  
>  	cse_is_idle = se_is_idle(se);
>  	pse_is_idle = se_is_idle(pse);

Similar to pick_task_dl.

> @@ -10134,7 +10134,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  		goto out_balanced;
>  	}
>  
> -	BUG_ON(busiest == env.dst_rq);
> +	WARN_ON_ONCE(busiest == env.dst_rq);
>  
>  	schedstat_add(sd->lb_imbalance[idle], env.imbalance);
>  

goto out if it triggers? It'll just continue to be unbalanced.

> @@ -10430,7 +10430,7 @@ static int active_load_balance_cpu_stop(void *data)
>  	 * we need to fix it. Originally reported by
>  	 * Bjorn Helgaas on a 128-CPU setup.
>  	 */
> -	BUG_ON(busiest_rq == target_rq);
> +	WARN_ON_ONCE(busiest_rq == target_rq);
>  
>  	/* Search for an sd spanning us and the target CPU. */
>  	rcu_read_lock();

goto out_unlock if it fires?

For the rest, I didn't see obvious recovery paths that would allow the
system to run predictably. Any of them firing will have unpredictable
consequences (e.g. move_queued_task firing would be fun if it was a per-cpu
kthread). Depending on which warning triggers, the remaining life of the
system may be very short but maybe long enough to be logged even if system
locks up shortly afterwards.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()

Posted by Ingo Molnar 3 years, 7 months ago

* Mel Gorman <mgorman@suse.de> wrote:

> For the rest, I didn't see obvious recovery paths that would allow the 
> system to run predictably. Any of them firing will have unpredictable 
> consequences (e.g. move_queued_task firing would be fun if it was a 
> per-cpu kthread). Depending on which warning triggers, the remaining life 
> of the system may be very short but maybe long enough to be logged even 
> if system locks up shortly afterwards.

Correct. I'd prefer to keep all these warnings 'simple' - i.e. no attempted 
recovery & control flow, unless we ever expect these to trigger.

I.e. instead of adding a 'goto' I'd prefer if we removed most of the ones 
you highlighted. But wanted to keep this first patch simple.

Thanks,

	Ingo

Re: [PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()

Posted by Mel Gorman 3 years, 7 months ago

On Sun, Aug 21, 2022 at 01:28:58PM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > For the rest, I didn't see obvious recovery paths that would allow the 
> > system to run predictably. Any of them firing will have unpredictable 
> > consequences (e.g. move_queued_task firing would be fun if it was a 
> > per-cpu kthread). Depending on which warning triggers, the remaining life 
> > of the system may be very short but maybe long enough to be logged even 
> > if system locks up shortly afterwards.
> 
> Correct. I'd prefer to keep all these warnings 'simple' - i.e. no attempted 
> recovery & control flow, unless we ever expect these to trigger.
> 

I generally hope we never expect a warning to trigger until of course it
does :P

> I.e. instead of adding a 'goto' I'd prefer if we removed most of the ones 
> you highlighted. But wanted to keep this first patch simple.
> 

The only one I would push back on is claming cpupri_find_fitness() so it
does not access outside the bounds of an array after a warning triggers.

For the rest, I've no strong objection to recovering given the nature of
the BUGs you are converting. If any of them *did* trigger, it would be
more important to fix the issue than recover the warning. So other than
the out-of-bounds array access which I'd like to see clamped just in case;

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

Re: [PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()

Posted by John Hubbard 3 years, 8 months ago

On 8/15/22 07:41, Mel Gorman wrote:
> For the rest, I didn't see obvious recovery paths that would allow the
> system to run predictably. Any of them firing will have unpredictable
> consequences (e.g. move_queued_task firing would be fun if it was a per-cpu
> kthread). Depending on which warning triggers, the remaining life of the
> system may be very short but maybe long enough to be logged even if system
> locks up shortly afterwards.

Hi Mel,

Are you basically saying, "WARN_ON*() seems acceptable in mm/, because
we can at least get the problem logged before it locks up, probably"?

Or are you implying that we should instead introduce and use some new
PANIC_ON() or VM_PANIC_ON() set of macros that would allow proper "log
then reboot" behavior?

thanks,
-- 
John Hubbard
NVIDIA

Re: [PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()

Posted by Mel Gorman 3 years, 8 months ago

On Mon, Aug 15, 2022 at 03:12:19PM -0700, John Hubbard wrote:
> On 8/15/22 07:41, Mel Gorman wrote:
> > For the rest, I didn't see obvious recovery paths that would allow the
> > system to run predictably. Any of them firing will have unpredictable
> > consequences (e.g. move_queued_task firing would be fun if it was a per-cpu
> > kthread). Depending on which warning triggers, the remaining life of the
> > system may be very short but maybe long enough to be logged even if system
> > locks up shortly afterwards.
> 
> Hi Mel,
> 
> Are you basically saying, "WARN_ON*() seems acceptable in mm/, because
> we can at least get the problem logged before it locks up, probably"?
> 

I don't consider this to be a subsystem-specific guideline and I am certainly
not suggesting that mm/ is the gold standard but my understanding was
"If a warning is recoverable in a sensible fashion then do so". A warning
is always bad, but it may be possible for the system to continue long
enough for the warning to be logged or an admin to schedule a reboot at
a controlled time. I think Ingo is doing the right thing with this patch
to avoid an unnecessary panic but for at least some of them, a partial
recovery is possible so why not do it seeing as it's been changed anyway?

The outcome of a warning is very variable, it might be a coding error where
state is unexpected but it can be reset to a known state, maybe some data
is leaked that if it persists we eventually OOM, maybe a particular feature
will no longer work as expected (e.g. load balancing doesn't balance load
due to a warning) or maybe the whole system will fail completely in the
near future, possibly before an error can be logged. In the scheduler
side, a warning may mean that a task never runs again and another means
that a task is in the wrong scheduling class which means ... no idea,
but it's not good if it's a deadline task that gets manipulated by the
normal scheduling class instead.

For mm/, take three examples

1. compaction.c warning. PFN ranges are unexpected, reset them.
   Compaction is less effective and opportunities were missed but the system
   will not blow up. Maybe at worst a hugetlb allocation would fail when it
   might otherwise have succeeded or maybe SLUB degrades because it can't
   use a high-order page for a cache that could reside in an order-0 page
   instead. It varies.

2. Warning in in filemap_unaccount_folio -- possible loss of unwritten
   data on normal filesystems. Bad, might cause inconsistency for state but
   maybe it'll be dirtied again and it'll recover eventually.  It's certainly
   not good if that data got lost forever.

3. The warning in isolate_movable_page is bad, page flags state is
   either corrupted by a driver or maybe there was a bit flip error but a
   page may be migrated unexpectedly leading to Who Knows What, corruption
   of a page that gets freed and reused by another process? It's much worse
   than compaction restoring its internal state of where it is scanning
   but maybe not a complete disaster.

> Or are you implying that we should instead introduce and use some new
> PANIC_ON() or VM_PANIC_ON() set of macros that would allow proper "log
> then reboot" behavior?
> 

No, not as a general rule. VM_PANIC_ON would suggest it's under DEBUG_VM
which is not always set and I do not see how either PANIC_ON or VM_PANIC_ON
could be coded in such a way that it always gets logged because it depends
on the context the bad condition occurred.

I also don't think the kernel could conditionally warn or panic for a
specific instance because if an admin wants a panic on a warning (can be
very useful for a crash dump), then panic_on_warn is available.

Either way, PANIC_ON or VM_PANIC_ON is outside the scope of this patch.

My understanding was that a panic() should only be used in the case where the
kernel is screwed and if tries to continue, the outcome will be much worse
or it's early in boot and the error condition means it'll be impossible
to boot anyway.

-- 
Mel Gorman
SUSE Labs