kernel/sched/sched.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
We have encountered a bug where the load average displayed in top is
abnormally high and obviously incorrect. The real values look like this
(this is a production env, not a simulated one):
top - 13:54:24 up 68 days, 14:33, 7 users, load average:
4294967298.80, 4294967298.55, 4294967298.58
Threads: 5764 total, 5 running, 5759 sleeping, 0 stopped, 0 zombie
From digging a bit into the vmcore:
crash> p calc_load_tasks
calc_load_tasks = $1 = {
counter = 4294967297
}
which is:
crash> eval 4294967297
hexadecimal: 100000001
It seems like an overflow, since the value exceeds UINT_MAX.
Checking further:
The nr_uninterruptible values for each of the CPU runqueues is large,
and when they are summed up, the sum exceeds UINT_MAX, and the result
is stored in a long, which preserves this overflow.
long calc_load_fold_active(struct rq *this_rq, long adjust)
{
long nr_active, delta = 0;
nr_active = this_rq->nr_running - adjust;
nr_active += (int)this_rq->nr_uninterruptible;
...
From the vmcore:
>>> sum=0
>>> for cpu in for_each_online_cpu(prog):
... rq = per_cpu(prog["runqueues"], cpu)
... nr_unint = rq.nr_uninterruptible.value_()
... sum += nr_unint
... print(f"CPU {cpu}: nr_uninterruptible = {hex(nr_unint)}")
... print(f"sum {hex(sum)}")
...
CPU 0: nr_uninterruptible = 0x638dd3
sum 0x638dd3
CPU 1: nr_uninterruptible = 0x129fb26
sum 0x18d88f9
CPU 2: nr_uninterruptible = 0xd8281f
sum 0x265b118
...
CPU 94: nr_uninterruptible = 0xe0a86
sum 0xfff1e855
CPU 95: nr_uninterruptible = 0xe17ab
sum 0x100000000
This is what we see, stored in calc_load_tasks. The correct sum here would be 0.
From kernel/sched/loadavg.c:
* - cpu_rq()->nr_uninterruptible isn't accurately tracked per-CPU
because
* this would add another cross-CPU cacheline miss and atomic
operation
* to the wakeup path. Instead we increment on whatever CPU the task
ran
* when it went into uninterruptible state and decrement on whatever
CPU
* did the wakeup. This means that only the sum of nr_uninterruptible
over
* all CPUs yields the correct result.
*
It seems that rq->nr_uninterruptible can go to large (positive) values
for one CPU if a lot of tasks were migrated off of that CPU after going
into an uninterruptible state. If they’re woken up on another CPU -
those target CPUs will have negative nr_uninterruptible values. I think
the casting of an unsigned int to signed int and adding to a long is
not preserving the sign, and results in a large positive value rather
than the correct sum of zero.
I suspect the bug surfaced as a side effect of this commit:
commit e6fe3f422be128b7d65de607f6ae67bedc55f0ca
Author: Alexey Dobriyan <adobriyan@gmail.com>
Date: Thu Apr 22 23:02:28 2021 +0300
sched: Make multiple runqueue task counters 32-bit
Make:
struct dl_rq::dl_nr_migratory
struct dl_rq::dl_nr_running
struct rt_rq::rt_nr_boosted
struct rt_rq::rt_nr_migratory
struct rt_rq::rt_nr_total
struct rq::nr_uninterruptible
32-bit.
If total number of tasks can't exceed 2**32 (and less due to futex
pid
limits), then per-runqueue counters can't as well.
This patchset has been sponsored by REX Prefix Eradication Society.
...
which changed the counter nr_uninterruptible from unsigned long to unsigned
int.
Since nr_uninterruptible can be a positive or negative number, change
the type from unsigned int to signed int.
Another possible solution would be to partially rollback e6fe3f422be1,
and change nr_uninterruptible back to unsigned long.
Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
---
kernel/sched/sched.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 475bb5998295..f6d21278e64e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1149,7 +1149,7 @@ struct rq {
* one CPU and if it got migrated afterwards it may decrease
* it on another CPU. Always updated under the runqueue lock:
*/
- unsigned int nr_uninterruptible;
+ int nr_uninterruptible;
union {
struct task_struct __rcu *donor; /* Scheduler context */
base-commit: 86731a2a651e58953fc949573895f2fa6d456841
prerequisite-patch-id: dd6db7012c5094dec89e689ba56fd3551d2b4a40
--
2.43.5
(Please, be careful not to wrap quoted text, unwrapped it for you) On Wed, Jun 25, 2025 at 04:48:36AM +0000, Aruna Ramakrishna wrote: > We have encountered a bug where the load average displayed in top is > abnormally high and obviously incorrect. The real values look like this > (this is a production env, not a simulated one): Whoopie.. > The nr_uninterruptible values for each of the CPU runqueues is large, > and when they are summed up, the sum exceeds UINT_MAX, and the result > is stored in a long, which preserves this overflow. Right, that's the problem spot. > long calc_load_fold_active(struct rq *this_rq, long adjust) > { > long nr_active, delta = 0; > > nr_active = this_rq->nr_running - adjust; > nr_active += (int)this_rq->nr_uninterruptible; > ... > From kernel/sched/loadavg.c: > > * - cpu_rq()->nr_uninterruptible isn't accurately tracked per-CPU because > * this would add another cross-CPU cacheline miss and atomic operation > * to the wakeup path. Instead we increment on whatever CPU the task ran > * when it went into uninterruptible state and decrement on whatever CPU > * did the wakeup. This means that only the sum of nr_uninterruptible over > * all CPUs yields the correct result. > * > > It seems that rq->nr_uninterruptible can go to large (positive) values > for one CPU if a lot of tasks were migrated off of that CPU after going > into an uninterruptible state. If they’re woken up on another CPU - > those target CPUs will have negative nr_uninterruptible values. I think > the casting of an unsigned int to signed int and adding to a long is > not preserving the sign, and results in a large positive value rather > than the correct sum of zero. So very close, yet so far... > I suspect the bug surfaced as a side effect of this commit: > > commit e6fe3f422be128b7d65de607f6ae67bedc55f0ca > Author: Alexey Dobriyan <adobriyan@gmail.com> > Date: Thu Apr 22 23:02:28 2021 +0300 > > sched: Make multiple runqueue task counters 32-bit > > Make: > > struct dl_rq::dl_nr_migratory > struct dl_rq::dl_nr_running > > struct rt_rq::rt_nr_boosted > struct rt_rq::rt_nr_migratory > struct rt_rq::rt_nr_total > > struct rq::nr_uninterruptible > > 32-bit. > > If total number of tasks can't exceed 2**32 (and less due to futex pid > limits), then per-runqueue counters can't as well. > > This patchset has been sponsored by REX Prefix Eradication Society. > ... > > which changed the counter nr_uninterruptible from unsigned long to unsigned > int. > > Since nr_uninterruptible can be a positive or negative number, change > the type from unsigned int to signed int. (Strictly speaking it's making things worse, since signed overflow is UB in regular C -- luckily we kernel folks have our own dialect and signed and unsigned are both expected to wrap 2s-complement). Also, we're already casting to (int) in the only place where we consume the value. So changing the type should make no difference what so ever, right? > Another possible solution would be to partially rollback e6fe3f422be1, > and change nr_uninterruptible back to unsigned long. I think I prefer this.
© 2016 - 2025 Red Hat, Inc.