From nobody Wed Oct 8 19:15:18 2025 Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8089C33E1 for ; Wed, 25 Jun 2025 04:49:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=205.220.165.32 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750826967; cv=none; b=EHfRPg2Q1/V/O0uMxUoRb4eMJenNbG7HAV3BB20IJjRrIr/6aGx0miMKZB5ag/oDbmB/u7aKhTx38dhFwnEwtTRg/FRceplYoUHaNCyHTKZ7MF4vnkAr8u8viXpa+IemxAVEbjfAr/H3RubShWD5XbMIMy+qGlyBKNbmm4rPad4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750826967; c=relaxed/simple; bh=h3vKZKMJLz6K9ecOFVhipEDXb3xrenVE3PPIaN08ogQ=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version:Content-Type; b=D64OoTSee1Wr9Qqsgl9YUJacxyC8IDW8SAM/GMKlhuWYkw02yYTAXxSPfcwsOh60k63IuE7FUHFZIXg9mAOYSut5JxP/xt/af7Nf3aMziD80vrHR1fmFEH7XZLw3Il8iThxFfY9FyhHMRPsNRHomGoTFq0IbshEiz5U0hCq5R64= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com; spf=pass smtp.mailfrom=oracle.com; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b=q9qYmjC0; arc=none smtp.client-ip=205.220.165.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=oracle.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="q9qYmjC0" Received: from pps.filterd (m0246627.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 55OMif2q019163; Wed, 25 Jun 2025 04:48:39 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=cc :content-transfer-encoding:content-type:date:from:message-id :mime-version:subject:to; s=corp-2025-04-25; bh=VmFPYnubK/xFPXJJ L2xpmltWzNgNE57L7HX+rHvMBoY=; b=q9qYmjC0NCgtx+NQBwunU+LsRhibY7WT EBZ2oiIKT2HmObQ/5HUsDz+gSxoXFbsZIi9kpx4cWHE1qUi173jgVipE9Wy/pgVf /xGo9g3lgNuT81lP/CVPwivzJn34rBz90EM8JMyLBwGY5646goKudvBrjC3QPyi5 x7DvTP5+OC5lVK4aJQhBpXq7KKYyANQpyYUPFjg42IXm1/tsnz/+of/8hucEhe+Y e1xTm0u0aIwMmQuw5vEqmFVZypIl9kvAfJP26eRXtzL+P9OE0R6zrTbPnrG3yvFY D2RwtO8qwUV9ymIeMkgZW1Kr81r9uvUrhkDKEJDra/CLH+6LsWLghA== Received: from phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta03.appoci.oracle.com [138.1.37.129]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 47ds7uxk9d-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 25 Jun 2025 04:48:39 +0000 (GMT) Received: from pps.filterd (phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (8.18.1.2/8.18.1.2) with ESMTP id 55P2RaJC002351; Wed, 25 Jun 2025 04:48:38 GMT Received: from aruramak-dev.osdevelopmeniad.oraclevcn.com (aruramak-dev.allregionaliads.osdevelopmeniad.oraclevcn.com [100.100.253.155]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTP id 47ehpe1p5t-1; Wed, 25 Jun 2025 04:48:38 +0000 From: Aruna Ramakrishna To: linux-kernel@vger.kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com Subject: [RFC PATCH] sched: Change nr_uninterruptible from unsigned to signed int Date: Wed, 25 Jun 2025 04:48:36 +0000 Message-ID: <20250625044836.3939605-1-aruna.ramakrishna@oracle.com> X-Mailer: git-send-email 2.43.5 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1099,Hydra:6.1.7,FMLib:17.12.80.40 definitions=2025-06-25_01,2025-06-23_07,2025-03-28_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 adultscore=0 phishscore=0 suspectscore=0 spamscore=0 mlxlogscore=999 mlxscore=0 malwarescore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2505160000 definitions=main-2506250034 X-Proofpoint-GUID: cUOi7gHiT3tLZfx-LsVJf-VqMJQMAfUY X-Authority-Analysis: v=2.4 cv=CeII5Krl c=1 sm=1 tr=0 ts=685b7fa7 b=1 cx=c_pps a=WeWmnZmh0fydH62SvGsd2A==:117 a=WeWmnZmh0fydH62SvGsd2A==:17 a=IkcTkHD0fZMA:10 a=6IFa9wvqVegA:10 a=pGLkceISAAAA:8 a=yPCof4ZbAAAA:8 a=HWCl2iI5DVSZb1gQ_N8A:9 a=3ZKOabzyN94A:10 a=QEXdDO2ut3YA:10 X-Proofpoint-ORIG-GUID: cUOi7gHiT3tLZfx-LsVJf-VqMJQMAfUY X-Proofpoint-Spam-Details-Enc: AW1haW4tMjUwNjI1MDAzNCBTYWx0ZWRfX9e7BpL4KA/G2 Vb8p0DD+03sshGndE6BCTccNHRG+74swKImBanbWfUNC5rrMCuF07WOGAiBr+A+0YTVyEXteZg8 icRFxL3yk0+Kb5DcD1TOUl4yMbt9h4CJbS7YFXlhIGoopPqf6cJxYQc6XSM1z69kbFeHoi7UKx1 R3Y/K/3rLZfmj4V11dIdZ5s4rrCpqCzGqjZu6/n35TH20aq2OWmrTushs7mRpbKBxNlul3MLt3u cFCmfJINBRvtpgzlRFt1GPC4Z4C1wBzSUdhXNqc2AMJE2O40lHjk9LuB5Q65jPmDW3j65LKUMYA m6EohjZcAoeJqvQg+Aw9GxH0UmLzv2xuNVtjeGMcg2j7pnu1Mmb2vuy2Qu25c1IYKWGfHjRqWAD L35BkoDWuaVIbGllb9plQvAVQe0ZpHb2Un2xAb8NUDdhV/slg/uDOn2ltDIkWZcNhTcFuSfa We have encountered a bug where the load average displayed in top is abnormally high and obviously incorrect. The real values look like this (this is a production env, not a simulated one): top - 13:54:24 up 68 days, 14:33, 7 users, load average: 4294967298.80, 4294967298.55, 4294967298.58 Threads: 5764 total, 5 running, 5759 sleeping, 0 stopped, 0 zombie From digging a bit into the vmcore: crash> p calc_load_tasks calc_load_tasks =3D $1 =3D { counter =3D 4294967297 } which is: crash> eval 4294967297 hexadecimal: 100000001 It seems like an overflow, since the value exceeds UINT_MAX. Checking further: The nr_uninterruptible values for each of the CPU runqueues is large, and when they are summed up, the sum exceeds UINT_MAX, and the result is stored in a long, which preserves this overflow. long calc_load_fold_active(struct rq *this_rq, long adjust) { long nr_active, delta =3D 0; nr_active =3D this_rq->nr_running - adjust; nr_active +=3D (int)this_rq->nr_uninterruptible; ... From the vmcore: >>> sum=3D0 >>> for cpu in for_each_online_cpu(prog): ... rq =3D per_cpu(prog["runqueues"], cpu) ... nr_unint =3D rq.nr_uninterruptible.value_() ... sum +=3D nr_unint ... print(f"CPU {cpu}: nr_uninterruptible =3D {hex(nr_unint)}") ... print(f"sum {hex(sum)}") ... CPU 0: nr_uninterruptible =3D 0x638dd3 sum 0x638dd3 CPU 1: nr_uninterruptible =3D 0x129fb26 sum 0x18d88f9 CPU 2: nr_uninterruptible =3D 0xd8281f sum 0x265b118 ... CPU 94: nr_uninterruptible =3D 0xe0a86 sum 0xfff1e855 CPU 95: nr_uninterruptible =3D 0xe17ab sum 0x100000000 This is what we see, stored in calc_load_tasks. The correct sum here would = be 0. From kernel/sched/loadavg.c: * - cpu_rq()->nr_uninterruptible isn't accurately tracked per-CPU because * this would add another cross-CPU cacheline miss and atomic operation * to the wakeup path. Instead we increment on whatever CPU the task ran * when it went into uninterruptible state and decrement on whatever CPU * did the wakeup. This means that only the sum of nr_uninterruptible over * all CPUs yields the correct result. * It seems that rq->nr_uninterruptible can go to large (positive) values for one CPU if a lot of tasks were migrated off of that CPU after going into an uninterruptible state. If they=E2=80=99re woken up on another CPU - those target CPUs will have negative nr_uninterruptible values. I think the casting of an unsigned int to signed int and adding to a long is not preserving the sign, and results in a large positive value rather than the correct sum of zero. I suspect the bug surfaced as a side effect of this commit: commit e6fe3f422be128b7d65de607f6ae67bedc55f0ca Author: Alexey Dobriyan Date: Thu Apr 22 23:02:28 2021 +0300 sched: Make multiple runqueue task counters 32-bit Make: struct dl_rq::dl_nr_migratory struct dl_rq::dl_nr_running struct rt_rq::rt_nr_boosted struct rt_rq::rt_nr_migratory struct rt_rq::rt_nr_total struct rq::nr_uninterruptible 32-bit. If total number of tasks can't exceed 2**32 (and less due to futex pid limits), then per-runqueue counters can't as well. This patchset has been sponsored by REX Prefix Eradication Society. ... which changed the counter nr_uninterruptible from unsigned long to unsigned int. Since nr_uninterruptible can be a positive or negative number, change the type from unsigned int to signed int. Another possible solution would be to partially rollback e6fe3f422be1, and change nr_uninterruptible back to unsigned long. Signed-off-by: Aruna Ramakrishna --- kernel/sched/sched.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 475bb5998295..f6d21278e64e 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1149,7 +1149,7 @@ struct rq { * one CPU and if it got migrated afterwards it may decrease * it on another CPU. Always updated under the runqueue lock: */ - unsigned int nr_uninterruptible; + int nr_uninterruptible; =20 union { struct task_struct __rcu *donor; /* Scheduler context */ base-commit: 86731a2a651e58953fc949573895f2fa6d456841 prerequisite-patch-id: dd6db7012c5094dec89e689ba56fd3551d2b4a40 --=20 2.43.5