From nobody Tue Apr  7 22:05:31 2026
Received: from mxhk.zte.com.cn (mxhk.zte.com.cn [160.30.148.35])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 941B831E82A
	for <linux-kernel@vger.kernel.org>; Wed, 11 Mar 2026 09:47:31 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=160.30.148.35
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1773222453; cv=none;
 b=IDbfeACBQPoyt+IIDcWutjSmIAQaJDGFNtgO3vNSfWC6owRYz0ib6zHfW10FZNgBLCpNk8tnHeOnDKA2wA0ImqhpOpc8iy4BlcpOx644RVAhaxBbvx9pBmTWL9kQfh9g/v+uCkB8h34bYUu1ze7AGk8gdlGmUCMe9R2uHN5iOCY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1773222453; c=relaxed/simple;
	bh=s6Xzlm2l8QS4AVwVWZB4ZIzsBAi24evvFOGJFzVd7cI=;
	h=Message-ID:Date:Mime-Version:From:To:Cc:Subject:Content-Type;
 b=dnuGI1PJrZmXpLg8umI257tKu5uosgQQzVdalCMoM9WaKAbVhJVFD6cxhKuZQtTKBfKk5dKlTW8y/L0I5afJmiZej5shgioekxskWvsU+wNRuXRXrg7GIgnHBbVxat998s5aCcWduU3SICLdc+qAJFrUy1BXbyQGkdWdGWtsigg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=zte.com.cn;
 spf=pass smtp.mailfrom=zte.com.cn; arc=none smtp.client-ip=160.30.148.35
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=zte.com.cn
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=zte.com.cn
Received: from mse-fl2.zte.com.cn (unknown [10.5.228.133])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange x25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mxhk.zte.com.cn (FangMail) with ESMTPS id 4fW5Wb5vpSz8Xs6v;
	Wed, 11 Mar 2026 17:47:23 +0800 (CST)
Received: from xaxapp02.zte.com.cn ([10.88.97.241])
	by mse-fl2.zte.com.cn with SMTP id 62B9l8vr058896;
	Wed, 11 Mar 2026 17:47:08 +0800 (+08)
	(envelope-from hu.shengming@zte.com.cn)
Received: from mapi (xaxapp02[null])
	by mapi (Zmail) with MAPI id mid32;
	Wed, 11 Mar 2026 17:47:11 +0800 (CST)
X-Zmail-TransId: 2afa69b13a1ffba-0c306
X-Mailer: Zmail v1.0
Message-ID: <20260311174711163Nk6ITx4M_Jno8mdC7-iYz@zte.com.cn>
Date: Wed, 11 Mar 2026 17:47:11 +0800 (CST)
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
From: <hu.shengming@zte.com.cn>
To: <mingo@redhat.com>, <peterz@infradead.org>, <juri.lelli@redhat.com>,
        <vincent.guittot@linaro.org>
Cc: <zhang.run@zte.com.cn>, <yang.yang29@zte.com.cn>,
 <lu.zhongjun@zte.com.cn>,
        <yang.tao172@zte.com.cn>, <ran.xiaokai@zte.com.cn>,
        <linux-kernel@vger.kernel.org>
Subject: 
 =?UTF-8?B?W1BBVENIIDUuMTAueV0gc2NoZWQvZmFpcjogRml4IHRhc2sgc3RhcnZhdGlvbiBjYXVzZWQgYnkgaW5jb3JyZWN0IHZydW50aW1lX25vcm1hbGl6ZWQgY2hlY2s=?=
X-MAIL: mse-fl2.zte.com.cn 62B9l8vr058896
X-TLS: YES
X-SPF-DOMAIN: zte.com.cn
X-ENVELOPE-SENDER: hu.shengming@zte.com.cn
X-SPF: None
X-SOURCE-IP: 10.5.228.133 unknown Wed, 11 Mar 2026 17:47:23 +0800
X-Fangmail-Anti-Spam-Filtered: true
X-Fangmail-MID-QID: 69B13A2B.002/4fW5Wb5vpSz8Xs6v
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: luohaiyang10243395 <luo.haiyang@zte.com.cn>

when a previous update_rq_clock() happened inside a {soft,}irq
region, we stop ->clock_task and only update the prev_irq_time
stamp. it means clock_task remains unchanged for a period of
time. As a result, we may observe that a task has actually been
running but its sum_exec_runtime =3D=3D 0. as confirmed by ftrace:

  <...>-3615452 [040] d... 1081329.791592: update_rq_clock: (update_rq_cloc=
k+0x0/0x180) clock_task=3D0x3d32230b9387a clock=3D0x3d776b17204b0 prev_irq_=
time=3D0x45480b8cc36
  <...>-3615452 [040] ..s. 1081329.791596: softirq_entry: vec=3D3 [action=
=3DNET_RX]
  <...>-3615452 [040] d.s. 1081329.791619: update_rq_clock: (update_rq_cloc=
k+0x0/0x180) clock_task=3D0x3d32230b943a2 clock=3D0x3d776b1720fd8 prev_irq_=
time=3D0x45480b8cc36
  <...>-3615452 [040] ..s. 1081329.791631: softirq_exit: vec=3D3 [action=3D=
NET_RX]
  <...>-3615452 [040] ..s. 1081329.791631: softirq_entry: vec=3D6 [action=
=3DTASKLET]
  <...>-3615452 [040] ..s. 1081329.791632: softirq_exit: vec=3D6 [action=3D=
TASKLET]
  <...>-3615452 [040] d... 1081329.791633: update_rq_clock: (update_rq_cloc=
k+0x0/0x180) clock_task=3D0x3d32230b9a887 clock=3D0x3d776b17278ec prev_irq_=
time=3D0x45480b8d065
  <...>-3615452 [040] d... 1081329.791637: update_rq_clock: (update_rq_cloc=
k+0x0/0x180) clock_task=3D0x3d32230b9a887 clock=3D0x3d776b172b0cf prev_irq_=
time=3D0x45480b90848
  <...>-3615452 [040] d... 1081329.791639: sched_switch: prev_comm=3Dfutex =
prev_pid=3D3615452 prev_prio=3D120 prev_state=3DS =3D=3D> next_comm=3Dfutex=
 next_pid=3D3615454 next_prio=3D120
  <...>-3615454 [040] d... 1081329.791643: update_rq_clock: (update_rq_cloc=
k+0x0/0x180) clock_task=3D0x3d32230b9a887 clock=3D0x3d776b172c1fd prev_irq_=
time=3D0x45480b9197
  <...>-3615454 [040] d... 1081329.791644: sched_switch: prev_comm=3Dfutex =
prev_pid=3D3615454 prev_prio=3D120 prev_state=3DS =3D=3D> next_comm=3Dsched=
_yield next_pid=3D2439180 next_prio=3D12
  sched_yield-2439180 [040] d... 1081329.791645: update_rq_clock: (update_r=
q_clock+0x0/0x180) clock_task=3D0x3d32230b9a887 clock=3D0x3d776b172d7b3 pre=
v_irq_time=3D0x45480b92f2c

In our production environment, we have two tasks which bind to on cpu,
  nginxA:
     int main()
	{
	      pthread_mutex_lock(&mutex);
		  do_something();
		  pthread_mutex_unlock(&mutex);
	}

  nginxB:
     int main()
	{
	     while(nginxA not exit)
		     sched_yield();
	}

ngnixA starved due to the following sequence of events:
  1. other task fork tow tasks: nginxA and nginxB. When the system has been=
 running
     for a long time, the task vruntime is a very large value.
  2. nginxA immediately goes to sleep due to lock contention but its sum_ex=
ec_runtime =3D=3D 0.
  3. nginxA and nginxB attach to new cgroup, nginxA vruntime do not adjust =
to new cfs_rq
     due to sum_exec_runtime =3D=3D 0 in vruntime_normalized.
  4. new cfs_rq.min_vruntime overflows quickly due to nginxB continues runn=
ing, When nginxA is
     woken up by the scheduler, since nginxA vruntime >> nginxB vruntime, n=
ginxA cannot get CPU
     for a long period of time.

Signed-off-by: Luo Haiyang <luo.haiyang@zte.com.cn>
Tested-by: Lu Zhongjun <lu.zhongjun@zte.com.cn>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c11d59bea0ea..e82806a48ef3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11073,7 +11073,7 @@ prio_changed_fair(struct rq *rq, struct task_struct=
 *p, int oldprio)

 static inline bool vruntime_normalized(struct task_struct *p)
 {
-	struct sched_entity *se =3D &p->se;
+	unsigned long nr_switches =3D p->nvcsw + p->nivcsw;

 	/*
 	 * In both the TASK_ON_RQ_QUEUED and TASK_ON_RQ_MIGRATING cases,
@@ -11092,7 +11092,7 @@ static inline bool vruntime_normalized(struct task_=
struct *p)
 	 * - A task which has been woken up by try_to_wake_up() and
 	 *   waiting for actually being woken up by sched_ttwu_pending().
 	 */
-	if (!se->sum_exec_runtime ||
+	if (!nr_switches ||
 	    (p->state =3D=3D TASK_WAKING && p->sched_remote_wakeup))
 		return true;

--=20
2.25.1