From nobody Tue Apr 7 22:05:31 2026 Received: from mxhk.zte.com.cn (mxhk.zte.com.cn [160.30.148.35]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 941B831E82A for ; Wed, 11 Mar 2026 09:47:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=160.30.148.35 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773222453; cv=none; b=IDbfeACBQPoyt+IIDcWutjSmIAQaJDGFNtgO3vNSfWC6owRYz0ib6zHfW10FZNgBLCpNk8tnHeOnDKA2wA0ImqhpOpc8iy4BlcpOx644RVAhaxBbvx9pBmTWL9kQfh9g/v+uCkB8h34bYUu1ze7AGk8gdlGmUCMe9R2uHN5iOCY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773222453; c=relaxed/simple; bh=s6Xzlm2l8QS4AVwVWZB4ZIzsBAi24evvFOGJFzVd7cI=; h=Message-ID:Date:Mime-Version:From:To:Cc:Subject:Content-Type; b=dnuGI1PJrZmXpLg8umI257tKu5uosgQQzVdalCMoM9WaKAbVhJVFD6cxhKuZQtTKBfKk5dKlTW8y/L0I5afJmiZej5shgioekxskWvsU+wNRuXRXrg7GIgnHBbVxat998s5aCcWduU3SICLdc+qAJFrUy1BXbyQGkdWdGWtsigg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=zte.com.cn; spf=pass smtp.mailfrom=zte.com.cn; arc=none smtp.client-ip=160.30.148.35 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=zte.com.cn Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=zte.com.cn Received: from mse-fl2.zte.com.cn (unknown [10.5.228.133]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange x25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mxhk.zte.com.cn (FangMail) with ESMTPS id 4fW5Wb5vpSz8Xs6v; Wed, 11 Mar 2026 17:47:23 +0800 (CST) Received: from xaxapp02.zte.com.cn ([10.88.97.241]) by mse-fl2.zte.com.cn with SMTP id 62B9l8vr058896; Wed, 11 Mar 2026 17:47:08 +0800 (+08) (envelope-from hu.shengming@zte.com.cn) Received: from mapi (xaxapp02[null]) by mapi (Zmail) with MAPI id mid32; Wed, 11 Mar 2026 17:47:11 +0800 (CST) X-Zmail-TransId: 2afa69b13a1ffba-0c306 X-Mailer: Zmail v1.0 Message-ID: <20260311174711163Nk6ITx4M_Jno8mdC7-iYz@zte.com.cn> Date: Wed, 11 Mar 2026 17:47:11 +0800 (CST) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 From: To: , , , Cc: , , , , , Subject: =?UTF-8?B?W1BBVENIIDUuMTAueV0gc2NoZWQvZmFpcjogRml4IHRhc2sgc3RhcnZhdGlvbiBjYXVzZWQgYnkgaW5jb3JyZWN0IHZydW50aW1lX25vcm1hbGl6ZWQgY2hlY2s=?= X-MAIL: mse-fl2.zte.com.cn 62B9l8vr058896 X-TLS: YES X-SPF-DOMAIN: zte.com.cn X-ENVELOPE-SENDER: hu.shengming@zte.com.cn X-SPF: None X-SOURCE-IP: 10.5.228.133 unknown Wed, 11 Mar 2026 17:47:23 +0800 X-Fangmail-Anti-Spam-Filtered: true X-Fangmail-MID-QID: 69B13A2B.002/4fW5Wb5vpSz8Xs6v Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: luohaiyang10243395 when a previous update_rq_clock() happened inside a {soft,}irq region, we stop ->clock_task and only update the prev_irq_time stamp. it means clock_task remains unchanged for a period of time. As a result, we may observe that a task has actually been running but its sum_exec_runtime =3D=3D 0. as confirmed by ftrace: <...>-3615452 [040] d... 1081329.791592: update_rq_clock: (update_rq_cloc= k+0x0/0x180) clock_task=3D0x3d32230b9387a clock=3D0x3d776b17204b0 prev_irq_= time=3D0x45480b8cc36 <...>-3615452 [040] ..s. 1081329.791596: softirq_entry: vec=3D3 [action= =3DNET_RX] <...>-3615452 [040] d.s. 1081329.791619: update_rq_clock: (update_rq_cloc= k+0x0/0x180) clock_task=3D0x3d32230b943a2 clock=3D0x3d776b1720fd8 prev_irq_= time=3D0x45480b8cc36 <...>-3615452 [040] ..s. 1081329.791631: softirq_exit: vec=3D3 [action=3D= NET_RX] <...>-3615452 [040] ..s. 1081329.791631: softirq_entry: vec=3D6 [action= =3DTASKLET] <...>-3615452 [040] ..s. 1081329.791632: softirq_exit: vec=3D6 [action=3D= TASKLET] <...>-3615452 [040] d... 1081329.791633: update_rq_clock: (update_rq_cloc= k+0x0/0x180) clock_task=3D0x3d32230b9a887 clock=3D0x3d776b17278ec prev_irq_= time=3D0x45480b8d065 <...>-3615452 [040] d... 1081329.791637: update_rq_clock: (update_rq_cloc= k+0x0/0x180) clock_task=3D0x3d32230b9a887 clock=3D0x3d776b172b0cf prev_irq_= time=3D0x45480b90848 <...>-3615452 [040] d... 1081329.791639: sched_switch: prev_comm=3Dfutex = prev_pid=3D3615452 prev_prio=3D120 prev_state=3DS =3D=3D> next_comm=3Dfutex= next_pid=3D3615454 next_prio=3D120 <...>-3615454 [040] d... 1081329.791643: update_rq_clock: (update_rq_cloc= k+0x0/0x180) clock_task=3D0x3d32230b9a887 clock=3D0x3d776b172c1fd prev_irq_= time=3D0x45480b9197 <...>-3615454 [040] d... 1081329.791644: sched_switch: prev_comm=3Dfutex = prev_pid=3D3615454 prev_prio=3D120 prev_state=3DS =3D=3D> next_comm=3Dsched= _yield next_pid=3D2439180 next_prio=3D12 sched_yield-2439180 [040] d... 1081329.791645: update_rq_clock: (update_r= q_clock+0x0/0x180) clock_task=3D0x3d32230b9a887 clock=3D0x3d776b172d7b3 pre= v_irq_time=3D0x45480b92f2c In our production environment, we have two tasks which bind to on cpu, nginxA: int main() { pthread_mutex_lock(&mutex); do_something(); pthread_mutex_unlock(&mutex); } nginxB: int main() { while(nginxA not exit) sched_yield(); } ngnixA starved due to the following sequence of events: 1. other task fork tow tasks: nginxA and nginxB. When the system has been= running for a long time, the task vruntime is a very large value. 2. nginxA immediately goes to sleep due to lock contention but its sum_ex= ec_runtime =3D=3D 0. 3. nginxA and nginxB attach to new cgroup, nginxA vruntime do not adjust = to new cfs_rq due to sum_exec_runtime =3D=3D 0 in vruntime_normalized. 4. new cfs_rq.min_vruntime overflows quickly due to nginxB continues runn= ing, When nginxA is woken up by the scheduler, since nginxA vruntime >> nginxB vruntime, n= ginxA cannot get CPU for a long period of time. Signed-off-by: Luo Haiyang Tested-by: Lu Zhongjun --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c11d59bea0ea..e82806a48ef3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11073,7 +11073,7 @@ prio_changed_fair(struct rq *rq, struct task_struct= *p, int oldprio) static inline bool vruntime_normalized(struct task_struct *p) { - struct sched_entity *se =3D &p->se; + unsigned long nr_switches =3D p->nvcsw + p->nivcsw; /* * In both the TASK_ON_RQ_QUEUED and TASK_ON_RQ_MIGRATING cases, @@ -11092,7 +11092,7 @@ static inline bool vruntime_normalized(struct task_= struct *p) * - A task which has been woken up by try_to_wake_up() and * waiting for actually being woken up by sched_ttwu_pending(). */ - if (!se->sum_exec_runtime || + if (!nr_switches || (p->state =3D=3D TASK_WAKING && p->sched_remote_wakeup)) return true; --=20 2.25.1