[PATCH 5.10.y] sched/fair: Fix task starvation caused by incorrect vruntime_normalized check

hu.shengming@zte.com.cn posted 1 patch 3 weeks, 6 days ago
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
[PATCH 5.10.y] sched/fair: Fix task starvation caused by incorrect vruntime_normalized check
Posted by hu.shengming@zte.com.cn 3 weeks, 6 days ago
From: luohaiyang10243395 <luo.haiyang@zte.com.cn>

when a previous update_rq_clock() happened inside a {soft,}irq
region, we stop ->clock_task and only update the prev_irq_time
stamp. it means clock_task remains unchanged for a period of
time. As a result, we may observe that a task has actually been
running but its sum_exec_runtime == 0. as confirmed by ftrace:

  <...>-3615452 [040] d... 1081329.791592: update_rq_clock: (update_rq_clock+0x0/0x180) clock_task=0x3d32230b9387a clock=0x3d776b17204b0 prev_irq_time=0x45480b8cc36
  <...>-3615452 [040] ..s. 1081329.791596: softirq_entry: vec=3 [action=NET_RX]
  <...>-3615452 [040] d.s. 1081329.791619: update_rq_clock: (update_rq_clock+0x0/0x180) clock_task=0x3d32230b943a2 clock=0x3d776b1720fd8 prev_irq_time=0x45480b8cc36
  <...>-3615452 [040] ..s. 1081329.791631: softirq_exit: vec=3 [action=NET_RX]
  <...>-3615452 [040] ..s. 1081329.791631: softirq_entry: vec=6 [action=TASKLET]
  <...>-3615452 [040] ..s. 1081329.791632: softirq_exit: vec=6 [action=TASKLET]
  <...>-3615452 [040] d... 1081329.791633: update_rq_clock: (update_rq_clock+0x0/0x180) clock_task=0x3d32230b9a887 clock=0x3d776b17278ec prev_irq_time=0x45480b8d065
  <...>-3615452 [040] d... 1081329.791637: update_rq_clock: (update_rq_clock+0x0/0x180) clock_task=0x3d32230b9a887 clock=0x3d776b172b0cf prev_irq_time=0x45480b90848
  <...>-3615452 [040] d... 1081329.791639: sched_switch: prev_comm=futex prev_pid=3615452 prev_prio=120 prev_state=S ==> next_comm=futex next_pid=3615454 next_prio=120
  <...>-3615454 [040] d... 1081329.791643: update_rq_clock: (update_rq_clock+0x0/0x180) clock_task=0x3d32230b9a887 clock=0x3d776b172c1fd prev_irq_time=0x45480b9197
  <...>-3615454 [040] d... 1081329.791644: sched_switch: prev_comm=futex prev_pid=3615454 prev_prio=120 prev_state=S ==> next_comm=sched_yield next_pid=2439180 next_prio=12
  sched_yield-2439180 [040] d... 1081329.791645: update_rq_clock: (update_rq_clock+0x0/0x180) clock_task=0x3d32230b9a887 clock=0x3d776b172d7b3 prev_irq_time=0x45480b92f2c

In our production environment, we have two tasks which bind to on cpu,
  nginxA:
     int main()
	{
	      pthread_mutex_lock(&mutex);
		  do_something();
		  pthread_mutex_unlock(&mutex);
	}

  nginxB:
     int main()
	{
	     while(nginxA not exit)
		     sched_yield();
	}

ngnixA starved due to the following sequence of events:
  1. other task fork tow tasks: nginxA and nginxB. When the system has been running
     for a long time, the task vruntime is a very large value.
  2. nginxA immediately goes to sleep due to lock contention but its sum_exec_runtime == 0.
  3. nginxA and nginxB attach to new cgroup, nginxA vruntime do not adjust to new cfs_rq
     due to sum_exec_runtime == 0 in vruntime_normalized.
  4. new cfs_rq.min_vruntime overflows quickly due to nginxB continues running, When nginxA is
     woken up by the scheduler, since nginxA vruntime >> nginxB vruntime, nginxA cannot get CPU
     for a long period of time.

Signed-off-by: Luo Haiyang <luo.haiyang@zte.com.cn>
Tested-by: Lu Zhongjun <lu.zhongjun@zte.com.cn>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c11d59bea0ea..e82806a48ef3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11073,7 +11073,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)

 static inline bool vruntime_normalized(struct task_struct *p)
 {
-	struct sched_entity *se = &p->se;
+	unsigned long nr_switches = p->nvcsw + p->nivcsw;

 	/*
 	 * In both the TASK_ON_RQ_QUEUED and TASK_ON_RQ_MIGRATING cases,
@@ -11092,7 +11092,7 @@ static inline bool vruntime_normalized(struct task_struct *p)
 	 * - A task which has been woken up by try_to_wake_up() and
 	 *   waiting for actually being woken up by sched_ttwu_pending().
 	 */
-	if (!se->sum_exec_runtime ||
+	if (!nr_switches ||
 	    (p->state == TASK_WAKING && p->sched_remote_wakeup))
 		return true;

-- 
2.25.1