[v5] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled

[PATCH v5 4/4] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING

Posted by Yafang Shao 1 year, 3 months ago

After enabling CONFIG_IRQ_TIME_ACCOUNTING to monitor IRQ pressure in our
container environment, we observed several noticeable behavioral changes.

One of our IRQ-heavy services, such as Redis, reported a significant
reduction in CPU usage after upgrading to the new kernel with
CONFIG_IRQ_TIME_ACCOUNTING enabled. However, despite adding more threads
to handle an increased workload, the CPU usage could not be raised. In
other words, even though the container’s CPU usage appeared low, it was
unable to process more workloads to utilize additional CPU resources, which
caused issues.

This behavior can be demonstrated using netperf:

  function start_server() {
      for j in `seq 1 3`; do
          netserver -p $[12345+j] > /dev/null &
      done
  }

  server_ip=$1
  function start_client() {
    # That applies to cgroup2 as well.
    mkdir -p /sys/fs/cgroup/cpuacct/test
    echo $$ > /sys/fs/cgroup/cpuacct/test/cgroup.procs
    for j in `seq 1 3`; do
        port=$[12345+j]
        taskset -c 0 netperf -H ${server_ip} -l ${run_time:-30000}   \
                -t TCP_STREAM -p $port -- -D -m 1k -M 1K -s 8k -S 8k \
                > /dev/null &
    done
  }

  start_server
  start_client

We can verify the CPU usage of the test cgroup using cpuacct.stat. The
output shows:

  system: 53
  user: 2

The CPU usage of the cgroup is relatively low at around 55%, but this usage
doesn't increase, even with more netperf tasks. The reason is that CPU0 is
at 100% utilization, as confirmed by mpstat:

  02:56:22 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
  02:56:23 PM    0    0.99    0.00   55.45    0.00    0.99   42.57    0.00    0.00    0.00    0.00

  02:56:23 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
  02:56:24 PM    0    2.00    0.00   55.00    0.00    0.00   43.00    0.00    0.00    0.00    0.00

It is clear that the %soft is excluded in the cgroup of the interrupted
task. This behavior is unexpected. We should include IRQ time in the
cgroup to reflect the pressure the group is under.

After a thorough analysis, I discovered that this change in behavior is due
to commit 305e6835e055 ("sched: Do not account irq time to current task"),
which altered whether IRQ time should be charged to the interrupted task.
While I agree that a task should not be penalized by random interrupts, the
task itself cannot progress while interrupted. Therefore, the interrupted
time should be reported to the user.

The system metric in cpuacct.stat is crucial in indicating whether a
container is under heavy system pressure, including IRQ/softirq activity.
Hence, IRQ/softirq time should be included in the cpuacct system usage,
which also applies to cgroup2’s rstat.

The reason it doesn't just add the cgroup_account_*() to
irqtime_account_irq() is that it might result in performance hit to hold
the rq_lock in the critical path. Taking inspiration from
commit ddae0ca2a8fe ("sched: Move psi_account_irqtime() out of
update_rq_clock_task() hotpath"), I've now adapted the approach to handle
it in a non-critical path, reducing the performance impact.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/core.c  | 33 +++++++++++++++++++++++++++++++--
 kernel/sched/psi.c   | 13 +++----------
 kernel/sched/sched.h |  2 +-
 kernel/sched/stats.h |  7 ++++---
 4 files changed, 39 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a75dad9be4b9..61545db8ca4b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5579,6 +5579,35 @@ __setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);
 static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }
 #endif /* CONFIG_SCHED_DEBUG */
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static void account_irqtime(struct rq *rq, struct task_struct *curr,
+			    struct task_struct *prev)
+{
+	int cpu = smp_processor_id();
+	s64 delta;
+	u64 irq;
+
+	if (!irqtime_enabled())
+		return;
+
+	irq = irq_time_read(cpu);
+	delta = (s64)(irq - rq->irq_time);
+	if (delta < 0)
+		return;
+
+	rq->irq_time = irq;
+	psi_account_irqtime(rq, curr, prev, delta);
+	cgroup_account_cputime(curr, delta);
+	/* We account both softirq and irq into CPUTIME_IRQ */
+	cgroup_account_cputime_field(curr, CPUTIME_IRQ, delta);
+}
+#else
+static inline void account_irqtime(struct rq *rq, struct task_struct *curr,
+				   struct task_struct *prev)
+{
+}
+#endif
+
 /*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
@@ -5600,7 +5629,7 @@ void sched_tick(void)
 	rq_lock(rq, &rf);
 
 	curr = rq->curr;
-	psi_account_irqtime(rq, curr, NULL);
+	account_irqtime(rq, curr, NULL);
 
 	update_rq_clock(rq);
 	hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
@@ -6683,7 +6712,7 @@ static void __sched notrace __schedule(int sched_mode)
 		++*switch_count;
 
 		migrate_disable_switch(rq, prev);
-		psi_account_irqtime(rq, prev, next);
+		account_irqtime(rq, prev, next);
 		psi_sched_switch(prev, next, block);
 
 		trace_sched_switch(preempt, prev, next, prev_state);
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 210aec717dc7..1adb41b2ae1d 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -990,15 +990,14 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev)
+void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev,
+			 s64 delta)
 {
 	int cpu = task_cpu(curr);
 	struct psi_group *group;
 	struct psi_group_cpu *groupc;
-	s64 delta;
-	u64 irq;
 
-	if (static_branch_likely(&psi_disabled) || !irqtime_enabled())
+	if (static_branch_likely(&psi_disabled))
 		return;
 
 	if (!curr->pid)
@@ -1009,12 +1008,6 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
 	if (prev && task_psi_group(prev) == group)
 		return;
 
-	irq = irq_time_read(cpu);
-	delta = (s64)(irq - rq->psi_irq_time);
-	if (delta < 0)
-		return;
-	rq->psi_irq_time = irq;
-
 	do {
 		u64 now;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0c83ab35256e..690fc6f9d97c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1223,7 +1223,7 @@ struct rq {
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 	u64			prev_irq_time;
-	u64			psi_irq_time;
+	u64			irq_time;
 #endif
 #ifdef CONFIG_PARAVIRT
 	u64			prev_steal_time;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 767e098a3bd1..17eefe5876a5 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -111,10 +111,11 @@ void psi_task_change(struct task_struct *task, int clear, int set);
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep);
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev);
+void psi_account_irqtime(struct rq *rq, struct task_struct *curr,
+			 struct task_struct *prev, s64 delta);
 #else
 static inline void psi_account_irqtime(struct rq *rq, struct task_struct *curr,
-				       struct task_struct *prev) {}
+				       struct task_struct *prev, s64 delta) {}
 #endif /*CONFIG_IRQ_TIME_ACCOUNTING */
 /*
  * PSI tracks state that persists across sleeps, such as iowaits and
@@ -215,7 +216,7 @@ static inline void psi_sched_switch(struct task_struct *prev,
 				    struct task_struct *next,
 				    bool sleep) {}
 static inline void psi_account_irqtime(struct rq *rq, struct task_struct *curr,
-				       struct task_struct *prev) {}
+				       struct task_struct *prev, s64 delta) {}
 #endif /* CONFIG_PSI */
 
 #ifdef CONFIG_SCHED_INFO
-- 
2.43.5

Re: [PATCH v5 4/4] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING

Posted by Michal Koutný 1 year, 2 months ago

On Fri, Nov 08, 2024 at 09:29:04PM GMT, Yafang Shao <laoar.shao@gmail.com> wrote:
> The system metric in cpuacct.stat is crucial in indicating whether a
> container is under heavy system pressure, including IRQ/softirq activity.
> Hence, IRQ/softirq time should be included in the cpuacct system usage,
> which also applies to cgroup2’s rstat.

(snipped from cover letter thread)

On Mon, Nov 18, 2024 at 08:12:03PM GMT, Yafang Shao <laoar.shao@gmail.com> wrote:
> The key issue here is determining how to reliably get the IRQ. I don't
> believe there is a dependable way to achieve this.
> 
> For example, consider a server with 16 CPUs. My cgroup contains 4
> threads that can freely migrate across CPUs, while other tasks are
> also running on the system simultaneously. In this scenario, how can
> we accurately determine the IRQ to subtract?

I understand there's some IRQ noise which is a property of CPU (noise is
a function of cpu).

Then there're cgroup workloads, on a single CPU impact per-cgroup
depends how much that given cgroup runs on the CPU (it's more exposed).
Whole cgroup's impact is sum of these (i.e. it's kind of scalar product
between that IRQ noise per-cpu and cgroup's CPU consuption per-cpu).

(In your usage, there's some correlation of IRQ noise with CPU
consumption).

> That is precisely what the user wants. If my tasks are frequently
> interrupted by IRQs, it indicates that my service may be experiencing
> poor quality. In response, I would likely reduce the traffic sent to
> it. If the issue persists and IRQ interruptions remain high, I would
> then consider migrating the service to other servers.

If I look at
	52b1364ba0b10 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure")
(where it's clearer than after
	ddae0ca2a8fe1 ("sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath")
)

the IRQ pressure accounting takes into account the task (its cgroup) vs
IRQ time (~multiplication) and then its summed a) over time
(psi_account_irqtime()), b) over CPUs (collect_percpu_times()) so it's
the scalar product (squinting) I mentioned above.

Therefore I think irq.pressure provides exactly the information that's
relevant for your scheduling decisions and the info cannot be fit into
cpuacct.stat.

Or what is irq.pressure missing or distorting for your scenario?

HTH,
Michal

Re: [PATCH v5 4/4] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING

Posted by Yafang Shao 1 year, 2 months ago

On Tue, Dec 3, 2024 at 6:01 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Fri, Nov 08, 2024 at 09:29:04PM GMT, Yafang Shao <laoar.shao@gmail.com> wrote:
> > The system metric in cpuacct.stat is crucial in indicating whether a
> > container is under heavy system pressure, including IRQ/softirq activity.
> > Hence, IRQ/softirq time should be included in the cpuacct system usage,
> > which also applies to cgroup2’s rstat.
>
> (snipped from cover letter thread)
>
> On Mon, Nov 18, 2024 at 08:12:03PM GMT, Yafang Shao <laoar.shao@gmail.com> wrote:
> > The key issue here is determining how to reliably get the IRQ. I don't
> > believe there is a dependable way to achieve this.
> >
> > For example, consider a server with 16 CPUs. My cgroup contains 4
> > threads that can freely migrate across CPUs, while other tasks are
> > also running on the system simultaneously. In this scenario, how can
> > we accurately determine the IRQ to subtract?
>
> I understand there's some IRQ noise which is a property of CPU (noise is
> a function of cpu).
>
> Then there're cgroup workloads, on a single CPU impact per-cgroup
> depends how much that given cgroup runs on the CPU (it's more exposed).
> Whole cgroup's impact is sum of these (i.e. it's kind of scalar product
> between that IRQ noise per-cpu and cgroup's CPU consuption per-cpu).
>
> (In your usage, there's some correlation of IRQ noise with CPU
> consumption).
>
> > That is precisely what the user wants. If my tasks are frequently
> > interrupted by IRQs, it indicates that my service may be experiencing
> > poor quality. In response, I would likely reduce the traffic sent to
> > it. If the issue persists and IRQ interruptions remain high, I would
> > then consider migrating the service to other servers.
>
> If I look at
>         52b1364ba0b10 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure")
> (where it's clearer than after
>         ddae0ca2a8fe1 ("sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath")
> )
>
> the IRQ pressure accounting takes into account the task (its cgroup) vs
> IRQ time (~multiplication) and then its summed a) over time
> (psi_account_irqtime()), b) over CPUs (collect_percpu_times()) so it's
> the scalar product (squinting) I mentioned above.
>
> Therefore I think irq.pressure provides exactly the information that's
> relevant for your scheduling decisions and the info cannot be fit into
> cpuacct.stat.
>
> Or what is irq.pressure missing or distorting for your scenario?

irq.pressure is a metric that operates independently of CPU
utilization. It can be used to monitor whether latency spikes in a
cgroup are caused by high IRQ pressure—this is exactly how we utilize
it in our production environment.

However, this metric alone doesn’t provide clear guidance on whether
to add or reduce CPUs for a workload. To address this, we’ve attempted
to combine irq.pressure with CPU utilization into a unified metric,
but achieving this has proven to be challenging.

--
Regards
Yafang

[PATCH v5 1/4] sched: Define sched_clock_irqtime as static key
[PATCH v5 2/4] sched: Don't account irq time if sched_clock_irqtime is disabled
[PATCH v5 3/4] sched, psi: Don't account irq time if sched_clock_irqtime is disabled
[PATCH v5 4/4] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING