[v4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING

[PATCH v4 0/4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING

Posted by Yafang Shao 3 weeks, 2 days ago

After enabling CONFIG_IRQ_TIME_ACCOUNTING to track IRQ pressure in our
container environment, we encountered several user-visible behavioral
changes:

- Interrupted IRQ/softirq time is not accounted for in the cpuacct cgroup

  This breaks userspace applications that rely on CPU usage data from
  cgroups to monitor CPU pressure. This patchset resolves the issue by
  ensuring that IRQ/softirq time is accounted for in the cgroup of the
  interrupted tasks.

- getrusage(2) does not include time interrupted by IRQ/softirq

  Some services use getrusage(2) to check if workloads are experiencing CPU
  pressure. Since IRQ/softirq time is no longer charged to task runtime,
  getrusage(2) can no longer reflect the CPU pressure caused by heavy
  interrupts.

This patchset addresses the first issue, which is relatively
straightforward. However, the second issue remains unresolved, as there
might be debate over whether interrupted time should be considered part of
a task’s usage. Nonetheless, it is important to report interrupted time to
the user via some metric, though that is a separate discussion.

Changes:
v3->v4:
- Rebase

v2->v3:
- Add a helper account_irqtime() to avoid redundant code (Johannes)

v1->v2: https://lore.kernel.org/cgroups/20241008061951.3980-1-laoar.shao@gmail.com/
- Fix lockdep issues reported by kernel test robot <oliver.sang@intel.com>

v1: https://lore.kernel.org/all/20240923090028.16368-1-laoar.shao@gmail.com/

Yafang Shao (4):
  sched: Define sched_clock_irqtime as static key
  sched: Don't account irq time if sched_clock_irqtime is disabled
  sched, psi: Don't account irq time if sched_clock_irqtime is disabled
  sched: Fix cgroup irq accounting for CONFIG_IRQ_TIME_ACCOUNTING

 kernel/sched/core.c    | 77 +++++++++++++++++++++++++++++-------------
 kernel/sched/cputime.c | 16 ++++-----
 kernel/sched/psi.c     | 11 ++----
 kernel/sched/sched.h   |  1 +
 kernel/sched/stats.h   |  7 ++--
 5 files changed, 68 insertions(+), 44 deletions(-)

-- 
2.43.5

Re: [PATCH v4 0/4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING

Posted by Peter Zijlstra 3 weeks, 2 days ago

On Fri, Nov 01, 2024 at 11:17:46AM +0800, Yafang Shao wrote:
> After enabling CONFIG_IRQ_TIME_ACCOUNTING to track IRQ pressure in our
> container environment, we encountered several user-visible behavioral
> changes:
> 
> - Interrupted IRQ/softirq time is not accounted for in the cpuacct cgroup
> 
>   This breaks userspace applications that rely on CPU usage data from
>   cgroups to monitor CPU pressure. This patchset resolves the issue by
>   ensuring that IRQ/softirq time is accounted for in the cgroup of the
>   interrupted tasks.
> 
> - getrusage(2) does not include time interrupted by IRQ/softirq
> 
>   Some services use getrusage(2) to check if workloads are experiencing CPU
>   pressure. Since IRQ/softirq time is no longer charged to task runtime,
>   getrusage(2) can no longer reflect the CPU pressure caused by heavy
>   interrupts.
> 
> This patchset addresses the first issue, which is relatively
> straightforward. 

So I don't think it is. I think they're both the same issue. You cannot
know for whom the work done by the (soft) interrupt is.

For instance, if you were to create 2 cgroups, and have one cgroup do a
while(1) loop, while you'd have that other cgroup do your netperf
workload, I suspect you'll see significant (soft)irq load on the
while(1) cgroup, even though it's guaranteed to not be from it.

Same with rusage -- rusage is fully task centric, and the work done by
(soft) irqs are not necessarily	related to the task they interrupt.

So while you're trying to make the world conform to your legacy
monitoring view, perhaps you should fix your view of things.

Re: [PATCH v4 0/4] sched: Fix irq accounting for CONFIG_IRQ_TIME_ACCOUNTING

Posted by Yafang Shao 3 weeks, 2 days ago

On Fri, Nov 1, 2024 at 6:54 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 01, 2024 at 11:17:46AM +0800, Yafang Shao wrote:
> > After enabling CONFIG_IRQ_TIME_ACCOUNTING to track IRQ pressure in our
> > container environment, we encountered several user-visible behavioral
> > changes:
> >
> > - Interrupted IRQ/softirq time is not accounted for in the cpuacct cgroup
> >
> >   This breaks userspace applications that rely on CPU usage data from
> >   cgroups to monitor CPU pressure. This patchset resolves the issue by
> >   ensuring that IRQ/softirq time is accounted for in the cgroup of the
> >   interrupted tasks.
> >
> > - getrusage(2) does not include time interrupted by IRQ/softirq
> >
> >   Some services use getrusage(2) to check if workloads are experiencing CPU
> >   pressure. Since IRQ/softirq time is no longer charged to task runtime,
> >   getrusage(2) can no longer reflect the CPU pressure caused by heavy
> >   interrupts.
> >
> > This patchset addresses the first issue, which is relatively
> > straightforward.
>
> So I don't think it is. I think they're both the same issue. You cannot
> know for whom the work done by the (soft) interrupt is.
>
> For instance, if you were to create 2 cgroups, and have one cgroup do a
> while(1) loop, while you'd have that other cgroup do your netperf
> workload, I suspect you'll see significant (soft)irq load on the
> while(1) cgroup, even though it's guaranteed to not be from it.
>
> Same with rusage -- rusage is fully task centric, and the work done by
> (soft) irqs are not necessarily related to the task they interrupt.
>
>
> So while you're trying to make the world conform to your legacy
> monitoring view, perhaps you should fix your view of things.

The issue here can't simply be addressed by adjusting my view of
things. Enabling CONFIG_IRQ_TIME_ACCOUNTING results in the CPU
utilization metric excluding the time spent in IRQs. This means we
lose visibility into how long the CPU was actually interrupted in
comparison to its total utilization. Currently, the only ways to
monitor interrupt time are through IRQ PSI or the IRQ time recorded in
delay accounting. However, these metrics are independent of CPU
utilization, which makes it difficult to combine them into a single,
unified measure.

CPU utilization is a critical metric for almost all workloads, and
it's problematic if it fails to reflect the full extent of system
pressure. This situation is similar to iowait: when a task is in
iowait, it could be due to other tasks performing I/O. It doesn’t
matter if the I/O is being done by one of your tasks or by someone
else's; what matters is that your task is stalled and waiting on I/O.
Similarly, a comprehensive CPU utilization metric should reflect all
sources of pressure, including IRQ time, to provide a more accurate
representation of workload behavior.

-- 
Regards
Yafang