[v5] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled

[PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled

Posted by Yafang Shao 1 year, 3 months ago

After enabling CONFIG_IRQ_TIME_ACCOUNTING to track IRQ pressure in our
container environment, we encountered several user-visible behavioral
changes:

- Interrupted IRQ/softirq time is excluded in the cpuacct cgroup

  This breaks userspace applications that rely on CPU usage data from
  cgroups to monitor CPU pressure. This patchset resolves the issue by
  ensuring that IRQ/softirq time is included in the cgroup of the
  interrupted tasks.

- getrusage(2) does not include time interrupted by IRQ/softirq

  Some services use getrusage(2) to check if workloads are experiencing CPU
  pressure. Since IRQ/softirq time is no longer included in task runtime,
  getrusage(2) can no longer reflect the CPU pressure caused by heavy
  interrupts.

This patchset addresses the first issue, which is relatively
straightforward. Once this solution is accepted, I will address the second
issue in a follow-up patchset.

Enabling CONFIG_IRQ_TIME_ACCOUNTING results in the CPU
utilization metric excluding the time spent in IRQs. This means we
lose visibility into how long the CPU was actually interrupted in
comparison to its total utilization. Currently, the only ways to
monitor interrupt time are through IRQ PSI or the IRQ time recorded in
delay accounting. However, these metrics are independent of CPU
utilization, which makes it difficult to combine them into a single,
unified measure

CPU utilization is a critical metric for almost all workloads, and
it's problematic if it fails to reflect the full extent of system
pressure. This situation is similar to iowait: when a task is in
iowait, it could be due to other tasks performing I/O. It doesn’t
matter if the I/O is being done by one of your tasks or by someone
else's; what matters is that your task is stalled and waiting on I/O.
Similarly, a comprehensive CPU utilization metric should reflect all
sources of pressure, including IRQ time, to provide a more accurate
representation of workload behavior.

One of the applications impacted by this issue is our Redis load-balancing
service. The setup operates as follows:

                   ----------------
                   | Load Balancer|
                   ----------------
                /    |      |        \
               /     |      |         \ 
          Server1 Server2 Server3 ... ServerN

Although the load balancer's algorithm is complex, it follows some core
principles:

- When server CPU utilization increases, it adds more servers and deploys
  additional instances to meet SLA requirements.
- When server CPU utilization decreases, it scales down by decommissioning
  servers and reducing the number of instances to save on costs.

The load balancer is malfunctioning due to the exclusion of IRQ time from
CPU utilization calculations.

Changes:
v4->v5:
- Don't use static key in the IRQ_TIME_ACCOUNTING=n case (Peter)
- Rename psi_irq_time to irq_time (Peter)
- Use CPUTIME_IRQ instead of CPUTIME_SOFTIRQ (Peter)

v3->v4: https://lore.kernel.org/all/20241101031750.1471-1-laoar.shao@gmail.com/
- Rebase

v2->v3:
- Add a helper account_irqtime() to avoid redundant code (Johannes)

v1->v2: https://lore.kernel.org/cgroups/20241008061951.3980-1-laoar.shao@gmail.com/
- Fix lockdep issues reported by kernel test robot <oliver.sang@intel.com>

v1: https://lore.kernel.org/all/20240923090028.16368-1-laoar.shao@gmail.com/



Yafang Shao (4):
  sched: Define sched_clock_irqtime as static key
  sched: Don't account irq time if sched_clock_irqtime is disabled
  sched, psi: Don't account irq time if sched_clock_irqtime is disabled
  sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING

 kernel/sched/core.c    | 77 +++++++++++++++++++++++++++++-------------
 kernel/sched/cputime.c | 16 ++++-----
 kernel/sched/psi.c     | 11 ++----
 kernel/sched/sched.h   | 15 +++++++-
 kernel/sched/stats.h   |  7 ++--
 5 files changed, 81 insertions(+), 45 deletions(-)

-- 
2.43.5

Re: [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled

Posted by Michal Koutný 1 year, 2 months ago

Hello Yafang.

On Fri, Nov 08, 2024 at 09:29:00PM GMT, Yafang Shao <laoar.shao@gmail.com> wrote:
> After enabling CONFIG_IRQ_TIME_ACCOUNTING to track IRQ pressure in our
> container environment, we encountered several user-visible behavioral
> changes:
> 
> - Interrupted IRQ/softirq time is excluded in the cpuacct cgroup
> 
>   This breaks userspace applications that rely on CPU usage data from
>   cgroups to monitor CPU pressure. This patchset resolves the issue by
>   ensuring that IRQ/softirq time is included in the cgroup of the
>   interrupted tasks.
> 
> - getrusage(2) does not include time interrupted by IRQ/softirq
> 
>   Some services use getrusage(2) to check if workloads are experiencing CPU
>   pressure. Since IRQ/softirq time is no longer included in task runtime,
>   getrusage(2) can no longer reflect the CPU pressure caused by heavy
>   interrupts.
 
I understand that IRQ/softirq time is difficult to attribute to an
"accountable" entity and it's technically simplest to attribute it
everyone/noone, i.e. to root cgroup (or through a global stat w/out
cgroups).

> This patchset addresses the first issue, which is relatively
> straightforward. Once this solution is accepted, I will address the second
> issue in a follow-up patchset.

Is the first issue about cpuacct data or irq.pressure?

It sounds kind of both and I noticed the docs for irq.pressure is
lacking in Documentation/accounting/psi.rst. When you're touching this,
could you please add a paragraph or sentence explaining what does this
value represent?

(Also, there is same change both for cpuacct and
cgroup_base_stat_cputime_show(), right?)

>                    ----------------
>                    | Load Balancer|
>                    ----------------
>                 /    |      |        \
>                /     |      |         \ 
>           Server1 Server2 Server3 ... ServerN
> 
> Although the load balancer's algorithm is complex, it follows some core
> principles:
> 
> - When server CPU utilization increases, it adds more servers and deploys
>   additional instances to meet SLA requirements.
> - When server CPU utilization decreases, it scales down by decommissioning
>   servers and reducing the number of instances to save on costs.

A server here references to a whole node (whole kernel) or to a cgroup
(i.e. more servers on top of one kernel)?

> The load balancer is malfunctioning due to the exclusion of IRQ time from
> CPU utilization calculations.

Could this be fixed by subtracting (global) IRQ time from (presumed
total) system capacity that the balancer uses for its decisions? (i.e.
without exact per-cgroup breakdown of IRQ time)

Thanks,
Michal

Re: [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled

Posted by Yafang Shao 1 year, 2 months ago

On Fri, Nov 15, 2024 at 9:41 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello Yafang.
>
> On Fri, Nov 08, 2024 at 09:29:00PM GMT, Yafang Shao <laoar.shao@gmail.com> wrote:
> > After enabling CONFIG_IRQ_TIME_ACCOUNTING to track IRQ pressure in our
> > container environment, we encountered several user-visible behavioral
> > changes:
> >
> > - Interrupted IRQ/softirq time is excluded in the cpuacct cgroup
> >
> >   This breaks userspace applications that rely on CPU usage data from
> >   cgroups to monitor CPU pressure. This patchset resolves the issue by
> >   ensuring that IRQ/softirq time is included in the cgroup of the
> >   interrupted tasks.
> >
> > - getrusage(2) does not include time interrupted by IRQ/softirq
> >
> >   Some services use getrusage(2) to check if workloads are experiencing CPU
> >   pressure. Since IRQ/softirq time is no longer included in task runtime,
> >   getrusage(2) can no longer reflect the CPU pressure caused by heavy
> >   interrupts.
>
> I understand that IRQ/softirq time is difficult to attribute to an
> "accountable" entity and it's technically simplest to attribute it
> everyone/noone, i.e. to root cgroup (or through a global stat w/out
> cgroups).

This issue is not about deciding which IRQ/softIRQ events should be
accounted for. Instead, it focuses on reflecting the interrupted
runtime of a task or a cgroup. I might be misunderstanding the
distinction between *charge* and *account*—or perhaps there is no
difference between them—but PATCH #4 captures exactly what I mean.
While IRQ/softIRQ time should not be attributed to the interrupted
task, it is crucial to have a metric that reflects this interrupted
runtime in CPU utilization.

The purpose of this patchset is to address this issue, conceptually
represented as:

   |<----Runtime---->|<----Interrupted time---->|<----Runtime---->|<---Sleep-->|

Without reflecting the *interrupted time* in CPU utilization, a gap—or
hole—is created:

    |<----Runtime---->|<----HOLE---->|<----Runtime---->|<---Sleep-->|

This gap will misleadingly appear as sleep time to the user:

  |<----Runtime---->|<----Sleep---->|<----Runtime---->|<---Sleep-->|

As a result, users may interpret this as underutilized CPU time and
attempt to increase their workloads to raise CPU runtime. However,
these efforts will be futile, as the observed runtime cannot increase
due to the missing metric for interrupted time.

>
> > This patchset addresses the first issue, which is relatively
> > straightforward. Once this solution is accepted, I will address the second
> > issue in a follow-up patchset.
>
> Is the first issue about cpuacct data or irq.pressure?

The data in question is from cpu.stat. Below is the cpu.stat file for cgroup2:

  $ cat cpu.stat
  usage_usec 0
  user_usec 0
  system_usec 0                            <<<< We should reflect the
interrupted time here.
  core_sched.force_idle_usec 0
  nr_periods 0
  nr_throttled 0
  throttled_usec 0
  nr_bursts 0
  burst_usec 0

>
> It sounds kind of both and I noticed the docs for irq.pressure is
> lacking in Documentation/accounting/psi.rst. When you're touching this,
> could you please add a paragraph or sentence explaining what does this
> value represent?

I believe I have explained this clearly in the comments above.
However, if anything remains unclear, please feel free to ask for
further clarification.

>
> (Also, there is same change both for cpuacct and
> cgroup_base_stat_cputime_show(), right?)
>
> >                    ----------------
> >                    | Load Balancer|
> >                    ----------------
> >                 /    |      |        \
> >                /     |      |         \
> >           Server1 Server2 Server3 ... ServerN
> >
> > Although the load balancer's algorithm is complex, it follows some core
> > principles:
> >
> > - When server CPU utilization increases, it adds more servers and deploys
> >   additional instances to meet SLA requirements.
> > - When server CPU utilization decreases, it scales down by decommissioning
> >   servers and reducing the number of instances to save on costs.
>
> A server here references to a whole node (whole kernel) or to a cgroup
> (i.e. more servers on top of one kernel)?

It is, in fact, a cgroup. These cgroups may be deployed across
different servers.

>
> > The load balancer is malfunctioning due to the exclusion of IRQ time from
> > CPU utilization calculations.
>
> Could this be fixed by subtracting (global) IRQ time from (presumed
> total) system capacity that the balancer uses for its decisions? (i.e.
> without exact per-cgroup breakdown of IRQ time)

The issue here is that the global IRQ time may include the interrupted
time of tasks outside the target cgroup. As a result, I don't believe
it's possible to find a reliable solution without modifying the
kernel.

--
Regards
Yafang

Re: [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled

Posted by Peter Zijlstra 1 year, 2 months ago

On Sun, Nov 17, 2024 at 10:56:21AM +0800, Yafang Shao wrote:
> On Fri, Nov 15, 2024 at 9:41 PM Michal Koutný <mkoutny@suse.com> wrote:

> > > The load balancer is malfunctioning due to the exclusion of IRQ time from
> > > CPU utilization calculations.
> >
> > Could this be fixed by subtracting (global) IRQ time from (presumed
> > total) system capacity that the balancer uses for its decisions? (i.e.
> > without exact per-cgroup breakdown of IRQ time)
> 
> The issue here is that the global IRQ time may include the interrupted
> time of tasks outside the target cgroup. As a result, I don't believe
> it's possible to find a reliable solution without modifying the
> kernel.

Since there is no relation between the interrupt and the interrupted
task (and through that its cgroup) -- all time might or might not be
part of your cgroup of interest. Consider it a random distribution if
you will.

What Michael suggests seems no less fair, and possible more fair than
what you propose:

 \Sum cgroup = total - IRQ

As opposed to what you propose:

 \Sum (cgroup + cgroup-IRQ) = total - remainder-IRQ

Like I argued earlier, if you have two cgroups, one doing a while(1)
loop (proxy for doing computation) and one cgroup doing heavy IO or
networking, then per your accounting the computation cgroup will get a
significant amount of IRQ time 'injected', even though it is effidently
not of that group.

Injecting 'half' of the interrupts in the computation group, and missing
'half' of the interrupts from the network group will get 'wrong'
load-balance results too.

I remain unconvinced that any of this makes sense.

Re: [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled

Posted by Yafang Shao 1 year, 2 months ago

On Mon, Nov 18, 2024 at 6:10 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Nov 17, 2024 at 10:56:21AM +0800, Yafang Shao wrote:
> > On Fri, Nov 15, 2024 at 9:41 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> > > > The load balancer is malfunctioning due to the exclusion of IRQ time from
> > > > CPU utilization calculations.
> > >
> > > Could this be fixed by subtracting (global) IRQ time from (presumed
> > > total) system capacity that the balancer uses for its decisions? (i.e.
> > > without exact per-cgroup breakdown of IRQ time)
> >
> > The issue here is that the global IRQ time may include the interrupted
> > time of tasks outside the target cgroup. As a result, I don't believe
> > it's possible to find a reliable solution without modifying the
> > kernel.
>
> Since there is no relation between the interrupt and the interrupted
> task (and through that its cgroup) -- all time might or might not be
> part of your cgroup of interest. Consider it a random distribution if
> you will.

Some points require further clarification.

On our servers, the majority of IRQ/softIRQ activity originates from
network traffic, and we consistently enable Receive Flow Steering
(RFS) [0]. This configuration ensures that softIRQs are more likely to
interrupt the tasks responsible for processing the corresponding
packets. As a result, the distribution of softIRQs is not random but
instead closely aligned with the packet-handling tasks.

[0]. https://lwn.net/Articles/381955/

Re: [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME_ACCOUNTING is enabled

Posted by Yafang Shao 1 year, 2 months ago

On Mon, Nov 18, 2024 at 6:10 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Nov 17, 2024 at 10:56:21AM +0800, Yafang Shao wrote:
> > On Fri, Nov 15, 2024 at 9:41 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> > > > The load balancer is malfunctioning due to the exclusion of IRQ time from
> > > > CPU utilization calculations.
> > >
> > > Could this be fixed by subtracting (global) IRQ time from (presumed
> > > total) system capacity that the balancer uses for its decisions? (i.e.
> > > without exact per-cgroup breakdown of IRQ time)
> >
> > The issue here is that the global IRQ time may include the interrupted
> > time of tasks outside the target cgroup. As a result, I don't believe
> > it's possible to find a reliable solution without modifying the
> > kernel.
>
> Since there is no relation between the interrupt and the interrupted
> task (and through that its cgroup) -- all time might or might not be
> part of your cgroup of interest. Consider it a random distribution if
> you will.
>
> What Michael suggests seems no less fair, and possible more fair than
> what you propose:
>
>  \Sum cgroup = total - IRQ

The key issue here is determining how to reliably get the IRQ. I don't
believe there is a dependable way to achieve this.

For example, consider a server with 16 CPUs. My cgroup contains 4
threads that can freely migrate across CPUs, while other tasks are
also running on the system simultaneously. In this scenario, how can
we accurately determine the IRQ to subtract?

>
> As opposed to what you propose:
>
>  \Sum (cgroup + cgroup-IRQ) = total - remainder-IRQ
>
> Like I argued earlier, if you have two cgroups, one doing a while(1)
> loop (proxy for doing computation) and one cgroup doing heavy IO or
> networking, then per your accounting the computation cgroup will get a
> significant amount of IRQ time 'injected', even though it is effidently
> not of that group.

That is precisely what the user wants. If my tasks are frequently
interrupted by IRQs, it indicates that my service may be experiencing
poor quality. In response, I would likely reduce the traffic sent to
it. If the issue persists and IRQ interruptions remain high, I would
then consider migrating the service to other servers.

>
> Injecting 'half' of the interrupts in the computation group, and missing
> 'half' of the interrupts from the network group will get 'wrong'
> load-balance results too.
>
> I remain unconvinced that any of this makes sense.

If we are uncertain about which choice makes more sense, it might be
better to align this behavior with the case where
CONFIG_IRQ_TIME_ACCOUNTING=n.

-- 
Regards
Yafang