[PATCH 00/15] tick/sched: Refactor idle cputime accounting

Frederic Weisbecker posted 15 patches 3 weeks, 1 day ago
There is a newer version of this series
arch/powerpc/kernel/time.c         |  41 +++++
arch/s390/include/asm/idle.h       |  11 +-
arch/s390/kernel/idle.c            |  13 +-
arch/s390/kernel/vtime.c           |  57 ++++++-
drivers/cpufreq/cpufreq.c          |  29 +---
drivers/cpufreq/cpufreq_governor.c |   6 +-
drivers/cpufreq/cpufreq_ondemand.c |   7 +-
drivers/macintosh/rack-meter.c     |   2 +-
fs/proc/stat.c                     |  40 +----
fs/proc/uptime.c                   |   8 +-
include/linux/kernel_stat.h        |  76 ++++++++--
include/linux/tick.h               |   4 -
include/linux/vtime.h              |  20 ++-
kernel/rcu/tree.c                  |   9 +-
kernel/rcu/tree_stall.h            |   7 +-
kernel/sched/cputime.c             | 302 +++++++++++++++++++++++++++++++------
kernel/sched/idle.c                |  11 +-
kernel/sched/sched.h               |   1 +
kernel/time/tick-sched.c           | 203 +++++--------------------
kernel/time/tick-sched.h           |  12 --
kernel/time/timer_list.c           |   6 +-
scripts/gdb/linux/timerlist.py     |   4 -
22 files changed, 505 insertions(+), 364 deletions(-)
[PATCH 00/15] tick/sched: Refactor idle cputime accounting
Posted by Frederic Weisbecker 3 weeks, 1 day ago
Hi,

After the issue reported here:

	https://lore.kernel.org/all/20251210083135.3993562-1-jackzxcui1989@163.com/

It occurs that the idle cputime accounting is a big mess that
accumulates within two concurrent statistics, each having their own
shortcomings:

* The accounting for online CPUs which is based on the delta between
  tick_nohz_start_idle() and tick_nohz_stop_idle().

  Pros:
       - Works when the tick is off

       - Has nsecs granularity

  Cons:
       - Account idle steal time but doesn't substract it from idle
         cputime.

       - Assumes CONFIG_IRQ_TIME_ACCOUNTING by not accounting IRQs but
         the IRQ time is simply ignored when
         CONFIG_IRQ_TIME_ACCOUNTING=n

       - The windows between 1) idle task scheduling and the first call
         to tick_nohz_start_idle() and 2) idle task between the last
         tick_nohz_stop_idle() and the rest of the idle time are
         blindspots wrt. cputime accounting (though mostly insignificant
         amount)

       - Relies on private fields outside of kernel stats, with specific
         accessors.

* The accounting for offline CPUs which is based on ticks and the
  jiffies delta during which the tick was stopped.

  Pros:
       - Handles steal time correctly

       - Handle CONFIG_IRQ_TIME_ACCOUNTING=y and
         CONFIG_IRQ_TIME_ACCOUNTING=n correctly.

       - Handles the whole idle task

       - Accounts directly to kernel stats, without midlayer accumulator.

   Cons:
       - Doesn't elapse when the tick is off, which doesn't make it
         suitable for online CPUs.

       - Has TICK_NSEC granularity (jiffies)

       - Needs to track the dyntick-idle ticks that were accounted and
         substract them from the total jiffies time spent while the tick
         was stopped. This is an ugly workaround.

Having two different accounting for a single context is not the only
problem: since those accountings are of different natures, it is
possible to observe the global idle time going backward after a CPU goes
offline, as reported by Xin Zhao.

Clean up the situation with introducing a hybrid approach that stays
coherent, fixes the backward jumps and works for both online and offline
CPUs:

* Tick based or native vtime accounting operate before the tick is
  stopped and resumes once the tick is restarted.

* When the idle loop starts, switch to dynticks-idle accounting as is
  done currently, except that the statistics accumulate directly to the
  relevant kernel stat fields.

* Private dyntick cputime accounting fields are removed.

* Works on both online and offline case.

* Move most of the relevant code to the common sched/cputime subsystem

* Handle CONFIG_IRQ_TIME_ACCOUNTING=n correctly such that the
  dynticks-idle accounting still elapses while on IRQs.

* Correctly substract idle steal cputime from idle time

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	timers/core

HEAD: 6a3d814ef2f6142714bef862be36def5ca4c9d96
Thanks,
	Frederic
---

Frederic Weisbecker (15):
      sched/idle: Handle offlining first in idle loop
      sched/cputime: Remove superfluous and error prone kcpustat_field() parameter
      sched/cputime: Correctly support generic vtime idle time
      powerpc/time: Prepare to stop elapsing in dynticks-idle
      s390/time: Prepare to stop elapsing in dynticks-idle
      tick/sched: Unify idle cputime accounting
      cpufreq: ondemand: Simplify idle cputime granularity test
      tick/sched: Remove nohz disabled special case in cputime fetch
      tick/sched: Move dyntick-idle cputime accounting to cputime code
      tick/sched: Remove unused fields
      tick/sched: Account tickless idle cputime only when tick is stopped
      tick/sched: Consolidate idle time fetching APIs
      sched/cputime: Consolidate get_cpu_[idle|iowait]_time_us()
      sched/cputime: Handle idle irqtime gracefully
      sched/cputime: Handle dyntick-idle steal time correctly

 arch/powerpc/kernel/time.c         |  41 +++++
 arch/s390/include/asm/idle.h       |  11 +-
 arch/s390/kernel/idle.c            |  13 +-
 arch/s390/kernel/vtime.c           |  57 ++++++-
 drivers/cpufreq/cpufreq.c          |  29 +---
 drivers/cpufreq/cpufreq_governor.c |   6 +-
 drivers/cpufreq/cpufreq_ondemand.c |   7 +-
 drivers/macintosh/rack-meter.c     |   2 +-
 fs/proc/stat.c                     |  40 +----
 fs/proc/uptime.c                   |   8 +-
 include/linux/kernel_stat.h        |  76 ++++++++--
 include/linux/tick.h               |   4 -
 include/linux/vtime.h              |  20 ++-
 kernel/rcu/tree.c                  |   9 +-
 kernel/rcu/tree_stall.h            |   7 +-
 kernel/sched/cputime.c             | 302 +++++++++++++++++++++++++++++++------
 kernel/sched/idle.c                |  11 +-
 kernel/sched/sched.h               |   1 +
 kernel/time/tick-sched.c           | 203 +++++--------------------
 kernel/time/tick-sched.h           |  12 --
 kernel/time/timer_list.c           |   6 +-
 scripts/gdb/linux/timerlist.py     |   4 -
 22 files changed, 505 insertions(+), 364 deletions(-)
Re: [PATCH 00/15] tick/sched: Refactor idle cputime accounting
Posted by Peter Zijlstra 2 weeks, 5 days ago
On Fri, Jan 16, 2026 at 03:51:53PM +0100, Frederic Weisbecker wrote:
>  kernel/sched/cputime.c             | 302 +++++++++++++++++++++++++++++++------

My editor feels strongly about the below; with that it still has one
complaint about paravirt_steal_clock() which does not have a proper
declaration.


diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 7ff8dbec7ee3..248232fa6e27 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -2,6 +2,7 @@
 /*
  * Simple CPU accounting cgroup controller
  */
+#include <linux/sched/clock.h>
 #include <linux/sched/cputime.h>
 #include <linux/tsacct_kern.h>
 #include "sched.h"
Re: [PATCH 00/15] tick/sched: Refactor idle cputime accounting
Posted by Frederic Weisbecker 2 weeks, 5 days ago
Le Mon, Jan 19, 2026 at 03:53:30PM +0100, Peter Zijlstra a écrit :
> On Fri, Jan 16, 2026 at 03:51:53PM +0100, Frederic Weisbecker wrote:
> >  kernel/sched/cputime.c             | 302 +++++++++++++++++++++++++++++++------
> 
> My editor feels strongly about the below; with that it still has one
> complaint about paravirt_steal_clock() which does not have a proper
> declaration.

I guess it happens to be somehow included in the <linux/sched*.h> wave

> 
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 7ff8dbec7ee3..248232fa6e27 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -2,6 +2,7 @@
>  /*
>   * Simple CPU accounting cgroup controller
>   */
> +#include <linux/sched/clock.h>
>  #include <linux/sched/cputime.h>
>  #include <linux/tsacct_kern.h>
>  #include "sched.h"

Ok I'll include that.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs
Re: [PATCH 00/15] tick/sched: Refactor idle cputime accounting
Posted by Frederic Weisbecker 3 weeks, 1 day ago
I forgot to mention I haven't yet tested CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
(s390 and powerpc).

Thanks.
Re: [PATCH 00/15] tick/sched: Refactor idle cputime accounting
Posted by Shrikanth Hegde 2 weeks, 4 days ago
Hi Frederic.

On 1/16/26 8:27 PM, Frederic Weisbecker wrote:
> I forgot to mention I haven't yet tested CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> (s390 and powerpc).
> 
> Thanks.


tl;dr

I ran this on powerNV(Non virtualized) with 144 CPUs with below config. (default ones)
Patch *breaks* the cpu idle stats most of the time. idle values are wrong.


Detailed info:

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++

In config i have this:
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
# CONFIG_BSD_PROCESS_ACCT is not set

+++++++++

When system is fully idle, i see this.

06:44:26 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
06:44:27 AM  all    0.01    0.00    0.01    0.00   57.20    0.00    0.00    0.00    0.00   42.79
06:44:28 AM  all    0.02    0.00    0.03    0.00   55.73    0.00    0.00    0.00    0.00   44.22
06:44:29 AM  all    0.01    0.00    0.00    0.00   56.23    0.00    0.00    0.00    0.00   43.77

- Seeing 50%+ in irq time, which is clearly wrong.

+++++++++
When running stress-ng --cpu=72 (expectation is 50% idle time)
06:48:12 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
06:48:13 AM  all   49.98    0.00    0.01    0.00   15.81    0.00    0.00    0.00    0.00   34.20
06:48:14 AM  all   49.93    0.00    0.00    0.00   15.15    0.00    0.00    0.00    0.00   34.91
06:48:15 AM  all   49.99    0.00    0.01    0.00   15.29    0.00    0.00    0.00    0.00   34.72

- Wrong values again. 50% is expected idle time.

+++++++++
system is idle again.
06:48:46 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
06:48:47 AM  all    0.00    0.00    0.00    0.00   63.93    0.00    0.00    0.00    0.00   36.07
06:48:48 AM  all    0.02    0.00    0.00    0.00   63.78    0.01    0.00    0.00    0.00   36.18
06:48:49 AM  all    0.00    0.00    0.00    0.00   63.77    0.00    0.00    0.00    0.00   36.23

- Wrong values again. irq increased further.

+++++++++

I have seen the below warnings too.
WARNING: kernel/time/tick-sched.c:1353 at tick_nohz_idle_exit
[    T0] WARNING: kernel/time/tick-sched.c:1353 at tick_nohz_idle_exit+0x148/0x150, CPU#4: swapper/4/0
[    T0] Modules linked in: vmx_crypto gf128mul
[    T0] CPU: 4 UID: 0 PID: 0 Comm: swapper/4 Tainted: G        W           6.19.0-rc5-00683-gbe7e8f3d5116 #61 PREEMPT(full)
[    T0] Tainted: [W]=WARN
[    T0] Hardware name: 0000000000000000 POWER9 0x4e1202 opal:v7.1 PowerNV
[    T0] NIP [c0000000002c8210] tick_nohz_idle_exit+0x148/0x150
[    T0] LR [c00000000022f10c] do_idle+0x1dc/0x328


WARNING: kernel/time/tick-sched.c:1274 at tick_nohz_get_sleep_length
     T0] NIP [c0000000002c7fc0] tick_nohz_get_sleep_length+0x108/0x110
[    T0] LR [c000000000ca1548] menu_select+0x3c0/0x7b4
[    T0] Call Trace:
[    T0] [c000000003197e10] [c000000003197e50] 0xc000000003197e50 (unreliable)
[    T0] [c000000003197e50] [c000000000ca1548] menu_select+0x3c0/0x7b4
[    T0] [c000000003197ed0] [c000000000c9f120] cpuidle_select+0x34/0x48
[    T0] [c000000003197ef0] [c00000000022f184] do_idle+0x254/0x328


+++++++++++++++++++++++++++++++++++++++++++++++++++++++++

I went back to baseline to confirm the original behaviour.
(d613f96096e4) Merge timers/vdso into tip/master

07:02:17 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
07:02:18 AM  all    0.01    0.00    0.01    0.01    1.19    0.00    0.00    0.00    0.00   98.77
07:02:19 AM  all    0.01    0.00    0.01    0.00    0.84    0.00    0.00    0.00    0.00   99.14
07:02:20 AM  all    0.00    0.00    0.01    0.00    0.99    0.00    0.00    0.00    0.00   99.00
07:02:21 AM  all    0.01    0.00    0.00    0.00    0.83    0.00    0.00    0.00    0.00   99.16

Which is the working as expected.



PS: Initial data. I haven't gone through the series yet.
Re: [PATCH 00/15] tick/sched: Refactor idle cputime accounting
Posted by Frederic Weisbecker 2 weeks, 3 days ago
Le Tue, Jan 20, 2026 at 06:12:08PM +0530, Shrikanth Hegde a écrit :
> 
> Hi Frederic.
> 
> On 1/16/26 8:27 PM, Frederic Weisbecker wrote:
> > I forgot to mention I haven't yet tested CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> > (s390 and powerpc).
> > 
> > Thanks.
> 
> 
> tl;dr
> 
> I ran this on powerNV(Non virtualized) with 144 CPUs with below config. (default ones)
> Patch *breaks* the cpu idle stats most of the time. idle values are wrong.

Right I somehow lost the TS_FLAG_INIDLE setting in tick_nohz_idle_enter(),
which ruins the whole thing.

You probably think I should have detected that with light testing and you're
right. Not checking dmesg was a bit sloppy from my end...

I'm fixing that and will send a v2 soonish.

Thanks a lot for testing!

-- 
Frederic Weisbecker
SUSE Labs