[PATCH 0/3] sched/tick: Decouple sched_tick() from HZ

Qais Yousef posted 3 patches 1 week, 1 day ago
kernel/sched/clock.c             |  4 +-
kernel/sched/core.c              | 78 +++++++++++++++++++++++++++++---
kernel/sched/cpufreq_schedutil.c |  2 +-
kernel/sched/fair.c              |  6 +--
kernel/sched/features.h          |  8 ++++
kernel/sched/sched.h             |  7 +++
6 files changed, 93 insertions(+), 12 deletions(-)
[PATCH 0/3] sched/tick: Decouple sched_tick() from HZ
Posted by Qais Yousef 1 week, 1 day ago
Previous attempt to make HZ 1000 the default [1] to help with scheduler
responsiveness didn't get merged. But maybe for the best, as I think this idea
of decoupling sched_tick() from HZ makes more sense. We shouldn't need to make
a choice between how often timers should trigger vs how often should the
scheduler update its stats/take decisions.

With HRTICK now is default on, preemption checks are less of problem. But tasks
could need to migrate (specifically on HMP systems) and frequencies could need
updating for solo long running tasks.

The conversion is half complete. We need to invent a new jiffy (piffy?) and
move existing load balancing logic (and potentially other users if they want
to) to speed them up. BUT, with push load balancer patches circulating around,
I am not sure it is worth while or desired to speed it since we have a better
faster mechanism to keep the system balanced without the need for the 'heavy
handed' full load balance to trigger. The push lb can benefit from this ptick
too to ensure more responsiveness.

[1] https://lore.kernel.org/lkml/20250226000810.459547-1-qyousef@layalina.io/

Qais Yousef (3):
  sched/tick: Decouple sched_tick() from HZ
  sched/tick: Move TICK_NSEC users to PTICK_NSEC
  sched/tick: Turn on PTICK by default

 kernel/sched/clock.c             |  4 +-
 kernel/sched/core.c              | 78 +++++++++++++++++++++++++++++---
 kernel/sched/cpufreq_schedutil.c |  2 +-
 kernel/sched/fair.c              |  6 +--
 kernel/sched/features.h          |  8 ++++
 kernel/sched/sched.h             |  7 +++
 6 files changed, 93 insertions(+), 12 deletions(-)

-- 
2.34.1
Re: [PATCH 0/3] sched/tick: Decouple sched_tick() from HZ
Posted by Peter Zijlstra 6 days, 22 hours ago
On Sun, May 17, 2026 at 05:07:37AM +0100, Qais Yousef wrote:
> With HRTICK now is default on,

.. for archs that use GENERIC_ENTRY. So for example, ARM64 doesn't (yet)
have this.
Re: [PATCH 0/3] sched/tick: Decouple sched_tick() from HZ
Posted by Qais Yousef 6 days, 5 hours ago
On 05/18/26 09:40, Peter Zijlstra wrote:
> On Sun, May 17, 2026 at 05:07:37AM +0100, Qais Yousef wrote:
> > With HRTICK now is default on,
> 
> .. for archs that use GENERIC_ENTRY. So for example, ARM64 doesn't (yet)
> have this.

I remembering trying to trace down the deps on HRTIMER_REARM_DEFERRED and then
got distracted.

HRTIMER on arm are cheap. Do we need this deps on generic entry?

Anyways. More reason to decouple I take :)
Re: [PATCH 0/3] sched/tick: Decouple sched_tick() from HZ
Posted by Peter Zijlstra 5 days, 20 hours ago
On Tue, May 19, 2026 at 01:42:56AM +0100, Qais Yousef wrote:
> On 05/18/26 09:40, Peter Zijlstra wrote:
> > On Sun, May 17, 2026 at 05:07:37AM +0100, Qais Yousef wrote:
> > > With HRTICK now is default on,
> > 
> > .. for archs that use GENERIC_ENTRY. So for example, ARM64 doesn't (yet)
> > have this.
> 
> I remembering trying to trace down the deps on HRTIMER_REARM_DEFERRED and then
> got distracted.
> 
> HRTIMER on arm are cheap. Do we need this deps on generic entry?
> 
> Anyways. More reason to decouple I take :)

ARM64 Generic Entry patches are en-route somewhere. I'll get sorted.
Re: [PATCH 0/3] sched/tick: Decouple sched_tick() from HZ
Posted by David Laight 1 week ago
On Sun, 17 May 2026 05:07:37 +0100
Qais Yousef <qyousef@layalina.io> wrote:

> Previous attempt to make HZ 1000 the default [1] to help with scheduler
> responsiveness didn't get merged. But maybe for the best, as I think this idea
> of decoupling sched_tick() from HZ makes more sense. We shouldn't need to make
> a choice between how often timers should trigger vs how often should the
> scheduler update its stats/take decisions.

Have you also looked a decoupling HZ/jiffies from the timer interrupt rate?
It ought to be reasonable set HZ to 1000 and to do 'jiffies += 4' when the
timer interrupts 250 times a second.
I'd expect most code to handle that fine.

I know one architecture (forgotten which) traditionally used a 1024Hz clock,
and some very old ones 60Hz; but I don't think Linux supports either.

-- David
Re: [PATCH 0/3] sched/tick: Decouple sched_tick() from HZ
Posted by Qais Yousef 1 week ago
On 05/17/26 15:10, David Laight wrote:
> On Sun, 17 May 2026 05:07:37 +0100
> Qais Yousef <qyousef@layalina.io> wrote:
> 
> > Previous attempt to make HZ 1000 the default [1] to help with scheduler
> > responsiveness didn't get merged. But maybe for the best, as I think this idea
> > of decoupling sched_tick() from HZ makes more sense. We shouldn't need to make
> > a choice between how often timers should trigger vs how often should the
> > scheduler update its stats/take decisions.
> 
> Have you also looked a decoupling HZ/jiffies from the timer interrupt rate?
> It ought to be reasonable set HZ to 1000 and to do 'jiffies += 4' when the
> timer interrupts 250 times a second.
> I'd expect most code to handle that fine.

John actually had a stab at that [1].

I am not sure we can fully say we don't care about what TICK_NSEC means. And
all the open coding with HZ.

I don't see value in the overall potential treewide conversion and the
potential compile time math becoming 'runtime overhead' complaint someone might
throw out.

I did have a stab at making HZ a variable and update TICK_NSEC to depend on it.
Unfortunately I lost that work. It required a lot of fix ups to the math and
conversions (which I did all of that). And since HZ is assumed to be
a label/define; when it is a variable lots of code fails to compile.  This is
where I was stuck getting all the conversion fixed when my machine died and
I forgot to save the work somewhere else and it was all gone.

> I know one architecture (forgotten which) traditionally used a 1024Hz clock,
> and some very old ones 60Hz; but I don't think Linux supports either.

Personally I think HZ=1000 is the only sensible option. But I am not going to
fight this battle :)

[1] https://lore.kernel.org/lkml/20250128063301.3879317-1-jstultz@google.com/
Re: [PATCH 0/3] sched/tick: Decouple sched_tick() from HZ
Posted by David Laight 1 week ago
On Sun, 17 May 2026 16:44:01 +0100
Qais Yousef <qyousef@layalina.io> wrote:

> On 05/17/26 15:10, David Laight wrote:
> > On Sun, 17 May 2026 05:07:37 +0100
> > Qais Yousef <qyousef@layalina.io> wrote:
> >   
> > > Previous attempt to make HZ 1000 the default [1] to help with scheduler
> > > responsiveness didn't get merged. But maybe for the best, as I think this idea
> > > of decoupling sched_tick() from HZ makes more sense. We shouldn't need to make
> > > a choice between how often timers should trigger vs how often should the
> > > scheduler update its stats/take decisions.  
> > 
> > Have you also looked a decoupling HZ/jiffies from the timer interrupt rate?
> > It ought to be reasonable set HZ to 1000 and to do 'jiffies += 4' when the
> > timer interrupts 250 times a second.
> > I'd expect most code to handle that fine.  
> 
> John actually had a stab at that [1].

I'd forgotten about that - even though I commented :-)

> I am not sure we can fully say we don't care about what TICK_NSEC means. And
> all the open coding with HZ.
> 
> I don't see value in the overall potential treewide conversion and the
> potential compile time math becoming 'runtime overhead' complaint someone might
> throw out.

If HZ is fixed at 1000 then there is no extra maths.

> I did have a stab at making HZ a variable and update TICK_NSEC to depend on it.

I do remember that, and thinking it would add a lot of extra maths.

...
> > I know one architecture (forgotten which) traditionally used a 1024Hz clock,
> > and some very old ones 60Hz; but I don't think Linux supports either.  
> 
> Personally I think HZ=1000 is the only sensible option. But I am not going to
> fight this battle :)

I go for only allowing values that divide evenly into 1000.
That really only gives you interrupt rates of 1000Hz, 500Hz, 250Hz, 200Hz
and 100Hz (and maybe 50Hz).

-- David

> 
> [1] https://lore.kernel.org/lkml/20250128063301.3879317-1-jstultz@google.com/
Re: [PATCH 0/3] sched/tick: Decouple sched_tick() from HZ
Posted by Qais Yousef 1 week ago
On 05/17/26 17:09, David Laight wrote:
> On Sun, 17 May 2026 16:44:01 +0100
> Qais Yousef <qyousef@layalina.io> wrote:
> 
> > On 05/17/26 15:10, David Laight wrote:
> > > On Sun, 17 May 2026 05:07:37 +0100
> > > Qais Yousef <qyousef@layalina.io> wrote:
> > >   
> > > > Previous attempt to make HZ 1000 the default [1] to help with scheduler
> > > > responsiveness didn't get merged. But maybe for the best, as I think this idea
> > > > of decoupling sched_tick() from HZ makes more sense. We shouldn't need to make
> > > > a choice between how often timers should trigger vs how often should the
> > > > scheduler update its stats/take decisions.  
> > > 
> > > Have you also looked a decoupling HZ/jiffies from the timer interrupt rate?
> > > It ought to be reasonable set HZ to 1000 and to do 'jiffies += 4' when the
> > > timer interrupts 250 times a second.
> > > I'd expect most code to handle that fine.  
> > 
> > John actually had a stab at that [1].
> 
> I'd forgotten about that - even though I commented :-)
> 
> > I am not sure we can fully say we don't care about what TICK_NSEC means. And
> > all the open coding with HZ.
> > 
> > I don't see value in the overall potential treewide conversion and the
> > potential compile time math becoming 'runtime overhead' complaint someone might
> > throw out.
> 
> If HZ is fixed at 1000 then there is no extra maths.
> 
> > I did have a stab at making HZ a variable and update TICK_NSEC to depend on it.
> 
> I do remember that, and thinking it would add a lot of extra maths.
> 
> ...
> > > I know one architecture (forgotten which) traditionally used a 1024Hz clock,
> > > and some very old ones 60Hz; but I don't think Linux supports either.  
> > 
> > Personally I think HZ=1000 is the only sensible option. But I am not going to
> > fight this battle :)
> 
> I go for only allowing values that divide evenly into 1000.
> That really only gives you interrupt rates of 1000Hz, 500Hz, 250Hz, 200Hz
> and 100Hz (and maybe 50Hz).

I don't know. I think the HZ being tightly coupled with timers is fine and
makes sense logically. Trying to decouple this decades long assumptions might
come with a lot of un-intended consequences for something that can be solved
much easier by decoupling users that need better consistency and guarantees
across all configurations. I don't see a great benefit trying to play games
with jiffies. Load balancer will still need to do work to decouple from jiffy
in this instance too - though as I stated with push lb I think this work can be
skipped given we have a superior alternative soon hopefully.

The PTICK can be generalized to allow other users potentially who can benefit
from decoupling from timer TICK.

But I'll defer to the gurus :)
Re: [PATCH 0/3] sched/tick: Decouple sched_tick() from HZ
Posted by Qais Yousef 1 week, 1 day ago
On 05/17/26 05:07, Qais Yousef wrote:
> Previous attempt to make HZ 1000 the default [1] to help with scheduler
> responsiveness didn't get merged. But maybe for the best, as I think this idea
> of decoupling sched_tick() from HZ makes more sense. We shouldn't need to make
> a choice between how often timers should trigger vs how often should the
> scheduler update its stats/take decisions.
> 
> With HRTICK now is default on, preemption checks are less of problem. But tasks
> could need to migrate (specifically on HMP systems) and frequencies could need
> updating for solo long running tasks.
> 
> The conversion is half complete. We need to invent a new jiffy (piffy?) and
> move existing load balancing logic (and potentially other users if they want
> to) to speed them up. BUT, with push load balancer patches circulating around,
> I am not sure it is worth while or desired to speed it since we have a better
> faster mechanism to keep the system balanced without the need for the 'heavy
> handed' full load balance to trigger. The push lb can benefit from this ptick
> too to ensure more responsiveness.

Err, this whole series is RFC. Forgot to add the tag, sorry.

> 
> [1] https://lore.kernel.org/lkml/20250226000810.459547-1-qyousef@layalina.io/
> 
> Qais Yousef (3):
>   sched/tick: Decouple sched_tick() from HZ
>   sched/tick: Move TICK_NSEC users to PTICK_NSEC
>   sched/tick: Turn on PTICK by default
> 
>  kernel/sched/clock.c             |  4 +-
>  kernel/sched/core.c              | 78 +++++++++++++++++++++++++++++---
>  kernel/sched/cpufreq_schedutil.c |  2 +-
>  kernel/sched/fair.c              |  6 +--
>  kernel/sched/features.h          |  8 ++++
>  kernel/sched/sched.h             |  7 +++
>  6 files changed, 93 insertions(+), 12 deletions(-)
> 
> -- 
> 2.34.1
>