hrtimer,sched: General optimizations and hrtick enablement

[patch 00/48] hrtimer,sched: General optimizations and hrtick enablement

Posted by Thomas Gleixner 1 month, 3 weeks ago

Peter recently posted a series tweaking the hrtimer subsystem to reduce the
overhead of the scheduler hrtick timer so it can be enabled by default:

   https://lore.kernel.org/20260121162010.647043073@infradead.org

That turned out to be incomplete and led to a deeper investigation of the
related bits and pieces.

The problem is that the hrtick deadline changes on every context switch and
is also modified by wakeups and balancing. On a hackbench run this results
in about 2500 clockevent reprogramming cycles per second, which is
especially hurtful in a VM as accessing the clockevent device implies a
VM-Exit.

The following series addresses various aspects of the overall related
problem space:

    1) Scheduler

       Aside of some trivial fixes the handling of the hrtick timer in
       the scheduler is suboptimal:

        - schedule() modifies the hrtick when picking the next task

	- schedule() can modify the hrtick when the balance callback runs
          before releasing rq:lock

	- the expiry time is unfiltered and can result in really tiny
          changes of the expiry time, which are functionally completely
          irrelevant

       Solve this by deferring the hrtick update to the end of schedule()
       and filtering out tiny changes.


    2) Clocksource, clockevents, timekeeping

        - Reading the current clocksource involves an indirect call, which
          is expensive especially for clocksources where the actual read is
          a single instruction like the TSC read on x86.

	  This could be solved with a static call, but the architecture
	  coverage for static calls is meager and that still has the
	  overhead of a function call and in the worst case a return
	  speculation mitigation.

	  As x86 and other architectures like S390 have one preferred
	  clocksource which is normally used on all contemporary systems,
	  this begs for a fully inlined solution.

	  This is achieved by a config option which tells the core code to
	  use the architecture provided inline guarded by a static branch.

	  If the branch is disabled, the indirect function call is used as
	  before. If enabled the inlined read is utilized.

	  The branch is disabled by default and only enabled after a
	  clocksource is installed which has the INLINE feature flag
	  set. When the clocksource is replaced the branch is disabled
	  before the clocksource change happens.


        - Programming clock events is based on calculating a relative
          expiry time, converting it to the clock cycles corresponding to
          the clockevent device frequency and invoking the set_next_event()
          callback of the clockevent device.

	  That works perfectly fine as most hardware timers are count down
	  implementations which require a relative time for programming.

	  But clockevent devices which are coupled to the clocksource and
	  provide a less than equal comparator suffer from this scheme. The
	  core calculates the relative expiry time based on a clock read
	  and the set_next_event() callback has to read the same clock
	  again to convert it back to a absolute time which can be
	  programmed into the comparator.

	  The other issue is that the conversion factor of the clockevent
	  device is calculated at boot time and does not take the NTP/PTP
	  adjustments of the clocksource frequency into account. Depending
	  on the direction of the adjustment this can cause timers to fire
	  early or late. Early is the more problematic case as the timer
	  interrupt has to reprogram the device with a very short delta as
	  it can't expire timers early.

	  This can be optimized by introducing a 'coupled' mode for the
	  clocksource and the clockevent device.

	    A) If the clocksource indicates support for 'coupled' mode, the
	       timekeeping core calculates a (NTP adjusted) reverse
	       conversion factor from the clocksource to nanoseconds
	       conversion. This takes NTP adjustments into account and
	       keeps the conversion in sync.

	    B) The timekeeping core provides a function to convert an
	       absolute CLOCK_MONOTONIC expiry time into a absolute time in
	       clocksource cycles which can be programmed directly into the
	       comparator without reading the clocksource at all.

	       This is possible because timekeeping keeps a time pair of
	       the base cycle count and the corresponding CLOCK_MONOTONIC base
	       time at the last update of the timekeeper.

	       So the absolute cycle time can be calculated by calculating
	       the relative time to the CLOCK_MONOTONIC base time,
	       converting the delta into cycles with the help of #A and
	       adding the base cycle count. Pure math, no hardware access.

	    C) The clockevent reprogramming code invokes this conversion
	       function when the clockevent device indicates 'coupled'
	       mode.  The function returns false when the corresponding
	       clocksource is not the current system clocksource (based on
	       a clocksource ID check) and true if the clocksource matches
	       and the conversion is successful.

	       If false, the regular relative set_next_event() mechanism is
	       used, otherwise a new set_next_coupled() callback which
	       takes the calculated absolute expiry time as argument.

	       Similar to the clocksource, this new callback can optionally
	       be inlined.


    3) hrtimers

       It turned out that the hrtimer code needed a long overdue spring
       cleaning independent of the problem at hand. That was conducted
       before tackling the actual performance issues:

       - Timer locality

       	 The handling of timer locality is suboptimal and results often in
	 pointless invocations of switch_hrtimer_base() which end up
	 keeping the CPU base unchanged.

	 Aside of the pointless overhead, this prevents further
	 optimizations for the common local case.

	 Address this by improving the decision logic for keeping the clock
	 base local and splitting out the (re)arm handling into a unified
	 operation.


       - Evalutation of the clock base expiries

       	 The clock bases (MONOTONIC, REALTIME, BOOT, TAI) cache the first
       	 expiring timer, but not the corresponding expiry time, which means
       	 a re-evaluation of the clock bases for the next expiring timer on
       	 the CPU requires to touch up to for extra cache lines.

	 Trivial to solve by caching the earliest expiry time in the clock
	 base itself.


       - Reprogramming of the clock event device

       	 The hrtimer interrupt already deferres reprogramming until the
       	 interrupt handler completes, but in case of the hrtick timer
       	 that's not sufficient because the hrtick timer callback only sets
       	 the NEED_RESCHED flag but has no information about the next hrtick
       	 timer expiry time, which can only be determined in the scheduler.

	 Expand the deferred reprogramming so it can ideally be handled in
	 the subsequent schedule() after the new hrtick value has been
	 established. If there is no schedule, soft interrupts have to be
	 processed on return from interrupt or a nested interrupt hits
	 before reaching schedule, the deferred programming is handled in
	 those contexts.


       - Modification of queued timers

       	 If a timer is already queued modifying the expiry time requires
       	 dequeueing from the RB tree and requeuing after the new expiry
       	 value has been updated. It turned out that the hrtick timer
       	 modification end up very often at the same spot in the RB tree as
       	 they have been before, which means the dequeue/enqueue cycle along
       	 with the related rebalancing could have been avoided. The timer
       	 wheel timers have a similar mechanism by checking upfront whether
       	 the resulting expiry time keeps them in the same hash bucket.

	 It was tried to check this by using rb_prev() and rb_next() to
	 evaluate whether the modification keeps the timer in the same
	 spot, but that turned out to be really inefficent.

	 Solve this by providing a RB tree variant which extends the node
	 with links to the previous and next nodes, which is established
	 when the node is linked into the tree or adjusted when it is
	 removed. These links allow a quick peek into the previous and next
	 expiry time and if the new expiry stays in the boundary the whole
	 RB tree operation can be avoided.

	 This also simplifies the caching and update of the leftmost node
	 as on remove the rb_next() walk can be completely avoided. It
	 would obviously provide a cached rightmost pointer too, but there
	 is not use case for that (yet).

	 On a hackbench run this results in about 35% of the updates being
	 handled that way, which cuts the execution time of
	 hrtimer_start_range_ns() down to 50ns on a 2GHz machine.


       - Cancellation of queued timers

       	 Cancelling a timer or moving its expiry time past the programmed
       	 time can result in reprogramming the clock event device.
       	 Especially with frequent modifications of a queued timer this
       	 results in substantial overhead especially in VMs.

	 Provide an option for hrtimers to tell the core to handle
	 reprogramming lazy in those cases, which means it trades frequent
	 reprogramming against an occasional pointless hrtimer interrupt.

	 But it turned out for the hrtick timer this is a reasonable
	 tradeoff. It's especially valuable when transitioning to idle,
	 where the timer has to be cancelled but then the NOHZ idle code
	 will reprogram it in case of a long idle sleep anyway. But also in
	 high frequency scheduling scenarios this turned out to be
	 beneficial.


With all the above modifications in place enabling hrtick does not longer
result in regressions compared to the hrtick disabled mode.

The reprogramming frequency of the clockevent device got down from
~2500/sec to ~100/sec for a hackbench run with a spurious hrtimer interrupt
ratio of about 25%.

What's interesting is the astonishing improvement of a hackbench run with
the following command line parameters: '-l$LOOPS -p -s8'. That uses pipes
with a message size of 8 bytes. On a 112 CPU SKL machine this results in:

       	   NO HRTICK[_DL]		HRTICK[_DL]
runtime:   0.840s			0.481s		~-42%

With other message sizes up to 256, HRTICK still results in improvements,
but not in that magnitude. Haven't investigated the cause of that yet.

While quite some parts of the series are independent enhancements, I've
decided to keep them together in one big pile for now as all of the
components are required to actually achieve the overall goal.

The patches have been already structured in a way that they can be
distributed to different subsystem branches without causing major cross
subsystem contamination or merge conflict headaches.

The series applies on v7.0-rc1 and is also available from git:

   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git sched/hrtick

Thanks,

	tglx
---
 arch/x86/Kconfig                      |    2 
 arch/x86/include/asm/clock_inlined.h  |   22 
 arch/x86/kernel/apic/apic.c           |   41 -
 arch/x86/kernel/tsc.c                 |    4 
 include/asm-generic/thread_info_tif.h |    5 
 include/linux/clockchips.h            |    8 
 include/linux/clocksource.h           |    3 
 include/linux/hrtimer.h               |   59 -
 include/linux/hrtimer_defs.h          |   79 +-
 include/linux/hrtimer_rearm.h         |   83 ++
 include/linux/hrtimer_types.h         |   19 
 include/linux/irq-entry-common.h      |   25 
 include/linux/rbtree.h                |   81 ++
 include/linux/rbtree_types.h          |   16 
 include/linux/rseq_entry.h            |   14 
 include/linux/timekeeper_internal.h   |    8 
 include/linux/timerqueue.h            |   56 +
 include/linux/timerqueue_types.h      |   15 
 include/trace/events/timer.h          |   35 -
 kernel/entry/common.c                 |    4 
 kernel/sched/core.c                   |   89 ++
 kernel/sched/deadline.c               |    2 
 kernel/sched/fair.c                   |   55 -
 kernel/sched/features.h               |    5 
 kernel/sched/sched.h                  |   41 -
 kernel/softirq.c                      |   15 
 kernel/time/Kconfig                   |   16 
 kernel/time/clockevents.c             |   48 +
 kernel/time/hrtimer.c                 | 1116 +++++++++++++++++++---------------
 kernel/time/tick-broadcast-hrtimer.c  |    1 
 kernel/time/tick-sched.c              |   27 
 kernel/time/timekeeping.c             |  184 +++++
 kernel/time/timekeeping.h             |    2 
 kernel/time/timer_list.c              |   12 
 lib/rbtree.c                          |   17 
 lib/timerqueue.c                      |   14 
 36 files changed, 1497 insertions(+), 728 deletions(-)

Re: [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement

Posted by Christian Loehle 1 month, 1 week ago

On 2/24/26 16:35, Thomas Gleixner wrote:
> Peter recently posted a series tweaking the hrtimer subsystem to reduce the
> overhead of the scheduler hrtick timer so it can be enabled by default:
> 
>    https://lore.kernel.org/20260121162010.647043073@infradead.org
> 
> That turned out to be incomplete and led to a deeper investigation of the
> related bits and pieces.
> 
> The problem is that the hrtick deadline changes on every context switch and
> is also modified by wakeups and balancing. On a hackbench run this results
> in about 2500 clockevent reprogramming cycles per second, which is
> especially hurtful in a VM as accessing the clockevent device implies a
> VM-Exit.
> 
> The following series addresses various aspects of the overall related
> problem space:
> 
>     1) Scheduler
> 
>        Aside of some trivial fixes the handling of the hrtick timer in
>        the scheduler is suboptimal:
> 
>         - schedule() modifies the hrtick when picking the next task
> 
> 	- schedule() can modify the hrtick when the balance callback runs
>           before releasing rq:lock
> 
> 	- the expiry time is unfiltered and can result in really tiny
>           changes of the expiry time, which are functionally completely
>           irrelevant
> 
>        Solve this by deferring the hrtick update to the end of schedule()
>        and filtering out tiny changes.
> 
> 
>     2) Clocksource, clockevents, timekeeping
> 
>         - Reading the current clocksource involves an indirect call, which
>           is expensive especially for clocksources where the actual read is
>           a single instruction like the TSC read on x86.
> 
> 	  This could be solved with a static call, but the architecture
> 	  coverage for static calls is meager and that still has the
> 	  overhead of a function call and in the worst case a return
> 	  speculation mitigation.
> 
> 	  As x86 and other architectures like S390 have one preferred
> 	  clocksource which is normally used on all contemporary systems,
> 	  this begs for a fully inlined solution.
> 
> 	  This is achieved by a config option which tells the core code to
> 	  use the architecture provided inline guarded by a static branch.
> 
> 	  If the branch is disabled, the indirect function call is used as
> 	  before. If enabled the inlined read is utilized.
> 
> 	  The branch is disabled by default and only enabled after a
> 	  clocksource is installed which has the INLINE feature flag
> 	  set. When the clocksource is replaced the branch is disabled
> 	  before the clocksource change happens.
> 
> 
>         - Programming clock events is based on calculating a relative
>           expiry time, converting it to the clock cycles corresponding to
>           the clockevent device frequency and invoking the set_next_event()
>           callback of the clockevent device.
> 
> 	  That works perfectly fine as most hardware timers are count down
> 	  implementations which require a relative time for programming.
> 
> 	  But clockevent devices which are coupled to the clocksource and
> 	  provide a less than equal comparator suffer from this scheme. The
> 	  core calculates the relative expiry time based on a clock read
> 	  and the set_next_event() callback has to read the same clock
> 	  again to convert it back to a absolute time which can be
> 	  programmed into the comparator.
> 
> 	  The other issue is that the conversion factor of the clockevent
> 	  device is calculated at boot time and does not take the NTP/PTP
> 	  adjustments of the clocksource frequency into account. Depending
> 	  on the direction of the adjustment this can cause timers to fire
> 	  early or late. Early is the more problematic case as the timer
> 	  interrupt has to reprogram the device with a very short delta as
> 	  it can't expire timers early.
> 
> 	  This can be optimized by introducing a 'coupled' mode for the
> 	  clocksource and the clockevent device.
> 
> 	    A) If the clocksource indicates support for 'coupled' mode, the
> 	       timekeeping core calculates a (NTP adjusted) reverse
> 	       conversion factor from the clocksource to nanoseconds
> 	       conversion. This takes NTP adjustments into account and
> 	       keeps the conversion in sync.
> 
> 	    B) The timekeeping core provides a function to convert an
> 	       absolute CLOCK_MONOTONIC expiry time into a absolute time in
> 	       clocksource cycles which can be programmed directly into the
> 	       comparator without reading the clocksource at all.
> 
> 	       This is possible because timekeeping keeps a time pair of
> 	       the base cycle count and the corresponding CLOCK_MONOTONIC base
> 	       time at the last update of the timekeeper.
> 
> 	       So the absolute cycle time can be calculated by calculating
> 	       the relative time to the CLOCK_MONOTONIC base time,
> 	       converting the delta into cycles with the help of #A and
> 	       adding the base cycle count. Pure math, no hardware access.
> 
> 	    C) The clockevent reprogramming code invokes this conversion
> 	       function when the clockevent device indicates 'coupled'
> 	       mode.  The function returns false when the corresponding
> 	       clocksource is not the current system clocksource (based on
> 	       a clocksource ID check) and true if the clocksource matches
> 	       and the conversion is successful.
> 
> 	       If false, the regular relative set_next_event() mechanism is
> 	       used, otherwise a new set_next_coupled() callback which
> 	       takes the calculated absolute expiry time as argument.
> 
> 	       Similar to the clocksource, this new callback can optionally
> 	       be inlined.
> 
> 
>     3) hrtimers
> 
>        It turned out that the hrtimer code needed a long overdue spring
>        cleaning independent of the problem at hand. That was conducted
>        before tackling the actual performance issues:
> 
>        - Timer locality
> 
>        	 The handling of timer locality is suboptimal and results often in
> 	 pointless invocations of switch_hrtimer_base() which end up
> 	 keeping the CPU base unchanged.
> 
> 	 Aside of the pointless overhead, this prevents further
> 	 optimizations for the common local case.
> 
> 	 Address this by improving the decision logic for keeping the clock
> 	 base local and splitting out the (re)arm handling into a unified
> 	 operation.
> 
> 
>        - Evalutation of the clock base expiries
> 
>        	 The clock bases (MONOTONIC, REALTIME, BOOT, TAI) cache the first
>        	 expiring timer, but not the corresponding expiry time, which means
>        	 a re-evaluation of the clock bases for the next expiring timer on
>        	 the CPU requires to touch up to for extra cache lines.
> 
> 	 Trivial to solve by caching the earliest expiry time in the clock
> 	 base itself.
> 
> 
>        - Reprogramming of the clock event device
> 
>        	 The hrtimer interrupt already deferres reprogramming until the
>        	 interrupt handler completes, but in case of the hrtick timer
>        	 that's not sufficient because the hrtick timer callback only sets
>        	 the NEED_RESCHED flag but has no information about the next hrtick
>        	 timer expiry time, which can only be determined in the scheduler.
> 
> 	 Expand the deferred reprogramming so it can ideally be handled in
> 	 the subsequent schedule() after the new hrtick value has been
> 	 established. If there is no schedule, soft interrupts have to be
> 	 processed on return from interrupt or a nested interrupt hits
> 	 before reaching schedule, the deferred programming is handled in
> 	 those contexts.
> 
> 
>        - Modification of queued timers
> 
>        	 If a timer is already queued modifying the expiry time requires
>        	 dequeueing from the RB tree and requeuing after the new expiry
>        	 value has been updated. It turned out that the hrtick timer
>        	 modification end up very often at the same spot in the RB tree as
>        	 they have been before, which means the dequeue/enqueue cycle along
>        	 with the related rebalancing could have been avoided. The timer
>        	 wheel timers have a similar mechanism by checking upfront whether
>        	 the resulting expiry time keeps them in the same hash bucket.
> 
> 	 It was tried to check this by using rb_prev() and rb_next() to
> 	 evaluate whether the modification keeps the timer in the same
> 	 spot, but that turned out to be really inefficent.
> 
> 	 Solve this by providing a RB tree variant which extends the node
> 	 with links to the previous and next nodes, which is established
> 	 when the node is linked into the tree or adjusted when it is
> 	 removed. These links allow a quick peek into the previous and next
> 	 expiry time and if the new expiry stays in the boundary the whole
> 	 RB tree operation can be avoided.
> 
> 	 This also simplifies the caching and update of the leftmost node
> 	 as on remove the rb_next() walk can be completely avoided. It
> 	 would obviously provide a cached rightmost pointer too, but there
> 	 is not use case for that (yet).
> 
> 	 On a hackbench run this results in about 35% of the updates being
> 	 handled that way, which cuts the execution time of
> 	 hrtimer_start_range_ns() down to 50ns on a 2GHz machine.
> 
> 
>        - Cancellation of queued timers
> 
>        	 Cancelling a timer or moving its expiry time past the programmed
>        	 time can result in reprogramming the clock event device.
>        	 Especially with frequent modifications of a queued timer this
>        	 results in substantial overhead especially in VMs.
> 
> 	 Provide an option for hrtimers to tell the core to handle
> 	 reprogramming lazy in those cases, which means it trades frequent
> 	 reprogramming against an occasional pointless hrtimer interrupt.
> 
> 	 But it turned out for the hrtick timer this is a reasonable
> 	 tradeoff. It's especially valuable when transitioning to idle,
> 	 where the timer has to be cancelled but then the NOHZ idle code
> 	 will reprogram it in case of a long idle sleep anyway. But also in
> 	 high frequency scheduling scenarios this turned out to be
> 	 beneficial.
> 
> 
> With all the above modifications in place enabling hrtick does not longer
> result in regressions compared to the hrtick disabled mode.
> 
> The reprogramming frequency of the clockevent device got down from
> ~2500/sec to ~100/sec for a hackbench run with a spurious hrtimer interrupt
> ratio of about 25%.
> 
> What's interesting is the astonishing improvement of a hackbench run with
> the following command line parameters: '-l$LOOPS -p -s8'. That uses pipes
> with a message size of 8 bytes. On a 112 CPU SKL machine this results in:
> 
>        	   NO HRTICK[_DL]		HRTICK[_DL]
> runtime:   0.840s			0.481s		~-42%
> 
> With other message sizes up to 256, HRTICK still results in improvements,
> but not in that magnitude. Haven't investigated the cause of that yet.
> 
> While quite some parts of the series are independent enhancements, I've
> decided to keep them together in one big pile for now as all of the
> components are required to actually achieve the overall goal.
> 
> The patches have been already structured in a way that they can be
> distributed to different subsystem branches without causing major cross
> subsystem contamination or merge conflict headaches.
> 
> The series applies on v7.0-rc1 and is also available from git:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git sched/hrtick
> 
> Thanks,
> 
> 	tglx
> ---
>  arch/x86/Kconfig                      |    2 
>  arch/x86/include/asm/clock_inlined.h  |   22 
>  arch/x86/kernel/apic/apic.c           |   41 -
>  arch/x86/kernel/tsc.c                 |    4 
>  include/asm-generic/thread_info_tif.h |    5 
>  include/linux/clockchips.h            |    8 
>  include/linux/clocksource.h           |    3 
>  include/linux/hrtimer.h               |   59 -
>  include/linux/hrtimer_defs.h          |   79 +-
>  include/linux/hrtimer_rearm.h         |   83 ++
>  include/linux/hrtimer_types.h         |   19 
>  include/linux/irq-entry-common.h      |   25 
>  include/linux/rbtree.h                |   81 ++
>  include/linux/rbtree_types.h          |   16 
>  include/linux/rseq_entry.h            |   14 
>  include/linux/timekeeper_internal.h   |    8 
>  include/linux/timerqueue.h            |   56 +
>  include/linux/timerqueue_types.h      |   15 
>  include/trace/events/timer.h          |   35 -
>  kernel/entry/common.c                 |    4 
>  kernel/sched/core.c                   |   89 ++
>  kernel/sched/deadline.c               |    2 
>  kernel/sched/fair.c                   |   55 -
>  kernel/sched/features.h               |    5 
>  kernel/sched/sched.h                  |   41 -
>  kernel/softirq.c                      |   15 
>  kernel/time/Kconfig                   |   16 
>  kernel/time/clockevents.c             |   48 +
>  kernel/time/hrtimer.c                 | 1116 +++++++++++++++++++---------------
>  kernel/time/tick-broadcast-hrtimer.c  |    1 
>  kernel/time/tick-sched.c              |   27 
>  kernel/time/timekeeping.c             |  184 +++++
>  kernel/time/timekeeping.h             |    2 
>  kernel/time/timer_list.c              |   12 
>  lib/rbtree.c                          |   17 
>  lib/timerqueue.c                      |   14 
>  36 files changed, 1497 insertions(+), 728 deletions(-)
> 
> 
> 

FWIW I tested various workloads for this on an arm64 rk3399 comparing
mainline NO_HRTICK
mainline HRTICK
rearm NO_HRTICK
rearm HRTICK
rearm being $SUBJECT + arm64 generic entry + enabling generic TIF bits.
https://lore.kernel.org/lkml/20260203133728.848283-1-ruanjinjie@huawei.com/

There's nothing statistically significant with 1000HZ (it has 6 CPUs, so base
slice granularity is 2.1ms).
With 250HZ I get at least something, a selection:
+-------------+---------------------+---------------------+----------------------+----------------------+----------------------+----------------------+
| Test        | mainline NO_HRTICK  | mainline HRTICK     | rearm NO_HRTICK      | rearm HRTICK         | subject NO_HRTICK    | subject HRTICK       |
+-------------+---------------------+---------------------+----------------------+----------------------+----------------------+----------------------+
| schbench    | 306.83 ± 3.10       | 301.81 ± 1.07       | 298.67 ± 3.33        | (304.87 ± 3.29)      | (305.79 ± 3.64)      | (307.07 ± 1.05)      |
| ebizzy      | 10664 ± 19          | (10565 ± 285)       | (10510 ± 245)        | (10580 ± 240)        | (10674 ± 259)        | 10816 ± 27           |
| hackbench   | 19.715 ± 0.11       | (19.707 ± 0.10)     | (19.826 ± 0.15)      | (19.81 ± 0.12)       | 19.98 ± 0.10         | (19.74 ± 0.11)       |
| nullb0 IOPS | 102525 ± 367        | (101850 ± 262)      | 92209 ± 7624         | (103385 ± 422)       | (101854 ± 473)       | (102141 ± 149)       |
+-------------+----------------------+--------------------+----------------------+----------------------+----------------------+----------------------+
(subject is $SUBJECT only, so no REARM_DEFERRED on arm64).
but at least no regression with sched_feat HRTICK.

Re: [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement

Posted by Peter Zijlstra 1 month, 2 weeks ago

On Tue, Feb 24, 2026 at 05:35:12PM +0100, Thomas Gleixner wrote:

> The series applies on v7.0-rc1 and is also available from git:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git sched/hrtick

If you'd have added the shortlog, you'd have made nearly 350 lines :-)

Peter Zijlstra (11):
      sched/eevdf: Fix HRTICK duration
      hrtimer: Avoid pointless reprogramming in __hrtimer_start_range_ns()
      hrtimer: Provide LAZY_REARM mode
      sched/hrtick: Mark hrtick timer LAZY_REARM
      hrtimer: Re-arrange hrtimer_interrupt()
      hrtimer: Prepare stubs for deferred rearming
      entry: Prepare for deferred hrtimer rearming
      softirq: Prepare for deferred hrtimer rearming
      sched/core: Prepare for deferred hrtimer rearming
      hrtimer: Push reprogramming timers into the interrupt return path
      sched: Default enable HRTICK when deferred rearming is enabled

Peter Zijlstra (Intel) (2):
      sched/fair: Simplify hrtick_update()
      sched/fair: Make hrtick resched hard

Thomas Gleixner (35):
      sched: Avoid ktime_get() indirection
      hrtimer: Provide a static branch based hrtimer_hres_enabled()
      sched: Use hrtimer_highres_enabled()
      sched: Optimize hrtimer handling
      sched/hrtick: Avoid tiny hrtick rearms
      tick/sched: Avoid hrtimer_cancel/start() sequence
      clockevents: Remove redundant CLOCK_EVT_FEAT_KTIME
      timekeeping: Allow inlining clocksource::read()
      x86: Inline TSC reads in timekeeping
      x86/apic: Remove pointless fence in lapic_next_deadline()
      x86/apic: Avoid the PVOPS indirection for the TSC deadline timer
      timekeeping: Provide infrastructure for coupled clockevents
      clockevents: Provide support for clocksource coupled comparators
      x86/apic: Enable TSC coupled programming mode
      hrtimer: Add debug object init assertion
      hrtimer: Reduce trace noise in hrtimer_start()
      hrtimer: Use guards where appropriate
      hrtimer: Cleanup coding style and comments
      hrtimer: Evaluate timer expiry only once
      hrtimer: Replace the bitfield in hrtimer_cpu_base
      hrtimer: Convert state and properties to boolean
      hrtimer: Optimize for local timers
      hrtimer: Use NOHZ information for locality
      hrtimer: Separate remove/enqueue handling for local timers
      hrtimer: Add hrtimer_rearm tracepoint
      hrtimer: Rename hrtimer_cpu_base::in_hrtirq to deferred_rearm
      hrtimer: Avoid re-evaluation when nothing changed
      hrtimer: Keep track of first expiring timer per clock base
      hrtimer: Rework next event evaluation
      hrtimer: Simplify run_hrtimer_queues()
      hrtimer: Optimize for_each_active_base()
      rbtree: Provide rbtree with links
      timerqueue: Provide linked timerqueue
      hrtimer: Use linked timerqueue
      hrtimer: Try to modify timers in place


Anyway, since I've been staring at these patches for over a week now:

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

You want me to go queue them in tip/sched/hrtick, tip/timer/hrick and
then merge both into tip/sched/core and have tip/timer/core only include
tip/timer/hrtick or something?

Re: [patch 00/48] hrtimer,sched: General optimizations and hrtick enablement

Posted by Thomas Gleixner 1 month, 2 weeks ago

On Wed, Feb 25 2026 at 16:25, Peter Zijlstra wrote:
> You want me to go queue them in tip/sched/hrtick, tip/timer/hrick and
> then merge both into tip/sched/core and have tip/timer/core only include
> tip/timer/hrtick or something?

I"d like to split them up and only pull the minimal stuff into the
subsystem branches. I made a plan already, but I can't find the notes
right now. I'll dig them out later.

Thanks,

        tglx