[v1] hrtimers: Prevent hrtimer interrupt starvation

[patch 00/12] hrtimers: Prevent hrtimer interrupt starvation

Posted by Thomas Gleixner 2 months, 1 week ago

Calvin reported an odd NMI watchdog lockup which claims that the CPU locked
up in user space:

  https://lore.kernel.org/lkml/acMe-QZUel-bBYUh@mozart.vkv.me/

He provided a reproducer, which sets up a timerfd based timer and then
rearms it in a loop with an absolute expiry time of 1ns.

As the expiry time is in the past, the timer ends up as the first expiring
timer in the per CPU hrtimer base and the clockevent device is programmed
with the minimum delta value. If the machine is fast enough, this ends up
in a endless loop of programming the delta value to the minimum value
defined by the clock event device, before the timer interrupt can fire,
which starves the interrupt and consequently triggers the lockup detector
because the hrtimer callback of the lockup mechanism is never invoked.

The first patch in the series changes the clockevent set next event
mechanism to prevent reprogramming of the clockevent device when the
minimum delta value was programmed unless the new delta is larger than
that. It's a less convoluted variant of the patch which was posted in the
above linked thread and was confirmed to prevent the starvation problem.

But that's only to be considered the last resort because it results in an
insane amount of avoidable hrtimer interrupts.

The problem of user controlled timers is that the input value is only
sanity checked vs. validity of the provided timespec and clamped to be in
the maximum allowable range. But for performance reasons for in kernel
usage there is no check whether a to be armed timer might have been expired
already at enqueue time.

The rest of the series addresses this by providing a separate interface to
arm user controlled timers. This works the same way as the existing
hrtimer_start_range_ns(), but in case that the timer ends up as the first
timer in the clock base after enqueue it provides additional checks:

      - Whether the timer becomes the first expiring timer in the CPU base.

      	If not the timer is considered to expire in the future as there is
	already an earlier event programmed.

      - Whether the timer has expired already by comparing the expiry value
        against current time.

	If it is expired, the timer is removed from the clock base and the
	function returns false, so that the caller can handle it. That's
	required because the function cannot invoke the callback as that
	might need to acquire a lock which is held by the caller.

This function is then used for the user controlled timer arming interfaces
mainly by converting hrtimer sleeper over to it. That affects a few in
kernel users too, but the overhead is minimal in that case and it spares a
tedious whack the mole game all over the tree.

The other usage sites in posixtimers, alarmtimers and timerfd are converted
as well, which should cover the vast majority of user space controllable
timers as far as my investigation goes.

The series applies against Linux tree and is also available from git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git hrtimer-exp-v1

There needs to be some discussion about the scope of backporting. The first
patch preventing the stall is obviously a backport candidate. The remaining
series can be obviously argued about, but in my opinion it should be
backported as well as it prevents stupid or malicious user space from
generating tons of pointless timer interrupts.

Thanks,

	tglx
---
 drivers/power/supply/charger-manager.c |   12 +-
 fs/timerfd.c                           |  115 +++++++++++++++-----------
 include/linux/alarmtimer.h             |    9 +-
 include/linux/clockchips.h             |    2 
 include/linux/hrtimer.h                |   20 +++-
 include/trace/events/timer.h           |   13 +++
 kernel/time/alarmtimer.c               |   70 +++++++---------
 kernel/time/clockevents.c              |   23 +++--
 kernel/time/hrtimer.c                  |  142 +++++++++++++++++++++++++++++----
 kernel/time/posix-cpu-timers.c         |   18 ++--
 kernel/time/posix-timers.c             |   35 +++++---
 kernel/time/posix-timers.h             |    4 
 kernel/time/tick-common.c              |    1 
 kernel/time/tick-sched.c               |    1 
 net/netfilter/xt_IDLETIMER.c           |   24 ++++-
 15 files changed, 341 insertions(+), 148 deletions(-)

Re: [patch 00/12] hrtimers: Prevent hrtimer interrupt starvation

Posted by Thomas Gleixner 2 months, 1 week ago

On Tue, Apr 07 2026 at 10:54, Thomas Gleixner wrote:
> There needs to be some discussion about the scope of backporting. The first
> patch preventing the stall is obviously a backport candidate. The remaining
> series can be obviously argued about, but in my opinion it should be
> backported as well as it prevents stupid or malicious user space from
> generating tons of pointless timer interrupts.

Peter and me just discussed it over IRC. With the clockevents prevention
in place, the effect of stupid/malicious code is pretty much affecting
only the user space task itself. As the timer is forced to expire once
the clockevent device has been force armed, it won't have other side
effects as device interrupts or IPIs are not blocked out and in the
worst case marginally delayed by the high frequency timer interrupt.

Once the task is scheduled out that subsides as there is nothing which
re-arms the timer anymore.

So we should be fine with backporting the clockevents fix and leave the
other parts of the series for upstream only. I still need to investigate
how all of that affects the pending changes vs. TSC deadline timer (and
similar devices) which are not going to reach that modified clockevents
code anymore.

Thanks,

        tglx

Re: [patch 00/12] hrtimers: Prevent hrtimer interrupt starvation

Posted by Thomas Gleixner 2 months, 1 week ago

On Tue, Apr 07 2026 at 16:43, Thomas Gleixner wrote:
> On Tue, Apr 07 2026 at 10:54, Thomas Gleixner wrote:
>> There needs to be some discussion about the scope of backporting. The first
>> patch preventing the stall is obviously a backport candidate. The remaining
>> series can be obviously argued about, but in my opinion it should be
>> backported as well as it prevents stupid or malicious user space from
>> generating tons of pointless timer interrupts.
>
> Peter and me just discussed it over IRC. With the clockevents prevention
> in place, the effect of stupid/malicious code is pretty much affecting
> only the user space task itself. As the timer is forced to expire once
> the clockevent device has been force armed, it won't have other side
> effects as device interrupts or IPIs are not blocked out and in the
> worst case marginally delayed by the high frequency timer interrupt.
>
> Once the task is scheduled out that subsides as there is nothing which
> re-arms the timer anymore.
>
> So we should be fine with backporting the clockevents fix and leave the
> other parts of the series for upstream only. I still need to investigate
> how all of that affects the pending changes vs. TSC deadline timer (and
> similar devices) which are not going to reach that modified clockevents
> code anymore.

It's pretty much the same as the above with the difference that a timer
armed in the past will result in an instantaneous interrupt as the
coupled event devices must provide a less than or equal comparator. So
again the task can only delay itself with hrtimer interrupts.

Thanks,

        tglx

Re: [patch 00/12] hrtimers: Prevent hrtimer interrupt starvation

Posted by Calvin Owens 2 months, 1 week ago

On Tuesday 04/07 at 10:54 +0200, Thomas Gleixner wrote:
> Calvin reported an odd NMI watchdog lockup which claims that the CPU locked
> up in user space:
> 
>   https://lore.kernel.org/lkml/acMe-QZUel-bBYUh@mozart.vkv.me/
> 
> He provided a reproducer, which sets up a timerfd based timer and then
> rearms it in a loop with an absolute expiry time of 1ns.

The original AMD machines survive the reproducer with this series.

Tested-by: Calvin Owens <calvin@wbinvd.org>

I'm happy to test subsets of it and stable backports too, if that's
helpful, just let me know.

Thanks,
Calvin

> As the expiry time is in the past, the timer ends up as the first expiring
> timer in the per CPU hrtimer base and the clockevent device is programmed
> with the minimum delta value. If the machine is fast enough, this ends up
> in a endless loop of programming the delta value to the minimum value
> defined by the clock event device, before the timer interrupt can fire,
> which starves the interrupt and consequently triggers the lockup detector
> because the hrtimer callback of the lockup mechanism is never invoked.
> 
> The first patch in the series changes the clockevent set next event
> mechanism to prevent reprogramming of the clockevent device when the
> minimum delta value was programmed unless the new delta is larger than
> that. It's a less convoluted variant of the patch which was posted in the
> above linked thread and was confirmed to prevent the starvation problem.
> 
> But that's only to be considered the last resort because it results in an
> insane amount of avoidable hrtimer interrupts.
> 
> The problem of user controlled timers is that the input value is only
> sanity checked vs. validity of the provided timespec and clamped to be in
> the maximum allowable range. But for performance reasons for in kernel
> usage there is no check whether a to be armed timer might have been expired
> already at enqueue time.
> 
> The rest of the series addresses this by providing a separate interface to
> arm user controlled timers. This works the same way as the existing
> hrtimer_start_range_ns(), but in case that the timer ends up as the first
> timer in the clock base after enqueue it provides additional checks:
> 
>       - Whether the timer becomes the first expiring timer in the CPU base.
> 
>       	If not the timer is considered to expire in the future as there is
> 	already an earlier event programmed.
> 
>       - Whether the timer has expired already by comparing the expiry value
>         against current time.
> 
> 	If it is expired, the timer is removed from the clock base and the
> 	function returns false, so that the caller can handle it. That's
> 	required because the function cannot invoke the callback as that
> 	might need to acquire a lock which is held by the caller.
> 
> This function is then used for the user controlled timer arming interfaces
> mainly by converting hrtimer sleeper over to it. That affects a few in
> kernel users too, but the overhead is minimal in that case and it spares a
> tedious whack the mole game all over the tree.
> 
> The other usage sites in posixtimers, alarmtimers and timerfd are converted
> as well, which should cover the vast majority of user space controllable
> timers as far as my investigation goes.
> 
> The series applies against Linux tree and is also available from git:
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git hrtimer-exp-v1
> 
> There needs to be some discussion about the scope of backporting. The first
> patch preventing the stall is obviously a backport candidate. The remaining
> series can be obviously argued about, but in my opinion it should be
> backported as well as it prevents stupid or malicious user space from
> generating tons of pointless timer interrupts.
> 
> Thanks,
> 
> 	tglx
> ---
>  drivers/power/supply/charger-manager.c |   12 +-
>  fs/timerfd.c                           |  115 +++++++++++++++-----------
>  include/linux/alarmtimer.h             |    9 +-
>  include/linux/clockchips.h             |    2 
>  include/linux/hrtimer.h                |   20 +++-
>  include/trace/events/timer.h           |   13 +++
>  kernel/time/alarmtimer.c               |   70 +++++++---------
>  kernel/time/clockevents.c              |   23 +++--
>  kernel/time/hrtimer.c                  |  142 +++++++++++++++++++++++++++++----
>  kernel/time/posix-cpu-timers.c         |   18 ++--
>  kernel/time/posix-timers.c             |   35 +++++---
>  kernel/time/posix-timers.h             |    4 
>  kernel/time/tick-common.c              |    1 
>  kernel/time/tick-sched.c               |    1 
>  net/netfilter/xt_IDLETIMER.c           |   24 ++++-
>  15 files changed, 341 insertions(+), 148 deletions(-)
> 
>

Re: [patch 00/12] hrtimers: Prevent hrtimer interrupt starvation

Posted by Thomas Gleixner 2 months, 1 week ago

On Tue, Apr 07 2026 at 10:38, Calvin Owens wrote:
> On Tuesday 04/07 at 10:54 +0200, Thomas Gleixner wrote:
>> He provided a reproducer, which sets up a timerfd based timer and then
>> rearms it in a loop with an absolute expiry time of 1ns.
>
> The original AMD machines survive the reproducer with this series.
>
> Tested-by: Calvin Owens <calvin@wbinvd.org>
>
> I'm happy to test subsets of it and stable backports too, if that's
> helpful, just let me know.

We'll only backport the first patch, so confirming that it still
prevents the issue would be nice. The rest is slated for upstream only.

Thanks,

        tglx

Re: [patch 00/12] hrtimers: Prevent hrtimer interrupt starvation

Posted by Calvin Owens 2 months, 1 week ago

On Tuesday 04/07 at 20:03 +0200, Thomas Gleixner wrote:
> On Tue, Apr 07 2026 at 10:38, Calvin Owens wrote:
> > On Tuesday 04/07 at 10:54 +0200, Thomas Gleixner wrote:
> >> He provided a reproducer, which sets up a timerfd based timer and then
> >> rearms it in a loop with an absolute expiry time of 1ns.
> >
> > The original AMD machines survive the reproducer with this series.
> >
> > Tested-by: Calvin Owens <calvin@wbinvd.org>
> >
> > I'm happy to test subsets of it and stable backports too, if that's
> > helpful, just let me know.
> 
> We'll only backport the first patch, so confirming that it still
> prevents the issue would be nice. The rest is slated for upstream only.

Confirmed, [1/12] alone passes.

Thanks,
Calvin

Re: [patch 00/12] hrtimers: Prevent hrtimer interrupt starvation

Posted by Thomas Gleixner 2 months, 1 week ago

On Tue, Apr 07 2026 at 11:35, Calvin Owens wrote:
> On Tuesday 04/07 at 20:03 +0200, Thomas Gleixner wrote:
>> On Tue, Apr 07 2026 at 10:38, Calvin Owens wrote:
>> > On Tuesday 04/07 at 10:54 +0200, Thomas Gleixner wrote:
>> >> He provided a reproducer, which sets up a timerfd based timer and then
>> >> rearms it in a loop with an absolute expiry time of 1ns.
>> >
>> > The original AMD machines survive the reproducer with this series.
>> >
>> > Tested-by: Calvin Owens <calvin@wbinvd.org>
>> >
>> > I'm happy to test subsets of it and stable backports too, if that's
>> > helpful, just let me know.
>> 
>> We'll only backport the first patch, so confirming that it still
>> prevents the issue would be nice. The rest is slated for upstream only.
>
> Confirmed, [1/12] alone passes.

Thanks a lot for all your help. Very appreciated.

       tglx