[v2] tick/sched: Limit non-timekeeper CPUs calling jiffies update

[PATCH v2] tick/sched: Limit non-timekeeper CPUs calling jiffies update

Posted by Steve Wahl 3 months, 1 week ago

On large NUMA systems, while running a test program that saturates the
inter-processor and inter-NUMA links, acquiring the jiffies_lock can
be very expensive.  If the cpu designated to do jiffies updates
(tick_do_timer_cpu) gets delayed and other cpus decide to do the
jiffies update themselves, a large number of them decide to do so at
the same time.  The inexpensive check against tick_next_period is far
quicker than actually acquiring the lock, so most of these get in line
to obtain the lock.  If obtaining the lock is slow enough, this
spirals into the vast majority of CPUs continuously being stuck
waiting for this lock, just to obtain it and find out that time has
already been updated by another cpu. For example, on one random entry
to kdb by manually-injected NMI, I saw 2912 of 3840 cpus stuck here.

To avoid this, allow only one non-timekeeper CPU to call
tick_do_update_jiffies64() at any given time, resetting ts->stalled
jiffies only if the jiffies update function is actually called.

With this change, manually interrupting the test I find at most two
CPUs in the tick_do_update_jiffies64 function (the timekeeper and one
other).

Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
---

v2: Rewritten to use an atomic to gate non-timekeeping cpus calling the
    jiffies update, as suggested by tglx. Title of patch has changed
    since trylock is no longer used.

v1 discussion: https://lore.kernel.org/all/20251013150959.298288-1-steve.wahl@hpe.com/

 kernel/time/tick-sched.c | 30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c527b421c865..3ff3eb1f90d0 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -201,6 +201,27 @@ static inline void tick_sched_flag_clear(struct tick_sched *ts,
 	ts->flags &= ~flag;
 }
 
+/*
+ * Allow only one non-timekeeper CPU at a time update jiffies from
+ * the timer tick.
+ *
+ * Returns true if update was run.
+ */
+static bool tick_limited_update_jiffies64(struct tick_sched *ts, ktime_t now)
+{
+	static atomic_t in_progress;
+	int inp;
+
+	inp = atomic_read(&in_progress);
+	if (inp || !atomic_try_cmpxchg(&in_progress, &inp, 1))
+		return false;
+
+	if (ts->last_tick_jiffies == jiffies)
+		tick_do_update_jiffies64(now);
+	atomic_set(&in_progress, 0);
+	return true;
+}
+
 #define MAX_STALLED_JIFFIES 5
 
 static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
@@ -239,10 +260,11 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
 		ts->stalled_jiffies = 0;
 		ts->last_tick_jiffies = READ_ONCE(jiffies);
 	} else {
-		if (++ts->stalled_jiffies == MAX_STALLED_JIFFIES) {
-			tick_do_update_jiffies64(now);
-			ts->stalled_jiffies = 0;
-			ts->last_tick_jiffies = READ_ONCE(jiffies);
+		if (++ts->stalled_jiffies >= MAX_STALLED_JIFFIES) {
+			if (tick_limited_update_jiffies64(ts, now)) {
+				ts->stalled_jiffies = 0;
+				ts->last_tick_jiffies = READ_ONCE(jiffies);
+			}
 		}
 	}
 
-- 
2.43.0

Re: [PATCH v2] tick/sched: Limit non-timekeeper CPUs calling jiffies update

Posted by Shrikanth Hegde 3 months, 1 week ago


On 10/28/25 12:04 AM, Steve Wahl wrote:
> On large NUMA systems, while running a test program that saturates the
> inter-processor and inter-NUMA links, acquiring the jiffies_lock can
> be very expensive.  If the cpu designated to do jiffies updates
> (tick_do_timer_cpu) gets delayed and other cpus decide to do the
> jiffies update themselves, a large number of them decide to do so at
> the same time.  The inexpensive check against tick_next_period is far
> quicker than actually acquiring the lock, so most of these get in line
> to obtain the lock.  If obtaining the lock is slow enough, this
> spirals into the vast majority of CPUs continuously being stuck
> waiting for this lock, just to obtain it and find out that time has
> already been updated by another cpu. For example, on one random entry
> to kdb by manually-injected NMI, I saw 2912 of 3840 cpus stuck here.
> 
> To avoid this, allow only one non-timekeeper CPU to call
> tick_do_update_jiffies64() at any given time, resetting ts->stalled
> jiffies only if the jiffies update function is actually called.
> 
> With this change, manually interrupting the test I find at most two
> CPUs in the tick_do_update_jiffies64 function (the timekeeper and one
> other).
> 
> Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
> ---
> 
> v2: Rewritten to use an atomic to gate non-timekeeping cpus calling the
>      jiffies update, as suggested by tglx. Title of patch has changed
>      since trylock is no longer used.
> 
> v1 discussion: https://lore.kernel.org/all/20251013150959.298288-1-steve.wahl@hpe.com/
> 
>   kernel/time/tick-sched.c | 30 ++++++++++++++++++++++++++----
>   1 file changed, 26 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index c527b421c865..3ff3eb1f90d0 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -201,6 +201,27 @@ static inline void tick_sched_flag_clear(struct tick_sched *ts,
>   	ts->flags &= ~flag;
>   }
>   
> +/*
> + * Allow only one non-timekeeper CPU at a time update jiffies from
> + * the timer tick.
> + *
> + * Returns true if update was run.
> + */
> +static bool tick_limited_update_jiffies64(struct tick_sched *ts, ktime_t now)
> +{
> +	static atomic_t in_progress;
> +	int inp;
> +
> +	inp = atomic_read(&in_progress);
> +	if (inp || !atomic_try_cmpxchg(&in_progress, &inp, 1))
> +		return false;
> +

You come here if (ts->last_tick_jiffies == jiffies). So it may be not necessary to check again.

> +	if (ts->last_tick_jiffies == jiffies)
> +		tick_do_update_jiffies64(now);
> +	atomic_set(&in_progress, 0);
> +	return true;
> +}
> +
>   #define MAX_STALLED_JIFFIES 5
>   
>   static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
> @@ -239,10 +260,11 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
>   		ts->stalled_jiffies = 0;
>   		ts->last_tick_jiffies = READ_ONCE(jiffies);
>   	} else {
> -		if (++ts->stalled_jiffies == MAX_STALLED_JIFFIES) {
> -			tick_do_update_jiffies64(now);
> -			ts->stalled_jiffies = 0;
> -			ts->last_tick_jiffies = READ_ONCE(jiffies);
> +		if (++ts->stalled_jiffies >= MAX_STALLED_JIFFIES) {
> +			if (tick_limited_update_jiffies64(ts, now)) {
> +				ts->stalled_jiffies = 0;
> +				ts->last_tick_jiffies = READ_ONCE(jiffies);
> +			}
>   		}
>   	}
>   


Yes. This could help large systems.

Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>

Re: [PATCH v2] tick/sched: Limit non-timekeeper CPUs calling jiffies update

Posted by Steve Wahl 3 months, 1 week ago

On Tue, Oct 28, 2025 at 11:39:30AM +0530, Shrikanth Hegde wrote:
> 
> 
> On 10/28/25 12:04 AM, Steve Wahl wrote:
> > On large NUMA systems, while running a test program that saturates the
> > inter-processor and inter-NUMA links, acquiring the jiffies_lock can
> > be very expensive.  If the cpu designated to do jiffies updates
> > (tick_do_timer_cpu) gets delayed and other cpus decide to do the
> > jiffies update themselves, a large number of them decide to do so at
> > the same time.  The inexpensive check against tick_next_period is far
> > quicker than actually acquiring the lock, so most of these get in line
> > to obtain the lock.  If obtaining the lock is slow enough, this
> > spirals into the vast majority of CPUs continuously being stuck
> > waiting for this lock, just to obtain it and find out that time has
> > already been updated by another cpu. For example, on one random entry
> > to kdb by manually-injected NMI, I saw 2912 of 3840 cpus stuck here.
> > 
> > To avoid this, allow only one non-timekeeper CPU to call
> > tick_do_update_jiffies64() at any given time, resetting ts->stalled
> > jiffies only if the jiffies update function is actually called.
> > 
> > With this change, manually interrupting the test I find at most two
> > CPUs in the tick_do_update_jiffies64 function (the timekeeper and one
> > other).
> > 
> > Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
> > ---
> > 
> > v2: Rewritten to use an atomic to gate non-timekeeping cpus calling the
> >      jiffies update, as suggested by tglx. Title of patch has changed
> >      since trylock is no longer used.
> > 
> > v1 discussion:
> > https://lore.kernel.org/all/20251013150959.298288-1-steve.wahl@hpe.com/
> > 
> >   kernel/time/tick-sched.c | 30 ++++++++++++++++++++++++++----
> >   1 file changed, 26 insertions(+), 4 deletions(-)
> > 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index c527b421c865..3ff3eb1f90d0 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -201,6 +201,27 @@ static inline void tick_sched_flag_clear(struct tick_sched *ts,
> >   	ts->flags &= ~flag;
> >   }
> > +/*
> > + * Allow only one non-timekeeper CPU at a time update jiffies from
> > + * the timer tick.
> > + *
> > + * Returns true if update was run.
> > + */
> > +static bool tick_limited_update_jiffies64(struct tick_sched *ts, ktime_t now)
> > +{
> > +	static atomic_t in_progress;
> > +	int inp;
> > +
> > +	inp = atomic_read(&in_progress);
> > +	if (inp || !atomic_try_cmpxchg(&in_progress, &inp, 1))
> > +		return false;
> > +
> 
> You come here if (ts->last_tick_jiffies == jiffies). So it may be not necessary to check again.

TGLX had this in his rewrite suggestion, and I looked pretty intensely
at this test.

The situation I'm looking to resolve is caused by inter-NUMA links
being abnormally swamped with traffic.  Especially for writes, access
to shared memory locations, such as the atomic operations to
in_progress right above this, take longer than one usually would
expect.  So to me it makes sense that things may have changed since
the atomic_try_cmpxchg was initiated, and so I left the check in
place.

> > +	if (ts->last_tick_jiffies == jiffies)
> > +		tick_do_update_jiffies64(now);
> > +	atomic_set(&in_progress, 0);
> > +	return true;
> > +}
> > +
> >   #define MAX_STALLED_JIFFIES 5
> >   static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
> > @@ -239,10 +260,11 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
> >   		ts->stalled_jiffies = 0;
> >   		ts->last_tick_jiffies = READ_ONCE(jiffies);
> >   	} else {
> > -		if (++ts->stalled_jiffies == MAX_STALLED_JIFFIES) {
> > -			tick_do_update_jiffies64(now);
> > -			ts->stalled_jiffies = 0;
> > -			ts->last_tick_jiffies = READ_ONCE(jiffies);
> > +		if (++ts->stalled_jiffies >= MAX_STALLED_JIFFIES) {
> > +			if (tick_limited_update_jiffies64(ts, now)) {
> > +				ts->stalled_jiffies = 0;
> > +				ts->last_tick_jiffies = READ_ONCE(jiffies);
> > +			}
> >   		}
> >   	}
> 
> 
> Yes. This could help large systems.
> 
> Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>

Thanks for your time reviewing!

--> Steve Wahl

-- 
Steve Wahl, Hewlett Packard Enterprise

Re: [PATCH v2] tick/sched: Limit non-timekeeper CPUs calling jiffies update

Posted by Shrikanth Hegde 3 months, 1 week ago


On 10/28/25 7:54 PM, Steve Wahl wrote:
> On Tue, Oct 28, 2025 at 11:39:30AM +0530, Shrikanth Hegde wrote:
>>
>>
>> On 10/28/25 12:04 AM, Steve Wahl wrote:
>>> On large NUMA systems, while running a test program that saturates the
>>> inter-processor and inter-NUMA links, acquiring the jiffies_lock can
>>> be very expensive.  If the cpu designated to do jiffies updates
>>> (tick_do_timer_cpu) gets delayed and other cpus decide to do the
>>> jiffies update themselves, a large number of them decide to do so at
>>> the same time.  The inexpensive check against tick_next_period is far
>>> quicker than actually acquiring the lock, so most of these get in line
>>> to obtain the lock.  If obtaining the lock is slow enough, this
>>> spirals into the vast majority of CPUs continuously being stuck
>>> waiting for this lock, just to obtain it and find out that time has
>>> already been updated by another cpu. For example, on one random entry
>>> to kdb by manually-injected NMI, I saw 2912 of 3840 cpus stuck here.
>>>
>>> To avoid this, allow only one non-timekeeper CPU to call
>>> tick_do_update_jiffies64() at any given time, resetting ts->stalled
>>> jiffies only if the jiffies update function is actually called.
>>>
>>> With this change, manually interrupting the test I find at most two
>>> CPUs in the tick_do_update_jiffies64 function (the timekeeper and one
>>> other).
>>>
>>> Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
>>> ---
>>>
>>> v2: Rewritten to use an atomic to gate non-timekeeping cpus calling the
>>>       jiffies update, as suggested by tglx. Title of patch has changed
>>>       since trylock is no longer used.
>>>
>>> v1 discussion:
>>> https://lore.kernel.org/all/20251013150959.298288-1-steve.wahl@hpe.com/
>>>
>>>    kernel/time/tick-sched.c | 30 ++++++++++++++++++++++++++----
>>>    1 file changed, 26 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>>> index c527b421c865..3ff3eb1f90d0 100644
>>> --- a/kernel/time/tick-sched.c
>>> +++ b/kernel/time/tick-sched.c
>>> @@ -201,6 +201,27 @@ static inline void tick_sched_flag_clear(struct tick_sched *ts,
>>>    	ts->flags &= ~flag;
>>>    }
>>> +/*
>>> + * Allow only one non-timekeeper CPU at a time update jiffies from
>>> + * the timer tick.
>>> + *
>>> + * Returns true if update was run.
>>> + */
>>> +static bool tick_limited_update_jiffies64(struct tick_sched *ts, ktime_t now)
>>> +{
>>> +	static atomic_t in_progress;
>>> +	int inp;
>>> +
>>> +	inp = atomic_read(&in_progress);
>>> +	if (inp || !atomic_try_cmpxchg(&in_progress, &inp, 1))
>>> +		return false;
>>> +
>>
>> You come here if (ts->last_tick_jiffies == jiffies). So it may be not necessary to check again.
> 
> TGLX had this in his rewrite suggestion, and I looked pretty intensely
> at this test.
> 
> The situation I'm looking to resolve is caused by inter-NUMA links
> being abnormally swamped with traffic.  Especially for writes, access
> to shared memory locations, such as the atomic operations to
> in_progress right above this, take longer than one usually would
> expect.  So to me it makes sense that things may have changed since
> the atomic_try_cmpxchg was initiated, and so I left the check in
> place.
> 

I see, one possibility is

- if it runs in parallel by that time on tick_cpu.( which always updates it)

>>> +	if (ts->last_tick_jiffies == jiffies)
>>> +		tick_do_update_jiffies64(now);
>>> +	atomic_set(&in_progress, 0);
>>> +	return true;
>>> +}
>>> +
>>>    #define MAX_STALLED_JIFFIES 5
>>>    static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
>>> @@ -239,10 +260,11 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
>>>    		ts->stalled_jiffies = 0;
>>>    		ts->last_tick_jiffies = READ_ONCE(jiffies);
>>>    	} else {
>>> -		if (++ts->stalled_jiffies == MAX_STALLED_JIFFIES) {
>>> -			tick_do_update_jiffies64(now);
>>> -			ts->stalled_jiffies = 0;
>>> -			ts->last_tick_jiffies = READ_ONCE(jiffies);
>>> +		if (++ts->stalled_jiffies >= MAX_STALLED_JIFFIES) {
>>> +			if (tick_limited_update_jiffies64(ts, now)) {
>>> +				ts->stalled_jiffies = 0;
>>> +				ts->last_tick_jiffies = READ_ONCE(jiffies);
>>> +			}
>>>    		}
>>>    	}
>>
>>
>> Yes. This could help large systems.
>>
>> Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> 
> Thanks for your time reviewing!
> 
> --> Steve Wahl
>

[tip: timers/core] tick/sched: Limit non-timekeeper CPUs calling jiffies update

Posted by tip-bot2 for Steve Wahl 3 months, 1 week ago

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     4138787408aa47e9e107f28876cb59b42d78bb99
Gitweb:        https://git.kernel.org/tip/4138787408aa47e9e107f28876cb59b42d78bb99
Author:        Steve Wahl <steve.wahl@hpe.com>
AuthorDate:    Mon, 27 Oct 2025 13:34:56 -05:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Sat, 01 Nov 2025 20:25:53 +01:00

tick/sched: Limit non-timekeeper CPUs calling jiffies update

On large NUMA systems, while running a test program that saturates the
inter-processor and inter-NUMA links, acquiring the jiffies_lock can be
very expensive.

If the cpu designated to do jiffies updates (tick_do_timer_cpu) gets
delayed and other cpus decide to do the jiffies update themselves, a large
number of them decide to do so at the same time.

The inexpensive check against tick_next_period is far quicker than actually
acquiring the lock, so most of these get in line to obtain the lock.  If
obtaining the lock is slow enough, this spirals into the vast majority of
CPUs continuously being stuck waiting for this lock, just to obtain it and
find out that time has already been updated by another cpu. For example, on
one random entry to kdb by manually-injected NMI, 2912 of 3840 CPUs were
observed to be stuck there.

To avoid this, allow only one non-timekeeper CPU to call
tick_do_update_jiffies64() at any given time, resetting ts->stalled jiffies
only if the jiffies update function is actually called.

With this change, manually interrupting the test at most two CPUs are
observed to invoke tick_do_update_jiffies64() - the timekeeper and one
other.

Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20251027183456.343407-1-steve.wahl@hpe.com
---
 kernel/time/tick-sched.c | 30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c527b42..3ff3eb1 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -201,6 +201,27 @@ static inline void tick_sched_flag_clear(struct tick_sched *ts,
 	ts->flags &= ~flag;
 }
 
+/*
+ * Allow only one non-timekeeper CPU at a time update jiffies from
+ * the timer tick.
+ *
+ * Returns true if update was run.
+ */
+static bool tick_limited_update_jiffies64(struct tick_sched *ts, ktime_t now)
+{
+	static atomic_t in_progress;
+	int inp;
+
+	inp = atomic_read(&in_progress);
+	if (inp || !atomic_try_cmpxchg(&in_progress, &inp, 1))
+		return false;
+
+	if (ts->last_tick_jiffies == jiffies)
+		tick_do_update_jiffies64(now);
+	atomic_set(&in_progress, 0);
+	return true;
+}
+
 #define MAX_STALLED_JIFFIES 5
 
 static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
@@ -239,10 +260,11 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
 		ts->stalled_jiffies = 0;
 		ts->last_tick_jiffies = READ_ONCE(jiffies);
 	} else {
-		if (++ts->stalled_jiffies == MAX_STALLED_JIFFIES) {
-			tick_do_update_jiffies64(now);
-			ts->stalled_jiffies = 0;
-			ts->last_tick_jiffies = READ_ONCE(jiffies);
+		if (++ts->stalled_jiffies >= MAX_STALLED_JIFFIES) {
+			if (tick_limited_update_jiffies64(ts, now)) {
+				ts->stalled_jiffies = 0;
+				ts->last_tick_jiffies = READ_ONCE(jiffies);
+			}
 		}
 	}