[patch V2 10/17] posix-timers: Make signal_struct::next_posix_timer_id an atomic_t

Thomas Gleixner posted 17 patches 9 months, 3 weeks ago
There is a newer version of this series
[patch V2 10/17] posix-timers: Make signal_struct::next_posix_timer_id an atomic_t
Posted by Thomas Gleixner 9 months, 3 weeks ago
From: Eric Dumazet <edumazet@google.com>

The global hash_lock protecting the posix timer hash table can be heavily
contended especially when there is an extensive linear search for a timer
ID.

Timer IDs are handed out by monotonically increasing next_posix_timer_id
and then validating that there is no timer with the same ID in the hash
table. Both operations happen with the global hash lock held.

To reduce the hash lock contention the hash will be reworked to a scaled
hash with per bucket locks, which requires to handle the ID counter
lockless.

Prepare for this by making next_posix_timer_id an atomic_t, which can be
used lockless with atomic_inc_return().

[ tglx: Adopted from Eric's series, massaged change log and simplified it ]

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250219125522.2535263-2-edumazet@google.com
---
V2: Use atomic_fetch_inc() - PeterZ
---
 include/linux/sched/signal.h |    2 +-
 kernel/time/posix-timers.c   |   14 +++++---------
 2 files changed, 6 insertions(+), 10 deletions(-)

--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -136,7 +136,7 @@ struct signal_struct {
 #ifdef CONFIG_POSIX_TIMERS
 
 	/* POSIX.1b Interval Timers */
-	unsigned int		next_posix_timer_id;
+	atomic_t		next_posix_timer_id;
 	struct hlist_head	posix_timers;
 	struct hlist_head	ignored_posix_timers;
 
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -118,21 +118,17 @@ static bool posix_timer_hashed(struct hl
 static int posix_timer_add(struct k_itimer *timer)
 {
 	struct signal_struct *sig = current->signal;
-	struct hlist_head *head;
-	unsigned int cnt, id;
 
 	/*
 	 * FIXME: Replace this by a per signal struct xarray once there is
 	 * a plan to handle the resulting CRIU regression gracefully.
 	 */
-	for (cnt = 0; cnt <= INT_MAX; cnt++) {
-		spin_lock(&hash_lock);
-		id = sig->next_posix_timer_id;
-
-		/* Write the next ID back. Clamp it to the positive space */
-		sig->next_posix_timer_id = (id + 1) & INT_MAX;
+	for (unsigned int cnt = 0; cnt <= INT_MAX; cnt++) {
+		/* Get the next timer ID and clamp it to positive space */
+		unsigned int id = atomic_fetch_inc(&sig->next_posix_timer_id) & INT_MAX;
+		struct hlist_head *head = &posix_timers_hashtable[hash(sig, id)];
 
-		head = &posix_timers_hashtable[hash(sig, id)];
+		spin_lock(&hash_lock);
 		if (!posix_timer_hashed(head, sig, id)) {
 			/*
 			 * Set the timer ID and the signal pointer to make
Re: [patch V2 10/17] posix-timers: Make signal_struct::next_posix_timer_id an atomic_t
Posted by Cyrill Gorcunov 9 months, 3 weeks ago
On Sun, Mar 02, 2025 at 08:36:59PM +0100, Thomas Gleixner wrote:
>  static int posix_timer_add(struct k_itimer *timer)
>  {
...
> +	for (unsigned int cnt = 0; cnt <= INT_MAX; cnt++) {

Hi, Thomas! This moment slightly confusing -- are we allowed to declare
type inside 'for' statement? (c99 allows, just never seen it before inside
the kernel code). Not a big deal though, thanks a huge for fixing this
problem with timers and criu.

	Cyrill
Re: [patch V2 10/17] posix-timers: Make signal_struct::next_posix_timer_id an atomic_t
Posted by Thomas Gleixner 9 months, 3 weeks ago
On Mon, Mar 03 2025 at 23:21, Cyrill Gorcunov wrote:
> On Sun, Mar 02, 2025 at 08:36:59PM +0100, Thomas Gleixner wrote:
>>  static int posix_timer_add(struct k_itimer *timer)
>>  {
> ...
>> +	for (unsigned int cnt = 0; cnt <= INT_MAX; cnt++) {
>
> Hi, Thomas! This moment slightly confusing -- are we allowed to declare
> type inside 'for' statement? (c99 allows, just never seen it before inside
> the kernel code).

We're slowly moving towards scope based variables. It's not widely used
yet, but it's coming along.

> Not a big deal though, thanks a huge for fixing this problem with
> timers and criu.

Welcome. Some quick validation with CRIU would be appreciated.

Thanks,

        tglx
Re: [patch V2 10/17] posix-timers: Make signal_struct::next_posix_timer_id an atomic_t
Posted by Cyrill Gorcunov 9 months, 2 weeks ago
On Mon, Mar 03, 2025 at 10:24:28PM +0100, Thomas Gleixner wrote:
> 
> Welcome. Some quick validation with CRIU would be appreciated.

Just tested in criu: works without problem, both modes -- with new
prctl and without it. Note that I only have ran separate posix-timers
test case, probably virtuozzo team might do more deep tesing.
---
root@fc:/home/cyrill/criu# test/zdtm.py run -t zdtm/static/posix_timers
userns is supported
The kernel is tainted: '512'
=== Run 1/1 ================ zdtm/static/posix_timers
====================== Run zdtm/static/posix_timers in h =======================
Start test
./posix_timers --pidfile=posix_timers.pid --outfile=posix_timers.out
Run criu dump
Run criu restore
Send the 15 signal to  56
Wait for zdtm/static/posix_timers(56) to die for 0.100000
Removing dump/zdtm/static/posix_timers/56
====================== Test zdtm/static/posix_timers PASS ======================
Re: [patch V2 10/17] posix-timers: Make signal_struct::next_posix_timer_id an atomic_t
Posted by Thomas Gleixner 9 months, 2 weeks ago
On Tue, Mar 04 2025 at 20:56, Cyrill Gorcunov wrote:
> On Mon, Mar 03, 2025 at 10:24:28PM +0100, Thomas Gleixner wrote:
>> 
>> Welcome. Some quick validation with CRIU would be appreciated.
>
> Just tested in criu: works without problem, both modes -- with new
> prctl and without it. Note that I only have ran separate posix-timers
> test case, probably virtuozzo team might do more deep tesing.

Thank you very much!

      tglx
Re: [patch V2 10/17] posix-timers: Make signal_struct::next_posix_timer_id an atomic_t
Posted by Cyrill Gorcunov 9 months, 2 weeks ago
On Tue, Mar 04, 2025 at 09:30:04PM +0100, Thomas Gleixner wrote:
> On Tue, Mar 04 2025 at 20:56, Cyrill Gorcunov wrote:
> > On Mon, Mar 03, 2025 at 10:24:28PM +0100, Thomas Gleixner wrote:
> >> 
> >> Welcome. Some quick validation with CRIU would be appreciated.
> >
> > Just tested in criu: works without problem, both modes -- with new
> > prctl and without it. Note that I only have ran separate posix-timers
> > test case, probably virtuozzo team might do more deep tesing.
> 
> Thank you very much!

Thanks for handling this) Also looking into this series I wonder why can't
we instead of mangling ::it_signal zero bit just use ::it_id with negative
value as a sign of not yet fully initialized timer? This would allow to not
read-modify action while traversing bucket hash chain. I mean we could do

static bool posix_timer_add_at(struct k_itimer *timer, struct signal_struct *sig, unsigned int id)
{
	struct timer_hash_bucket *bucket = hash_bucket(sig, id);

	scoped_guard (spinlock, &bucket->lock) {
		if (!posix_timer_hashed(bucket, sig, id)) {
--->			timer->it_id = -(timer_t)id;
			timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
			hlist_add_head_rcu(&timer->t_hash, &bucket->head);
			return true;
		}
	}
	return false;
}

Then hash traverse won't find the timer until the do_timer_create will do

	scoped_guard (spinlock_irq, &current->sighand->siglock) {
		WRITE_ONCE(new_timer->it_id, abs(new_timer->it_id));
		hlist_add_head_rcu(&new_timer->list, &current->signal->posix_timers);
	}

Or I miss something obvious? (Of course when deleting timer we will have to pass
abs it_id for hash traversing).

It looks that in case of many many timers present in the system traversing hash
in read-modify way might be heavy (though I didn't measure of course).

	Cyrill
Re: [patch V2 10/17] posix-timers: Make signal_struct::next_posix_timer_id an atomic_t
Posted by Thomas Gleixner 9 months, 2 weeks ago
On Wed, Mar 05 2025 at 01:16, Cyrill Gorcunov wrote:
> Thanks for handling this) Also looking into this series I wonder why can't
> we instead of mangling ::it_signal zero bit just use ::it_id with negative
> value as a sign of not yet fully initialized timer? This would allow to not
> read-modify action while traversing bucket hash chain. I mean we could do
>
> static bool posix_timer_add_at(struct k_itimer *timer, struct signal_struct *sig, unsigned int id)
> {
> 	struct timer_hash_bucket *bucket = hash_bucket(sig, id);
>
> 	scoped_guard (spinlock, &bucket->lock) {
> 		if (!posix_timer_hashed(bucket, sig, id)) {
> --->			timer->it_id = -(timer_t)id;
> 			timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
> 			hlist_add_head_rcu(&timer->t_hash, &bucket->head);
> 			return true;
> 		}
> 	}
> 	return false;
> }
>
> Then hash traverse won't find the timer until the do_timer_create will do
>
> 	scoped_guard (spinlock_irq, &current->sighand->siglock) {
> 		WRITE_ONCE(new_timer->it_id, abs(new_timer->it_id));
> 		hlist_add_head_rcu(&new_timer->list, &current->signal->posix_timers);
> 	}
>
> Or I miss something obvious? (Of course when deleting timer we will have to pass
> abs it_id for hash traversing).
>
> It looks that in case of many many timers present in the system traversing hash
> in read-modify way might be heavy (though I didn't measure of course).

The traversal does not RMW the timer itself, it unmangles the signal
pointer for comparison in posix_timer_hashed(). posix_timer_by_id() does
straight comparisons. So both only read.

Sure, we can mangle timer ID instead of the signal pointer, but the
outcome is pretty much the same. The only difference is in
posix_timer_hashed(), which must detect a taken timer ID independent of
the timers valid state to prevent collisions.

With the signal pointer mangling we have:

   if ((timer->signal & ~1) == sig && timer->id == id)

and with the negative ID value this becomes:

   if (timer->signal == sig && (timer->id == id || timer->id == -id))

which is obviously worse. You'd need to do:

      timer->id = id | (1 << 31);

and then the posix_timer_hashed() check becomes:

   if (timer->signal == sig && (timer->id & ~(1 << 31)) == id)

Granted, the timer ID mangling spares the AND operation on the signal in
case the timer is not owned by the currrent process, but I doubt that
this is even measurable.

Thanks,

        tglx
Re: [patch V2 10/17] posix-timers: Make signal_struct::next_posix_timer_id an atomic_t
Posted by Cyrill Gorcunov 9 months, 2 weeks ago
On Wed, Mar 05, 2025 at 08:31:21AM +0100, Thomas Gleixner wrote:
...
> 
> The traversal does not RMW the timer itself, it unmangles the signal
> pointer for comparison in posix_timer_hashed(). posix_timer_by_id() does
> straight comparisons. So both only read.

No, I mean that we read the value then allocate a temp value with 0 bit
excluded implicitly, so it is not a straight read, but whatever.

> Sure, we can mangle timer ID instead of the signal pointer, but the
> outcome is pretty much the same. The only difference is in
> posix_timer_hashed(), which must detect a taken timer ID independent of
> the timers valid state to prevent collisions.

Bah, I managed to miss that we need to lookup for not yet fully initialized
timers as well, and indeed it makes no much difference which exactly field
to mangle. Thanks a huge for explanations, Thomas!

	Cyrill