[v2] posix-timers: Reduce spinlock contention

[PATCH V2 4/4] posix-timers: Use RCU in posix_timer_add()

Posted by Eric Dumazet 11 months, 3 weeks ago

If many posix timers are hashed in posix_timers_hashtable,
hash_lock can be held for long durations.

This can be really bad in some cases as Thomas
explained in https://lore.kernel.org/all/87ednpyyeo.ffs@tglx/

We can perform all searches under RCU, then acquire
the lock only when there is a good chance to need it,
and after cpu caches were populated.

Also add a cond_resched() in the possible long loop.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 kernel/time/posix-timers.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index ed27c7eab456..bd73bc4707c1 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -125,7 +125,19 @@ static int posix_timer_add(struct k_itimer *timer)
 
 		head = &posix_timers_hashtable[hash(sig, id)];
 
+		rcu_read_lock();
+		if (posix_timers_find(head, sig, id)) {
+			rcu_read_unlock();
+			cond_resched();
+			continue;
+		}
+		rcu_read_unlock();
 		spin_lock(&hash_lock);
+		/*
+		 * We must perform the lookup under hash_lock protection
+		 * because another thread could have used the same id.
+		 * This is very unlikely, but possible.
+		 */
 		if (!posix_timers_find(head, sig, id)) {
 			timer->it_id = (timer_t)id;
 			timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
-- 
2.48.1.601.g30ceb7b040-goog

Re: [PATCH V2 4/4] posix-timers: Use RCU in posix_timer_add()

Posted by Thomas Gleixner 11 months, 2 weeks ago

On Wed, Feb 19 2025 at 12:55, Eric Dumazet wrote:
> --- a/kernel/time/posix-timers.c
> +++ b/kernel/time/posix-timers.c
> @@ -125,7 +125,19 @@ static int posix_timer_add(struct k_itimer *timer)
>  
>  		head = &posix_timers_hashtable[hash(sig, id)];
>  
> +		rcu_read_lock();
> +		if (posix_timers_find(head, sig, id)) {
> +			rcu_read_unlock();
> +			cond_resched();
> +			continue;
> +		}
> +		rcu_read_unlock();
>  		spin_lock(&hash_lock);
> +		/*
> +		 * We must perform the lookup under hash_lock protection
> +		 * because another thread could have used the same id.
> +		 * This is very unlikely, but possible.
> +		 */
>  		if (!posix_timers_find(head, sig, id)) {
>  			timer->it_id = (timer_t)id;
>  			timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);

So I played with that because I wanted understand what's going on.

Actually this change is just voodoo programming. Why?

  The only case where this lockless lookup would help is when the timer
  ID wrapped around and the lookup has to validate a large range of
  already occupied IDs, but even then the lookup is dropping hash_lock
  after each ID search. I seriously doubt that this is the case at hand
  because according to Ben this happens when a process is started or
  restored, which means there is no timer ID wrap around and therefore
  no timer ID collision in the process itself.

In fact the extra lookup is counter productive in most cases:

 With a single process, which creates 50000 timers in a loop, this
 results in:

 [1]        mainline     +lookup
 create        98 ms      219 ms

 and it gets worse with 100000 timers:

 [2]        mainline     +lookup
 create       313 ms      650 ms

Why? Because with 100k timers the hash buckets contain ~200 timers each,
which means in the worst case 200 timers have to be walked and
checked. Doing this twice obviously at least doubles the time.

Now there is a case where the change improves the situation, but for the
very wrong reasons:

 A process forks 63 times and all forks wait on a barrier before each
 instance creates 20000 timers. The processes are pinned on seperate CPUs
 to achive maximum concurrency. The numbers are the average times per
 process:

 [3]        mainline     +lookup
 create    157253 ms    40008 ms

But that improvement has absolutely nothing to do with the timer ID
collision. The extra lookup (and I verified with tracing) creates just
enough interleaving so that all CPUs can make progress on the hash lock
instead of being stuck in a cache line bouncing shitshow. The very same
can be achieved by:

        ndelay(total_number_of_timers / $CONSTANT);

So, no. This is not a solution. The proper solution is to replace the
hash by a scaled hash with per hash bucket locks, like we have in the
futex code already. I've implemented this and the result proves the
point with the three test cases:

 [1]        mainline     +lookup   scaled hash
 create        98 ms      219 ms         20 ms

 [2]        mainline     +lookup   scaled hash
 create       313 ms      650 ms         48 ms

 [3]        mainline     +lookup   scaled hash
 create    157253 ms    40008 ms         83 ms

This is really only a problem of the hash collisions and the resulting
list walk length, which is easy to prove by extending the tests above.

After creating the timer and synchronizing again (for the fork case),
the test invokes timer_getoverrun(2) for all created timers in a loop.

 [1]        mainline     +lookup   scaled hash
 create        98 ms      219 ms         20 ms
 getoverrun    62 ms       62 ms         10 ms

 [2]        mainline     +lookup   scaled hash
 create       313 ms      650 ms         48 ms
 getoverrun   261 ms      260 ms         20 ms

 [3]        mainline     +lookup   scaled hash
 create    157253 ms    40008 ms         83 ms
 getoverrun  2611 ms     2614 ms         40 ms

I picked timer_getoverrun(2) because that's just doing the lookup and
reading from the k_itimer struct after locking it. So the syscall has no
contention on the timer lock nor on anything else. According to perf the
vast majority of time is spent in the list walk to find the timer and of
course the cache foot print becomes worse the larger the bucket lists
become.

Let me write proper changelogs and a cover letter and post that stuff.

Thanks,

        tglx

Re: [PATCH V2 4/4] posix-timers: Use RCU in posix_timer_add()

Posted by Eric Dumazet 11 months, 2 weeks ago

On Mon, Feb 24, 2025 at 10:33 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, Feb 19 2025 at 12:55, Eric Dumazet wrote:
> > --- a/kernel/time/posix-timers.c
> > +++ b/kernel/time/posix-timers.c
> > @@ -125,7 +125,19 @@ static int posix_timer_add(struct k_itimer *timer)
> >
> >               head = &posix_timers_hashtable[hash(sig, id)];
> >
> > +             rcu_read_lock();
> > +             if (posix_timers_find(head, sig, id)) {
> > +                     rcu_read_unlock();
> > +                     cond_resched();
> > +                     continue;
> > +             }
> > +             rcu_read_unlock();
> >               spin_lock(&hash_lock);
> > +             /*
> > +              * We must perform the lookup under hash_lock protection
> > +              * because another thread could have used the same id.
> > +              * This is very unlikely, but possible.
> > +              */
> >               if (!posix_timers_find(head, sig, id)) {
> >                       timer->it_id = (timer_t)id;
> >                       timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
>
> So I played with that because I wanted understand what's going on.
>
> Actually this change is just voodoo programming. Why?
>
>   The only case where this lockless lookup would help is when the timer
>   ID wrapped around and the lookup has to validate a large range of
>   already occupied IDs, but even then the lookup is dropping hash_lock
>   after each ID search. I seriously doubt that this is the case at hand
>   because according to Ben this happens when a process is started or
>   restored, which means there is no timer ID wrap around and therefore
>   no timer ID collision in the process itself.
>
> In fact the extra lookup is counter productive in most cases:
>
>  With a single process, which creates 50000 timers in a loop, this
>  results in:
>
>  [1]        mainline     +lookup
>  create        98 ms      219 ms
>
>  and it gets worse with 100000 timers:
>
>  [2]        mainline     +lookup
>  create       313 ms      650 ms
>
> Why? Because with 100k timers the hash buckets contain ~200 timers each,
> which means in the worst case 200 timers have to be walked and
> checked. Doing this twice obviously at least doubles the time.
>
> Now there is a case where the change improves the situation, but for the
> very wrong reasons:
>
>  A process forks 63 times and all forks wait on a barrier before each
>  instance creates 20000 timers. The processes are pinned on seperate CPUs
>  to achive maximum concurrency. The numbers are the average times per
>  process:
>
>  [3]        mainline     +lookup
>  create    157253 ms    40008 ms
>
> But that improvement has absolutely nothing to do with the timer ID
> collision. The extra lookup (and I verified with tracing) creates just
> enough interleaving so that all CPUs can make progress on the hash lock
> instead of being stuck in a cache line bouncing shitshow. The very same
> can be achieved by:
>
>         ndelay(total_number_of_timers / $CONSTANT);
>
> So, no. This is not a solution. The proper solution is to replace the
> hash by a scaled hash with per hash bucket locks, like we have in the
> futex code already. I've implemented this and the result proves the
> point with the three test cases:
>
>  [1]        mainline     +lookup   scaled hash
>  create        98 ms      219 ms         20 ms
>
>  [2]        mainline     +lookup   scaled hash
>  create       313 ms      650 ms         48 ms
>
>  [3]        mainline     +lookup   scaled hash
>  create    157253 ms    40008 ms         83 ms
>
> This is really only a problem of the hash collisions and the resulting
> list walk length, which is easy to prove by extending the tests above.
>
> After creating the timer and synchronizing again (for the fork case),
> the test invokes timer_getoverrun(2) for all created timers in a loop.
>
>  [1]        mainline     +lookup   scaled hash
>  create        98 ms      219 ms         20 ms
>  getoverrun    62 ms       62 ms         10 ms
>
>  [2]        mainline     +lookup   scaled hash
>  create       313 ms      650 ms         48 ms
>  getoverrun   261 ms      260 ms         20 ms
>
>  [3]        mainline     +lookup   scaled hash
>  create    157253 ms    40008 ms         83 ms
>  getoverrun  2611 ms     2614 ms         40 ms
>
> I picked timer_getoverrun(2) because that's just doing the lookup and
> reading from the k_itimer struct after locking it. So the syscall has no
> contention on the timer lock nor on anything else. According to perf the
> vast majority of time is spent in the list walk to find the timer and of
> course the cache foot print becomes worse the larger the bucket lists
> become.
>
> Let me write proper changelogs and a cover letter and post that stuff.
>

Great, we are looking forward to your patches.

Re: [PATCH V2 4/4] posix-timers: Use RCU in posix_timer_add()

Posted by David Laight 11 months, 3 weeks ago

On Wed, 19 Feb 2025 12:55:22 +0000
Eric Dumazet <edumazet@google.com> wrote:

> If many posix timers are hashed in posix_timers_hashtable,
> hash_lock can be held for long durations.
> 
> This can be really bad in some cases as Thomas
> explained in https://lore.kernel.org/all/87ednpyyeo.ffs@tglx/
> 
> We can perform all searches under RCU, then acquire
> the lock only when there is a good chance to need it,
> and after cpu caches were populated.
> 
> Also add a cond_resched() in the possible long loop.

Since this code fragment has a 'free choice' of the timer id, why not
select an empty table slot and then pick a value that maps to it?

You can run a free-list through the empty table slots so the allocate
is (almost always) fixed cost and trivial.
The only complexity arises when the table is full and needs to be
reallocated twice as large.

The high bits of the 'id' can be incremented every time the id is allocated
so stale ids can be detected (until a quite large number of allocate/free).

	David

> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  kernel/time/posix-timers.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
> index ed27c7eab456..bd73bc4707c1 100644
> --- a/kernel/time/posix-timers.c
> +++ b/kernel/time/posix-timers.c
> @@ -125,7 +125,19 @@ static int posix_timer_add(struct k_itimer *timer)
>  
>  		head = &posix_timers_hashtable[hash(sig, id)];
>  
> +		rcu_read_lock();
> +		if (posix_timers_find(head, sig, id)) {
> +			rcu_read_unlock();
> +			cond_resched();
> +			continue;
> +		}
> +		rcu_read_unlock();
>  		spin_lock(&hash_lock);
> +		/*
> +		 * We must perform the lookup under hash_lock protection
> +		 * because another thread could have used the same id.
> +		 * This is very unlikely, but possible.
> +		 */
>  		if (!posix_timers_find(head, sig, id)) {
>  			timer->it_id = (timer_t)id;
>  			timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);

Re: [PATCH V2 4/4] posix-timers: Use RCU in posix_timer_add()

Posted by Eric Dumazet 11 months, 3 weeks ago

On Wed, Feb 19, 2025 at 8:38 PM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Wed, 19 Feb 2025 12:55:22 +0000
> Eric Dumazet <edumazet@google.com> wrote:
>
> > If many posix timers are hashed in posix_timers_hashtable,
> > hash_lock can be held for long durations.
> >
> > This can be really bad in some cases as Thomas
> > explained in https://lore.kernel.org/all/87ednpyyeo.ffs@tglx/
> >
> > We can perform all searches under RCU, then acquire
> > the lock only when there is a good chance to need it,
> > and after cpu caches were populated.
> >
> > Also add a cond_resched() in the possible long loop.
>
> Since this code fragment has a 'free choice' of the timer id, why not
> select an empty table slot and then pick a value that maps to it?
>
> You can run a free-list through the empty table slots so the allocate
> is (almost always) fixed cost and trivial.
> The only complexity arises when the table is full and needs to be
> reallocated twice as large.
>
> The high bits of the 'id' can be incremented every time the id is allocated
> so stale ids can be detected (until a quite large number of allocate/free).

We have to understand the RCU lookup can only fail after 2^31 timer
allocations for a given processus,
I am not sure why we would bother.

CRIU wants predictable timer_id sequences.

We have to try N, N+1, N+2 ... until we find a suitable id.

Alternative would be to add a new system call I guess.