[v9] futex: Add support task local hash maps.

[PATCH v9 00/11] futex: Add support task local hash maps.

Posted by Sebastian Andrzej Siewior 11 months, 2 weeks ago

Hi,

this is a follow up on
        https://lore.kernel.org/ZwVOMgBMxrw7BU9A@jlelli-thinkpadt14gen4.remote.csb

and adds support for task local futex_hash_bucket.

This is rebased of v8 ontop of PeterZ's futex_class. The complete tree
is at
	https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git/log/?h=futex_local_v9
	https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git futex_local_v9

v8…v9 https://lore.kernel.org/all/20250203135935.440018-1-bigeasy@linutronix.de
  - Rebase on top PeterZ futex_class
  - A few patches vanished due to class rework.
  - struct futex_hash_bucket has now pointer to futex_private_hash
    instead of slot number
  - CONFIG_BASE_SMALL now removes support for the "futex local hash"
    instead of restricting it to to 2 slots.
  - Number of threads, used to determine the number of slots, is capped
    at num_online_cpus.

v7…v8 https://lore.kernel.org/all/20250123202446.610203-1-bigeasy@linutronix.de/
  - Rebase on v6.14-rc1

Sebastian Andrzej Siewior (11):
  futex: fixup futex_wait_setup [fold futex: Move futex_queue() into
    futex_wait_setup()]
  futex: Create helper function to initialize a hash slot.
  futex: Add basic infrastructure for local task local hash.
  futex: Hash only the address for private futexes.
  futex: Allow automatic allocation of process wide futex hash.
  futex: Decrease the waiter count before the unlock operation.
  futex: Introduce futex_q_lockptr_lock().
  futex: Acquire a hash reference in futex_wait_multiple_setup().
  futex: Allow to re-allocate the private local hash.
  futex: Resize local futex hash table based on number of threads.
  futex: Use a hashmask instead of hashsize.

 include/linux/futex.h      |  32 ++-
 include/linux/mm_types.h   |   7 +-
 include/uapi/linux/prctl.h |   5 +
 kernel/fork.c              |  24 ++
 kernel/futex/core.c        | 450 +++++++++++++++++++++++++++++++++++--
 kernel/futex/futex.h       |  15 +-
 kernel/futex/pi.c          |  15 +-
 kernel/futex/requeue.c     |  29 ++-
 kernel/futex/waitwake.c    |  31 ++-
 kernel/sys.c               |   4 +
 10 files changed, 574 insertions(+), 38 deletions(-)

-- 
2.47.2

Re: [PATCH v9 00/11] futex: Add support task local hash maps.

Posted by Peter Zijlstra 11 months, 1 week ago

On Tue, Feb 25, 2025 at 06:09:03PM +0100, Sebastian Andrzej Siewior wrote:

> Sebastian Andrzej Siewior (11):
>   futex: fixup futex_wait_setup [fold futex: Move futex_queue() into
>     futex_wait_setup()]
>   futex: Create helper function to initialize a hash slot.
>   futex: Add basic infrastructure for local task local hash.
>   futex: Hash only the address for private futexes.
>   futex: Allow automatic allocation of process wide futex hash.
>   futex: Decrease the waiter count before the unlock operation.
>   futex: Introduce futex_q_lockptr_lock().
>   futex: Acquire a hash reference in futex_wait_multiple_setup().
>   futex: Allow to re-allocate the private local hash.
>   futex: Resize local futex hash table based on number of threads.

Right, I've been going over this and been poking at the patches for the
past few days, and I'm not quite sure where to start.

There's a bunch of simple things, that can be trivially fixed, but
there's also some more fundamental things.

I've written a pile of patches on top of this while playing around with
things. The latest pile sits in:

  queue/locking/futex

I'm not sure I should post the patches as a reply to this email (I can,
if people want), but let me try and summarize what I did and why.

Primarily, the reason I started poking at it is that I think the prctl()
as implemented is completely useless. Notably its effect is entirely
ephemeral, one pthread_create() call can re-size the hash, destroying
the user requested size. Also, I still feel one should be able to set
the hash size to 0 and have it revert to global hash.

Finally prctl() should not return until the rehash is complete.

I think my implementation now does all that -- but I've not tested it
yet -- I've to write a prctl() testcase and it was too nice outside :-)

So, on the way to reworking the prctl(), I ran into:

 - naming; hb_p is a terrible name, the way I read that is
   hash-bucket-private, or hash-bucket pointer, neither make much sense,
   because they're a pointer to struct futex_private_hash, which is a
   hash-table.

   I've very uninspired done s/hb_p/fph/g with the exception of
   hb->hb_p, which is now hb->priv.

 - more naming; you had:

    hb = __futex_hash(key);
    futex_hash_get(hb);
    futex_hash_put(hb);

    fph = futex_get_private_hash();
    futex_put_private_hash();

   which is all sorts of inconsistent, and I've made that:

    hb = __futex_hash(key);	/* hash, no get */
    hb = futex_hash(key)	/* hash and get */
    futex_hash_get(hb);		/* get */
    futex_hash_put(hb);		/* put */

    fph = futex_private_hash();
    futex_private_hash_get(fph);
    futex_private_hash_put(fph);

 - There was some superfluous state; notably, AFAICT
   futex_private_hash::{initial_ref_dropped,released} are unneeded and
   made the code unnecessarily complicated.

   You can drop the initial ref when phash && !phash_new, eg on the
   first time around when you allocate a new hash-table.

   We don't need to track released because we can simply check for that
   state using rcuref_read() == 0.

 - As alluded to in a previous point, there was no means of only
   hashing, the fph get was both non-obviously hidden inside the private
   hash and unconditional. Untangled that.

My current prctl() thing does:

 - reject !power-of-two and 1
 - accepts 0
 - returns once rehash is done

Notably, having done a prctl() disables the auto-sizing.

When allocating a new private hash table and there is already one
pending, it compares the tables. The compare function checks in order:

 - custom (user provided / prctl())
 - zero size
 - biggest size

IOW, any user requested size always wins, a 0 size is final otherwise
go with the largest.

After that I rebased my FUTEX2_NUMA patch on top of all this and added
a new FUTEX2_MPOL, which is something Christoph Lameter asked for a
while back, and something we can now actually do sanely, since we have
lockless vma lookups working.

Anyway, the entire stack builds and boots, but is otherwise very much
untested.

WDYT?

Re: [PATCH v9 00/11] futex: Add support task local hash maps.

Posted by Sebastian Andrzej Siewior 11 months ago

On 2025-03-03 11:54:16 [+0100], Peter Zijlstra wrote:
> After that I rebased my FUTEX2_NUMA patch on top of all this and added
> a new FUTEX2_MPOL, which is something Christoph Lameter asked for a
> while back, and something we can now actually do sanely, since we have
> lockless vma lookups working.

I'm going to keep the keep the changes mostly as-is (except the few
compile fallouts). I thing I wanted to mention in case someone has a
simple idea: We have this now:
|struct {
|         unsigned long            hashmask;
|         unsigned int             hashshift;
|         struct futex_hash_bucket *queues[MAX_NUMNODES];
| } __futex_data __read_mostly __aligned(2*sizeof(long));

This MAX_NUMNODES will be set to 1 << 10 due to MAXSMP for instance on
Debian. This in turn leads to an 8KiB huge queues array which will be
largely unused on a simple machine which has no / 1 nodes. I don't have
access to machine with more than 4 nodes so _assumed_ this is the limit.
Anyway. I'm also not aware about the corner cases, say we have that many
nodes (1024) but just two CPUs. That would lead roundup_pow_of_two(0) in
futex_init().

> WDYT?

Sebastian

Re: [PATCH v9 00/11] futex: Add support task local hash maps.

Posted by Sebastian Andrzej Siewior 11 months, 1 week ago

On 2025-03-03 11:54:16 [+0100], Peter Zijlstra wrote:
> Right, I've been going over this and been poking at the patches for the
> past few days, and I'm not quite sure where to start.
> 
> There's a bunch of simple things, that can be trivially fixed, but
> there's also some more fundamental things.
> 
> I've written a pile of patches on top of this while playing around with
> things. The latest pile sits in:
> 
>   queue/locking/futex
> 
> I'm not sure I should post the patches as a reply to this email (I can,
> if people want), but let me try and summarize what I did and why.
> 
> 
> Primarily, the reason I started poking at it is that I think the prctl()
> as implemented is completely useless. Notably its effect is entirely
> ephemeral, one pthread_create() call can re-size the hash, destroying
> the user requested size. Also, I still feel one should be able to set
> the hash size to 0 and have it revert to global hash.
> 
> Finally prctl() should not return until the rehash is complete.
> 
> I think my implementation now does all that -- but I've not tested it
> yet -- I've to write a prctl() testcase and it was too nice outside :-)

I kept prctl() mostly around for testing with a few hacks to be able to
always resize it, even if the size is the same/ smaller. tglx to have it
only increasing. However, let me take this and do some testing.

…
> Anyway, the entire stack builds and boots, but is otherwise very much
> untested.
> 
> WDYT?

well. Let take a look and do a bit of hammering.

Sebastian

Re: [PATCH v9 00/11] futex: Add support task local hash maps.

Posted by Sebastian Andrzej Siewior 11 months, 1 week ago

On 2025-03-03 15:17:55 [+0100], To Peter Zijlstra wrote:
> > Anyway, the entire stack builds and boots, but is otherwise very much
> > untested.
> > 
> > WDYT?
> 
> well. Let take a look and do a bit of hammering.

so you kept the q.drop_hb_ref logic and the reference get. You kept the
private reference but renamed it and hid it behind the CLASS. I meant to
do it, just wanted to check if you had another idea regarding it. But
okay.

You avoided the two states by dropping refcount only there is no !new
pointer. That should work. 

There is no refcount check in futex_hash_free(). It wouldn't hurt to
check futex_phash for 0/1, right?

My first few tests succeeded. And I have a few RCU annotations, which I
post once I complete them and finish my requeue-pi tests.

Sebastian

Re: [PATCH v9 00/11] futex: Add support task local hash maps.

Posted by Sebastian Andrzej Siewior 11 months, 1 week ago

On 2025-03-03 17:40:16 [+0100], To Peter Zijlstra wrote:
…
> You avoided the two states by dropping refcount only there is no !new
> pointer. That should work.
…
> My first few tests succeeded. And I have a few RCU annotations, which I
> post once I complete them and finish my requeue-pi tests.

get_futex_key() has this:
|…
|         if (!fshared) {
|…
|                 if (IS_ENABLED(CONFIG_MMU))
|                         key->private.mm = mm;
|                 else
|                         key->private.mm = NULL;
|
|                 key->private.address = address;
|

and now __futex_hash_private() has this:
| {
|         if (!futex_key_is_private(key))
|                 return NULL;
|
|         if (!fph)
|                 fph = rcu_dereference(key->private.mm->futex_phash);

Dereferencing mm won't work on !CONFIG_MMU. We could limit private hash
to !CONFIG_BASE_SMALL && CONFIG_MMU.

Ignoring this, I managed to crash the box on top of 49fd6b8f5d59
("futex: Implement FUTEX2_MPOL"). I had one commit on top to make the
prctl not blocking (make futex_hash_allocate(, false)). This is simulate
the fork resize. The backtrace:
| [   T8658] BUG: unable to handle page fault for address: fffffffffffffff0
| [   T8658] #PF: supervisor read access in kernel mode
| [   T8658] #PF: error_code(0x0000) - not-present page
| [   T8658] PGD 2c5a067 P4D 2c5a067 PUD 2c5c067 PMD 0
| [   T8658] Oops: Oops: 0000 [#1] PREEMPT_RT SMP NOPTI
| [   T8658] CPU: 6 UID: 1001 PID: 8658 Comm: thread-create-l Not tainted 6.14.0-rc4+ #188 676565269ee73396c27dead3a66b3f774bd9af57
| [   T8658] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.03.0003.041920141333 04/19/2014
| [   T8658] RIP: 0010:plist_check_list+0xb/0xa0
| [   T8658] Code: cc cc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 41 54 49 89 fc 55 53 48 83 ec 10 <48> 8b 1f 48 8b 43 08 48 39 c7  74 27 48 8b 4f 08 50 49 89 f8 48 89
| [   T8658] RSP: 0018:ffffc90022e27c90 EFLAGS: 00010286
| [   T8658] RAX: 0000000000000000 RBX: ffffc90022e27e00 RCX: 0000000000000000
| [   T8658] RDX: ffff888558da02a8 RSI: ffff888558da02a8 RDI: fffffffffffffff0
| [   T8658] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff8885680dc980
| [   T8658] R10: 0000031e8e1a7200 R11: ffff888574990028 R12: fffffffffffffff0
| [   T8658] R13: ffff888558da02a8 R14: ffffc90022e27e48 R15: ffffc90022e27d38
| [   T8658] FS:  00007f741af9e6c0(0000) GS:ffff8885a7c2b000(0000) knlGS:0000000000000000
| [   T8658] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
| [   T8658] CR2: fffffffffffffff0 CR3: 00000006d7aca005 CR4: 00000000000626f0
| [   T8658] Call Trace:
| [   T8658]  <TASK>
| [   T8658]  plist_del+0x28/0x100
| [   T8658]  __futex_unqueue+0x29/0x40
| [   T8658]  futex_unqueue_pi+0x1f/0x40
| [   T8658]  futex_lock_pi+0x24d/0x420
| [   T8658]  do_futex+0x57/0x190
| [   T8658]  __x64_sys_futex+0xfe/0x1a0

It takes about 1h+ to reproduce. And only on one particular stubborn
box. This originates from futex_unqueue_pi() after
futex_q_lockptr_lock(). I have another crash within
futex_q_lockptr_lock() (in spin_lock()).

This looks like the locking task was not enqueued in the hash bucket
during the resize. This means there was a timeout and the unlocking task
removed it while looking for the next owner. But the unlocking part
acquired an additional reference to avoid a resize in that case. So,
confused I am.
I reverted to 50ca0ec83226 ("futex: Resize local futex hash table based
on number of threads."), have the another "always resize hack" and so
far it looks good.
Looking at __futex_pivot_hash() there is this:
|         if (fph) {
|                 if (rcuref_read(&fph->users) != 0) {
|                         mm->futex_phash_new = new;
|                         return false;
|                 }
|
|                 futex_rehash_private(fph, new);
|         }

So we stash the new pointer as long as rcuref_read() does not return 0.
How stable is rcuref_read()'s 0 return actually? The code says:

| static inline unsigned int rcuref_read(rcuref_t *ref)
| {
|         unsigned int c = atomic_read(&ref->refcnt);
|
|         /* Return 0 if within the DEAD zone. */
|         return c >= RCUREF_RELEASED ? 0 : c + 1;
| }

so if it got negative on its final put, the c becomes -1/ 0xff…ff. This
+1 will be 0 and we do a resize. But it is negative and did not reach
RCUREF_DEAD yet so it can be bumbed back to positive. It will not be
deconstructed because the cmpxchg in rcuref_put_slowpath() fails. So it
will remains active. But we do a resize here and end up with to private
hash. That is why I had the `released' member.

Sebastian

Re: [PATCH v9 00/11] futex: Add support task local hash maps.

Posted by Peter Zijlstra 11 months ago

On Tue, Mar 04, 2025 at 03:58:37PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-03-03 17:40:16 [+0100], To Peter Zijlstra wrote:
> …
> > You avoided the two states by dropping refcount only there is no !new
> > pointer. That should work.
> …
> > My first few tests succeeded. And I have a few RCU annotations, which I
> > post once I complete them and finish my requeue-pi tests.
> 
> get_futex_key() has this:
> |…
> |         if (!fshared) {
> |…
> |                 if (IS_ENABLED(CONFIG_MMU))
> |                         key->private.mm = mm;
> |                 else
> |                         key->private.mm = NULL;
> |
> |                 key->private.address = address;
> |
> 
> and now __futex_hash_private() has this:
> | {
> |         if (!futex_key_is_private(key))
> |                 return NULL;
> |
> |         if (!fph)
> |                 fph = rcu_dereference(key->private.mm->futex_phash);
> 
> Dereferencing mm won't work on !CONFIG_MMU. We could limit private hash
> to !CONFIG_BASE_SMALL && CONFIG_MMU.

Humph, yeah, not sure we should care about !MMU.

> Ignoring this, I managed to crash the box on top of 49fd6b8f5d59
> ("futex: Implement FUTEX2_MPOL"). I had one commit on top to make the
> prctl not blocking (make futex_hash_allocate(, false)). This is simulate
> the fork resize. The backtrace:
> | [   T8658] BUG: unable to handle page fault for address: fffffffffffffff0
> | [   T8658] #PF: supervisor read access in kernel mode
> | [   T8658] #PF: error_code(0x0000) - not-present page
> | [   T8658] PGD 2c5a067 P4D 2c5a067 PUD 2c5c067 PMD 0
> | [   T8658] Oops: Oops: 0000 [#1] PREEMPT_RT SMP NOPTI
> | [   T8658] CPU: 6 UID: 1001 PID: 8658 Comm: thread-create-l Not tainted 6.14.0-rc4+ #188 676565269ee73396c27dead3a66b3f774bd9af57
> | [   T8658] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.03.0003.041920141333 04/19/2014
> | [   T8658] RIP: 0010:plist_check_list+0xb/0xa0
> | [   T8658] Code: cc cc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 41 54 49 89 fc 55 53 48 83 ec 10 <48> 8b 1f 48 8b 43 08 48 39 c7  74 27 48 8b 4f 08 50 49 89 f8 48 89
> | [   T8658] RSP: 0018:ffffc90022e27c90 EFLAGS: 00010286
> | [   T8658] RAX: 0000000000000000 RBX: ffffc90022e27e00 RCX: 0000000000000000
> | [   T8658] RDX: ffff888558da02a8 RSI: ffff888558da02a8 RDI: fffffffffffffff0
> | [   T8658] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff8885680dc980
> | [   T8658] R10: 0000031e8e1a7200 R11: ffff888574990028 R12: fffffffffffffff0
> | [   T8658] R13: ffff888558da02a8 R14: ffffc90022e27e48 R15: ffffc90022e27d38
> | [   T8658] FS:  00007f741af9e6c0(0000) GS:ffff8885a7c2b000(0000) knlGS:0000000000000000
> | [   T8658] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> | [   T8658] CR2: fffffffffffffff0 CR3: 00000006d7aca005 CR4: 00000000000626f0
> | [   T8658] Call Trace:
> | [   T8658]  <TASK>
> | [   T8658]  plist_del+0x28/0x100
> | [   T8658]  __futex_unqueue+0x29/0x40
> | [   T8658]  futex_unqueue_pi+0x1f/0x40
> | [   T8658]  futex_lock_pi+0x24d/0x420
> | [   T8658]  do_futex+0x57/0x190
> | [   T8658]  __x64_sys_futex+0xfe/0x1a0
> 
> It takes about 1h+ to reproduce. And only on one particular stubborn
> box. This originates from futex_unqueue_pi() after
> futex_q_lockptr_lock(). I have another crash within
> futex_q_lockptr_lock() (in spin_lock()).
> 
> This looks like the locking task was not enqueued in the hash bucket
> during the resize. This means there was a timeout and the unlocking task
> removed it while looking for the next owner. But the unlocking part
> acquired an additional reference to avoid a resize in that case. So,
> confused I am.

Yeah, weird that.

> I reverted to 50ca0ec83226 ("futex: Resize local futex hash table based
> on number of threads."), have the another "always resize hack" and so
> far it looks good.
> Looking at __futex_pivot_hash() there is this:
> |         if (fph) {
> |                 if (rcuref_read(&fph->users) != 0) {
> |                         mm->futex_phash_new = new;
> |                         return false;
> |                 }
> |
> |                 futex_rehash_private(fph, new);
> |         }
> 
> So we stash the new pointer as long as rcuref_read() does not return 0.
> How stable is rcuref_read()'s 0 return actually? The code says:
> 
> | static inline unsigned int rcuref_read(rcuref_t *ref)
> | {
> |         unsigned int c = atomic_read(&ref->refcnt);
> |
> |         /* Return 0 if within the DEAD zone. */
> |         return c >= RCUREF_RELEASED ? 0 : c + 1;
> | }
> 
> so if it got negative on its final put, the c becomes -1/ 0xff…ff. This
> +1 will be 0 and we do a resize. But it is negative and did not reach
> RCUREF_DEAD yet so it can be bumbed back to positive. It will not be
> deconstructed because the cmpxchg in rcuref_put_slowpath() fails. So it
> will remains active. But we do a resize here and end up with to private
> hash. That is why I had the `released' member.

I am not quite sure I follow. If rcuref_put_slowpath() returns true;
then the value has been set to DEAD (high nibble E), any concurrent
inc/dec will move it away from that a little, but it will always be set
back to DEAD (IOW, you need 1<<29 concurrent modifications into the same
direction to push it out of the DEAD range).

As long as it is within those 29 bits of DEAD, rcuref_read() should
return 0.

Re: [PATCH v9 00/11] futex: Add support task local hash maps.

Posted by Sebastian Andrzej Siewior 11 months, 1 week ago

On 2025-03-04 15:58:39 [+0100], To Peter Zijlstra wrote:
> hash. That is why I had the `released' member.

The box was still alive this morning so it did survive >12h testing. I
would bring back the `released' member back unless you have other
preferences.
Depending on those I could fold the fixes directly into the patches and
repost the whole thing or prepare you patches that can be folded back
and send those.

Sebastian

Re: [PATCH v9 00/11] futex: Add support task local hash maps.

Posted by Peter Zijlstra 11 months ago

On Wed, Mar 05, 2025 at 10:02:37AM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-03-04 15:58:39 [+0100], To Peter Zijlstra wrote:
> > hash. That is why I had the `released' member.
> 
> The box was still alive this morning so it did survive >12h testing. I
> would bring back the `released' member back unless you have other
> preferences.

Like I just wrote in that other email; I'm a bit confused as to how this
can happen. If rcuref_put() returns success, then the value is DEAD. It must
then either be decremented below RELEASED or incremented past NOREF in
order for rcuref_read() to no longer return 0.

> Depending on those I could fold the fixes directly into the patches and
> repost the whole thing or prepare you patches that can be folded back
> and send those.

Please, it appears I don't have as much time as I would like :/

Re: [PATCH v9 00/11] futex: Add support task local hash maps.

Posted by Sebastian Andrzej Siewior 11 months ago

On 2025-03-10 17:01:02 [+0100], Peter Zijlstra wrote:
> On Wed, Mar 05, 2025 at 10:02:37AM +0100, Sebastian Andrzej Siewior wrote:
> > On 2025-03-04 15:58:39 [+0100], To Peter Zijlstra wrote:
> > > hash. That is why I had the `released' member.
> > 
> > The box was still alive this morning so it did survive >12h testing. I
> > would bring back the `released' member back unless you have other
> > preferences.
> 
> Like I just wrote in that other email; I'm a bit confused as to how this
> can happen. If rcuref_put() returns success, then the value is DEAD. It must
> then either be decremented below RELEASED or incremented past NOREF in
> order for rcuref_read() to no longer return 0.

We can't rely on 0 to be released as it might become active. We could
change rcuref_read() to return 0 if it could be obtained and -1 if it
can not.
We don't have many users atm so an audit should be quick.

> > Depending on those I could fold the fixes directly into the patches and
> > repost the whole thing or prepare you patches that can be folded back
> > and send those.
> 
> Please, it appears I don't have as much time as I would like :/

Sebastian

Re: [PATCH v9 00/11] futex: Add support task local hash maps.

Posted by Peter Zijlstra 11 months ago

On Mon, Mar 10, 2025 at 05:27:10PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-03-10 17:01:02 [+0100], Peter Zijlstra wrote:
> > On Wed, Mar 05, 2025 at 10:02:37AM +0100, Sebastian Andrzej Siewior wrote:
> > > On 2025-03-04 15:58:39 [+0100], To Peter Zijlstra wrote:
> > > > hash. That is why I had the `released' member.
> > > 
> > > The box was still alive this morning so it did survive >12h testing. I
> > > would bring back the `released' member back unless you have other
> > > preferences.
> > 
> > Like I just wrote in that other email; I'm a bit confused as to how this
> > can happen. If rcuref_put() returns success, then the value is DEAD. It must
> > then either be decremented below RELEASED or incremented past NOREF in
> > order for rcuref_read() to no longer return 0.
> 
> We can't rely on 0 to be released as it might become active. We could
> change rcuref_read() to return 0 if it could be obtained and -1 if it
> can not.
> We don't have many users atm so an audit should be quick.

Right, so I failed to understand initially. When DEAD it stays 0, but
there is indeed the one case where it isn't yet DEAD but still returns
0.

Making the DEAD return -1 seems like a good solution.

Re: [PATCH v9 00/11] futex: Add support task local hash maps.

Posted by Sebastian Andrzej Siewior 11 months ago

On 2025-03-11 11:17:14 [+0100], Peter Zijlstra wrote:
> Right, so I failed to understand initially. When DEAD it stays 0, but
> there is indeed the one case where it isn't yet DEAD but still returns
> 0.
> 
> Making the DEAD return -1 seems like a good solution.

The patch below is what I have/ tglx asked for. I intend to use it the
series and repost it once I fixed it up.

-------------->8--------------

Subject: [PATCH] rcuref: Provide rcuref_is_dead().

rcuref_read() returns the number of references that are currently held.
If 0 is returned then it is not safe to assume that the object ca be
scheduled for deconstruction because it is marked DEAD. This happens if
the return value of rcuref_put() is ignored and assumptions are made.

If 0 is returned then the counter transitioned from 0 to RCUREF_NOREF.
If rcuref_put() did not return to the caller then the counter did not
yet transition from RCUREF_NOREF to RCUREF_DEAD. This means that there
is still a chance that the counter counter will transition from
RCUREF_NOREF to 0 meaning it is still valid and must not be
deconstructed. In this brief window rcuref_read() will return 0.

Provide rcuref_is_dead() to determine if the counter is marked as
RCUREF_DEAD.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/rcuref.h | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/linux/rcuref.h b/include/linux/rcuref.h
index 6322d8c1c6b42..2fb2af6d98249 100644
--- a/include/linux/rcuref.h
+++ b/include/linux/rcuref.h
@@ -30,7 +30,11 @@ static inline void rcuref_init(rcuref_t *ref, unsigned int cnt)
  * rcuref_read - Read the number of held reference counts of a rcuref
  * @ref:	Pointer to the reference count
  *
- * Return: The number of held references (0 ... N)
+ * Return: The number of held references (0 ... N). The value 0 does not
+ * indicate that it is safe to schedule the object, protected by this reference
+ * counter, for deconstruction.
+ * If you want to know if the reference counter has been marked DEAD (as
+ * signaled by rcuref_put()) please use rcuread_is_dead().
  */
 static inline unsigned int rcuref_read(rcuref_t *ref)
 {
@@ -40,6 +44,22 @@ static inline unsigned int rcuref_read(rcuref_t *ref)
 	return c >= RCUREF_RELEASED ? 0 : c + 1;
 }

+/**
+ * rcuref_is_dead -	Check if the rcuref has been already marked dead
+ * @ref:		Pointer to the reference count
+ *
+ * Return: True if the object has been marked DEAD. This signals that a previous
+ * invocation of rcuref_put() returned true on this reference counter meaning
+ * the protected object can safely be scheduled for deconstruction.
+ * Otherwise, returns false.
+ */
+static inline bool rcuref_is_dead(rcuref_t *ref)
+{
+	unsigned int c = atomic_read(&ref->refcnt);
+
+	return (c >= RCUREF_RELEASED) && (c < RCUREF_NOREF);
+}
+
 extern __must_check bool rcuref_get_slowpath(rcuref_t *ref);

 /**
-- 
2.47.2