[PATCH] sched/membarrier: Fix redundant load of membarrier_state

Nysal Jan K.A. posted 1 patch 1 month, 3 weeks ago
There is a newer version of this series
include/linux/sched/mm.h | 2 ++
1 file changed, 2 insertions(+)
[PATCH] sched/membarrier: Fix redundant load of membarrier_state
Posted by Nysal Jan K.A. 1 month, 3 weeks ago
From: "Nysal Jan K.A" <nysal@linux.ibm.com>

On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
is not selected, sync_core_before_usermode() is a no-op.
In membarrier_mm_sync_core_before_usermode() the compiler does not
eliminate redundant branches and the load of mm->membarrier_state
for this case as the atomic_read() cannot be optimized away.

Here's a snippet of the code generated for finish_task_switch() on powerpc:

1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
.......
1b78c8:   cmpdi   cr7,r26,0
1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
1b78d0:   ld      r9,2312(r13)    # current
1b78d4:   ld      r9,1888(r9)     # current->mm
1b78d8:   cmpd    cr7,r26,r9
1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
1b78e0:   hwsync
1b78e4:   cmplwi  cr7,r27,128
.......
1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
1b7a74:   b       1b78e0 <finish_task_switch+0xcc>

This was found while analyzing "perf c2c" reports on kernels prior
to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
where mm_count was false sharing with membarrier_state.

There is a minor improvement in the size of finish_task_switch().
The following are results from bloat-o-meter:

GCC 7.5.0:
----------
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
Function                                     old     new   delta
finish_task_switch                           884     852     -32

GCC 12.2.1:
-----------
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
Function                                     old     new   delta
finish_task_switch.isra                      852     820     -32

LLVM 17.0.6:
------------
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36)
Function                                     old     new   delta
rt_mutex_schedule                            120     104     -16
finish_task_switch                           792     772     -20

Signed-off-by: Nysal Jan K.A <nysal@linux.ibm.com>
---
 include/linux/sched/mm.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 07bb8d4181d7..042e60ab853a 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -540,6 +540,8 @@ enum {
 
 static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
 {
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
+		return;
 	if (current->mm != mm)
 		return;
 	if (likely(!(atomic_read(&mm->membarrier_state) &
-- 
2.35.3
Re: [PATCH] sched/membarrier: Fix redundant load of membarrier_state
Posted by Michael Ellerman 1 month ago
[To += Mathieu]

"Nysal Jan K.A." <nysal@linux.ibm.com> writes:
> From: "Nysal Jan K.A" <nysal@linux.ibm.com>
>
> On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> is not selected, sync_core_before_usermode() is a no-op.
> In membarrier_mm_sync_core_before_usermode() the compiler does not
> eliminate redundant branches and the load of mm->membarrier_state
> for this case as the atomic_read() cannot be optimized away.

I was wondering if this was caused by powerpc's arch_atomic_read() which
uses asm volatile.

But replacing arch_atomic_read() with READ_ONCE() makes no difference,
presumably because the compiler still can't see that the READ_ONCE() is
unnecessary (which is kind of by design).

> Here's a snippet of the code generated for finish_task_switch() on powerpc:
>
> 1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
> .......
> 1b78c8:   cmpdi   cr7,r26,0
> 1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
> 1b78d0:   ld      r9,2312(r13)    # current
> 1b78d4:   ld      r9,1888(r9)     # current->mm
> 1b78d8:   cmpd    cr7,r26,r9
> 1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
> 1b78e0:   hwsync
> 1b78e4:   cmplwi  cr7,r27,128
> .......
> 1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
> 1b7a74:   b       1b78e0 <finish_task_switch+0xcc>
>
> This was found while analyzing "perf c2c" reports on kernels prior
> to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
> where mm_count was false sharing with membarrier_state.

So it was causing a noticable performance blip? But isn't anymore?

> There is a minor improvement in the size of finish_task_switch().
> The following are results from bloat-o-meter:
>
> GCC 7.5.0:
> ----------
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> Function                                     old     new   delta
> finish_task_switch                           884     852     -32
>
> GCC 12.2.1:
> -----------
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> Function                                     old     new   delta
> finish_task_switch.isra                      852     820     -32

GCC 12 is a couple of years old, I assume GCC 14 behaves similarly?

> LLVM 17.0.6:
> ------------
> add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36)
> Function                                     old     new   delta
> rt_mutex_schedule                            120     104     -16
> finish_task_switch                           792     772     -20
>
> Signed-off-by: Nysal Jan K.A <nysal@linux.ibm.com>
> ---
>  include/linux/sched/mm.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 07bb8d4181d7..042e60ab853a 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -540,6 +540,8 @@ enum {
>  
>  static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>  {
> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
> +		return;
>  	if (current->mm != mm)
>  		return;
>  	if (likely(!(atomic_read(&mm->membarrier_state) &

The other option would be to have a completely separate stub, eg:

  #ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
  static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
  {
          if (current->mm != mm)
                  return;
          if (likely(!(atomic_read(&mm->membarrier_state) &
                       MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
                  return;
          sync_core_before_usermode();
  }
  #else
  static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { }
  #endif

Not sure what folks prefer.

In either case I think it's probably worth a short comment explaining
why it's worth the trouble (ie. that the atomic_read() prevents the
compiler from doing DCE).

cheers
Re: [PATCH] sched/membarrier: Fix redundant load of membarrier_state
Posted by Nysal Jan K.A. 1 month ago
On Fri, Oct 25, 2024 at 11:29:38AM +1100, Michael Ellerman wrote:
> [To += Mathieu]
> 
> "Nysal Jan K.A." <nysal@linux.ibm.com> writes:
> > From: "Nysal Jan K.A" <nysal@linux.ibm.com>
> >
> > On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> > is not selected, sync_core_before_usermode() is a no-op.
> > In membarrier_mm_sync_core_before_usermode() the compiler does not
> > eliminate redundant branches and the load of mm->membarrier_state
> > for this case as the atomic_read() cannot be optimized away.
> 
> I was wondering if this was caused by powerpc's arch_atomic_read() which
> uses asm volatile.
> 

Yes, that's my understanding as well

> But replacing arch_atomic_read() with READ_ONCE() makes no difference,
> presumably because the compiler still can't see that the READ_ONCE() is
> unnecessary (which is kind of by design).
> 

In READ_ONCE() we cast to a volatile pointer, I think the compiler cannot eliminate
the code in that case.

> > Here's a snippet of the code generated for finish_task_switch() on powerpc:
> >
> > 1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
> > .......
> > 1b78c8:   cmpdi   cr7,r26,0
> > 1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
> > 1b78d0:   ld      r9,2312(r13)    # current
> > 1b78d4:   ld      r9,1888(r9)     # current->mm
> > 1b78d8:   cmpd    cr7,r26,r9
> > 1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
> > 1b78e0:   hwsync
> > 1b78e4:   cmplwi  cr7,r27,128
> > .......
> > 1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
> > 1b7a74:   b       1b78e0 <finish_task_switch+0xcc>
> >
> > This was found while analyzing "perf c2c" reports on kernels prior
> > to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
> > where mm_count was false sharing with membarrier_state.
> 
> So it was causing a noticable performance blip? But isn't anymore?
> 

It was noticeable in that it showed up amongst the top entries in perf c2c reports.
There was similar false sharing with other fields that share the cache line with
mm_count, so the gains were minimal with just this patch. c1753fd02a00 addresses
these cases too.

> > There is a minor improvement in the size of finish_task_switch().
> > The following are results from bloat-o-meter:
> >
> > GCC 7.5.0:
> > ----------
> > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> > Function                                     old     new   delta
> > finish_task_switch                           884     852     -32
> >
> > GCC 12.2.1:
> > -----------
> > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> > Function                                     old     new   delta
> > finish_task_switch.isra                      852     820     -32
> 
> GCC 12 is a couple of years old, I assume GCC 14 behaves similarly?
> 

I cross compiled for aarch64 with gcc 14.1.1 and see similar results:

add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56)
Function                                     old     new   delta
get_nohz_timer_target                        352     356      +4
e843419@0b02_0000d7e7_408                      8       -      -8
e843419@01bb_000021d2_868                      8       -      -8
finish_task_switch.isra                      592     548     -44
Total: Before=31013792, After=31013736, chg -0.00%

> > LLVM 17.0.6:
> > ------------
> > add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36)
> > Function                                     old     new   delta
> > rt_mutex_schedule                            120     104     -16
> > finish_task_switch                           792     772     -20
> >
> > Signed-off-by: Nysal Jan K.A <nysal@linux.ibm.com>
> > ---
> >  include/linux/sched/mm.h | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > index 07bb8d4181d7..042e60ab853a 100644
> > --- a/include/linux/sched/mm.h
> > +++ b/include/linux/sched/mm.h
> > @@ -540,6 +540,8 @@ enum {
> >  
> >  static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
> >  {
> > +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
> > +		return;
> >  	if (current->mm != mm)
> >  		return;
> >  	if (likely(!(atomic_read(&mm->membarrier_state) &
> 
> The other option would be to have a completely separate stub, eg:
> 
>   #ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
>   static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>   {
>           if (current->mm != mm)
>                   return;
>           if (likely(!(atomic_read(&mm->membarrier_state) &
>                        MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
>                   return;
>           sync_core_before_usermode();
>   }
>   #else
>   static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { }
>   #endif
> 
> Not sure what folks prefer.
> 
> In either case I think it's probably worth a short comment explaining
> why it's worth the trouble (ie. that the atomic_read() prevents the
> compiler from doing DCE).
>

I'll send a v2 with a comment added in there. Thanks for the review.

--Nysal
Re: [PATCH] sched/membarrier: Fix redundant load of membarrier_state
Posted by Mathieu Desnoyers 1 month ago
On 2024-10-24 20:29, Michael Ellerman wrote:
> [To += Mathieu]
> 
> "Nysal Jan K.A." <nysal@linux.ibm.com> writes:
>> From: "Nysal Jan K.A" <nysal@linux.ibm.com>
>>
>> On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
>> is not selected, sync_core_before_usermode() is a no-op.
>> In membarrier_mm_sync_core_before_usermode() the compiler does not
>> eliminate redundant branches and the load of mm->membarrier_state
>> for this case as the atomic_read() cannot be optimized away.
> 
> I was wondering if this was caused by powerpc's arch_atomic_read() which
> uses asm volatile.
> 
> But replacing arch_atomic_read() with READ_ONCE() makes no difference,
> presumably because the compiler still can't see that the READ_ONCE() is
> unnecessary (which is kind of by design).
> 
>> Here's a snippet of the code generated for finish_task_switch() on powerpc:
>>
>> 1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
>> .......
>> 1b78c8:   cmpdi   cr7,r26,0
>> 1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
>> 1b78d0:   ld      r9,2312(r13)    # current
>> 1b78d4:   ld      r9,1888(r9)     # current->mm
>> 1b78d8:   cmpd    cr7,r26,r9
>> 1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
>> 1b78e0:   hwsync
>> 1b78e4:   cmplwi  cr7,r27,128
>> .......
>> 1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
>> 1b7a74:   b       1b78e0 <finish_task_switch+0xcc>
>>
>> This was found while analyzing "perf c2c" reports on kernels prior
>> to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
>> where mm_count was false sharing with membarrier_state.
> 
> So it was causing a noticable performance blip? But isn't anymore?

I indeed moved mm_count into its own cacheline in response to
performance regressions reports, which were caused by simply
loading the pcpu_cid pointer frequently enough. So if membarrier_state
was also sharing that cache line, it makes sense that moving
mm_count away helped there as well.

[...]

>> ---
>>   include/linux/sched/mm.h | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
>> index 07bb8d4181d7..042e60ab853a 100644
>> --- a/include/linux/sched/mm.h
>> +++ b/include/linux/sched/mm.h
>> @@ -540,6 +540,8 @@ enum {
>>   
>>   static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>>   {
>> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
>> +		return;
>>   	if (current->mm != mm)
>>   		return;
>>   	if (likely(!(atomic_read(&mm->membarrier_state) &

I prefer the approach above, because it requires fewer kernel
configurations to reach the same compilation code coverage.

Thanks,

Mathieu


> 
> The other option would be to have a completely separate stub, eg:
> 
>    #ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
>    static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>    {
>            if (current->mm != mm)
>                    return;
>            if (likely(!(atomic_read(&mm->membarrier_state) &
>                         MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
>                    return;
>            sync_core_before_usermode();
>    }
>    #else
>    static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { }
>    #endif
> 
> Not sure what folks prefer.
> 
> In either case I think it's probably worth a short comment explaining
> why it's worth the trouble (ie. that the atomic_read() prevents the
> compiler from doing DCE).
> 
> cheers

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [PATCH] sched/membarrier: Fix redundant load of membarrier_state
Posted by Segher Boessenkool 1 month ago
Hi!

On Fri, Oct 25, 2024 at 11:29:38AM +1100, Michael Ellerman wrote:
> [To += Mathieu]
> 
> "Nysal Jan K.A." <nysal@linux.ibm.com> writes:
> > From: "Nysal Jan K.A" <nysal@linux.ibm.com>
> >
> > On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> > is not selected, sync_core_before_usermode() is a no-op.
> > In membarrier_mm_sync_core_before_usermode() the compiler does not
> > eliminate redundant branches and the load of mm->membarrier_state
> > for this case as the atomic_read() cannot be optimized away.
> 
> I was wondering if this was caused by powerpc's arch_atomic_read() which
> uses asm volatile.
> 
> But replacing arch_atomic_read() with READ_ONCE() makes no difference,
> presumably because the compiler still can't see that the READ_ONCE() is
> unnecessary (which is kind of by design).

Exactly.

> > GCC 12.2.1:
> > -----------
> > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> > Function                                     old     new   delta
> > finish_task_switch.isra                      852     820     -32
> 
> GCC 12 is a couple of years old, I assume GCC 14 behaves similarly?

GCC 12 is still being actively developed.  There will be a 12.5
probably halfway next year (and that will be the last 12.x release,
yes).  The GCC homepage (<https://gcc.gnu.org>) will tell you what
releases are still maintained/supported, and sometimes you can derive
our planned plans from there as well :-)

But yes, 14 is similar (I did not test, but I feel confident making that
assertion :-) )

> >  static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
> >  {
> > +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
> > +		return;
> >  	if (current->mm != mm)
> >  		return;
> >  	if (likely(!(atomic_read(&mm->membarrier_state) &
> 
> The other option would be to have a completely separate stub, eg:
> 
>   #ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
>   static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>   {
>           if (current->mm != mm)
>                   return;
>           if (likely(!(atomic_read(&mm->membarrier_state) &
>                        MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
>                   return;
>           sync_core_before_usermode();
>   }
>   #else
>   static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { }
>   #endif
> 
> Not sure what folks prefer.
> 
> In either case I think it's probably worth a short comment explaining
> why it's worth the trouble (ie. that the atomic_read() prevents the
> compiler from doing DCE).

Since you ask, I like the proposed change (the inline one) best.  But
yeah, comment please!

(And it is not about DCE -- just the definition of __READ_ONCE makes it
directly impossible to CSE any expressions with this, it (standards-
violatingly) casts the pointers to pointers to volatile, and you cannot
CSE any accesses to volatile objects!)

So what are the actual semantics the kernel wants from its READ_ONCE,
and from its atomics in general?  GCC has perfectly fine in-compiler
support for such things, there is no need for making a second rate
manual implementation of parts of this, when you can use a good
implementation of everything instead!


Segher
Re: [PATCH] sched/membarrier: Fix redundant load of membarrier_state
Posted by Stephen Rothwell 1 month ago
Hi Michael,

On Fri, 25 Oct 2024 11:29:38 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote:
>
> > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > index 07bb8d4181d7..042e60ab853a 100644
> > --- a/include/linux/sched/mm.h
> > +++ b/include/linux/sched/mm.h
> > @@ -540,6 +540,8 @@ enum {
> >  
> >  static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
> >  {
> > +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
> > +		return;
> >  	if (current->mm != mm)
> >  		return;
> >  	if (likely(!(atomic_read(&mm->membarrier_state) &  
> 
> The other option would be to have a completely separate stub, eg:
> 
>   #ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
>   static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>   {
>           if (current->mm != mm)
>                   return;
>           if (likely(!(atomic_read(&mm->membarrier_state) &
>                        MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
>                   return;
>           sync_core_before_usermode();
>   }
>   #else
>   static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { }
>   #endif
> 
> Not sure what folks prefer.

I case it matters, in general I prefer the first as there is less code
to get out of sync and all the code still gets parsed by the compiler
whether the CONFIG option is set or not.
-- 
Cheers,
Stephen Rothwell
[PATCH v2] sched/membarrier: Fix redundant load of membarrier_state
Posted by Nysal Jan K.A. 4 weeks, 1 day ago
On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
is not selected, sync_core_before_usermode() is a no-op.
In membarrier_mm_sync_core_before_usermode() the compiler does not
eliminate redundant branches and load of mm->membarrier_state
for this case as the atomic_read() cannot be optimized away.

Here's a snippet of the code generated for finish_task_switch() on powerpc
prior to this change:

1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
.......
1b78c8:   cmpdi   cr7,r26,0
1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
1b78d0:   ld      r9,2312(r13)    # current
1b78d4:   ld      r9,1888(r9)     # current->mm
1b78d8:   cmpd    cr7,r26,r9
1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
1b78e0:   hwsync
1b78e4:   cmplwi  cr7,r27,128
.......
1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
1b7a74:   b       1b78e0 <finish_task_switch+0xcc>

This was found while analyzing "perf c2c" reports on kernels prior
to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
where mm_count was false sharing with membarrier_state.

There is a minor improvement in the size of finish_task_switch().
The following are results from bloat-o-meter for ppc64le:

GCC 7.5.0
---------
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
Function                                     old     new   delta
finish_task_switch                           884     852     -32

GCC 12.2.1
----------
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
Function                                     old     new   delta
finish_task_switch.isra                      852     820     -32

LLVM 17.0.6
-----------
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36)
Function                                     old     new   delta
rt_mutex_schedule                            120     104     -16
finish_task_switch                           792     772     -20

Results on aarch64:

GCC 14.1.1
----------
add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56)
Function                                     old     new   delta
get_nohz_timer_target                        352     356      +4
e843419@0b02_0000d7e7_408                      8       -      -8
e843419@01bb_000021d2_868                      8       -      -8
finish_task_switch.isra                      592     548     -44

Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com>
---
V1 -> V2:
- Add results for aarch64
- Add a comment describing the changes
---
 include/linux/sched/mm.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 928a626725e6..b13474825130 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -531,6 +531,13 @@ enum {
 
 static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
 {
+	/*
+	 * The atomic_read() below prevents CSE. The following should
+	 * help the compiler generate more efficient code on architectures
+	 * where sync_core_before_usermode() is a no-op.
+	 */
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
+		return;
 	if (current->mm != mm)
 		return;
 	if (likely(!(atomic_read(&mm->membarrier_state) &
-- 
2.47.0
Re: [PATCH v2] sched/membarrier: Fix redundant load of membarrier_state
Posted by Michal Hocko 1 week, 2 days ago
I do not see this patch staged in any tree (e.g. linux-next). Is this on
its way to be merged?

Thanks!

On Tue 29-10-24 11:21:28, Nysal Jan K.A. wrote:
> On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> is not selected, sync_core_before_usermode() is a no-op.
> In membarrier_mm_sync_core_before_usermode() the compiler does not
> eliminate redundant branches and load of mm->membarrier_state
> for this case as the atomic_read() cannot be optimized away.
> 
> Here's a snippet of the code generated for finish_task_switch() on powerpc
> prior to this change:
> 
> 1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
> .......
> 1b78c8:   cmpdi   cr7,r26,0
> 1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
> 1b78d0:   ld      r9,2312(r13)    # current
> 1b78d4:   ld      r9,1888(r9)     # current->mm
> 1b78d8:   cmpd    cr7,r26,r9
> 1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
> 1b78e0:   hwsync
> 1b78e4:   cmplwi  cr7,r27,128
> .......
> 1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
> 1b7a74:   b       1b78e0 <finish_task_switch+0xcc>
> 
> This was found while analyzing "perf c2c" reports on kernels prior
> to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
> where mm_count was false sharing with membarrier_state.
> 
> There is a minor improvement in the size of finish_task_switch().
> The following are results from bloat-o-meter for ppc64le:
> 
> GCC 7.5.0
> ---------
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> Function                                     old     new   delta
> finish_task_switch                           884     852     -32
> 
> GCC 12.2.1
> ----------
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> Function                                     old     new   delta
> finish_task_switch.isra                      852     820     -32
> 
> LLVM 17.0.6
> -----------
> add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36)
> Function                                     old     new   delta
> rt_mutex_schedule                            120     104     -16
> finish_task_switch                           792     772     -20
> 
> Results on aarch64:
> 
> GCC 14.1.1
> ----------
> add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56)
> Function                                     old     new   delta
> get_nohz_timer_target                        352     356      +4
> e843419@0b02_0000d7e7_408                      8       -      -8
> e843419@01bb_000021d2_868                      8       -      -8
> finish_task_switch.isra                      592     548     -44
> 
> Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com>
> ---
> V1 -> V2:
> - Add results for aarch64
> - Add a comment describing the changes
> ---
>  include/linux/sched/mm.h | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 928a626725e6..b13474825130 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -531,6 +531,13 @@ enum {
>  
>  static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>  {
> +	/*
> +	 * The atomic_read() below prevents CSE. The following should
> +	 * help the compiler generate more efficient code on architectures
> +	 * where sync_core_before_usermode() is a no-op.
> +	 */
> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
> +		return;
>  	if (current->mm != mm)
>  		return;
>  	if (likely(!(atomic_read(&mm->membarrier_state) &
> -- 
> 2.47.0
> 

-- 
Michal Hocko
SUSE Labs
Re: [PATCH v2] sched/membarrier: Fix redundant load of membarrier_state
Posted by Peter Zijlstra 1 week, 2 days ago
On Mon, Nov 18, 2024 at 10:04:18AM +0100, Michal Hocko wrote:
> I do not see this patch staged in any tree (e.g. linux-next). Is this on
> its way to be merged?

I only now found it -- it doesn't look super urgent. I'll get it into a
git tree after -rc1.
Re: [PATCH v2] sched/membarrier: Fix redundant load of membarrier_state
Posted by Michal Hocko 1 week, 2 days ago
On Mon 18-11-24 10:25:17, Peter Zijlstra wrote:
> On Mon, Nov 18, 2024 at 10:04:18AM +0100, Michal Hocko wrote:
> > I do not see this patch staged in any tree (e.g. linux-next). Is this on
> > its way to be merged?
> 
> I only now found it -- it doesn't look super urgent. I'll get it into a
> git tree after -rc1.

Thanks!

-- 
Michal Hocko
SUSE Labs
Re: [PATCH v2] sched/membarrier: Fix redundant load of membarrier_state
Posted by Segher Boessenkool 4 weeks ago
Hi!

On Tue, Oct 29, 2024 at 11:21:28AM +0530, Nysal Jan K.A. wrote:
> On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> is not selected, sync_core_before_usermode() is a no-op.
> In membarrier_mm_sync_core_before_usermode() the compiler does not
> eliminate redundant branches and load of mm->membarrier_state
> for this case as the atomic_read() cannot be optimized away.
> 
> Here's a snippet of the code generated for finish_task_switch() on powerpc
> prior to this change:
> 
> 1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
> .......
> 1b78c8:   cmpdi   cr7,r26,0
> 1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
> 1b78d0:   ld      r9,2312(r13)    # current
> 1b78d4:   ld      r9,1888(r9)     # current->mm
> 1b78d8:   cmpd    cr7,r26,r9
> 1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
> 1b78e0:   hwsync
> 1b78e4:   cmplwi  cr7,r27,128
> .......
> 1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
> 1b7a74:   b       1b78e0 <finish_task_switch+0xcc>
> 
> This was found while analyzing "perf c2c" reports on kernels prior
> to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
> where mm_count was false sharing with membarrier_state.
> 
> There is a minor improvement in the size of finish_task_switch().
> The following are results from bloat-o-meter for ppc64le:
> 
> GCC 7.5.0
> ---------
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> Function                                     old     new   delta
> finish_task_switch                           884     852     -32
> 
> GCC 12.2.1
> ----------
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> Function                                     old     new   delta
> finish_task_switch.isra                      852     820     -32
> 
> LLVM 17.0.6
> -----------
> add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36)
> Function                                     old     new   delta
> rt_mutex_schedule                            120     104     -16
> finish_task_switch                           792     772     -20
> 
> Results on aarch64:
> 
> GCC 14.1.1
> ----------
> add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56)
> Function                                     old     new   delta
> get_nohz_timer_target                        352     356      +4
> e843419@0b02_0000d7e7_408                      8       -      -8
> e843419@01bb_000021d2_868                      8       -      -8
> finish_task_switch.isra                      592     548     -44
> 
> Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com>
> ---
> V1 -> V2:
> - Add results for aarch64
> - Add a comment describing the changes
> ---
>  include/linux/sched/mm.h | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 928a626725e6..b13474825130 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -531,6 +531,13 @@ enum {
>  
>  static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>  {
> +	/*
> +	 * The atomic_read() below prevents CSE. The following should
> +	 * help the compiler generate more efficient code on architectures
> +	 * where sync_core_before_usermode() is a no-op.
> +	 */
> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
> +		return;
>  	if (current->mm != mm)
>  		return;
>  	if (likely(!(atomic_read(&mm->membarrier_state) &

I'd say "CSE and similar transformations", but yeah, in this case CSE.
The point is that any access to a volatile object is a necessary side-
effect, so it has to be performed on the actual machine just as on the
abstract machine (on all the same paths, and as often).  It might be
nice to have an atomic_read (for PowerPC) that can generate better
machine code.  Not a trivial task though!

Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>


Segher
Re: [PATCH v2] sched/membarrier: Fix redundant load of membarrier_state
Posted by Michael Ellerman 4 weeks ago
"Nysal Jan K.A." <nysal@linux.ibm.com> writes:
> On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> is not selected, sync_core_before_usermode() is a no-op.
> In membarrier_mm_sync_core_before_usermode() the compiler does not
> eliminate redundant branches and load of mm->membarrier_state
> for this case as the atomic_read() cannot be optimized away.
>
> Here's a snippet of the code generated for finish_task_switch() on powerpc
> prior to this change:
>
> 1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
> .......
> 1b78c8:   cmpdi   cr7,r26,0
> 1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
> 1b78d0:   ld      r9,2312(r13)    # current
> 1b78d4:   ld      r9,1888(r9)     # current->mm
> 1b78d8:   cmpd    cr7,r26,r9
> 1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
> 1b78e0:   hwsync
> 1b78e4:   cmplwi  cr7,r27,128
> .......
> 1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
> 1b7a74:   b       1b78e0 <finish_task_switch+0xcc>

Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>

cheers
Re: [PATCH v2] sched/membarrier: Fix redundant load of membarrier_state
Posted by Mathieu Desnoyers 4 weeks, 1 day ago
On 2024-10-29 01:51, Nysal Jan K.A. wrote:
> On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> is not selected, sync_core_before_usermode() is a no-op.
> In membarrier_mm_sync_core_before_usermode() the compiler does not
> eliminate redundant branches and load of mm->membarrier_state
> for this case as the atomic_read() cannot be optimized away.
> 
> Here's a snippet of the code generated for finish_task_switch() on powerpc
> prior to this change:
> 
> 1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
> .......
> 1b78c8:   cmpdi   cr7,r26,0
> 1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
> 1b78d0:   ld      r9,2312(r13)    # current
> 1b78d4:   ld      r9,1888(r9)     # current->mm
> 1b78d8:   cmpd    cr7,r26,r9
> 1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
> 1b78e0:   hwsync
> 1b78e4:   cmplwi  cr7,r27,128
> .......
> 1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
> 1b7a74:   b       1b78e0 <finish_task_switch+0xcc>
> 
> This was found while analyzing "perf c2c" reports on kernels prior
> to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
> where mm_count was false sharing with membarrier_state.
> 
> There is a minor improvement in the size of finish_task_switch().
> The following are results from bloat-o-meter for ppc64le:
> 
> GCC 7.5.0
> ---------
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> Function                                     old     new   delta
> finish_task_switch                           884     852     -32
> 
> GCC 12.2.1
> ----------
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> Function                                     old     new   delta
> finish_task_switch.isra                      852     820     -32
> 
> LLVM 17.0.6
> -----------
> add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36)
> Function                                     old     new   delta
> rt_mutex_schedule                            120     104     -16
> finish_task_switch                           792     772     -20
> 
> Results on aarch64:
> 
> GCC 14.1.1
> ----------
> add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56)
> Function                                     old     new   delta
> get_nohz_timer_target                        352     356      +4
> e843419@0b02_0000d7e7_408                      8       -      -8
> e843419@01bb_000021d2_868                      8       -      -8
> finish_task_switch.isra                      592     548     -44
> 
> Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com>

Thanks!

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> ---
> V1 -> V2:
> - Add results for aarch64
> - Add a comment describing the changes
> ---
>   include/linux/sched/mm.h | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 928a626725e6..b13474825130 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -531,6 +531,13 @@ enum {
>   
>   static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>   {
> +	/*
> +	 * The atomic_read() below prevents CSE. The following should
> +	 * help the compiler generate more efficient code on architectures
> +	 * where sync_core_before_usermode() is a no-op.
> +	 */
> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
> +		return;
>   	if (current->mm != mm)
>   		return;
>   	if (likely(!(atomic_read(&mm->membarrier_state) &

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com