include/linux/sched/mm.h | 2 ++ 1 file changed, 2 insertions(+)
From: "Nysal Jan K.A" <nysal@linux.ibm.com>
On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
is not selected, sync_core_before_usermode() is a no-op.
In membarrier_mm_sync_core_before_usermode() the compiler does not
eliminate redundant branches and the load of mm->membarrier_state
for this case as the atomic_read() cannot be optimized away.
Here's a snippet of the code generated for finish_task_switch() on powerpc:
1b786c: ld r26,2624(r30) # mm = rq->prev_mm;
.......
1b78c8: cmpdi cr7,r26,0
1b78cc: beq cr7,1b78e4 <finish_task_switch+0xd0>
1b78d0: ld r9,2312(r13) # current
1b78d4: ld r9,1888(r9) # current->mm
1b78d8: cmpd cr7,r26,r9
1b78dc: beq cr7,1b7a70 <finish_task_switch+0x25c>
1b78e0: hwsync
1b78e4: cmplwi cr7,r27,128
.......
1b7a70: lwz r9,176(r26) # atomic_read(&mm->membarrier_state)
1b7a74: b 1b78e0 <finish_task_switch+0xcc>
This was found while analyzing "perf c2c" reports on kernels prior
to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
where mm_count was false sharing with membarrier_state.
There is a minor improvement in the size of finish_task_switch().
The following are results from bloat-o-meter:
GCC 7.5.0:
----------
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
Function old new delta
finish_task_switch 884 852 -32
GCC 12.2.1:
-----------
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
Function old new delta
finish_task_switch.isra 852 820 -32
LLVM 17.0.6:
------------
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36)
Function old new delta
rt_mutex_schedule 120 104 -16
finish_task_switch 792 772 -20
Signed-off-by: Nysal Jan K.A <nysal@linux.ibm.com>
---
include/linux/sched/mm.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 07bb8d4181d7..042e60ab853a 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -540,6 +540,8 @@ enum {
static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
{
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
+ return;
if (current->mm != mm)
return;
if (likely(!(atomic_read(&mm->membarrier_state) &
--
2.35.3
[To += Mathieu] "Nysal Jan K.A." <nysal@linux.ibm.com> writes: > From: "Nysal Jan K.A" <nysal@linux.ibm.com> > > On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE > is not selected, sync_core_before_usermode() is a no-op. > In membarrier_mm_sync_core_before_usermode() the compiler does not > eliminate redundant branches and the load of mm->membarrier_state > for this case as the atomic_read() cannot be optimized away. I was wondering if this was caused by powerpc's arch_atomic_read() which uses asm volatile. But replacing arch_atomic_read() with READ_ONCE() makes no difference, presumably because the compiler still can't see that the READ_ONCE() is unnecessary (which is kind of by design). > Here's a snippet of the code generated for finish_task_switch() on powerpc: > > 1b786c: ld r26,2624(r30) # mm = rq->prev_mm; > ....... > 1b78c8: cmpdi cr7,r26,0 > 1b78cc: beq cr7,1b78e4 <finish_task_switch+0xd0> > 1b78d0: ld r9,2312(r13) # current > 1b78d4: ld r9,1888(r9) # current->mm > 1b78d8: cmpd cr7,r26,r9 > 1b78dc: beq cr7,1b7a70 <finish_task_switch+0x25c> > 1b78e0: hwsync > 1b78e4: cmplwi cr7,r27,128 > ....... > 1b7a70: lwz r9,176(r26) # atomic_read(&mm->membarrier_state) > 1b7a74: b 1b78e0 <finish_task_switch+0xcc> > > This was found while analyzing "perf c2c" reports on kernels prior > to commit c1753fd02a00 ("mm: move mm_count into its own cache line") > where mm_count was false sharing with membarrier_state. So it was causing a noticable performance blip? But isn't anymore? > There is a minor improvement in the size of finish_task_switch(). > The following are results from bloat-o-meter: > > GCC 7.5.0: > ---------- > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) > Function old new delta > finish_task_switch 884 852 -32 > > GCC 12.2.1: > ----------- > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) > Function old new delta > finish_task_switch.isra 852 820 -32 GCC 12 is a couple of years old, I assume GCC 14 behaves similarly? > LLVM 17.0.6: > ------------ > add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36) > Function old new delta > rt_mutex_schedule 120 104 -16 > finish_task_switch 792 772 -20 > > Signed-off-by: Nysal Jan K.A <nysal@linux.ibm.com> > --- > include/linux/sched/mm.h | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h > index 07bb8d4181d7..042e60ab853a 100644 > --- a/include/linux/sched/mm.h > +++ b/include/linux/sched/mm.h > @@ -540,6 +540,8 @@ enum { > > static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) > { > + if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE)) > + return; > if (current->mm != mm) > return; > if (likely(!(atomic_read(&mm->membarrier_state) & The other option would be to have a completely separate stub, eg: #ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { if (current->mm != mm) return; if (likely(!(atomic_read(&mm->membarrier_state) & MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) return; sync_core_before_usermode(); } #else static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { } #endif Not sure what folks prefer. In either case I think it's probably worth a short comment explaining why it's worth the trouble (ie. that the atomic_read() prevents the compiler from doing DCE). cheers
On Fri, Oct 25, 2024 at 11:29:38AM +1100, Michael Ellerman wrote: > [To += Mathieu] > > "Nysal Jan K.A." <nysal@linux.ibm.com> writes: > > From: "Nysal Jan K.A" <nysal@linux.ibm.com> > > > > On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE > > is not selected, sync_core_before_usermode() is a no-op. > > In membarrier_mm_sync_core_before_usermode() the compiler does not > > eliminate redundant branches and the load of mm->membarrier_state > > for this case as the atomic_read() cannot be optimized away. > > I was wondering if this was caused by powerpc's arch_atomic_read() which > uses asm volatile. > Yes, that's my understanding as well > But replacing arch_atomic_read() with READ_ONCE() makes no difference, > presumably because the compiler still can't see that the READ_ONCE() is > unnecessary (which is kind of by design). > In READ_ONCE() we cast to a volatile pointer, I think the compiler cannot eliminate the code in that case. > > Here's a snippet of the code generated for finish_task_switch() on powerpc: > > > > 1b786c: ld r26,2624(r30) # mm = rq->prev_mm; > > ....... > > 1b78c8: cmpdi cr7,r26,0 > > 1b78cc: beq cr7,1b78e4 <finish_task_switch+0xd0> > > 1b78d0: ld r9,2312(r13) # current > > 1b78d4: ld r9,1888(r9) # current->mm > > 1b78d8: cmpd cr7,r26,r9 > > 1b78dc: beq cr7,1b7a70 <finish_task_switch+0x25c> > > 1b78e0: hwsync > > 1b78e4: cmplwi cr7,r27,128 > > ....... > > 1b7a70: lwz r9,176(r26) # atomic_read(&mm->membarrier_state) > > 1b7a74: b 1b78e0 <finish_task_switch+0xcc> > > > > This was found while analyzing "perf c2c" reports on kernels prior > > to commit c1753fd02a00 ("mm: move mm_count into its own cache line") > > where mm_count was false sharing with membarrier_state. > > So it was causing a noticable performance blip? But isn't anymore? > It was noticeable in that it showed up amongst the top entries in perf c2c reports. There was similar false sharing with other fields that share the cache line with mm_count, so the gains were minimal with just this patch. c1753fd02a00 addresses these cases too. > > There is a minor improvement in the size of finish_task_switch(). > > The following are results from bloat-o-meter: > > > > GCC 7.5.0: > > ---------- > > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) > > Function old new delta > > finish_task_switch 884 852 -32 > > > > GCC 12.2.1: > > ----------- > > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) > > Function old new delta > > finish_task_switch.isra 852 820 -32 > > GCC 12 is a couple of years old, I assume GCC 14 behaves similarly? > I cross compiled for aarch64 with gcc 14.1.1 and see similar results: add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56) Function old new delta get_nohz_timer_target 352 356 +4 e843419@0b02_0000d7e7_408 8 - -8 e843419@01bb_000021d2_868 8 - -8 finish_task_switch.isra 592 548 -44 Total: Before=31013792, After=31013736, chg -0.00% > > LLVM 17.0.6: > > ------------ > > add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36) > > Function old new delta > > rt_mutex_schedule 120 104 -16 > > finish_task_switch 792 772 -20 > > > > Signed-off-by: Nysal Jan K.A <nysal@linux.ibm.com> > > --- > > include/linux/sched/mm.h | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h > > index 07bb8d4181d7..042e60ab853a 100644 > > --- a/include/linux/sched/mm.h > > +++ b/include/linux/sched/mm.h > > @@ -540,6 +540,8 @@ enum { > > > > static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) > > { > > + if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE)) > > + return; > > if (current->mm != mm) > > return; > > if (likely(!(atomic_read(&mm->membarrier_state) & > > The other option would be to have a completely separate stub, eg: > > #ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE > static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) > { > if (current->mm != mm) > return; > if (likely(!(atomic_read(&mm->membarrier_state) & > MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) > return; > sync_core_before_usermode(); > } > #else > static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { } > #endif > > Not sure what folks prefer. > > In either case I think it's probably worth a short comment explaining > why it's worth the trouble (ie. that the atomic_read() prevents the > compiler from doing DCE). > I'll send a v2 with a comment added in there. Thanks for the review. --Nysal
On 2024-10-24 20:29, Michael Ellerman wrote: > [To += Mathieu] > > "Nysal Jan K.A." <nysal@linux.ibm.com> writes: >> From: "Nysal Jan K.A" <nysal@linux.ibm.com> >> >> On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE >> is not selected, sync_core_before_usermode() is a no-op. >> In membarrier_mm_sync_core_before_usermode() the compiler does not >> eliminate redundant branches and the load of mm->membarrier_state >> for this case as the atomic_read() cannot be optimized away. > > I was wondering if this was caused by powerpc's arch_atomic_read() which > uses asm volatile. > > But replacing arch_atomic_read() with READ_ONCE() makes no difference, > presumably because the compiler still can't see that the READ_ONCE() is > unnecessary (which is kind of by design). > >> Here's a snippet of the code generated for finish_task_switch() on powerpc: >> >> 1b786c: ld r26,2624(r30) # mm = rq->prev_mm; >> ....... >> 1b78c8: cmpdi cr7,r26,0 >> 1b78cc: beq cr7,1b78e4 <finish_task_switch+0xd0> >> 1b78d0: ld r9,2312(r13) # current >> 1b78d4: ld r9,1888(r9) # current->mm >> 1b78d8: cmpd cr7,r26,r9 >> 1b78dc: beq cr7,1b7a70 <finish_task_switch+0x25c> >> 1b78e0: hwsync >> 1b78e4: cmplwi cr7,r27,128 >> ....... >> 1b7a70: lwz r9,176(r26) # atomic_read(&mm->membarrier_state) >> 1b7a74: b 1b78e0 <finish_task_switch+0xcc> >> >> This was found while analyzing "perf c2c" reports on kernels prior >> to commit c1753fd02a00 ("mm: move mm_count into its own cache line") >> where mm_count was false sharing with membarrier_state. > > So it was causing a noticable performance blip? But isn't anymore? I indeed moved mm_count into its own cacheline in response to performance regressions reports, which were caused by simply loading the pcpu_cid pointer frequently enough. So if membarrier_state was also sharing that cache line, it makes sense that moving mm_count away helped there as well. [...] >> --- >> include/linux/sched/mm.h | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h >> index 07bb8d4181d7..042e60ab853a 100644 >> --- a/include/linux/sched/mm.h >> +++ b/include/linux/sched/mm.h >> @@ -540,6 +540,8 @@ enum { >> >> static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) >> { >> + if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE)) >> + return; >> if (current->mm != mm) >> return; >> if (likely(!(atomic_read(&mm->membarrier_state) & I prefer the approach above, because it requires fewer kernel configurations to reach the same compilation code coverage. Thanks, Mathieu > > The other option would be to have a completely separate stub, eg: > > #ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE > static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) > { > if (current->mm != mm) > return; > if (likely(!(atomic_read(&mm->membarrier_state) & > MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) > return; > sync_core_before_usermode(); > } > #else > static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { } > #endif > > Not sure what folks prefer. > > In either case I think it's probably worth a short comment explaining > why it's worth the trouble (ie. that the atomic_read() prevents the > compiler from doing DCE). > > cheers -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Hi! On Fri, Oct 25, 2024 at 11:29:38AM +1100, Michael Ellerman wrote: > [To += Mathieu] > > "Nysal Jan K.A." <nysal@linux.ibm.com> writes: > > From: "Nysal Jan K.A" <nysal@linux.ibm.com> > > > > On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE > > is not selected, sync_core_before_usermode() is a no-op. > > In membarrier_mm_sync_core_before_usermode() the compiler does not > > eliminate redundant branches and the load of mm->membarrier_state > > for this case as the atomic_read() cannot be optimized away. > > I was wondering if this was caused by powerpc's arch_atomic_read() which > uses asm volatile. > > But replacing arch_atomic_read() with READ_ONCE() makes no difference, > presumably because the compiler still can't see that the READ_ONCE() is > unnecessary (which is kind of by design). Exactly. > > GCC 12.2.1: > > ----------- > > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) > > Function old new delta > > finish_task_switch.isra 852 820 -32 > > GCC 12 is a couple of years old, I assume GCC 14 behaves similarly? GCC 12 is still being actively developed. There will be a 12.5 probably halfway next year (and that will be the last 12.x release, yes). The GCC homepage (<https://gcc.gnu.org>) will tell you what releases are still maintained/supported, and sometimes you can derive our planned plans from there as well :-) But yes, 14 is similar (I did not test, but I feel confident making that assertion :-) ) > > static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) > > { > > + if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE)) > > + return; > > if (current->mm != mm) > > return; > > if (likely(!(atomic_read(&mm->membarrier_state) & > > The other option would be to have a completely separate stub, eg: > > #ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE > static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) > { > if (current->mm != mm) > return; > if (likely(!(atomic_read(&mm->membarrier_state) & > MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) > return; > sync_core_before_usermode(); > } > #else > static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { } > #endif > > Not sure what folks prefer. > > In either case I think it's probably worth a short comment explaining > why it's worth the trouble (ie. that the atomic_read() prevents the > compiler from doing DCE). Since you ask, I like the proposed change (the inline one) best. But yeah, comment please! (And it is not about DCE -- just the definition of __READ_ONCE makes it directly impossible to CSE any expressions with this, it (standards- violatingly) casts the pointers to pointers to volatile, and you cannot CSE any accesses to volatile objects!) So what are the actual semantics the kernel wants from its READ_ONCE, and from its atomics in general? GCC has perfectly fine in-compiler support for such things, there is no need for making a second rate manual implementation of parts of this, when you can use a good implementation of everything instead! Segher
Hi Michael, On Fri, 25 Oct 2024 11:29:38 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote: > > > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h > > index 07bb8d4181d7..042e60ab853a 100644 > > --- a/include/linux/sched/mm.h > > +++ b/include/linux/sched/mm.h > > @@ -540,6 +540,8 @@ enum { > > > > static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) > > { > > + if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE)) > > + return; > > if (current->mm != mm) > > return; > > if (likely(!(atomic_read(&mm->membarrier_state) & > > The other option would be to have a completely separate stub, eg: > > #ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE > static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) > { > if (current->mm != mm) > return; > if (likely(!(atomic_read(&mm->membarrier_state) & > MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) > return; > sync_core_before_usermode(); > } > #else > static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { } > #endif > > Not sure what folks prefer. I case it matters, in general I prefer the first as there is less code to get out of sync and all the code still gets parsed by the compiler whether the CONFIG option is set or not. -- Cheers, Stephen Rothwell
On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
is not selected, sync_core_before_usermode() is a no-op.
In membarrier_mm_sync_core_before_usermode() the compiler does not
eliminate redundant branches and load of mm->membarrier_state
for this case as the atomic_read() cannot be optimized away.
Here's a snippet of the code generated for finish_task_switch() on powerpc
prior to this change:
1b786c: ld r26,2624(r30) # mm = rq->prev_mm;
.......
1b78c8: cmpdi cr7,r26,0
1b78cc: beq cr7,1b78e4 <finish_task_switch+0xd0>
1b78d0: ld r9,2312(r13) # current
1b78d4: ld r9,1888(r9) # current->mm
1b78d8: cmpd cr7,r26,r9
1b78dc: beq cr7,1b7a70 <finish_task_switch+0x25c>
1b78e0: hwsync
1b78e4: cmplwi cr7,r27,128
.......
1b7a70: lwz r9,176(r26) # atomic_read(&mm->membarrier_state)
1b7a74: b 1b78e0 <finish_task_switch+0xcc>
This was found while analyzing "perf c2c" reports on kernels prior
to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
where mm_count was false sharing with membarrier_state.
There is a minor improvement in the size of finish_task_switch().
The following are results from bloat-o-meter for ppc64le:
GCC 7.5.0
---------
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
Function old new delta
finish_task_switch 884 852 -32
GCC 12.2.1
----------
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
Function old new delta
finish_task_switch.isra 852 820 -32
LLVM 17.0.6
-----------
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36)
Function old new delta
rt_mutex_schedule 120 104 -16
finish_task_switch 792 772 -20
Results on aarch64:
GCC 14.1.1
----------
add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56)
Function old new delta
get_nohz_timer_target 352 356 +4
e843419@0b02_0000d7e7_408 8 - -8
e843419@01bb_000021d2_868 8 - -8
finish_task_switch.isra 592 548 -44
Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com>
---
V1 -> V2:
- Add results for aarch64
- Add a comment describing the changes
---
include/linux/sched/mm.h | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 928a626725e6..b13474825130 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -531,6 +531,13 @@ enum {
static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
{
+ /*
+ * The atomic_read() below prevents CSE. The following should
+ * help the compiler generate more efficient code on architectures
+ * where sync_core_before_usermode() is a no-op.
+ */
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
+ return;
if (current->mm != mm)
return;
if (likely(!(atomic_read(&mm->membarrier_state) &
--
2.47.0
I do not see this patch staged in any tree (e.g. linux-next). Is this on its way to be merged? Thanks! On Tue 29-10-24 11:21:28, Nysal Jan K.A. wrote: > On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE > is not selected, sync_core_before_usermode() is a no-op. > In membarrier_mm_sync_core_before_usermode() the compiler does not > eliminate redundant branches and load of mm->membarrier_state > for this case as the atomic_read() cannot be optimized away. > > Here's a snippet of the code generated for finish_task_switch() on powerpc > prior to this change: > > 1b786c: ld r26,2624(r30) # mm = rq->prev_mm; > ....... > 1b78c8: cmpdi cr7,r26,0 > 1b78cc: beq cr7,1b78e4 <finish_task_switch+0xd0> > 1b78d0: ld r9,2312(r13) # current > 1b78d4: ld r9,1888(r9) # current->mm > 1b78d8: cmpd cr7,r26,r9 > 1b78dc: beq cr7,1b7a70 <finish_task_switch+0x25c> > 1b78e0: hwsync > 1b78e4: cmplwi cr7,r27,128 > ....... > 1b7a70: lwz r9,176(r26) # atomic_read(&mm->membarrier_state) > 1b7a74: b 1b78e0 <finish_task_switch+0xcc> > > This was found while analyzing "perf c2c" reports on kernels prior > to commit c1753fd02a00 ("mm: move mm_count into its own cache line") > where mm_count was false sharing with membarrier_state. > > There is a minor improvement in the size of finish_task_switch(). > The following are results from bloat-o-meter for ppc64le: > > GCC 7.5.0 > --------- > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) > Function old new delta > finish_task_switch 884 852 -32 > > GCC 12.2.1 > ---------- > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) > Function old new delta > finish_task_switch.isra 852 820 -32 > > LLVM 17.0.6 > ----------- > add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36) > Function old new delta > rt_mutex_schedule 120 104 -16 > finish_task_switch 792 772 -20 > > Results on aarch64: > > GCC 14.1.1 > ---------- > add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56) > Function old new delta > get_nohz_timer_target 352 356 +4 > e843419@0b02_0000d7e7_408 8 - -8 > e843419@01bb_000021d2_868 8 - -8 > finish_task_switch.isra 592 548 -44 > > Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com> > --- > V1 -> V2: > - Add results for aarch64 > - Add a comment describing the changes > --- > include/linux/sched/mm.h | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h > index 928a626725e6..b13474825130 100644 > --- a/include/linux/sched/mm.h > +++ b/include/linux/sched/mm.h > @@ -531,6 +531,13 @@ enum { > > static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) > { > + /* > + * The atomic_read() below prevents CSE. The following should > + * help the compiler generate more efficient code on architectures > + * where sync_core_before_usermode() is a no-op. > + */ > + if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE)) > + return; > if (current->mm != mm) > return; > if (likely(!(atomic_read(&mm->membarrier_state) & > -- > 2.47.0 > -- Michal Hocko SUSE Labs
On Mon, Nov 18, 2024 at 10:04:18AM +0100, Michal Hocko wrote: > I do not see this patch staged in any tree (e.g. linux-next). Is this on > its way to be merged? I only now found it -- it doesn't look super urgent. I'll get it into a git tree after -rc1.
On Mon 18-11-24 10:25:17, Peter Zijlstra wrote: > On Mon, Nov 18, 2024 at 10:04:18AM +0100, Michal Hocko wrote: > > I do not see this patch staged in any tree (e.g. linux-next). Is this on > > its way to be merged? > > I only now found it -- it doesn't look super urgent. I'll get it into a > git tree after -rc1. Thanks! -- Michal Hocko SUSE Labs
Hi! On Tue, Oct 29, 2024 at 11:21:28AM +0530, Nysal Jan K.A. wrote: > On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE > is not selected, sync_core_before_usermode() is a no-op. > In membarrier_mm_sync_core_before_usermode() the compiler does not > eliminate redundant branches and load of mm->membarrier_state > for this case as the atomic_read() cannot be optimized away. > > Here's a snippet of the code generated for finish_task_switch() on powerpc > prior to this change: > > 1b786c: ld r26,2624(r30) # mm = rq->prev_mm; > ....... > 1b78c8: cmpdi cr7,r26,0 > 1b78cc: beq cr7,1b78e4 <finish_task_switch+0xd0> > 1b78d0: ld r9,2312(r13) # current > 1b78d4: ld r9,1888(r9) # current->mm > 1b78d8: cmpd cr7,r26,r9 > 1b78dc: beq cr7,1b7a70 <finish_task_switch+0x25c> > 1b78e0: hwsync > 1b78e4: cmplwi cr7,r27,128 > ....... > 1b7a70: lwz r9,176(r26) # atomic_read(&mm->membarrier_state) > 1b7a74: b 1b78e0 <finish_task_switch+0xcc> > > This was found while analyzing "perf c2c" reports on kernels prior > to commit c1753fd02a00 ("mm: move mm_count into its own cache line") > where mm_count was false sharing with membarrier_state. > > There is a minor improvement in the size of finish_task_switch(). > The following are results from bloat-o-meter for ppc64le: > > GCC 7.5.0 > --------- > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) > Function old new delta > finish_task_switch 884 852 -32 > > GCC 12.2.1 > ---------- > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) > Function old new delta > finish_task_switch.isra 852 820 -32 > > LLVM 17.0.6 > ----------- > add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36) > Function old new delta > rt_mutex_schedule 120 104 -16 > finish_task_switch 792 772 -20 > > Results on aarch64: > > GCC 14.1.1 > ---------- > add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56) > Function old new delta > get_nohz_timer_target 352 356 +4 > e843419@0b02_0000d7e7_408 8 - -8 > e843419@01bb_000021d2_868 8 - -8 > finish_task_switch.isra 592 548 -44 > > Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com> > --- > V1 -> V2: > - Add results for aarch64 > - Add a comment describing the changes > --- > include/linux/sched/mm.h | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h > index 928a626725e6..b13474825130 100644 > --- a/include/linux/sched/mm.h > +++ b/include/linux/sched/mm.h > @@ -531,6 +531,13 @@ enum { > > static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) > { > + /* > + * The atomic_read() below prevents CSE. The following should > + * help the compiler generate more efficient code on architectures > + * where sync_core_before_usermode() is a no-op. > + */ > + if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE)) > + return; > if (current->mm != mm) > return; > if (likely(!(atomic_read(&mm->membarrier_state) & I'd say "CSE and similar transformations", but yeah, in this case CSE. The point is that any access to a volatile object is a necessary side- effect, so it has to be performed on the actual machine just as on the abstract machine (on all the same paths, and as often). It might be nice to have an atomic_read (for PowerPC) that can generate better machine code. Not a trivial task though! Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org> Segher
"Nysal Jan K.A." <nysal@linux.ibm.com> writes: > On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE > is not selected, sync_core_before_usermode() is a no-op. > In membarrier_mm_sync_core_before_usermode() the compiler does not > eliminate redundant branches and load of mm->membarrier_state > for this case as the atomic_read() cannot be optimized away. > > Here's a snippet of the code generated for finish_task_switch() on powerpc > prior to this change: > > 1b786c: ld r26,2624(r30) # mm = rq->prev_mm; > ....... > 1b78c8: cmpdi cr7,r26,0 > 1b78cc: beq cr7,1b78e4 <finish_task_switch+0xd0> > 1b78d0: ld r9,2312(r13) # current > 1b78d4: ld r9,1888(r9) # current->mm > 1b78d8: cmpd cr7,r26,r9 > 1b78dc: beq cr7,1b7a70 <finish_task_switch+0x25c> > 1b78e0: hwsync > 1b78e4: cmplwi cr7,r27,128 > ....... > 1b7a70: lwz r9,176(r26) # atomic_read(&mm->membarrier_state) > 1b7a74: b 1b78e0 <finish_task_switch+0xcc> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> cheers
On 2024-10-29 01:51, Nysal Jan K.A. wrote: > On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE > is not selected, sync_core_before_usermode() is a no-op. > In membarrier_mm_sync_core_before_usermode() the compiler does not > eliminate redundant branches and load of mm->membarrier_state > for this case as the atomic_read() cannot be optimized away. > > Here's a snippet of the code generated for finish_task_switch() on powerpc > prior to this change: > > 1b786c: ld r26,2624(r30) # mm = rq->prev_mm; > ....... > 1b78c8: cmpdi cr7,r26,0 > 1b78cc: beq cr7,1b78e4 <finish_task_switch+0xd0> > 1b78d0: ld r9,2312(r13) # current > 1b78d4: ld r9,1888(r9) # current->mm > 1b78d8: cmpd cr7,r26,r9 > 1b78dc: beq cr7,1b7a70 <finish_task_switch+0x25c> > 1b78e0: hwsync > 1b78e4: cmplwi cr7,r27,128 > ....... > 1b7a70: lwz r9,176(r26) # atomic_read(&mm->membarrier_state) > 1b7a74: b 1b78e0 <finish_task_switch+0xcc> > > This was found while analyzing "perf c2c" reports on kernels prior > to commit c1753fd02a00 ("mm: move mm_count into its own cache line") > where mm_count was false sharing with membarrier_state. > > There is a minor improvement in the size of finish_task_switch(). > The following are results from bloat-o-meter for ppc64le: > > GCC 7.5.0 > --------- > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) > Function old new delta > finish_task_switch 884 852 -32 > > GCC 12.2.1 > ---------- > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) > Function old new delta > finish_task_switch.isra 852 820 -32 > > LLVM 17.0.6 > ----------- > add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36) > Function old new delta > rt_mutex_schedule 120 104 -16 > finish_task_switch 792 772 -20 > > Results on aarch64: > > GCC 14.1.1 > ---------- > add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56) > Function old new delta > get_nohz_timer_target 352 356 +4 > e843419@0b02_0000d7e7_408 8 - -8 > e843419@01bb_000021d2_868 8 - -8 > finish_task_switch.isra 592 548 -44 > > Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com> Thanks! Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > --- > V1 -> V2: > - Add results for aarch64 > - Add a comment describing the changes > --- > include/linux/sched/mm.h | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h > index 928a626725e6..b13474825130 100644 > --- a/include/linux/sched/mm.h > +++ b/include/linux/sched/mm.h > @@ -531,6 +531,13 @@ enum { > > static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) > { > + /* > + * The atomic_read() below prevents CSE. The following should > + * help the compiler generate more efficient code on architectures > + * where sync_core_before_usermode() is a no-op. > + */ > + if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE)) > + return; > if (current->mm != mm) > return; > if (likely(!(atomic_read(&mm->membarrier_state) & -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
© 2016 - 2024 Red Hat, Inc.