[PATCH v3 0/9] s390: Improve this_cpu operations

Heiko Carstens posted 9 patches 4 days, 14 hours ago
There is a newer version of this series
arch/s390/boot/alternative.c         |   7 +
arch/s390/include/asm/alternative.h  |   5 +
arch/s390/include/asm/entry-percpu.h |  76 ++++++++
arch/s390/include/asm/lowcore.h      |   3 +-
arch/s390/include/asm/percpu.h       | 249 +++++++++++++++++++++------
arch/s390/include/asm/ptrace.h       |   2 +
arch/s390/kernel/alternative.c       |  25 ++-
arch/s390/kernel/irq.c               |  26 ++-
arch/s390/kernel/nmi.c               |   6 +
arch/s390/kernel/traps.c             |   6 +
10 files changed, 344 insertions(+), 61 deletions(-)
create mode 100644 arch/s390/include/asm/entry-percpu.h
[PATCH v3 0/9] s390: Improve this_cpu operations
Posted by Heiko Carstens 4 days, 14 hours ago
v3:
- Fix various typos [Juergen Christ]

- Add missing kprobe detection / handling [Sashiko [3]]
  [FWIW, this made me also aware of that the current general s390 kprobes
   code seems to be racy against concurrent removal of a kprobe while a
   probe hit on a different CPU. But that is a different story.]

- Fix various minor findings [Sashiko [3]]

- All of this might be dropped / exchanged in future in favor of the percpu
  page table approach proposed by Yang Shi [4].

[3] https://sashiko.dev/#/patchset/20260319120503.4046659-1-hca@linux.ibm.com
[4] https://lore.kernel.org/all/20260429170758.3018959-1-yang@os.amperecomputing.com/

v2:

- Add proper PERCPU_PTR cast to most patches to avoid tons of sparse
  warnings

- Add missing __packed attribute to insn structure [Sashiko [2]]

- Fix inverted if condition [Sashiko [2]]

- Add missing user_mode() check [Sashiko [2]]

- Move percpu_entry() call in front of irqentry_enter() call in all
  entry paths to avoid that potential this_cpu() operations overwrite
  the not-yet saved percpu code section indicator  [Sashiko [2]]

[2] https://sashiko.dev/#/patchset/20260317195436.2276810-1-hca%40linux.ibm.com

v1:

This is a follow-up to Peter Zijlstra's in-kernel rseq RFC [1].

With the intended removal of PREEMPT_NONE this_cpu operations based on
atomic instructions, guarded with preempt_disable()/preempt_enable() pairs,
become more expensive: the preempt_disable() / preempt_enable() pairs are
not optimized away anymore during compile time.

In particular the conditional call to preempt_schedule_notrace() after
preempt_enable() adds additional code and register pressure.

To avoid this Peter suggested an in-kernel rseq approach. While this would
certainly work, this series tries to come up with a solution which uses
less instructions and doesn't require to repeat instruction sequences.

The idea is that this_cpu operations based on atomic instructions are
guarded with mvyi instructions:

- The first mvyi instruction writes the register number, which contains
  the percpu address variable to lowcore. This also indicates that a
  percpu code section is executed.

- The first instruction following the mvyi instruction must be the ag
  instruction which adds the percpu offset to the percpu address register.

- Afterwards the atomic percpu operation follows.

- Then a second mvyi instruction writes a zero to lowcore, which indicates
  the end of the percpu code section.

- In case of an interrupt/exception/nmi the register number which was
  written to lowcore is copied to the exception frame (pt_regs), and a zero
  is written to lowcore.

- On return to the previous context it is checked if a percpu code section
  was executed (saved register number not zero), and if the process was
  migrated to a different cpu. If the percpu offset was already added to
  the percpu address register (instruction address does _not_ point to the
  ag instruction) the content of the percpu address register is adjusted so
  it points to percpu variable of the new cpu.

All of this seems to work, but of course it could still be broken since I
missed some detail.

In total this series results in a kernel text size reduction of ~106kb. The
number of preempt_schedule_notrace() call sites is reduced from 7089 to
1577.

Note: this comes without any huge performance analysis, however all
microbenchmarks confirmed that the new code is at least as fast as the
old code, like expected.

[1] 20260223163843.GR1282955@noisy.programming.kicks-ass.net

Heiko Carstens (9):
  s390/alternatives: Add new ALT_TYPE_PERCPU type
  s390/percpu: Infrastructure for more efficient this_cpu operations
  s390/percpu: Add missing do { } while (0) constructs
  s390/percpu: Use new percpu code section for arch_this_cpu_add()
  s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
  s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
  s390/percpu: Provide arch_this_cpu_read() implementation
  s390/percpu: Provide arch_this_cpu_write() implementation
  s390/percpu: Remove one and two byte this_cpu operation implementation

 arch/s390/boot/alternative.c         |   7 +
 arch/s390/include/asm/alternative.h  |   5 +
 arch/s390/include/asm/entry-percpu.h |  76 ++++++++
 arch/s390/include/asm/lowcore.h      |   3 +-
 arch/s390/include/asm/percpu.h       | 249 +++++++++++++++++++++------
 arch/s390/include/asm/ptrace.h       |   2 +
 arch/s390/kernel/alternative.c       |  25 ++-
 arch/s390/kernel/irq.c               |  26 ++-
 arch/s390/kernel/nmi.c               |   6 +
 arch/s390/kernel/traps.c             |   6 +
 10 files changed, 344 insertions(+), 61 deletions(-)
 create mode 100644 arch/s390/include/asm/entry-percpu.h

base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8
-- 
2.51.0
Re: [PATCH v3 0/9] s390: Improve this_cpu operations
Posted by Yang Shi 4 days, 5 hours ago
Hi Heiko,

Thanks for cc'ing me the patchset. Please see the below inline comments.


On 5/20/26 2:22 AM, Heiko Carstens wrote:
> v3:
> - Fix various typos [Juergen Christ]
>
> - Add missing kprobe detection / handling [Sashiko [3]]
>    [FWIW, this made me also aware of that the current general s390 kprobes
>     code seems to be racy against concurrent removal of a kprobe while a
>     probe hit on a different CPU. But that is a different story.]
>
> - Fix various minor findings [Sashiko [3]]
>
> - All of this might be dropped / exchanged in future in favor of the percpu
>    page table approach proposed by Yang Shi [4].

Thanks for mentioning my approach. I will do some comparison with rseq 
in the following design details section of the cover letter.

>
> [3] https://sashiko.dev/#/patchset/20260319120503.4046659-1-hca@linux.ibm.com
> [4] https://lore.kernel.org/all/20260429170758.3018959-1-yang@os.amperecomputing.com/
>
> v2:
>
> - Add proper PERCPU_PTR cast to most patches to avoid tons of sparse
>    warnings
>
> - Add missing __packed attribute to insn structure [Sashiko [2]]
>
> - Fix inverted if condition [Sashiko [2]]
>
> - Add missing user_mode() check [Sashiko [2]]
>
> - Move percpu_entry() call in front of irqentry_enter() call in all
>    entry paths to avoid that potential this_cpu() operations overwrite
>    the not-yet saved percpu code section indicator  [Sashiko [2]]
>
> [2] https://sashiko.dev/#/patchset/20260317195436.2276810-1-hca%40linux.ibm.com
>
> v1:
>
> This is a follow-up to Peter Zijlstra's in-kernel rseq RFC [1].
>
> With the intended removal of PREEMPT_NONE this_cpu operations based on
> atomic instructions, guarded with preempt_disable()/preempt_enable() pairs,
> become more expensive: the preempt_disable() / preempt_enable() pairs are
> not optimized away anymore during compile time.
>
> In particular the conditional call to preempt_schedule_notrace() after
> preempt_enable() adds additional code and register pressure.
>
> To avoid this Peter suggested an in-kernel rseq approach. While this would
> certainly work, this series tries to come up with a solution which uses
> less instructions and doesn't require to repeat instruction sequences.
>
> The idea is that this_cpu operations based on atomic instructions are
> guarded with mvyi instructions:
>
> - The first mvyi instruction writes the register number, which contains
>    the percpu address variable to lowcore. This also indicates that a
>    percpu code section is executed.
>
> - The first instruction following the mvyi instruction must be the ag
>    instruction which adds the percpu offset to the percpu address register.
>
> - Afterwards the atomic percpu operation follows.
>
> - Then a second mvyi instruction writes a zero to lowcore, which indicates
>    the end of the percpu code section.
>
> - In case of an interrupt/exception/nmi the register number which was
>    written to lowcore is copied to the exception frame (pt_regs), and a zero
>    is written to lowcore.
>
> - On return to the previous context it is checked if a percpu code section
>    was executed (saved register number not zero), and if the process was
>    migrated to a different cpu. If the percpu offset was already added to
>    the percpu address register (instruction address does _not_ point to the
>    ag instruction) the content of the percpu address register is adjusted so
>    it points to percpu variable of the new cpu.

If I understand correctly, you replaced preempt_disable() and 
preempt_enable() with seq begin and seg end, and seq begin and seq end 
can be optimized by mvyi instruction on S390. So you just need a single 
mvyi instruction for each instead of read-modify-write the seq count.

But you need some extra overhead for context switch (save and restore 
the seq count register) and need to check whether it is still on the 
same cpu once resuming execution. And there is also penalty if it is 
migrated to another CPU (need to rerun this_cpu ops).

So it seems have more overhead than the percpu page table approach IIUC. 
We don't need all the steps with percpu page table. And there is no 
penalty for migration.

>
> All of this seems to work, but of course it could still be broken since I
> missed some detail.
>
> In total this series results in a kernel text size reduction of ~106kb. The
> number of preempt_schedule_notrace() call sites is reduced from 7089 to
> 1577.

Yeah, both approaches can reduce the number of 
preempt_schedule_notrace() call sites. And both approaches can reduce 
the number of non-preemptible critical sections.

>
> Note: this comes without any huge performance analysis, however all
> microbenchmarks confirmed that the new code is at least as fast as the
> old code, like expected.

I'm really interested in the benchmark number. I'm supposed percpu page 
table approach should have better performance per my above analysis.

Christopher Lameter is also interested in it, cc'ed him too.

Thanks,
Yang

>
> [1] 20260223163843.GR1282955@noisy.programming.kicks-ass.net
>
> Heiko Carstens (9):
>    s390/alternatives: Add new ALT_TYPE_PERCPU type
>    s390/percpu: Infrastructure for more efficient this_cpu operations
>    s390/percpu: Add missing do { } while (0) constructs
>    s390/percpu: Use new percpu code section for arch_this_cpu_add()
>    s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
>    s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
>    s390/percpu: Provide arch_this_cpu_read() implementation
>    s390/percpu: Provide arch_this_cpu_write() implementation
>    s390/percpu: Remove one and two byte this_cpu operation implementation
>
>   arch/s390/boot/alternative.c         |   7 +
>   arch/s390/include/asm/alternative.h  |   5 +
>   arch/s390/include/asm/entry-percpu.h |  76 ++++++++
>   arch/s390/include/asm/lowcore.h      |   3 +-
>   arch/s390/include/asm/percpu.h       | 249 +++++++++++++++++++++------
>   arch/s390/include/asm/ptrace.h       |   2 +
>   arch/s390/kernel/alternative.c       |  25 ++-
>   arch/s390/kernel/irq.c               |  26 ++-
>   arch/s390/kernel/nmi.c               |   6 +
>   arch/s390/kernel/traps.c             |   6 +
>   10 files changed, 344 insertions(+), 61 deletions(-)
>   create mode 100644 arch/s390/include/asm/entry-percpu.h
>
> base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8
Re: [PATCH v3 0/9] s390: Improve this_cpu operations
Posted by David Laight 4 days, 1 hour ago
On Wed, 20 May 2026 11:42:36 -0700
Yang Shi <yang@os.amperecomputing.com> wrote:

> Hi Heiko,
> 
> Thanks for cc'ing me the patchset. Please see the below inline comments.
> 
> 
> On 5/20/26 2:22 AM, Heiko Carstens wrote:
> > v3:
> > - Fix various typos [Juergen Christ]
> >
> > - Add missing kprobe detection / handling [Sashiko [3]]
> >    [FWIW, this made me also aware of that the current general s390 kprobes
> >     code seems to be racy against concurrent removal of a kprobe while a
> >     probe hit on a different CPU. But that is a different story.]
> >
> > - Fix various minor findings [Sashiko [3]]
> >
> > - All of this might be dropped / exchanged in future in favor of the percpu
> >    page table approach proposed by Yang Shi [4].  
> 
> Thanks for mentioning my approach. I will do some comparison with rseq 
> in the following design details section of the cover letter.
> 
> >
> > [3] https://sashiko.dev/#/patchset/20260319120503.4046659-1-hca@linux.ibm.com
> > [4] https://lore.kernel.org/all/20260429170758.3018959-1-yang@os.amperecomputing.com/
> >
> > v2:
> >
> > - Add proper PERCPU_PTR cast to most patches to avoid tons of sparse
> >    warnings
> >
> > - Add missing __packed attribute to insn structure [Sashiko [2]]
> >
> > - Fix inverted if condition [Sashiko [2]]
> >
> > - Add missing user_mode() check [Sashiko [2]]
> >
> > - Move percpu_entry() call in front of irqentry_enter() call in all
> >    entry paths to avoid that potential this_cpu() operations overwrite
> >    the not-yet saved percpu code section indicator  [Sashiko [2]]
> >
> > [2] https://sashiko.dev/#/patchset/20260317195436.2276810-1-hca%40linux.ibm.com
> >
> > v1:
> >
> > This is a follow-up to Peter Zijlstra's in-kernel rseq RFC [1].
> >
> > With the intended removal of PREEMPT_NONE this_cpu operations based on
> > atomic instructions, guarded with preempt_disable()/preempt_enable() pairs,
> > become more expensive: the preempt_disable() / preempt_enable() pairs are
> > not optimized away anymore during compile time.
> >
> > In particular the conditional call to preempt_schedule_notrace() after
> > preempt_enable() adds additional code and register pressure.
> >
> > To avoid this Peter suggested an in-kernel rseq approach. While this would
> > certainly work, this series tries to come up with a solution which uses
> > less instructions and doesn't require to repeat instruction sequences.
> >
> > The idea is that this_cpu operations based on atomic instructions are
> > guarded with mvyi instructions:
> >
> > - The first mvyi instruction writes the register number, which contains
> >    the percpu address variable to lowcore. This also indicates that a
> >    percpu code section is executed.
> >
> > - The first instruction following the mvyi instruction must be the ag
> >    instruction which adds the percpu offset to the percpu address register.
> >
> > - Afterwards the atomic percpu operation follows.
> >
> > - Then a second mvyi instruction writes a zero to lowcore, which indicates
> >    the end of the percpu code section.
> >
> > - In case of an interrupt/exception/nmi the register number which was
> >    written to lowcore is copied to the exception frame (pt_regs), and a zero
> >    is written to lowcore.
> >
> > - On return to the previous context it is checked if a percpu code section
> >    was executed (saved register number not zero), and if the process was
> >    migrated to a different cpu. If the percpu offset was already added to
> >    the percpu address register (instruction address does _not_ point to the
> >    ag instruction) the content of the percpu address register is adjusted so
> >    it points to percpu variable of the new cpu.  
> 
> If I understand correctly, you replaced preempt_disable() and 
> preempt_enable() with seq begin and seg end, and seq begin and seq end 
> can be optimized by mvyi instruction on S390. So you just need a single 
> mvyi instruction for each instead of read-modify-write the seq count.
> 
> But you need some extra overhead for context switch (save and restore 
> the seq count register) and need to check whether it is still on the 
> same cpu once resuming execution. And there is also penalty if it is 
> migrated to another CPU (need to rerun this_cpu ops).

Not as I understand it.
What happens is the context switch code 'corrupts' the register being
used to access per-cpu data so that it is correct for the new cpu.
The write of zero after the sequence is there to stop the register
being corrupted outside of this code window.

This really just means that you can (mostly) only do single accesses,
since nothing stops pre-emption between the RW or an RMW sequence.
Although you can probably do an increment of the preempt disable count
because if you are preempted the value read will be zero.

> 
> So it seems have more overhead than the percpu page table approach IIUC. 
> We don't need all the steps with percpu page table. And there is no 
> penalty for migration.

This code looks like it relies on 'page zero' already being percpu.
So it probably isn't really that different.
Some values like the 'preemption disable count' and 'current' could be
(maybe are?) written into page zero to give fast access.

But I'm sure I remember that some cpu don't like having the same
physical address at different virtual addresses (and not just those
with VIVT caches like some sparc cpu).
I'm sure code can end up accessing the current cpu's percpu data
using the same address that other cpu use - there are definitely
places where it needs that address.
On x86-64 that means it reading the address from the array rather
than just offsetting from %gs.

-- David

> 
> >
> > All of this seems to work, but of course it could still be broken since I
> > missed some detail.
> >
> > In total this series results in a kernel text size reduction of ~106kb. The
> > number of preempt_schedule_notrace() call sites is reduced from 7089 to
> > 1577.  
> 
> Yeah, both approaches can reduce the number of 
> preempt_schedule_notrace() call sites. And both approaches can reduce 
> the number of non-preemptible critical sections.
> 
> >
> > Note: this comes without any huge performance analysis, however all
> > microbenchmarks confirmed that the new code is at least as fast as the
> > old code, like expected.  
> 
> I'm really interested in the benchmark number. I'm supposed percpu page 
> table approach should have better performance per my above analysis.
> 
> Christopher Lameter is also interested in it, cc'ed him too.
> 
> Thanks,
> Yang
> 
> >
> > [1] 20260223163843.GR1282955@noisy.programming.kicks-ass.net
> >
> > Heiko Carstens (9):
> >    s390/alternatives: Add new ALT_TYPE_PERCPU type
> >    s390/percpu: Infrastructure for more efficient this_cpu operations
> >    s390/percpu: Add missing do { } while (0) constructs
> >    s390/percpu: Use new percpu code section for arch_this_cpu_add()
> >    s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
> >    s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
> >    s390/percpu: Provide arch_this_cpu_read() implementation
> >    s390/percpu: Provide arch_this_cpu_write() implementation
> >    s390/percpu: Remove one and two byte this_cpu operation implementation
> >
> >   arch/s390/boot/alternative.c         |   7 +
> >   arch/s390/include/asm/alternative.h  |   5 +
> >   arch/s390/include/asm/entry-percpu.h |  76 ++++++++
> >   arch/s390/include/asm/lowcore.h      |   3 +-
> >   arch/s390/include/asm/percpu.h       | 249 +++++++++++++++++++++------
> >   arch/s390/include/asm/ptrace.h       |   2 +
> >   arch/s390/kernel/alternative.c       |  25 ++-
> >   arch/s390/kernel/irq.c               |  26 ++-
> >   arch/s390/kernel/nmi.c               |   6 +
> >   arch/s390/kernel/traps.c             |   6 +
> >   10 files changed, 344 insertions(+), 61 deletions(-)
> >   create mode 100644 arch/s390/include/asm/entry-percpu.h
> >
> > base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8  
> 
>
Re: [PATCH v3 0/9] s390: Improve this_cpu operations
Posted by Yang Shi 3 days, 23 hours ago

On 5/20/26 3:34 PM, David Laight wrote:
> On Wed, 20 May 2026 11:42:36 -0700
> Yang Shi <yang@os.amperecomputing.com> wrote:
>
>> Hi Heiko,
>>
>> Thanks for cc'ing me the patchset. Please see the below inline comments.
>>
>>
>> On 5/20/26 2:22 AM, Heiko Carstens wrote:
>>> v3:
>>> - Fix various typos [Juergen Christ]
>>>
>>> - Add missing kprobe detection / handling [Sashiko [3]]
>>>     [FWIW, this made me also aware of that the current general s390 kprobes
>>>      code seems to be racy against concurrent removal of a kprobe while a
>>>      probe hit on a different CPU. But that is a different story.]
>>>
>>> - Fix various minor findings [Sashiko [3]]
>>>
>>> - All of this might be dropped / exchanged in future in favor of the percpu
>>>     page table approach proposed by Yang Shi [4].
>> Thanks for mentioning my approach. I will do some comparison with rseq
>> in the following design details section of the cover letter.
>>
>>> [3] https://sashiko.dev/#/patchset/20260319120503.4046659-1-hca@linux.ibm.com
>>> [4] https://lore.kernel.org/all/20260429170758.3018959-1-yang@os.amperecomputing.com/
>>>
>>> v2:
>>>
>>> - Add proper PERCPU_PTR cast to most patches to avoid tons of sparse
>>>     warnings
>>>
>>> - Add missing __packed attribute to insn structure [Sashiko [2]]
>>>
>>> - Fix inverted if condition [Sashiko [2]]
>>>
>>> - Add missing user_mode() check [Sashiko [2]]
>>>
>>> - Move percpu_entry() call in front of irqentry_enter() call in all
>>>     entry paths to avoid that potential this_cpu() operations overwrite
>>>     the not-yet saved percpu code section indicator  [Sashiko [2]]
>>>
>>> [2] https://sashiko.dev/#/patchset/20260317195436.2276810-1-hca%40linux.ibm.com
>>>
>>> v1:
>>>
>>> This is a follow-up to Peter Zijlstra's in-kernel rseq RFC [1].
>>>
>>> With the intended removal of PREEMPT_NONE this_cpu operations based on
>>> atomic instructions, guarded with preempt_disable()/preempt_enable() pairs,
>>> become more expensive: the preempt_disable() / preempt_enable() pairs are
>>> not optimized away anymore during compile time.
>>>
>>> In particular the conditional call to preempt_schedule_notrace() after
>>> preempt_enable() adds additional code and register pressure.
>>>
>>> To avoid this Peter suggested an in-kernel rseq approach. While this would
>>> certainly work, this series tries to come up with a solution which uses
>>> less instructions and doesn't require to repeat instruction sequences.
>>>
>>> The idea is that this_cpu operations based on atomic instructions are
>>> guarded with mvyi instructions:
>>>
>>> - The first mvyi instruction writes the register number, which contains
>>>     the percpu address variable to lowcore. This also indicates that a
>>>     percpu code section is executed.
>>>
>>> - The first instruction following the mvyi instruction must be the ag
>>>     instruction which adds the percpu offset to the percpu address register.
>>>
>>> - Afterwards the atomic percpu operation follows.
>>>
>>> - Then a second mvyi instruction writes a zero to lowcore, which indicates
>>>     the end of the percpu code section.
>>>
>>> - In case of an interrupt/exception/nmi the register number which was
>>>     written to lowcore is copied to the exception frame (pt_regs), and a zero
>>>     is written to lowcore.
>>>
>>> - On return to the previous context it is checked if a percpu code section
>>>     was executed (saved register number not zero), and if the process was
>>>     migrated to a different cpu. If the percpu offset was already added to
>>>     the percpu address register (instruction address does _not_ point to the
>>>     ag instruction) the content of the percpu address register is adjusted so
>>>     it points to percpu variable of the new cpu.
>> If I understand correctly, you replaced preempt_disable() and
>> preempt_enable() with seq begin and seg end, and seq begin and seq end
>> can be optimized by mvyi instruction on S390. So you just need a single
>> mvyi instruction for each instead of read-modify-write the seq count.
>>
>> But you need some extra overhead for context switch (save and restore
>> the seq count register) and need to check whether it is still on the
>> same cpu once resuming execution. And there is also penalty if it is
>> migrated to another CPU (need to rerun this_cpu ops).
> Not as I understand it.
> What happens is the context switch code 'corrupts' the register being
> used to access per-cpu data so that it is correct for the new cpu.
> The write of zero after the sequence is there to stop the register
> being corrupted outside of this code window.

Thanks for elaborating it. I misunderstood some nuance. I read the patch 
#2 commit message, now I think I understand how it works.

Borrowed the disassemble from patch #2 commit message:

   11a8e6:       c0 30 00 d0 c5 0d       larl    %r3,1b33300
   11a8ec:       b9 04 00 43             lgr     %r4,%r3
   11a8f0:       eb 00 43 c0 00 52       mviy    960,4
   11a8f6:       e3 40 03 b8 00 08       ag      %r4,952
   11a8fc:       eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
   11a902:       eb 00 03 c0 00 52       mviy    960,0
   11a908:       b9 08 00 25             agr     %r2,%r5
   11a90c        07 fe                   br      %r14

11a8f0 loads the percpu offset and mark the percpu code section begin, I 
believe this is needed with percpu page table too because we need load 
local percpu offset.
11a920 loads 0 to the register to mark the percpu code section end, this 
is not needed with percpu page table.

And you need to save the register at the irq/exception entry, then 
restore it at exit. But you also need to check whether migration happens 
or not, if it happens kernel needs to rewrite the register with correct 
percpu offset and needs to check whether the interrupted instruction is 
"ag". If it is "ag" instruction (11a8f6) , kernel needs to recalculate 
the percpu address, right?

It sounds a little bit hacky to me TBH and incur some extra overhead for 
"migration detection" and fixup.

>
> This really just means that you can (mostly) only do single accesses,
> since nothing stops pre-emption between the RW or an RMW sequence.
> Although you can probably do an increment of the preempt disable count
> because if you are preempted the value read will be zero.
>
>> So it seems have more overhead than the percpu page table approach IIUC.
>> We don't need all the steps with percpu page table. And there is no
>> penalty for migration.
> This code looks like it relies on 'page zero' already being percpu.
> So it probably isn't really that different.
> Some values like the 'preemption disable count' and 'current' could be
> (maybe are?) written into page zero to give fast access.

I don't quite get what you mean about 'page zero'.

>
> But I'm sure I remember that some cpu don't like having the same
> physical address at different virtual addresses (and not just those
> with VIVT caches like some sparc cpu).

Yeah, VIVT cache doesn't like it due to cache alias. But the mapping is 
really percpu, so the mapping to the physical address belonging to 
another CPU should never pollute the current CPU's cache if I don't miss 
something.

> I'm sure code can end up accessing the current cpu's percpu data
> using the same address that other cpu use - there are definitely
> places where it needs that address.

No, it is not. In the percpu page table approach, we use different 
address for this_cpu_*() and per_cpu_ptr() which is mainly used to 
initialize percpu data for all CPUs.

Thanks,
Yang

> On x86-64 that means it reading the address from the array rather
> than just offsetting from %gs.
>
> -- David
>
>>> All of this seems to work, but of course it could still be broken since I
>>> missed some detail.
>>>
>>> In total this series results in a kernel text size reduction of ~106kb. The
>>> number of preempt_schedule_notrace() call sites is reduced from 7089 to
>>> 1577.
>> Yeah, both approaches can reduce the number of
>> preempt_schedule_notrace() call sites. And both approaches can reduce
>> the number of non-preemptible critical sections.
>>
>>> Note: this comes without any huge performance analysis, however all
>>> microbenchmarks confirmed that the new code is at least as fast as the
>>> old code, like expected.
>> I'm really interested in the benchmark number. I'm supposed percpu page
>> table approach should have better performance per my above analysis.
>>
>> Christopher Lameter is also interested in it, cc'ed him too.
>>
>> Thanks,
>> Yang
>>
>>> [1] 20260223163843.GR1282955@noisy.programming.kicks-ass.net
>>>
>>> Heiko Carstens (9):
>>>     s390/alternatives: Add new ALT_TYPE_PERCPU type
>>>     s390/percpu: Infrastructure for more efficient this_cpu operations
>>>     s390/percpu: Add missing do { } while (0) constructs
>>>     s390/percpu: Use new percpu code section for arch_this_cpu_add()
>>>     s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
>>>     s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
>>>     s390/percpu: Provide arch_this_cpu_read() implementation
>>>     s390/percpu: Provide arch_this_cpu_write() implementation
>>>     s390/percpu: Remove one and two byte this_cpu operation implementation
>>>
>>>    arch/s390/boot/alternative.c         |   7 +
>>>    arch/s390/include/asm/alternative.h  |   5 +
>>>    arch/s390/include/asm/entry-percpu.h |  76 ++++++++
>>>    arch/s390/include/asm/lowcore.h      |   3 +-
>>>    arch/s390/include/asm/percpu.h       | 249 +++++++++++++++++++++------
>>>    arch/s390/include/asm/ptrace.h       |   2 +
>>>    arch/s390/kernel/alternative.c       |  25 ++-
>>>    arch/s390/kernel/irq.c               |  26 ++-
>>>    arch/s390/kernel/nmi.c               |   6 +
>>>    arch/s390/kernel/traps.c             |   6 +
>>>    10 files changed, 344 insertions(+), 61 deletions(-)
>>>    create mode 100644 arch/s390/include/asm/entry-percpu.h
>>>
>>> base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8
>>

Re: [PATCH v3 0/9] s390: Improve this_cpu operations
Posted by Heiko Carstens 3 days, 13 hours ago
On Wed, May 20, 2026 at 05:23:37PM -0700, Yang Shi wrote:
> > > If I understand correctly, you replaced preempt_disable() and
> > > preempt_enable() with seq begin and seg end, and seq begin and seq end
> > > can be optimized by mvyi instruction on S390. So you just need a single
> > > mvyi instruction for each instead of read-modify-write the seq count.
> > > 
> > > But you need some extra overhead for context switch (save and restore
> > > the seq count register) and need to check whether it is still on the
> > > same cpu once resuming execution. And there is also penalty if it is
> > > migrated to another CPU (need to rerun this_cpu ops).
> > Not as I understand it.
> > What happens is the context switch code 'corrupts' the register being
> > used to access per-cpu data so that it is correct for the new cpu.
> > The write of zero after the sequence is there to stop the register
> > being corrupted outside of this code window.
> 
> Thanks for elaborating it. I misunderstood some nuance. I read the patch #2
> commit message, now I think I understand how it works.

As background: s390 has so called prefix pages; the first two pages of every
CPU are percpu, via a special prefixing mechanism. Parts of the pages can be
used by operating systems as percpu data area, which we use to have fast
access to e.g. the 'current' pointer, the pid, percpu_offset of the current
cpu, etc.

Helpful is also that for instructions which access memory with a base register
zero, its contents are assumed to be zero for address generation by the
hardware, regardless of its real contents. That is, the above

        ag %r4,952

is the short version of

        ag %r4,952(%r0)

The eight bytes at offset 952 of the current CPU's prefix page are added to
register 4. Real contents of register 0 are irrelevant for such address
generations; reducing register pressure.

> Borrowed the disassemble from patch #2 commit message:
> 
>   11a8e6:       c0 30 00 d0 c5 0d       larl    %r3,1b33300
>   11a8ec:       b9 04 00 43             lgr     %r4,%r3
>   11a8f0:       eb 00 43 c0 00 52       mviy    960,4
>   11a8f6:       e3 40 03 b8 00 08       ag      %r4,952
>   11a8fc:       eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
>   11a902:       eb 00 03 c0 00 52       mviy    960,0
>   11a908:       b9 08 00 25             agr     %r2,%r5
>   11a90c        07 fe                   br      %r14
> 
> 11a8f0 loads the percpu offset and mark the percpu code section begin, I
> believe this is needed with percpu page table too because we need load local
> percpu offset.

No, 11a8f0 _writes_ the base register number, which contains the percpu
address used by the percpu atomic op at 11a8fc, to offset 960 of the first
prefix page. It could also be written like

	mviy 960(%r0),4

maybe that makes it more obvious what happens. And yes, this marks the
beginning of a percpu code section. The percpu offset is added to register r4
at 11a8f6 with the ag instruction. This could also be written like

	ag %r4,952(%r0)

This reads the eight byte percpu_offset from offset 952 of the first prefix
page, and adds it to register r4.

> 11a920 loads 0 to the register to mark the percpu code section end, this is
> not needed with percpu page table.

I guess you meant 11a902. But yes, this marks the end of the percpu code
section. Just that this is not a register, but a memory location where is
written to.

> And you need to save the register at the irq/exception entry, then restore
> it at exit. But you also need to check whether migration happens or not, if
> it happens kernel needs to rewrite the register with correct percpu offset
> and needs to check whether the interrupted instruction is "ag".

Yes.

> If it is "ag" instruction (11a8f6) , kernel needs to recalculate the percpu
> address, right?

No, if it is within the percpu code section and it is _not_ the ag instruction,
the percpu base register needs to be adjusted (that's by the way a bug in
patch two, which has this logic inverted - my mistake).

> It sounds a little bit hacky to me TBH and incur some extra overhead for
> "migration detection" and fixup.

Sure, it is hacky, and the small overhead part is of course true.

Compared to the percpu page table proposal the two mviy instructions above
would go away, as well as the extra interrupt/exception overhead. Besides
that your proposal is way less hacky.

> > > So it seems have more overhead than the percpu page table approach IIUC.
> > > We don't need all the steps with percpu page table. And there is no
> > > penalty for migration.
> > This code looks like it relies on 'page zero' already being percpu.
> > So it probably isn't really that different.
> > Some values like the 'preemption disable count' and 'current' could be
> > (maybe are?) written into page zero to give fast access.
> 
> I don't quite get what you mean about 'page zero'.

Hopefully the above description with prefix pages explains it?
Re: [PATCH v3 0/9] s390: Improve this_cpu operations
Posted by Yang Shi 3 days, 6 hours ago

On 5/21/26 3:37 AM, Heiko Carstens wrote:
> On Wed, May 20, 2026 at 05:23:37PM -0700, Yang Shi wrote:
>>>> If I understand correctly, you replaced preempt_disable() and
>>>> preempt_enable() with seq begin and seg end, and seq begin and seq end
>>>> can be optimized by mvyi instruction on S390. So you just need a single
>>>> mvyi instruction for each instead of read-modify-write the seq count.
>>>>
>>>> But you need some extra overhead for context switch (save and restore
>>>> the seq count register) and need to check whether it is still on the
>>>> same cpu once resuming execution. And there is also penalty if it is
>>>> migrated to another CPU (need to rerun this_cpu ops).
>>> Not as I understand it.
>>> What happens is the context switch code 'corrupts' the register being
>>> used to access per-cpu data so that it is correct for the new cpu.
>>> The write of zero after the sequence is there to stop the register
>>> being corrupted outside of this code window.
>> Thanks for elaborating it. I misunderstood some nuance. I read the patch #2
>> commit message, now I think I understand how it works.
> As background: s390 has so called prefix pages; the first two pages of every
> CPU are percpu, via a special prefixing mechanism. Parts of the pages can be
> used by operating systems as percpu data area, which we use to have fast
> access to e.g. the 'current' pointer, the pid, percpu_offset of the current
> cpu, etc.
>
> Helpful is also that for instructions which access memory with a base register
> zero, its contents are assumed to be zero for address generation by the
> hardware, regardless of its real contents. That is, the above
>
>          ag %r4,952
>
> is the short version of
>
>          ag %r4,952(%r0)
>
> The eight bytes at offset 952 of the current CPU's prefix page are added to
> register 4. Real contents of register 0 are irrelevant for such address
> generations; reducing register pressure.

Aha, I see. So the prefix pages are some special memory?

>
>> Borrowed the disassemble from patch #2 commit message:
>>
>>    11a8e6:       c0 30 00 d0 c5 0d       larl    %r3,1b33300
>>    11a8ec:       b9 04 00 43             lgr     %r4,%r3
>>    11a8f0:       eb 00 43 c0 00 52       mviy    960,4
>>    11a8f6:       e3 40 03 b8 00 08       ag      %r4,952
>>    11a8fc:       eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
>>    11a902:       eb 00 03 c0 00 52       mviy    960,0
>>    11a908:       b9 08 00 25             agr     %r2,%r5
>>    11a90c        07 fe                   br      %r14
>>
>> 11a8f0 loads the percpu offset and mark the percpu code section begin, I
>> believe this is needed with percpu page table too because we need load local
>> percpu offset.
> No, 11a8f0 _writes_ the base register number, which contains the percpu
> address used by the percpu atomic op at 11a8fc, to offset 960 of the first
> prefix page. It could also be written like
>
> 	mviy 960(%r0),4
>
> maybe that makes it more obvious what happens. And yes, this marks the
> beginning of a percpu code section. The percpu offset is added to register r4
> at 11a8f6 with the ag instruction. This could also be written like
>
> 	ag %r4,952(%r0)
>
> This reads the eight byte percpu_offset from offset 952 of the first prefix
> page, and adds it to register r4.

Got it.

>
>> 11a920 loads 0 to the register to mark the percpu code section end, this is
>> not needed with percpu page table.
> I guess you meant 11a902. But yes, this marks the end of the percpu code
> section. Just that this is not a register, but a memory location where is
> written to.

So both mviy instructions actually do memory store?

>
>> And you need to save the register at the irq/exception entry, then restore
>> it at exit. But you also need to check whether migration happens or not, if
>> it happens kernel needs to rewrite the register with correct percpu offset
>> and needs to check whether the interrupted instruction is "ag".
> Yes.
>
>> If it is "ag" instruction (11a8f6) , kernel needs to recalculate the percpu
>> address, right?
> No, if it is within the percpu code section and it is _not_ the ag instruction,
> the percpu base register needs to be adjusted (that's by the way a bug in
> patch two, which has this logic inverted - my mistake).

Yeah, I see.

>
>> It sounds a little bit hacky to me TBH and incur some extra overhead for
>> "migration detection" and fixup.
> Sure, it is hacky, and the small overhead part is of course true.
>
> Compared to the percpu page table proposal the two mviy instructions above
> would go away, as well as the extra interrupt/exception overhead. Besides
> that your proposal is way less hacky.

It would be great if we can compare the performance number for the two 
approaches. rseq has been discussed for ARM64, but it seems too 
expensive and just move the overhead to somewhere else.

>
>>>> So it seems have more overhead than the percpu page table approach IIUC.
>>>> We don't need all the steps with percpu page table. And there is no
>>>> penalty for migration.
>>> This code looks like it relies on 'page zero' already being percpu.
>>> So it probably isn't really that different.
>>> Some values like the 'preemption disable count' and 'current' could be
>>> (maybe are?) written into page zero to give fast access.
>> I don't quite get what you mean about 'page zero'.
> Hopefully the above description with prefix pages explains it?

Yes, definitely, thank you so much for elaborating it.

Regards,
Yang


Re: [PATCH v3 0/9] s390: Improve this_cpu operations
Posted by Heiko Carstens 2 days, 14 hours ago
On Thu, May 21, 2026 at 10:47:49AM -0700, Yang Shi wrote:
> > As background: s390 has so called prefix pages; the first two pages of every
> > CPU are percpu, via a special prefixing mechanism. Parts of the pages can be
> > used by operating systems as percpu data area, which we use to have fast
> > access to e.g. the 'current' pointer, the pid, percpu_offset of the current
> > cpu, etc.
> > 
> > Helpful is also that for instructions which access memory with a base register
> > zero, its contents are assumed to be zero for address generation by the
> > hardware, regardless of its real contents. That is, the above
> > 
> >          ag %r4,952
> > 
> > is the short version of
> > 
> >          ag %r4,952(%r0)
> > 
> > The eight bytes at offset 952 of the current CPU's prefix page are added to
> > register 4. Real contents of register 0 are irrelevant for such address
> > generations; reducing register pressure.
> 
> Aha, I see. So the prefix pages are some special memory?

No, it is regular memory. The CPU has a special "prefix register". If
that is set to an address not equal to zero all memory accesses to the
first two pages will be transparently redirected to the 8k memory area
specified with that register.

E.g. the prefix register contains the value 0x10000. If then a memory
access to address 0x400 happens the CPU will transparently turn that
into a memory access to address 0x10400. Or in other words, that is a
small per cpu memory area mechanism provided by the architecture.

> > >    11a8e6:       c0 30 00 d0 c5 0d       larl    %r3,1b33300
> > >    11a8ec:       b9 04 00 43             lgr     %r4,%r3
> > >    11a8f0:       eb 00 43 c0 00 52       mviy    960,4
> > >    11a8f6:       e3 40 03 b8 00 08       ag      %r4,952
> > >    11a8fc:       eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
> > >    11a902:       eb 00 03 c0 00 52       mviy    960,0
> > >    11a908:       b9 08 00 25             agr     %r2,%r5
> > >    11a90c        07 fe                   br      %r14

...

> > > 11a920 loads 0 to the register to mark the percpu code section end, this is
> > > not needed with percpu page table.
> > I guess you meant 11a902. But yes, this marks the end of the percpu code
> > section. Just that this is not a register, but a memory location where is
> > written to.
> 
> So both mviy instructions actually do memory store?

Yes.

> > > It sounds a little bit hacky to me TBH and incur some extra overhead for
> > > "migration detection" and fixup.
> > Sure, it is hacky, and the small overhead part is of course true.
> > 
> > Compared to the percpu page table proposal the two mviy instructions above
> > would go away, as well as the extra interrupt/exception overhead. Besides
> > that your proposal is way less hacky.
> 
> It would be great if we can compare the performance number for the two
> approaches. rseq has been discussed for ARM64, but it seems too expensive
> and just move the overhead to somewhere else.

I tried to implement the proposed rseq/kseq, but the required inline
assemblies resulted in code which was larger than what we have now for
s390.

Also with the current proposal I only did some quick micro benchmarks,
which resulted in 0-1% improvement, which is in the expected range.

It is amazing to see the performance improvements you see on arm64, however
I believe that is mainly because of the large amount of code which is
generated by the arm64 implementations of the preempt primitives
__preempt_count_add() and __preempt_count_dec_and_test().

That's a big difference to s390: for both primitives the result is a single
instruction.
Re: [PATCH v3 0/9] s390: Improve this_cpu operations
Posted by David Laight 3 days, 13 hours ago
On Wed, 20 May 2026 17:23:37 -0700
Yang Shi <yang@os.amperecomputing.com> wrote:

> On 5/20/26 3:34 PM, David Laight wrote:
..
> >> So it seems have more overhead than the percpu page table approach IIUC.
> >> We don't need all the steps with percpu page table. And there is no
> >> penalty for migration.  
> > This code looks like it relies on 'page zero' already being percpu.
> > So it probably isn't really that different.
> > Some values like the 'preemption disable count' and 'current' could be
> > (maybe are?) written into page zero to give fast access.  
> 
> I don't quite get what you mean about 'page zero'.

'page zero' is (at least for some cpu) the memory that can be accessed
using a small offset embedded in the instruction.
This is equivalent to using offsets from an 'always zero' %r0.

The code relies on the accesses to 960(%r0) being per-cpu.

-- David
Re: [PATCH v3 0/9] s390: Improve this_cpu operations
Posted by Yang Shi 3 days, 6 hours ago

On 5/21/26 3:23 AM, David Laight wrote:
> On Wed, 20 May 2026 17:23:37 -0700
> Yang Shi <yang@os.amperecomputing.com> wrote:
>
>> On 5/20/26 3:34 PM, David Laight wrote:
> ..
>>>> So it seems have more overhead than the percpu page table approach IIUC.
>>>> We don't need all the steps with percpu page table. And there is no
>>>> penalty for migration.
>>> This code looks like it relies on 'page zero' already being percpu.
>>> So it probably isn't really that different.
>>> Some values like the 'preemption disable count' and 'current' could be
>>> (maybe are?) written into page zero to give fast access.
>> I don't quite get what you mean about 'page zero'.
> 'page zero' is (at least for some cpu) the memory that can be accessed
> using a small offset embedded in the instruction.
> This is equivalent to using offsets from an 'always zero' %r0.
>
> The code relies on the accesses to 960(%r0) being per-cpu.

Thank you. Heiko also elaborated it.

Yang

>
> -- David
Re: [PATCH v3 0/9] s390: Improve this_cpu operations
Posted by David Laight 3 days, 13 hours ago
On Wed, 20 May 2026 17:23:37 -0700
Yang Shi <yang@os.amperecomputing.com> wrote:

> On 5/20/26 3:34 PM, David Laight wrote:
...
> >
> > But I'm sure I remember that some cpu don't like having the same
> > physical address at different virtual addresses (and not just those
> > with VIVT caches like some sparc cpu).  
> 
> Yeah, VIVT cache doesn't like it due to cache alias. But the mapping is 
> really percpu, so the mapping to the physical address belonging to 
> another CPU should never pollute the current CPU's cache if I don't miss 
> something.
>
> > I'm sure code can end up accessing the current cpu's percpu data
> > using the same address that other cpu use - there are definitely
> > places where it needs that address.  
> 
> No, it is not. In the percpu page table approach, we use different 
> address for this_cpu_*() and per_cpu_ptr() which is mainly used to 
> initialize percpu data for all CPUs.

You missed something.

Look, for example, at kernel/locking/osq_lock.c
The code uses this_cpu_ptr() and then both dereferences the pointer
and writes it to places that other cpu will use.
It also uses per_cpu_ptr() to get an address it can use for the per-cpu
data of another cpu.
(That code all assumes preemption is disabled.)

-- David

> Thanks,
> Yang
Re: [PATCH v3 0/9] s390: Improve this_cpu operations
Posted by Yang Shi 3 days, 7 hours ago

On 5/21/26 3:17 AM, David Laight wrote:
> On Wed, 20 May 2026 17:23:37 -0700
> Yang Shi <yang@os.amperecomputing.com> wrote:
>
>> On 5/20/26 3:34 PM, David Laight wrote:
> ...
>>> But I'm sure I remember that some cpu don't like having the same
>>> physical address at different virtual addresses (and not just those
>>> with VIVT caches like some sparc cpu).
>> Yeah, VIVT cache doesn't like it due to cache alias. But the mapping is
>> really percpu, so the mapping to the physical address belonging to
>> another CPU should never pollute the current CPU's cache if I don't miss
>> something.
>>
>>> I'm sure code can end up accessing the current cpu's percpu data
>>> using the same address that other cpu use - there are definitely
>>> places where it needs that address.
>> No, it is not. In the percpu page table approach, we use different
>> address for this_cpu_*() and per_cpu_ptr() which is mainly used to
>> initialize percpu data for all CPUs.
> You missed something.
>
> Look, for example, at kernel/locking/osq_lock.c
> The code uses this_cpu_ptr() and then both dereferences the pointer
> and writes it to places that other cpu will use.
> It also uses per_cpu_ptr() to get an address it can use for the per-cpu
> data of another cpu.
> (That code all assumes preemption is disabled.)

this_cpu_ptr() uses different addresses for different CPUs. It is a 
special case, it is not due to VIVT, but because it may confuse list 
API. Because list API determines list is empty by comparing pointers 
(head->next == head). this_cpu_read/write/add/sub, etc, are fine.

And per_cpu_ptr() also uses different addresses for different CPUs.

The lwn article explained it. 
https://lwn.net/SubscriberLink/1073395/12c08f128e515809/

Thanks,
Yang

>
> -- David
>
>> Thanks,
>> Yang
Re: [PATCH v3 0/9] s390: Improve this_cpu operations
Posted by David Laight 3 days, 6 hours ago
On Thu, 21 May 2026 09:57:37 -0700
Yang Shi <yang@os.amperecomputing.com> wrote:

> On 5/21/26 3:17 AM, David Laight wrote:
> > On Wed, 20 May 2026 17:23:37 -0700
> > Yang Shi <yang@os.amperecomputing.com> wrote:
> >  
> >> On 5/20/26 3:34 PM, David Laight wrote:  
> > ...  
> >>> But I'm sure I remember that some cpu don't like having the same
> >>> physical address at different virtual addresses (and not just those
> >>> with VIVT caches like some sparc cpu).  
> >> Yeah, VIVT cache doesn't like it due to cache alias. But the mapping is
> >> really percpu, so the mapping to the physical address belonging to
> >> another CPU should never pollute the current CPU's cache if I don't miss
> >> something.
> >>  
> >>> I'm sure code can end up accessing the current cpu's percpu data
> >>> using the same address that other cpu use - there are definitely
> >>> places where it needs that address.  
> >> No, it is not. In the percpu page table approach, we use different
> >> address for this_cpu_*() and per_cpu_ptr() which is mainly used to
> >> initialize percpu data for all CPUs.  
> > You missed something.
> >
> > Look, for example, at kernel/locking/osq_lock.c
> > The code uses this_cpu_ptr() and then both dereferences the pointer
> > and writes it to places that other cpu will use.
> > It also uses per_cpu_ptr() to get an address it can use for the per-cpu
> > data of another cpu.
> > (That code all assumes preemption is disabled.)  
> 
> this_cpu_ptr() uses different addresses for different CPUs. It is a 
> special case, it is not due to VIVT, but because it may confuse list 
> API. Because list API determines list is empty by comparing pointers 
> (head->next == head). this_cpu_read/write/add/sub, etc, are fine.

But you could quite easily get code that manages to mix accesses through
this_cpu_ptr() with direct accesses to per-cpu variables in the same
cache line.
I'm sure some arm cpu really don't like you doing that.
(But it is a foggy memory from somewhere.)

You can use per-cpu page tables, but it really only helps for a
few items.
Anything that is RMW (eg add on pretty much everything except x86)
either has to disable preemption or use a compare and exchange loop.
Variables like 'current' can be written into the per-cpu page table
data area by the process switch code (as I believe s390 does).

The 'trick' here will work for reading/writing values if you don't
care that the value read is stale (or might have been written to
the memory for a different cpu).
It might work for updating the preemption disable count - because
you can only be preempted while it is zero.

-- David

> 
> And per_cpu_ptr() also uses different addresses for different CPUs.
> 
> The lwn article explained it. 
> https://lwn.net/SubscriberLink/1073395/12c08f128e515809/
> 
> Thanks,
> Yang
> 
> >
> > -- David
> >  
> >> Thanks,
> >> Yang  
>
Re: [PATCH v3 0/9] s390: Improve this_cpu operations
Posted by Yang Shi 3 days, 3 hours ago

On 5/21/26 10:55 AM, David Laight wrote:
> On Thu, 21 May 2026 09:57:37 -0700
> Yang Shi <yang@os.amperecomputing.com> wrote:
>
>> On 5/21/26 3:17 AM, David Laight wrote:
>>> On Wed, 20 May 2026 17:23:37 -0700
>>> Yang Shi <yang@os.amperecomputing.com> wrote:
>>>   
>>>> On 5/20/26 3:34 PM, David Laight wrote:
>>> ...
>>>>> But I'm sure I remember that some cpu don't like having the same
>>>>> physical address at different virtual addresses (and not just those
>>>>> with VIVT caches like some sparc cpu).
>>>> Yeah, VIVT cache doesn't like it due to cache alias. But the mapping is
>>>> really percpu, so the mapping to the physical address belonging to
>>>> another CPU should never pollute the current CPU's cache if I don't miss
>>>> something.
>>>>   
>>>>> I'm sure code can end up accessing the current cpu's percpu data
>>>>> using the same address that other cpu use - there are definitely
>>>>> places where it needs that address.
>>>> No, it is not. In the percpu page table approach, we use different
>>>> address for this_cpu_*() and per_cpu_ptr() which is mainly used to
>>>> initialize percpu data for all CPUs.
>>> You missed something.
>>>
>>> Look, for example, at kernel/locking/osq_lock.c
>>> The code uses this_cpu_ptr() and then both dereferences the pointer
>>> and writes it to places that other cpu will use.
>>> It also uses per_cpu_ptr() to get an address it can use for the per-cpu
>>> data of another cpu.
>>> (That code all assumes preemption is disabled.)
>> this_cpu_ptr() uses different addresses for different CPUs. It is a
>> special case, it is not due to VIVT, but because it may confuse list
>> API. Because list API determines list is empty by comparing pointers
>> (head->next == head). this_cpu_read/write/add/sub, etc, are fine.
> But you could quite easily get code that manages to mix accesses through
> this_cpu_ptr() with direct accesses to per-cpu variables in the same
> cache line.

I can see potential cache alias issue with VIVT cache with the below 
pattern:

for_each_cpu(cpu)
     per_cpu_ptr(cpu) <-- Initialize per cpu data

this_cpu_inc(current_cpu) <-- Inc the current cpu copy

this_cpu_inc() may see stale copy (uninitialized) if there is no cache 
flush after initialization.

> I'm sure some arm cpu really don't like you doing that.
> (But it is a foggy memory from somewhere.)

ARMv8 requires PIPT if I remember correctly. Some old 32 bit arm 
machines may have VIVT cache, but 32 bit arm is not the target user TBH. 
I can see some potential issues with VIVT cache, I don't think they are 
the target users and VIVT cache is rare now.

>
> You can use per-cpu page tables, but it really only helps for a
> few items.
> Anything that is RMW (eg add on pretty much everything except x86)
> either has to disable preemption or use a compare and exchange loop.

It only helps this_cpu ops because they use atomic instructions (at 
least on ARM64). __this_cpu ops still require preemption disabled. But 
the performance improvement is still impressive even though it just can 
help this_cpu ops.

> Variables like 'current' can be written into the per-cpu page table
> data area by the process switch code (as I believe s390 does).

That may be a useful usecase.

>
> The 'trick' here will work for reading/writing values if you don't
> care that the value read is stale (or might have been written to
> the memory for a different cpu).

If you don't care the stale value, you can just call __this_cpu_read().

Thanks,
Yang

> It might work for updating the preemption disable count - because
> you can only be preempted while it is zero.
>
> -- David
>
>> And per_cpu_ptr() also uses different addresses for different CPUs.
>>
>> The lwn article explained it.
>> https://lwn.net/SubscriberLink/1073395/12c08f128e515809/
>>
>> Thanks,
>> Yang
>>
>>> -- David
>>>   
>>>> Thanks,
>>>> Yang

Re: [PATCH v3 0/9] s390: Improve this_cpu operations
Posted by David Laight 3 days, 1 hour ago
On Thu, 21 May 2026 13:46:25 -0700
Yang Shi <yang@os.amperecomputing.com> wrote:

> On 5/21/26 10:55 AM, David Laight wrote:
...
> > The 'trick' here will work for reading/writing values if you don't
> > care that the value read is stale (or might have been written to
> > the memory for a different cpu).  
> 
> If you don't care the stale value, you can just call __this_cpu_read().

You can get an impossible value.
The generated code might be like this:
	this_cpu_data = xxx;
	preempt_disable_count = this_cpu_data->preempt_disable_count;
If the count was non-zero at the start you'll read the value from
the current cpu and all is fine.
But if the count is zero you can get preempted between the instructions,
the process now running on your 'old' cpu can increment the value
and you then read the new non-zero value.
That won't be good at all.

You can only use __this_cpu_read() for things that don't change.

The big problem with using per-cpu page tables is there will be absolutely
nothing stopping code taking the wrong address of a per-cpu variable and
saving it somewhere.
At the moment you have to use the helper so always get the global address.

-- David
	
> 
> Thanks,
> Yang
Re: [PATCH v3 0/9] s390: Improve this_cpu operations
Posted by Yang Shi 3 days ago

On 5/21/26 3:13 PM, David Laight wrote:
> On Thu, 21 May 2026 13:46:25 -0700
> Yang Shi <yang@os.amperecomputing.com> wrote:
>
>> On 5/21/26 10:55 AM, David Laight wrote:
> ...
>>> The 'trick' here will work for reading/writing values if you don't
>>> care that the value read is stale (or might have been written to
>>> the memory for a different cpu).
>> If you don't care the stale value, you can just call __this_cpu_read().
> You can get an impossible value.
> The generated code might be like this:
> 	this_cpu_data = xxx;
> 	preempt_disable_count = this_cpu_data->preempt_disable_count;
> If the count was non-zero at the start you'll read the value from
> the current cpu and all is fine.
> But if the count is zero you can get preempted between the instructions,
> the process now running on your 'old' cpu can increment the value
> and you then read the new non-zero value.
> That won't be good at all.

TBH, I don't think this counts for "don't care the stale value".

>
> You can only use __this_cpu_read() for things that don't change.
>
> The big problem with using per-cpu page tables is there will be absolutely
> nothing stopping code taking the wrong address of a per-cpu variable and
> saving it somewhere.

Err... I'm lost. If you mean RW or RMW, atomic instructions are required 
for this_cpu ops. This is how this_cpu ops is implemented for ARM64 even 
though without percpu page table. If the operation is interrupted in the 
middle, the exclusion monitor will be cleared, the hardware will reload 
the value.

If you mean this_cpu_read() some value then this_cpu_write() to the same 
cpu, I don't think it can work as expected without disabling preemption 
for the whole code section even though we don't have percpu page table.

Thanks,
Yang

> At the moment you have to use the helper so always get the global address.
>
> -- David
> 	
>> Thanks,
>> Yang