[v4] s390: Improve this_cpu operations

[PATCH v4 0/8] s390: Improve this_cpu operations

Posted by Heiko Carstens 2 days, 5 hours ago

v4:

- Drop alternatives approach and extract percpu base register number for
  mviy at compile time [David Laight]

- Fix logic for percpu code section detection, as well as
  interruption/exception/nmi path [Sashiko [5]]

[5] https://sashiko.dev/#/patchset/20260520092243.264847-1-hca%40linux.ibm.com

v3:

- Fix various typos [Juergen Christ]

- Add missing kprobe detection / handling [Sashiko [3]]
  [FWIW, this made me also aware of that the current general s390 kprobes
   code seems to be racy against concurrent removal of a kprobe while a
   probe hit on a different CPU. But that is a different story.]

- Fix various minor findings [Sashiko [3]]

- All of this might be dropped / exchanged in future in favor of the percpu
  page table approach proposed by Yang Shi [4].

[3] https://sashiko.dev/#/patchset/20260319120503.4046659-1-hca@linux.ibm.com
[4] https://lore.kernel.org/all/20260429170758.3018959-1-yang@os.amperecomputing.com/

v2:

- Add proper PERCPU_PTR cast to most patches to avoid tons of sparse
  warnings

- Add missing __packed attribute to insn structure [Sashiko [2]]

- Fix inverted if condition [Sashiko [2]]

- Add missing user_mode() check [Sashiko [2]]

- Move percpu_entry() call in front of irqentry_enter() call in all
  entry paths to avoid that potential this_cpu() operations overwrite
  the not-yet saved percpu code section indicator  [Sashiko [2]]

[2] https://sashiko.dev/#/patchset/20260317195436.2276810-1-hca%40linux.ibm.com

v1:

This is a follow-up to Peter Zijlstra's in-kernel rseq RFC [1].

With the intended removal of PREEMPT_NONE this_cpu operations based on
atomic instructions, guarded with preempt_disable()/preempt_enable() pairs,
become more expensive: the preempt_disable() / preempt_enable() pairs are
not optimized away anymore during compile time.

In particular the conditional call to preempt_schedule_notrace() after
preempt_enable() adds additional code and register pressure.

To avoid this Peter suggested an in-kernel rseq approach. While this would
certainly work, this series tries to come up with a solution which uses
less instructions and doesn't require to repeat instruction sequences.

The idea is that this_cpu operations based on atomic instructions are
guarded with mvyi instructions:

- The first mvyi instruction writes the register number, which contains
  the percpu address variable to lowcore. This also indicates that a
  percpu code section is executed.

- The first instruction following the mvyi instruction must be the ag
  instruction which adds the percpu offset to the percpu address register.

- Afterwards the atomic percpu operation follows.

- Then a second mvyi instruction writes a zero to lowcore, which indicates
  the end of the percpu code section.

- In case of an interrupt/exception/nmi the register number which was
  written to lowcore is copied to the exception frame (pt_regs), and a zero
  is written to lowcore.

- On return to the previous context it is checked if a percpu code section
  was executed (saved register number not zero), and if the process was
  migrated to a different cpu. If the percpu offset was already added to
  the percpu address register (instruction address does _not_ point to the
  ag instruction) the content of the percpu address register is adjusted so
  it points to percpu variable of the new cpu.

All of this seems to work, but of course it could still be broken since I
missed some detail.

In total this series results in a kernel text size reduction of ~106kb. The
number of preempt_schedule_notrace() call sites is reduced from 7089 to
1577.

Note: this comes without any huge performance analysis, however all
microbenchmarks confirmed that the new code is at least as fast as the
old code, like expected.

[1] 20260223163843.GR1282955@noisy.programming.kicks-ass.net

Heiko Carstens (8):
  s390/percpu: Infrastructure for more efficient this_cpu operations
  s390/percpu: Add missing do { } while (0) constructs
  s390/percpu: Use new percpu code section for arch_this_cpu_add()
  s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
  s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
  s390/percpu: Provide arch_this_cpu_read() implementation
  s390/percpu: Provide arch_this_cpu_write() implementation
  s390/percpu: Remove one and two byte this_cpu operation implementation

 arch/s390/include/asm/entry-percpu.h |  78 ++++++++
 arch/s390/include/asm/lowcore.h      |   3 +-
 arch/s390/include/asm/percpu.h       | 257 +++++++++++++++++++++------
 arch/s390/include/asm/ptrace.h       |   2 +
 arch/s390/kernel/irq.c               |  24 ++-
 arch/s390/kernel/nmi.c               |   5 +
 arch/s390/kernel/traps.c             |   5 +
 7 files changed, 315 insertions(+), 59 deletions(-)
 create mode 100644 arch/s390/include/asm/entry-percpu.h

-- 
2.51.0

Re: [PATCH v4 0/8] s390: Improve this_cpu operations

Posted by David Laight 2 days, 1 hour ago

On Fri, 22 May 2026 16:12:49 +0200
Heiko Carstens <hca@linux.ibm.com> wrote:

> v4:
> 
> - Drop alternatives approach and extract percpu base register number for
>   mviy at compile time [David Laight]

Definitely looks better.
I'm sure I managed to understand it once :-)

-- David

> 
> - Fix logic for percpu code section detection, as well as
>   interruption/exception/nmi path [Sashiko [5]]
> 
> [5] https://sashiko.dev/#/patchset/20260520092243.264847-1-hca%40linux.ibm.com
> 
> v3:
> 
> - Fix various typos [Juergen Christ]
> 
> - Add missing kprobe detection / handling [Sashiko [3]]
>   [FWIW, this made me also aware of that the current general s390 kprobes
>    code seems to be racy against concurrent removal of a kprobe while a
>    probe hit on a different CPU. But that is a different story.]
> 
> - Fix various minor findings [Sashiko [3]]
> 
> - All of this might be dropped / exchanged in future in favor of the percpu
>   page table approach proposed by Yang Shi [4].
> 
> [3] https://sashiko.dev/#/patchset/20260319120503.4046659-1-hca@linux.ibm.com
> [4] https://lore.kernel.org/all/20260429170758.3018959-1-yang@os.amperecomputing.com/
> 
> v2:
> 
> - Add proper PERCPU_PTR cast to most patches to avoid tons of sparse
>   warnings
> 
> - Add missing __packed attribute to insn structure [Sashiko [2]]
> 
> - Fix inverted if condition [Sashiko [2]]
> 
> - Add missing user_mode() check [Sashiko [2]]
> 
> - Move percpu_entry() call in front of irqentry_enter() call in all
>   entry paths to avoid that potential this_cpu() operations overwrite
>   the not-yet saved percpu code section indicator  [Sashiko [2]]
> 
> [2] https://sashiko.dev/#/patchset/20260317195436.2276810-1-hca%40linux.ibm.com
> 
> v1:
> 
> This is a follow-up to Peter Zijlstra's in-kernel rseq RFC [1].
> 
> With the intended removal of PREEMPT_NONE this_cpu operations based on
> atomic instructions, guarded with preempt_disable()/preempt_enable() pairs,
> become more expensive: the preempt_disable() / preempt_enable() pairs are
> not optimized away anymore during compile time.
> 
> In particular the conditional call to preempt_schedule_notrace() after
> preempt_enable() adds additional code and register pressure.
> 
> To avoid this Peter suggested an in-kernel rseq approach. While this would
> certainly work, this series tries to come up with a solution which uses
> less instructions and doesn't require to repeat instruction sequences.
> 
> The idea is that this_cpu operations based on atomic instructions are
> guarded with mvyi instructions:
> 
> - The first mvyi instruction writes the register number, which contains
>   the percpu address variable to lowcore. This also indicates that a
>   percpu code section is executed.
> 
> - The first instruction following the mvyi instruction must be the ag
>   instruction which adds the percpu offset to the percpu address register.
> 
> - Afterwards the atomic percpu operation follows.
> 
> - Then a second mvyi instruction writes a zero to lowcore, which indicates
>   the end of the percpu code section.
> 
> - In case of an interrupt/exception/nmi the register number which was
>   written to lowcore is copied to the exception frame (pt_regs), and a zero
>   is written to lowcore.
> 
> - On return to the previous context it is checked if a percpu code section
>   was executed (saved register number not zero), and if the process was
>   migrated to a different cpu. If the percpu offset was already added to
>   the percpu address register (instruction address does _not_ point to the
>   ag instruction) the content of the percpu address register is adjusted so
>   it points to percpu variable of the new cpu.
> 
> All of this seems to work, but of course it could still be broken since I
> missed some detail.
> 
> In total this series results in a kernel text size reduction of ~106kb. The
> number of preempt_schedule_notrace() call sites is reduced from 7089 to
> 1577.
> 
> Note: this comes without any huge performance analysis, however all
> microbenchmarks confirmed that the new code is at least as fast as the
> old code, like expected.
> 
> [1] 20260223163843.GR1282955@noisy.programming.kicks-ass.net
> 
> Heiko Carstens (8):
>   s390/percpu: Infrastructure for more efficient this_cpu operations
>   s390/percpu: Add missing do { } while (0) constructs
>   s390/percpu: Use new percpu code section for arch_this_cpu_add()
>   s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
>   s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
>   s390/percpu: Provide arch_this_cpu_read() implementation
>   s390/percpu: Provide arch_this_cpu_write() implementation
>   s390/percpu: Remove one and two byte this_cpu operation implementation
> 
>  arch/s390/include/asm/entry-percpu.h |  78 ++++++++
>  arch/s390/include/asm/lowcore.h      |   3 +-
>  arch/s390/include/asm/percpu.h       | 257 +++++++++++++++++++++------
>  arch/s390/include/asm/ptrace.h       |   2 +
>  arch/s390/kernel/irq.c               |  24 ++-
>  arch/s390/kernel/nmi.c               |   5 +
>  arch/s390/kernel/traps.c             |   5 +
>  7 files changed, 315 insertions(+), 59 deletions(-)
>  create mode 100644 arch/s390/include/asm/entry-percpu.h
>