Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)

[RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)

Posted by Yang Shi 1 month, 2 weeks ago

Introduction
============
This patch series implemented the LSFMM 2026 proposal for optimizing
this_cpu_*() ops on ARM64. For the details of the proposal, Please refer to:
https://lore.kernel.org/linux-mm/CAHbLzkpcN-T8MH6=W3jCxcFj1gVZp8fRqe231yzZT-rV_E_org@mail.gmail.com/
I didn't repeat it in the cover letter because there is no change to the
proposal.

The series is based on 7.1-rc1. It is basically minimum viable patches.
There are still a few hacks in this series and it may break something,
for example, KPTI, SMT machines which shared TLB, etc. But it shoule be
good enough for now to demonstrate the core idea. The main purpose of the
RFC is to gather feedback early, figure out missing parts and risks, and
make sure we are on the right track, as well as hopefully it can help the
discussion for the upcoming LSFMM.

I broke the patches down to arch-dependent and arch-independent parts so that
hopefully the interested persons can do experiments on other architectures,
for example, S390, easier.

A new kernel config is introduced, HAVE_LOCAL_PER_CPU_MAP. The architectures
which can support this feature will select it. Allocating and freeing percpu
local mapping is protected by this config so that others won't pay the cost.

 
Known Issues
============
1. KPIT
-------
We need determine what CPU we are on, then switch to the right page table.
Currently arm64 kernel fetches tramp_pg_dir via swapper_pg_dir - fixed_offset,
and fetches swapper_pg_dir from ttbr1. But ttbr1 may not hold swapper_pg_dir
anymore except CPU #0. So we need to figure out the other way to handle it.
Switching to tramp_pg_dir should be easy, but the reverse seems harder because
tramp_pg_dir just maps the trampoline vectors.
Maybe we can do two steps switch. Switch to swapper_pg_dir at the first step,
then switch to per cpu page table (for entry) or tramp page table (for exit).
Nobody should call this_cpu_*() at either userspace -> kernel entry stage or
kernel -> userspace exit stage.

2. Shared TLB machines
----------------------
Some machines may share TLB between CPUs, for example, SMT machines may share
TLB between the two hardware threads in one single core.
The per cpu page table just can't work with it. Maybe we need a new
cpufeature to indicate whether per cpu page table is allowed or not. Then
just enable it for not-shared-TLB machines.

 
Benchmark
=========
The benchmarks are done on 160 core AmpereOne machine. The baseline is
v7.1-rc1 kernel.

1. Kernel Build
---------------
Run kernel build (make -j160) with the default Fedora kernel config in a
memcg.
13% - 18% sys time improvment
3% - 7% wall time improvement

2. stress-ng vm ops
-------------------
stress-ng --vm 160 --vm-bytes 128M --vm-ops 100000000
8.5% improvement

3. stress-ng vm ops + fork
----------------------
stress-ng --mmapfork 160 --mmapfork-bytes 128M --mmapfork-ops 500
15% improvement


Regression test
===============
1. memcg creation
-----------------
Create 10K memcgs. Each memcg creation needs to allocate multiple percpu
variables, for example, percpu refcnt, rstat and objcg percpu refcnt.

Consumed 2112K more virtual memory for percpu “local mapping” and a few
more mega bytes consumed by per cpu page tables.
No noticeable regression was found for elapsed time.

2. fork test
------------
stress-ng --fork 160 --fork-ops 10000000
fork() needs to allocate multiple percpu variables, for example, rss
counters and mm_cid_cpu.

Roughly 1% regression was found. However stress-ng fork test has quites
small address space, the real life workloads typically have much larger
address space and do more complicated works. The stress-ng mmapfork
benchmark saw 15% improvement.


Yang Shi (11):
      arm64: mm: enable percpu kernel page table
      arm64: mm: define percpu virtual space area
      arm64: smp: define setup_per_cpu_areas()
      mm: percpu: prepare to use dedicated percpu area
      arm64: mm: map local percpu first chunk
      mm: percpu: set up first chunk and reserve chunk
      arm64: mm: introduce __per_cpu_local_off
      vmalloc: pass in pgd pointer for vmap{__vunmap}_range_noflush()
      mm: percpu: allocate and free local percpu vm area
      arm64: kconfig: select HAVE_LOCAL_PER_CPU_MAP
      arm64: percpu: use local percpu for this_cpu_*() APIs

 arch/arm64/Kconfig                   |   2 +-
 arch/arm64/include/asm/mmu.h         |   3 +++
 arch/arm64/include/asm/mmu_context.h |   6 +++++-
 arch/arm64/include/asm/percpu.h      |  17 ++++++++++-------
 arch/arm64/include/asm/pgtable.h     |  24 +++++++++++++++++++++---
 arch/arm64/kernel/setup.c            |   3 +++
 arch/arm64/kernel/smp.c              |  40 ++++++++++++++++++++++++++++++++++++++++
 arch/arm64/mm/mmu.c                  |  75 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 arch/arm64/mm/ptdump.c               |   4 ++++
 drivers/base/arch_numa.c             |  51 +--------------------------------------------------
 include/linux/percpu.h               |   4 +++-
 include/linux/vmalloc.h              |   3 +++
 mm/Kconfig                           |   3 +++
 mm/internal.h                        |   5 ++++-
 mm/kmsan/hooks.c                     |  14 +++++++-------
 mm/percpu-internal.h                 |  15 +++++++++++++++
 mm/percpu-vm.c                       |  91 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/percpu.c                          |  46 +++++++++++++++++++++++++++++++++++++---------
 mm/vmalloc.c                         | 112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------
 19 files changed, 419 insertions(+), 99 deletions(-)


Thanks,
Yang

Re: [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)

Posted by Yang Shi 1 month, 2 weeks ago


On 4/29/26 10:04 AM, Yang Shi wrote:
> Introduction
> ============
> This patch series implemented the LSFMM 2026 proposal for optimizing
> this_cpu_*() ops on ARM64. For the details of the proposal, Please refer to:
> https://lore.kernel.org/linux-mm/CAHbLzkpcN-T8MH6=W3jCxcFj1gVZp8fRqe231yzZT-rV_E_org@mail.gmail.com/
> I didn't repeat it in the cover letter because there is no change to the
> proposal.
>
> The series is based on 7.1-rc1. It is basically minimum viable patches.
> There are still a few hacks in this series and it may break something,
> for example, KPTI, SMT machines which shared TLB, etc. But it shoule be
> good enough for now to demonstrate the core idea. The main purpose of the
> RFC is to gather feedback early, figure out missing parts and risks, and
> make sure we are on the right track, as well as hopefully it can help the
> discussion for the upcoming LSFMM.
>
> I broke the patches down to arch-dependent and arch-independent parts so that
> hopefully the interested persons can do experiments on other architectures,
> for example, S390, easier.
>
> A new kernel config is introduced, HAVE_LOCAL_PER_CPU_MAP. The architectures
> which can support this feature will select it. Allocating and freeing percpu
> local mapping is protected by this config so that others won't pay the cost.
>
>   
> Known Issues
> ============
> 1. KPIT
> -------
> We need determine what CPU we are on, then switch to the right page table.
> Currently arm64 kernel fetches tramp_pg_dir via swapper_pg_dir - fixed_offset,
> and fetches swapper_pg_dir from ttbr1. But ttbr1 may not hold swapper_pg_dir
> anymore except CPU #0. So we need to figure out the other way to handle it.
> Switching to tramp_pg_dir should be easy, but the reverse seems harder because
> tramp_pg_dir just maps the trampoline vectors.
> Maybe we can do two steps switch. Switch to swapper_pg_dir at the first step,
> then switch to per cpu page table (for entry) or tramp page table (for exit).
> Nobody should call this_cpu_*() at either userspace -> kernel entry stage or
> kernel -> userspace exit stage.
>
> 2. Shared TLB machines
> ----------------------
> Some machines may share TLB between CPUs, for example, SMT machines may share
> TLB between the two hardware threads in one single core.
> The per cpu page table just can't work with it. Maybe we need a new
> cpufeature to indicate whether per cpu page table is allowed or not. Then
> just enable it for not-shared-TLB machines.

Adding more known issues, I forgot to list them.

3. Memory hotplug/unplug
-----------------------
The linear mapping and/or vmemmap may be out of sync because 
__create_pgd_mapping() and __remove_pgd_mapping() are called
to deal with the page tables for memory hotplug/unplug, which don't have 
mechanism to sync page tables. But it should not be hard to resolve.

4. 2-level and 3-level page table
----------------------------
Need to make page table sync work for them, currently should just work 
with 4-level page table for now. It is not hard either.

5. Confusing /proc/vmallocinfo
---------------------------
The percpu were allocated from vmalloc area before, now they are not. So 
they should not show up in /proc/vmallocinfo
anymore,


Yang

>
>   
> Benchmark
> =========
> The benchmarks are done on 160 core AmpereOne machine. The baseline is
> v7.1-rc1 kernel.
>
> 1. Kernel Build
> ---------------
> Run kernel build (make -j160) with the default Fedora kernel config in a
> memcg.
> 13% - 18% sys time improvment
> 3% - 7% wall time improvement
>
> 2. stress-ng vm ops
> -------------------
> stress-ng --vm 160 --vm-bytes 128M --vm-ops 100000000
> 8.5% improvement
>
> 3. stress-ng vm ops + fork
> ----------------------
> stress-ng --mmapfork 160 --mmapfork-bytes 128M --mmapfork-ops 500
> 15% improvement
>
>
> Regression test
> ===============
> 1. memcg creation
> -----------------
> Create 10K memcgs. Each memcg creation needs to allocate multiple percpu
> variables, for example, percpu refcnt, rstat and objcg percpu refcnt.
>
> Consumed 2112K more virtual memory for percpu “local mapping” and a few
> more mega bytes consumed by per cpu page tables.
> No noticeable regression was found for elapsed time.
>
> 2. fork test
> ------------
> stress-ng --fork 160 --fork-ops 10000000
> fork() needs to allocate multiple percpu variables, for example, rss
> counters and mm_cid_cpu.
>
> Roughly 1% regression was found. However stress-ng fork test has quites
> small address space, the real life workloads typically have much larger
> address space and do more complicated works. The stress-ng mmapfork
> benchmark saw 15% improvement.
>
>
> Yang Shi (11):
>        arm64: mm: enable percpu kernel page table
>        arm64: mm: define percpu virtual space area
>        arm64: smp: define setup_per_cpu_areas()
>        mm: percpu: prepare to use dedicated percpu area
>        arm64: mm: map local percpu first chunk
>        mm: percpu: set up first chunk and reserve chunk
>        arm64: mm: introduce __per_cpu_local_off
>        vmalloc: pass in pgd pointer for vmap{__vunmap}_range_noflush()
>        mm: percpu: allocate and free local percpu vm area
>        arm64: kconfig: select HAVE_LOCAL_PER_CPU_MAP
>        arm64: percpu: use local percpu for this_cpu_*() APIs
>
>   arch/arm64/Kconfig                   |   2 +-
>   arch/arm64/include/asm/mmu.h         |   3 +++
>   arch/arm64/include/asm/mmu_context.h |   6 +++++-
>   arch/arm64/include/asm/percpu.h      |  17 ++++++++++-------
>   arch/arm64/include/asm/pgtable.h     |  24 +++++++++++++++++++++---
>   arch/arm64/kernel/setup.c            |   3 +++
>   arch/arm64/kernel/smp.c              |  40 ++++++++++++++++++++++++++++++++++++++++
>   arch/arm64/mm/mmu.c                  |  75 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   arch/arm64/mm/ptdump.c               |   4 ++++
>   drivers/base/arch_numa.c             |  51 +--------------------------------------------------
>   include/linux/percpu.h               |   4 +++-
>   include/linux/vmalloc.h              |   3 +++
>   mm/Kconfig                           |   3 +++
>   mm/internal.h                        |   5 ++++-
>   mm/kmsan/hooks.c                     |  14 +++++++-------
>   mm/percpu-internal.h                 |  15 +++++++++++++++
>   mm/percpu-vm.c                       |  91 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   mm/percpu.c                          |  46 +++++++++++++++++++++++++++++++++++++---------
>   mm/vmalloc.c                         | 112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------
>   19 files changed, 419 insertions(+), 99 deletions(-)
>
>
> Thanks,
> Yang
>

Re: [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)

Posted by David Hildenbrand (Arm) 1 month ago

> =========
> The benchmarks are done on 160 core AmpereOne machine. The baseline is
> v7.1-rc1 kernel.
> 
> 1. Kernel Build
> ---------------
> Run kernel build (make -j160) with the default Fedora kernel config in a
> memcg.
> 13% - 18% sys time improvment
> 3% - 7% wall time improvement

This is pretty impressive!

There was quite some feedback during the LSF/MM session, what's the current plan?

Also, it was raised that Linus so far didn't enjoy per-process page tables. Is
there a way forward?


Finally, in the LSF/MM session, there was the question why the preemption
handling is even required. Can you describe what the problem is?

-- 
Cheers,

David

Re: [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)

Posted by Yang Shi 1 month ago

On 5/12/26 2:02 AM, David Hildenbrand (Arm) wrote:
>> =========
>> The benchmarks are done on 160 core AmpereOne machine. The baseline is
>> v7.1-rc1 kernel.
>>
>> 1. Kernel Build
>> ---------------
>> Run kernel build (make -j160) with the default Fedora kernel config in a
>> memcg.
>> 13% - 18% sys time improvment
>> 3% - 7% wall time improvement
> This is pretty impressive!

Thank you.

>
> There was quite some feedback during the LSF/MM session, what's the current plan?

We didn't talk about the plan in the LSFMM session due to time ran out. 
I had some hallway conversation with Ryan. He said he will try to 
replicate the performance benchmarks on some other ARM64 machines.

He raised the concern about CNP (Common not Private), but neither I nor 
he can find machines with shared TLB. We do need some help to run the 
patchset on those machines because disabling CNP may have some 
performance implication.

I plan to polish up the patchset. There are still a lot work to do to 
make it in a better shape. Sounds likes a plan?

I'm not sure whether S390 folks will implement this on S390 or not, 
anyway they are cc'ed.

>
> Also, it was raised that Linus so far didn't enjoy per-process page tables. Is
> there a way forward?

Yeah, it was discussed. My point is it makes some sense for x86 to not 
have per cpu page table because userspace and kernel share the same page 
table on x86, so the number of kernel page tables is actually unbounded. 
But ARM64 is different. The hardware supports separate userspace and 
kernel page tables, so the number of kernel page tables is actually 
bounded by the number of CPUs. And my regression tests didn't show 
noticeable regression for setting up percpu local mapping for 160 cores 
(means 160 kernel page tables).

So we should maximize the hardware benefit IMHO. And it should be up to 
the architecture maintainers.

>
>
> Finally, in the LSF/MM session, there was the question why the preemption
> handling is even required. Can you describe what the problem is?

Someone questioned why not just remove preempt_disable/enable because we 
just care about the sum of the counters. It may be ok for some cases, 
for example, some simple statistics, but it may cause problems for a lot 
usecases, for example:
     - __this_cpu_*() ops don't use atomic instructions. If they happen 
to access the same counter with this_cpu_*() concurrently, the counter 
may be corrupted.
     - this_cpu_write() may write a value or pointer, it may corrupt the 
remote CPU's copy.
     - The percpu counter may call into slow path to flush the per cpu 
counters to a global counter if some threshold is reached, the imprecise 
per cpu counter may result in suboptimal behavior, for example, calling 
in slow path more than necessary.
     - Cause the statistics out of sync or larger deviation than 
expected because the counter flush is not done due to comparing the 
threshold with wrong value.
     - AFAIK, scheduler may use percpu counter for some percpu lock, the 
imprecise counter may cause lockup and misbehavior.
     - And some subsystems maintain percpu state, then make decision 
based on the percpu state. The corrupted percpu state may cause various 
problems.
     - this_cpu_cmpxchg() may compare the remote CPU's value and result 
in indefinite loop.

There are a lot other cases that I may be not aware of because percpu is 
widely used by various subsystems. Anyway the spec is this_cpu_*() ops 
just can access local CPU copy. Accessing remote CPU's data is 
definitely not expected and may cause various problems.

Thanks,
Yang

>

Re: [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)

Posted by Heiko Carstens 1 month ago

On Wed, May 13, 2026 at 05:00:19PM -0700, Yang Shi wrote:
> On 5/12/26 2:02 AM, David Hildenbrand (Arm) wrote:
> > There was quite some feedback during the LSF/MM session, what's the current plan?
...
> I'm not sure whether S390 folks will implement this on S390 or not, anyway
> they are cc'ed.

I'm not sure yet, however after I had a look at the architecture documentation
a couple of weeks ago, I think it shouldn't be too hard to get this working on
s390 as well. I was a bit concerned about TLB flushing, if changes to the
kernel mapping happen with per-cpu page tables, but as of now I believe this
shouldn't cause any harm (famous last words...).

Re: [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)

Posted by Yang Shi 1 month ago

On 5/15/26 9:28 AM, Heiko Carstens wrote:
> On Wed, May 13, 2026 at 05:00:19PM -0700, Yang Shi wrote:
>> On 5/12/26 2:02 AM, David Hildenbrand (Arm) wrote:
>>> There was quite some feedback during the LSF/MM session, what's the current plan?
> ...
>> I'm not sure whether S390 folks will implement this on S390 or not, anyway
>> they are cc'ed.
> I'm not sure yet, however after I had a look at the architecture documentation
> a couple of weeks ago, I think it shouldn't be too hard to get this working on
> s390 as well. I was a bit concerned about TLB flushing, if changes to the
> kernel mapping happen with per-cpu page tables, but as of now I believe this
> shouldn't cause any harm (famous last words...).

Yeah, it shouldn't. Kernel needs to flush TLB for all CPUs regardless of 
percpu page table when kernel mapping is changed. There should not be 
any extra overhead for the most cases.

Some extra TLB flush is needed for "percpu local mapping area", but all 
CPUs use the same virtual address, so we should just need one more TLB 
flush call with the same virtual address for all CPUs. In addition, the 
percpu chunk destruction happens asynchronously in work queue. Umapping 
page tables, flushing TLB and freeing pages all happen in work queue 
when the whole chunk is freed. The fast path basically just updates an 
allocation bitmap.

Thanks,
Yang