Add a percpu subsection for hot data

[RFC PATCH 00/11] Add a percpu subsection for hot data

Posted by Brian Gerst 11 months, 2 weeks ago

Add a new percpu subsection for data that is frequently accessed and
exclusive to each processor.  This is intended to replace the pcpu_hot
struct on X86, and is available to all architectures.

The one caveat with this approach is that it depends on the linker to
effeciently pack data that is smaller than machine word size.  The
binutils linker does this properly:

ffffffff842f6000 D __per_cpu_hot_start
ffffffff842f6000 D softirq_pending
ffffffff842f6002 D hardirq_stack_inuse
ffffffff842f6008 D hardirq_stack_ptr
ffffffff842f6010 D __ref_stack_chk_guard
ffffffff842f6010 D __stack_chk_guard
ffffffff842f6018 D const_cpu_current_top_of_stack
ffffffff842f6018 D cpu_current_top_of_stack
ffffffff842f6020 D const_current_task
ffffffff842f6020 D current_task
ffffffff842f6028 D __preempt_count
ffffffff842f602c D cpu_number
ffffffff842f6030 D this_cpu_off
ffffffff842f6038 D __x86_call_depth
ffffffff842f6040 D __per_cpu_hot_end

The LLVM linker doesn't do as well with packing smaller data objects,
causing it to spill over into a second cacheline.

Brian Gerst (11):
  percpu: Introduce percpu hot section
  x86/preempt: Move preempt count to percpu hot section
  x86/smp: Move cpu number to percpu hot section
  x86/retbleed: Move call depth to percpu hot section
  x86/percpu: Move top_of_stack to percpu hot section
  x86/percpu: Move current_task to percpu hot section
  x86/softirq: Move softirq_pending to percpu hot section
  x86/irq: Move irq stacks to percpu hot section
  x86/percpu: Remove pcpu_hot
  x86/stackprotector: Move __stack_chk_guard to percpu hot section
  x86/smp: Move this_cpu_off to percpu hot section

 arch/x86/entry/entry_32.S             |  4 +--
 arch/x86/entry/entry_64.S             |  6 ++---
 arch/x86/entry/entry_64_compat.S      |  4 +--
 arch/x86/include/asm/current.h        | 35 ++++-----------------------
 arch/x86/include/asm/hardirq.h        |  3 ++-
 arch/x86/include/asm/irq_stack.h      | 12 ++++-----
 arch/x86/include/asm/nospec-branch.h  | 10 +++++---
 arch/x86/include/asm/percpu.h         |  4 +--
 arch/x86/include/asm/preempt.h        | 25 ++++++++++---------
 arch/x86/include/asm/processor.h      | 15 ++++++++++--
 arch/x86/include/asm/smp.h            |  7 +++---
 arch/x86/include/asm/stackprotector.h |  2 +-
 arch/x86/kernel/asm-offsets.c         |  5 ----
 arch/x86/kernel/callthunks.c          |  3 +++
 arch/x86/kernel/cpu/common.c          | 17 +++++++------
 arch/x86/kernel/dumpstack_32.c        |  4 +--
 arch/x86/kernel/dumpstack_64.c        |  2 +-
 arch/x86/kernel/head_64.S             |  4 +--
 arch/x86/kernel/irq.c                 |  8 ++++++
 arch/x86/kernel/irq_32.c              | 12 +++++----
 arch/x86/kernel/irq_64.c              |  6 ++---
 arch/x86/kernel/process_32.c          |  6 ++---
 arch/x86/kernel/process_64.c          |  6 ++---
 arch/x86/kernel/setup_percpu.c        |  7 ++++--
 arch/x86/kernel/smpboot.c             |  4 +--
 arch/x86/kernel/vmlinux.lds.S         |  5 +++-
 arch/x86/lib/retpoline.S              |  2 +-
 include/asm-generic/vmlinux.lds.h     | 10 ++++++++
 include/linux/percpu-defs.h           | 10 ++++++++
 kernel/bpf/verifier.c                 |  4 +--
 scripts/gdb/linux/cpus.py             |  2 +-
 31 files changed, 135 insertions(+), 109 deletions(-)


base-commit: 01157ddc58dc2fe428ec17dd5a18cc13f134639f
-- 
2.48.1

Re: [RFC PATCH 00/11] Add a percpu subsection for hot data

Posted by Ingo Molnar 11 months, 2 weeks ago

* Brian Gerst <brgerst@gmail.com> wrote:

> Add a new percpu subsection for data that is frequently accessed and
> exclusive to each processor.  This is intended to replace the pcpu_hot
> struct on X86, and is available to all architectures.
> 
> The one caveat with this approach is that it depends on the linker to
> effeciently pack data that is smaller than machine word size.  The
> binutils linker does this properly:
> 
> ffffffff842f6000 D __per_cpu_hot_start
> ffffffff842f6000 D softirq_pending
> ffffffff842f6002 D hardirq_stack_inuse
> ffffffff842f6008 D hardirq_stack_ptr
> ffffffff842f6010 D __ref_stack_chk_guard
> ffffffff842f6010 D __stack_chk_guard
> ffffffff842f6018 D const_cpu_current_top_of_stack
> ffffffff842f6018 D cpu_current_top_of_stack
> ffffffff842f6020 D const_current_task
> ffffffff842f6020 D current_task
> ffffffff842f6028 D __preempt_count
> ffffffff842f602c D cpu_number
> ffffffff842f6030 D this_cpu_off
> ffffffff842f6038 D __x86_call_depth
> ffffffff842f6040 D __per_cpu_hot_end
> 
> The LLVM linker doesn't do as well with packing smaller data objects,
> causing it to spill over into a second cacheline.

Ok, so I like it how it decentralizes the decision about what is 'hot' 
and what is not:

  --- a/arch/x86/kernel/irq.c
  +++ b/arch/x86/kernel/irq.c

  DEFINE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
  EXPORT_PER_CPU_SYMBOL(irq_stat);

  +DEFINE_PER_CPU_HOT(u16, softirq_pending);

This can also be a drawback if it's abused by random driver code - so I 
think it should at minimum be documented to be used by core & arch 
code. Maybe add a build #error too if it's defined in modular code?

Other variants like DEFINE_PER_CPU_SHARED_ALIGNED aren't being abused 
really AFAICS, so maybe this isn't too much of a concern.

One potential drawback would be that previously the section was 
hand-ordered:

 struct pcpu_hot {
        union {
                struct {
                        struct task_struct      *current_task;
                        int                     preempt_count;
                        int                     cpu_number;
 #ifdef CONFIG_MITIGATION_CALL_DEPTH_TRACKING
                        u64                     call_depth;
 #endif
                        unsigned long           top_of_stack;
                        void                    *hardirq_stack_ptr;
                        u16                     softirq_pending;
 #ifdef CONFIG_X86_64
                        bool                    hardirq_stack_inuse;
 #else
                        void                    *softirq_stack_ptr;
 #endif
                };
                u8      pad[64];
        };
 };

... while now it's linker-ordered. But on the other hand that can be an 
advantage too: the linker will try to (or at least has a chance to) 
order the fields optimally for cache density, while the hand-packing 
always has the potential to bitrot without much of an outside, 
actionable indicator for the bitrot.

One naming suggestion, wouldn't it be better to make it explicit that 
the 'hot' qualifier is about cache locality:

  +DEFINE_PER_CPU_CACHE_HOT(u16, softirq_pending);

Makes it more of a mouthful to write definitions/declarations, but the 
actual per-cpu usage sites are unaffected as this too is otherwise part 
of the generic percpu namespace.

... and yes, DEFINE_PER_CPU_ALIGNED should probably have been named 
DEFINE_PER_CPU_CACHE_ALIGNED too. (Because 'aligned' often means 
machine word unit, so the naming is a bit ambiguous.)

I.e. in an ideal world the complete set of DEFINE_PER_CPU_XXX 
attributes should be something like:

 DEFINE_PER_CPU_CACHE_HOT
 DEFINE_PER_CPU_CACHE_ALIGNED		# was: DEFINE_PER_CPU_ALIGNED
 DEFINE_PER_CPU_CACHE_ALIGNED_SHARED	# was: DEFINE_PER_CPU_SHARED_ALIGNED

 DEFINE_PER_CPU_PAGE_ALIGNED

 DEFINE_PER_CPU_READ_MOSTLY
 DEFINE_PER_CPU_DECRYPTED

But I digress...

Anyway, I've Cc:-ed various potentially interested parties, please 
speak up now or forever hold your peace. ;-)

Thanks,

	Ingo

Re: [RFC PATCH 00/11] Add a percpu subsection for hot data

Posted by Linus Torvalds 11 months, 2 weeks ago

On Sun, 23 Feb 2025 at 01:37, Ingo Molnar <mingo@kernel.org> wrote:
>
> This can also be a drawback if it's abused by random driver code - so I
> think it should at minimum be documented to be used by core & arch
> code. Maybe add a build #error too if it's defined in modular code?

Yes, please.

Everybody always thinks that *their* code is the most important code,
so making it easy for random filesystems or drivers to just say "this
is my hot piece of data" needs to be avoided.

That is also an argument for having the final size be asserted to be
smaller than one cacheline.

Because I do think that the patches look fine, but it's too much of an
invitation for random developers to go "Oh, *MY* code deserves a hot
marker".

           Linus

Re: [RFC PATCH 00/11] Add a percpu subsection for hot data

Posted by Ard Biesheuvel 11 months, 2 weeks ago

On Sun, 23 Feb 2025 at 10:37, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Brian Gerst <brgerst@gmail.com> wrote:
>
> > Add a new percpu subsection for data that is frequently accessed and
> > exclusive to each processor.  This is intended to replace the pcpu_hot
> > struct on X86, and is available to all architectures.
> >
> > The one caveat with this approach is that it depends on the linker to
> > effeciently pack data that is smaller than machine word size.  The
> > binutils linker does this properly:
> >
> > ffffffff842f6000 D __per_cpu_hot_start
> > ffffffff842f6000 D softirq_pending
> > ffffffff842f6002 D hardirq_stack_inuse
> > ffffffff842f6008 D hardirq_stack_ptr
> > ffffffff842f6010 D __ref_stack_chk_guard
> > ffffffff842f6010 D __stack_chk_guard
> > ffffffff842f6018 D const_cpu_current_top_of_stack
> > ffffffff842f6018 D cpu_current_top_of_stack
> > ffffffff842f6020 D const_current_task
> > ffffffff842f6020 D current_task
> > ffffffff842f6028 D __preempt_count
> > ffffffff842f602c D cpu_number
> > ffffffff842f6030 D this_cpu_off
> > ffffffff842f6038 D __x86_call_depth
> > ffffffff842f6040 D __per_cpu_hot_end
> >
> > The LLVM linker doesn't do as well with packing smaller data objects,
> > causing it to spill over into a second cacheline.
>
> ... now it's linker-ordered. But on the other hand that can be an
> advantage too: the linker will try to (or at least has a chance to)
> order the fields optimally for cache density, while the hand-packing
> always has the potential to bitrot without much of an outside,
> actionable indicator for the bitrot.
>

The linker will need some help here - by default, it just emits these
variables in the order they appear in the input.

If we emit each such variable 'foo' into .data..hot.foo, and define
the contents of the section as

*(SORT_BY_ALIGNMENT(.data..hot.*))

we should get optimal packing as long as the alignment of these
variables does not exceed their size.

Re: [RFC PATCH 00/11] Add a percpu subsection for hot data

Posted by Brian Gerst 11 months, 2 weeks ago

On Sun, Feb 23, 2025 at 5:20 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Sun, 23 Feb 2025 at 10:37, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > * Brian Gerst <brgerst@gmail.com> wrote:
> >
> > > Add a new percpu subsection for data that is frequently accessed and
> > > exclusive to each processor.  This is intended to replace the pcpu_hot
> > > struct on X86, and is available to all architectures.
> > >
> > > The one caveat with this approach is that it depends on the linker to
> > > effeciently pack data that is smaller than machine word size.  The
> > > binutils linker does this properly:
> > >
> > > ffffffff842f6000 D __per_cpu_hot_start
> > > ffffffff842f6000 D softirq_pending
> > > ffffffff842f6002 D hardirq_stack_inuse
> > > ffffffff842f6008 D hardirq_stack_ptr
> > > ffffffff842f6010 D __ref_stack_chk_guard
> > > ffffffff842f6010 D __stack_chk_guard
> > > ffffffff842f6018 D const_cpu_current_top_of_stack
> > > ffffffff842f6018 D cpu_current_top_of_stack
> > > ffffffff842f6020 D const_current_task
> > > ffffffff842f6020 D current_task
> > > ffffffff842f6028 D __preempt_count
> > > ffffffff842f602c D cpu_number
> > > ffffffff842f6030 D this_cpu_off
> > > ffffffff842f6038 D __x86_call_depth
> > > ffffffff842f6040 D __per_cpu_hot_end
> > >
> > > The LLVM linker doesn't do as well with packing smaller data objects,
> > > causing it to spill over into a second cacheline.
> >
> > ... now it's linker-ordered. But on the other hand that can be an
> > advantage too: the linker will try to (or at least has a chance to)
> > order the fields optimally for cache density, while the hand-packing
> > always has the potential to bitrot without much of an outside,
> > actionable indicator for the bitrot.
> >
>
> The linker will need some help here - by default, it just emits these
> variables in the order they appear in the input.
>
> If we emit each such variable 'foo' into .data..hot.foo, and define
> the contents of the section as
>
> *(SORT_BY_ALIGNMENT(.data..hot.*))
>
> we should get optimal packing as long as the alignment of these
> variables does not exceed their size.

Thanks for the tip on SORT_BY_ALIGNMENT().  That got the LLVM linker
to pack the data correctly.


Brian Gerst

Re: [RFC PATCH 00/11] Add a percpu subsection for hot data

Posted by Uros Bizjak 11 months, 2 weeks ago

On Sun, Feb 23, 2025 at 11:20 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Sun, 23 Feb 2025 at 10:37, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > * Brian Gerst <brgerst@gmail.com> wrote:
> >
> > > Add a new percpu subsection for data that is frequently accessed and
> > > exclusive to each processor.  This is intended to replace the pcpu_hot
> > > struct on X86, and is available to all architectures.
> > >
> > > The one caveat with this approach is that it depends on the linker to
> > > effeciently pack data that is smaller than machine word size.  The
> > > binutils linker does this properly:
> > >
> > > ffffffff842f6000 D __per_cpu_hot_start
> > > ffffffff842f6000 D softirq_pending
> > > ffffffff842f6002 D hardirq_stack_inuse
> > > ffffffff842f6008 D hardirq_stack_ptr
> > > ffffffff842f6010 D __ref_stack_chk_guard
> > > ffffffff842f6010 D __stack_chk_guard
> > > ffffffff842f6018 D const_cpu_current_top_of_stack
> > > ffffffff842f6018 D cpu_current_top_of_stack
> > > ffffffff842f6020 D const_current_task
> > > ffffffff842f6020 D current_task
> > > ffffffff842f6028 D __preempt_count
> > > ffffffff842f602c D cpu_number
> > > ffffffff842f6030 D this_cpu_off
> > > ffffffff842f6038 D __x86_call_depth
> > > ffffffff842f6040 D __per_cpu_hot_end
> > >
> > > The LLVM linker doesn't do as well with packing smaller data objects,
> > > causing it to spill over into a second cacheline.
> >
> > ... now it's linker-ordered. But on the other hand that can be an
> > advantage too: the linker will try to (or at least has a chance to)
> > order the fields optimally for cache density, while the hand-packing
> > always has the potential to bitrot without much of an outside,
> > actionable indicator for the bitrot.
> >
>
> The linker will need some help here - by default, it just emits these
> variables in the order they appear in the input.
>
> If we emit each such variable 'foo' into .data..hot.foo, and define
> the contents of the section as
>
> *(SORT_BY_ALIGNMENT(.data..hot.*))
>
> we should get optimal packing as long as the alignment of these
> variables does not exceed their size.

Is it possible to warn/error when data is spilled over the cache line?
Previously, there was:

-static_assert(sizeof(struct pcpu_hot) == 64);

that failed the build in this case.

Uros.

Re: [RFC PATCH 00/11] Add a percpu subsection for hot data

Posted by Brian Gerst 11 months, 2 weeks ago

On Sun, Feb 23, 2025 at 5:30 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Sun, Feb 23, 2025 at 11:20 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Sun, 23 Feb 2025 at 10:37, Ingo Molnar <mingo@kernel.org> wrote:
> > >
> > >
> > > * Brian Gerst <brgerst@gmail.com> wrote:
> > >
> > > > Add a new percpu subsection for data that is frequently accessed and
> > > > exclusive to each processor.  This is intended to replace the pcpu_hot
> > > > struct on X86, and is available to all architectures.
> > > >
> > > > The one caveat with this approach is that it depends on the linker to
> > > > effeciently pack data that is smaller than machine word size.  The
> > > > binutils linker does this properly:
> > > >
> > > > ffffffff842f6000 D __per_cpu_hot_start
> > > > ffffffff842f6000 D softirq_pending
> > > > ffffffff842f6002 D hardirq_stack_inuse
> > > > ffffffff842f6008 D hardirq_stack_ptr
> > > > ffffffff842f6010 D __ref_stack_chk_guard
> > > > ffffffff842f6010 D __stack_chk_guard
> > > > ffffffff842f6018 D const_cpu_current_top_of_stack
> > > > ffffffff842f6018 D cpu_current_top_of_stack
> > > > ffffffff842f6020 D const_current_task
> > > > ffffffff842f6020 D current_task
> > > > ffffffff842f6028 D __preempt_count
> > > > ffffffff842f602c D cpu_number
> > > > ffffffff842f6030 D this_cpu_off
> > > > ffffffff842f6038 D __x86_call_depth
> > > > ffffffff842f6040 D __per_cpu_hot_end
> > > >
> > > > The LLVM linker doesn't do as well with packing smaller data objects,
> > > > causing it to spill over into a second cacheline.
> > >
> > > ... now it's linker-ordered. But on the other hand that can be an
> > > advantage too: the linker will try to (or at least has a chance to)
> > > order the fields optimally for cache density, while the hand-packing
> > > always has the potential to bitrot without much of an outside,
> > > actionable indicator for the bitrot.
> > >
> >
> > The linker will need some help here - by default, it just emits these
> > variables in the order they appear in the input.
> >
> > If we emit each such variable 'foo' into .data..hot.foo, and define
> > the contents of the section as
> >
> > *(SORT_BY_ALIGNMENT(.data..hot.*))
> >
> > we should get optimal packing as long as the alignment of these
> > variables does not exceed their size.
>
> Is it possible to warn/error when data is spilled over the cache line?
> Previously, there was:
>
> -static_assert(sizeof(struct pcpu_hot) == 64);
>
> that failed the build in this case.

I think it should be a warning and not an error.  If it does spill
into a second cacheline the kernel will still boot and function
properly so it's not a fatal error, it just could hit performance a
bit.  By decentralizing this it does make it harder to account for
size, especially with conditional builds.  Unfortunately, the linker
script language does not have a WARNING() counterpart to ASSERT().


Brian Gerst

Re: [RFC PATCH 00/11] Add a percpu subsection for hot data

Posted by Ard Biesheuvel 11 months, 2 weeks ago

On Sun, 23 Feb 2025 at 18:25, Brian Gerst <brgerst@gmail.com> wrote:
>
> On Sun, Feb 23, 2025 at 5:30 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > On Sun, Feb 23, 2025 at 11:20 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > On Sun, 23 Feb 2025 at 10:37, Ingo Molnar <mingo@kernel.org> wrote:
> > > >
> > > >
> > > > * Brian Gerst <brgerst@gmail.com> wrote:
> > > >
> > > > > Add a new percpu subsection for data that is frequently accessed and
> > > > > exclusive to each processor.  This is intended to replace the pcpu_hot
> > > > > struct on X86, and is available to all architectures.
> > > > >
> > > > > The one caveat with this approach is that it depends on the linker to
> > > > > effeciently pack data that is smaller than machine word size.  The
> > > > > binutils linker does this properly:
> > > > >
> > > > > ffffffff842f6000 D __per_cpu_hot_start
> > > > > ffffffff842f6000 D softirq_pending
> > > > > ffffffff842f6002 D hardirq_stack_inuse
> > > > > ffffffff842f6008 D hardirq_stack_ptr
> > > > > ffffffff842f6010 D __ref_stack_chk_guard
> > > > > ffffffff842f6010 D __stack_chk_guard
> > > > > ffffffff842f6018 D const_cpu_current_top_of_stack
> > > > > ffffffff842f6018 D cpu_current_top_of_stack
> > > > > ffffffff842f6020 D const_current_task
> > > > > ffffffff842f6020 D current_task
> > > > > ffffffff842f6028 D __preempt_count
> > > > > ffffffff842f602c D cpu_number
> > > > > ffffffff842f6030 D this_cpu_off
> > > > > ffffffff842f6038 D __x86_call_depth
> > > > > ffffffff842f6040 D __per_cpu_hot_end
> > > > >
> > > > > The LLVM linker doesn't do as well with packing smaller data objects,
> > > > > causing it to spill over into a second cacheline.
> > > >
> > > > ... now it's linker-ordered. But on the other hand that can be an
> > > > advantage too: the linker will try to (or at least has a chance to)
> > > > order the fields optimally for cache density, while the hand-packing
> > > > always has the potential to bitrot without much of an outside,
> > > > actionable indicator for the bitrot.
> > > >
> > >
> > > The linker will need some help here - by default, it just emits these
> > > variables in the order they appear in the input.
> > >
> > > If we emit each such variable 'foo' into .data..hot.foo, and define
> > > the contents of the section as
> > >
> > > *(SORT_BY_ALIGNMENT(.data..hot.*))
> > >
> > > we should get optimal packing as long as the alignment of these
> > > variables does not exceed their size.
> >
> > Is it possible to warn/error when data is spilled over the cache line?
> > Previously, there was:
> >
> > -static_assert(sizeof(struct pcpu_hot) == 64);
> >
> > that failed the build in this case.
>
> I think it should be a warning and not an error.  If it does spill
> into a second cacheline the kernel will still boot and function
> properly so it's not a fatal error, it just could hit performance a
> bit.  By decentralizing this it does make it harder to account for
> size, especially with conditional builds.  Unfortunately, the linker
> script language does not have a WARNING() counterpart to ASSERT().
>

Why should it even be a warning? What is the problem if the build in
question has two cachelines worth of hot per-CPU data?