arch/x86/entry/entry_32.S | 4 +-- arch/x86/entry/entry_64.S | 6 ++--- arch/x86/entry/entry_64_compat.S | 4 +-- arch/x86/include/asm/current.h | 35 ++++----------------------- arch/x86/include/asm/hardirq.h | 3 ++- arch/x86/include/asm/irq_stack.h | 12 ++++----- arch/x86/include/asm/nospec-branch.h | 10 +++++--- arch/x86/include/asm/percpu.h | 4 +-- arch/x86/include/asm/preempt.h | 25 ++++++++++--------- arch/x86/include/asm/processor.h | 15 ++++++++++-- arch/x86/include/asm/smp.h | 7 +++--- arch/x86/include/asm/stackprotector.h | 2 +- arch/x86/kernel/asm-offsets.c | 5 ---- arch/x86/kernel/callthunks.c | 3 +++ arch/x86/kernel/cpu/common.c | 17 +++++++------ arch/x86/kernel/dumpstack_32.c | 4 +-- arch/x86/kernel/dumpstack_64.c | 2 +- arch/x86/kernel/head_64.S | 4 +-- arch/x86/kernel/irq.c | 8 ++++++ arch/x86/kernel/irq_32.c | 12 +++++---- arch/x86/kernel/irq_64.c | 6 ++--- arch/x86/kernel/process_32.c | 6 ++--- arch/x86/kernel/process_64.c | 6 ++--- arch/x86/kernel/setup_percpu.c | 7 ++++-- arch/x86/kernel/smpboot.c | 4 +-- arch/x86/kernel/vmlinux.lds.S | 5 +++- arch/x86/lib/retpoline.S | 2 +- include/asm-generic/vmlinux.lds.h | 10 ++++++++ include/linux/percpu-defs.h | 10 ++++++++ kernel/bpf/verifier.c | 4 +-- scripts/gdb/linux/cpus.py | 2 +- 31 files changed, 135 insertions(+), 109 deletions(-)
Add a new percpu subsection for data that is frequently accessed and exclusive to each processor. This is intended to replace the pcpu_hot struct on X86, and is available to all architectures. The one caveat with this approach is that it depends on the linker to effeciently pack data that is smaller than machine word size. The binutils linker does this properly: ffffffff842f6000 D __per_cpu_hot_start ffffffff842f6000 D softirq_pending ffffffff842f6002 D hardirq_stack_inuse ffffffff842f6008 D hardirq_stack_ptr ffffffff842f6010 D __ref_stack_chk_guard ffffffff842f6010 D __stack_chk_guard ffffffff842f6018 D const_cpu_current_top_of_stack ffffffff842f6018 D cpu_current_top_of_stack ffffffff842f6020 D const_current_task ffffffff842f6020 D current_task ffffffff842f6028 D __preempt_count ffffffff842f602c D cpu_number ffffffff842f6030 D this_cpu_off ffffffff842f6038 D __x86_call_depth ffffffff842f6040 D __per_cpu_hot_end The LLVM linker doesn't do as well with packing smaller data objects, causing it to spill over into a second cacheline. Brian Gerst (11): percpu: Introduce percpu hot section x86/preempt: Move preempt count to percpu hot section x86/smp: Move cpu number to percpu hot section x86/retbleed: Move call depth to percpu hot section x86/percpu: Move top_of_stack to percpu hot section x86/percpu: Move current_task to percpu hot section x86/softirq: Move softirq_pending to percpu hot section x86/irq: Move irq stacks to percpu hot section x86/percpu: Remove pcpu_hot x86/stackprotector: Move __stack_chk_guard to percpu hot section x86/smp: Move this_cpu_off to percpu hot section arch/x86/entry/entry_32.S | 4 +-- arch/x86/entry/entry_64.S | 6 ++--- arch/x86/entry/entry_64_compat.S | 4 +-- arch/x86/include/asm/current.h | 35 ++++----------------------- arch/x86/include/asm/hardirq.h | 3 ++- arch/x86/include/asm/irq_stack.h | 12 ++++----- arch/x86/include/asm/nospec-branch.h | 10 +++++--- arch/x86/include/asm/percpu.h | 4 +-- arch/x86/include/asm/preempt.h | 25 ++++++++++--------- arch/x86/include/asm/processor.h | 15 ++++++++++-- arch/x86/include/asm/smp.h | 7 +++--- arch/x86/include/asm/stackprotector.h | 2 +- arch/x86/kernel/asm-offsets.c | 5 ---- arch/x86/kernel/callthunks.c | 3 +++ arch/x86/kernel/cpu/common.c | 17 +++++++------ arch/x86/kernel/dumpstack_32.c | 4 +-- arch/x86/kernel/dumpstack_64.c | 2 +- arch/x86/kernel/head_64.S | 4 +-- arch/x86/kernel/irq.c | 8 ++++++ arch/x86/kernel/irq_32.c | 12 +++++---- arch/x86/kernel/irq_64.c | 6 ++--- arch/x86/kernel/process_32.c | 6 ++--- arch/x86/kernel/process_64.c | 6 ++--- arch/x86/kernel/setup_percpu.c | 7 ++++-- arch/x86/kernel/smpboot.c | 4 +-- arch/x86/kernel/vmlinux.lds.S | 5 +++- arch/x86/lib/retpoline.S | 2 +- include/asm-generic/vmlinux.lds.h | 10 ++++++++ include/linux/percpu-defs.h | 10 ++++++++ kernel/bpf/verifier.c | 4 +-- scripts/gdb/linux/cpus.py | 2 +- 31 files changed, 135 insertions(+), 109 deletions(-) base-commit: 01157ddc58dc2fe428ec17dd5a18cc13f134639f -- 2.48.1
* Brian Gerst <brgerst@gmail.com> wrote:
> Add a new percpu subsection for data that is frequently accessed and
> exclusive to each processor. This is intended to replace the pcpu_hot
> struct on X86, and is available to all architectures.
>
> The one caveat with this approach is that it depends on the linker to
> effeciently pack data that is smaller than machine word size. The
> binutils linker does this properly:
>
> ffffffff842f6000 D __per_cpu_hot_start
> ffffffff842f6000 D softirq_pending
> ffffffff842f6002 D hardirq_stack_inuse
> ffffffff842f6008 D hardirq_stack_ptr
> ffffffff842f6010 D __ref_stack_chk_guard
> ffffffff842f6010 D __stack_chk_guard
> ffffffff842f6018 D const_cpu_current_top_of_stack
> ffffffff842f6018 D cpu_current_top_of_stack
> ffffffff842f6020 D const_current_task
> ffffffff842f6020 D current_task
> ffffffff842f6028 D __preempt_count
> ffffffff842f602c D cpu_number
> ffffffff842f6030 D this_cpu_off
> ffffffff842f6038 D __x86_call_depth
> ffffffff842f6040 D __per_cpu_hot_end
>
> The LLVM linker doesn't do as well with packing smaller data objects,
> causing it to spill over into a second cacheline.
Ok, so I like it how it decentralizes the decision about what is 'hot'
and what is not:
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
DEFINE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
EXPORT_PER_CPU_SYMBOL(irq_stat);
+DEFINE_PER_CPU_HOT(u16, softirq_pending);
This can also be a drawback if it's abused by random driver code - so I
think it should at minimum be documented to be used by core & arch
code. Maybe add a build #error too if it's defined in modular code?
Other variants like DEFINE_PER_CPU_SHARED_ALIGNED aren't being abused
really AFAICS, so maybe this isn't too much of a concern.
One potential drawback would be that previously the section was
hand-ordered:
struct pcpu_hot {
union {
struct {
struct task_struct *current_task;
int preempt_count;
int cpu_number;
#ifdef CONFIG_MITIGATION_CALL_DEPTH_TRACKING
u64 call_depth;
#endif
unsigned long top_of_stack;
void *hardirq_stack_ptr;
u16 softirq_pending;
#ifdef CONFIG_X86_64
bool hardirq_stack_inuse;
#else
void *softirq_stack_ptr;
#endif
};
u8 pad[64];
};
};
... while now it's linker-ordered. But on the other hand that can be an
advantage too: the linker will try to (or at least has a chance to)
order the fields optimally for cache density, while the hand-packing
always has the potential to bitrot without much of an outside,
actionable indicator for the bitrot.
One naming suggestion, wouldn't it be better to make it explicit that
the 'hot' qualifier is about cache locality:
+DEFINE_PER_CPU_CACHE_HOT(u16, softirq_pending);
Makes it more of a mouthful to write definitions/declarations, but the
actual per-cpu usage sites are unaffected as this too is otherwise part
of the generic percpu namespace.
... and yes, DEFINE_PER_CPU_ALIGNED should probably have been named
DEFINE_PER_CPU_CACHE_ALIGNED too. (Because 'aligned' often means
machine word unit, so the naming is a bit ambiguous.)
I.e. in an ideal world the complete set of DEFINE_PER_CPU_XXX
attributes should be something like:
DEFINE_PER_CPU_CACHE_HOT
DEFINE_PER_CPU_CACHE_ALIGNED # was: DEFINE_PER_CPU_ALIGNED
DEFINE_PER_CPU_CACHE_ALIGNED_SHARED # was: DEFINE_PER_CPU_SHARED_ALIGNED
DEFINE_PER_CPU_PAGE_ALIGNED
DEFINE_PER_CPU_READ_MOSTLY
DEFINE_PER_CPU_DECRYPTED
But I digress...
Anyway, I've Cc:-ed various potentially interested parties, please
speak up now or forever hold your peace. ;-)
Thanks,
Ingo
On Sun, 23 Feb 2025 at 01:37, Ingo Molnar <mingo@kernel.org> wrote:
>
> This can also be a drawback if it's abused by random driver code - so I
> think it should at minimum be documented to be used by core & arch
> code. Maybe add a build #error too if it's defined in modular code?
Yes, please.
Everybody always thinks that *their* code is the most important code,
so making it easy for random filesystems or drivers to just say "this
is my hot piece of data" needs to be avoided.
That is also an argument for having the final size be asserted to be
smaller than one cacheline.
Because I do think that the patches look fine, but it's too much of an
invitation for random developers to go "Oh, *MY* code deserves a hot
marker".
Linus
On Sun, 23 Feb 2025 at 10:37, Ingo Molnar <mingo@kernel.org> wrote: > > > * Brian Gerst <brgerst@gmail.com> wrote: > > > Add a new percpu subsection for data that is frequently accessed and > > exclusive to each processor. This is intended to replace the pcpu_hot > > struct on X86, and is available to all architectures. > > > > The one caveat with this approach is that it depends on the linker to > > effeciently pack data that is smaller than machine word size. The > > binutils linker does this properly: > > > > ffffffff842f6000 D __per_cpu_hot_start > > ffffffff842f6000 D softirq_pending > > ffffffff842f6002 D hardirq_stack_inuse > > ffffffff842f6008 D hardirq_stack_ptr > > ffffffff842f6010 D __ref_stack_chk_guard > > ffffffff842f6010 D __stack_chk_guard > > ffffffff842f6018 D const_cpu_current_top_of_stack > > ffffffff842f6018 D cpu_current_top_of_stack > > ffffffff842f6020 D const_current_task > > ffffffff842f6020 D current_task > > ffffffff842f6028 D __preempt_count > > ffffffff842f602c D cpu_number > > ffffffff842f6030 D this_cpu_off > > ffffffff842f6038 D __x86_call_depth > > ffffffff842f6040 D __per_cpu_hot_end > > > > The LLVM linker doesn't do as well with packing smaller data objects, > > causing it to spill over into a second cacheline. > > ... now it's linker-ordered. But on the other hand that can be an > advantage too: the linker will try to (or at least has a chance to) > order the fields optimally for cache density, while the hand-packing > always has the potential to bitrot without much of an outside, > actionable indicator for the bitrot. > The linker will need some help here - by default, it just emits these variables in the order they appear in the input. If we emit each such variable 'foo' into .data..hot.foo, and define the contents of the section as *(SORT_BY_ALIGNMENT(.data..hot.*)) we should get optimal packing as long as the alignment of these variables does not exceed their size.
On Sun, Feb 23, 2025 at 5:20 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > On Sun, 23 Feb 2025 at 10:37, Ingo Molnar <mingo@kernel.org> wrote: > > > > > > * Brian Gerst <brgerst@gmail.com> wrote: > > > > > Add a new percpu subsection for data that is frequently accessed and > > > exclusive to each processor. This is intended to replace the pcpu_hot > > > struct on X86, and is available to all architectures. > > > > > > The one caveat with this approach is that it depends on the linker to > > > effeciently pack data that is smaller than machine word size. The > > > binutils linker does this properly: > > > > > > ffffffff842f6000 D __per_cpu_hot_start > > > ffffffff842f6000 D softirq_pending > > > ffffffff842f6002 D hardirq_stack_inuse > > > ffffffff842f6008 D hardirq_stack_ptr > > > ffffffff842f6010 D __ref_stack_chk_guard > > > ffffffff842f6010 D __stack_chk_guard > > > ffffffff842f6018 D const_cpu_current_top_of_stack > > > ffffffff842f6018 D cpu_current_top_of_stack > > > ffffffff842f6020 D const_current_task > > > ffffffff842f6020 D current_task > > > ffffffff842f6028 D __preempt_count > > > ffffffff842f602c D cpu_number > > > ffffffff842f6030 D this_cpu_off > > > ffffffff842f6038 D __x86_call_depth > > > ffffffff842f6040 D __per_cpu_hot_end > > > > > > The LLVM linker doesn't do as well with packing smaller data objects, > > > causing it to spill over into a second cacheline. > > > > ... now it's linker-ordered. But on the other hand that can be an > > advantage too: the linker will try to (or at least has a chance to) > > order the fields optimally for cache density, while the hand-packing > > always has the potential to bitrot without much of an outside, > > actionable indicator for the bitrot. > > > > The linker will need some help here - by default, it just emits these > variables in the order they appear in the input. > > If we emit each such variable 'foo' into .data..hot.foo, and define > the contents of the section as > > *(SORT_BY_ALIGNMENT(.data..hot.*)) > > we should get optimal packing as long as the alignment of these > variables does not exceed their size. Thanks for the tip on SORT_BY_ALIGNMENT(). That got the LLVM linker to pack the data correctly. Brian Gerst
On Sun, Feb 23, 2025 at 11:20 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > On Sun, 23 Feb 2025 at 10:37, Ingo Molnar <mingo@kernel.org> wrote: > > > > > > * Brian Gerst <brgerst@gmail.com> wrote: > > > > > Add a new percpu subsection for data that is frequently accessed and > > > exclusive to each processor. This is intended to replace the pcpu_hot > > > struct on X86, and is available to all architectures. > > > > > > The one caveat with this approach is that it depends on the linker to > > > effeciently pack data that is smaller than machine word size. The > > > binutils linker does this properly: > > > > > > ffffffff842f6000 D __per_cpu_hot_start > > > ffffffff842f6000 D softirq_pending > > > ffffffff842f6002 D hardirq_stack_inuse > > > ffffffff842f6008 D hardirq_stack_ptr > > > ffffffff842f6010 D __ref_stack_chk_guard > > > ffffffff842f6010 D __stack_chk_guard > > > ffffffff842f6018 D const_cpu_current_top_of_stack > > > ffffffff842f6018 D cpu_current_top_of_stack > > > ffffffff842f6020 D const_current_task > > > ffffffff842f6020 D current_task > > > ffffffff842f6028 D __preempt_count > > > ffffffff842f602c D cpu_number > > > ffffffff842f6030 D this_cpu_off > > > ffffffff842f6038 D __x86_call_depth > > > ffffffff842f6040 D __per_cpu_hot_end > > > > > > The LLVM linker doesn't do as well with packing smaller data objects, > > > causing it to spill over into a second cacheline. > > > > ... now it's linker-ordered. But on the other hand that can be an > > advantage too: the linker will try to (or at least has a chance to) > > order the fields optimally for cache density, while the hand-packing > > always has the potential to bitrot without much of an outside, > > actionable indicator for the bitrot. > > > > The linker will need some help here - by default, it just emits these > variables in the order they appear in the input. > > If we emit each such variable 'foo' into .data..hot.foo, and define > the contents of the section as > > *(SORT_BY_ALIGNMENT(.data..hot.*)) > > we should get optimal packing as long as the alignment of these > variables does not exceed their size. Is it possible to warn/error when data is spilled over the cache line? Previously, there was: -static_assert(sizeof(struct pcpu_hot) == 64); that failed the build in this case. Uros.
On Sun, Feb 23, 2025 at 5:30 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Sun, Feb 23, 2025 at 11:20 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > > > On Sun, 23 Feb 2025 at 10:37, Ingo Molnar <mingo@kernel.org> wrote: > > > > > > > > > * Brian Gerst <brgerst@gmail.com> wrote: > > > > > > > Add a new percpu subsection for data that is frequently accessed and > > > > exclusive to each processor. This is intended to replace the pcpu_hot > > > > struct on X86, and is available to all architectures. > > > > > > > > The one caveat with this approach is that it depends on the linker to > > > > effeciently pack data that is smaller than machine word size. The > > > > binutils linker does this properly: > > > > > > > > ffffffff842f6000 D __per_cpu_hot_start > > > > ffffffff842f6000 D softirq_pending > > > > ffffffff842f6002 D hardirq_stack_inuse > > > > ffffffff842f6008 D hardirq_stack_ptr > > > > ffffffff842f6010 D __ref_stack_chk_guard > > > > ffffffff842f6010 D __stack_chk_guard > > > > ffffffff842f6018 D const_cpu_current_top_of_stack > > > > ffffffff842f6018 D cpu_current_top_of_stack > > > > ffffffff842f6020 D const_current_task > > > > ffffffff842f6020 D current_task > > > > ffffffff842f6028 D __preempt_count > > > > ffffffff842f602c D cpu_number > > > > ffffffff842f6030 D this_cpu_off > > > > ffffffff842f6038 D __x86_call_depth > > > > ffffffff842f6040 D __per_cpu_hot_end > > > > > > > > The LLVM linker doesn't do as well with packing smaller data objects, > > > > causing it to spill over into a second cacheline. > > > > > > ... now it's linker-ordered. But on the other hand that can be an > > > advantage too: the linker will try to (or at least has a chance to) > > > order the fields optimally for cache density, while the hand-packing > > > always has the potential to bitrot without much of an outside, > > > actionable indicator for the bitrot. > > > > > > > The linker will need some help here - by default, it just emits these > > variables in the order they appear in the input. > > > > If we emit each such variable 'foo' into .data..hot.foo, and define > > the contents of the section as > > > > *(SORT_BY_ALIGNMENT(.data..hot.*)) > > > > we should get optimal packing as long as the alignment of these > > variables does not exceed their size. > > Is it possible to warn/error when data is spilled over the cache line? > Previously, there was: > > -static_assert(sizeof(struct pcpu_hot) == 64); > > that failed the build in this case. I think it should be a warning and not an error. If it does spill into a second cacheline the kernel will still boot and function properly so it's not a fatal error, it just could hit performance a bit. By decentralizing this it does make it harder to account for size, especially with conditional builds. Unfortunately, the linker script language does not have a WARNING() counterpart to ASSERT(). Brian Gerst
On Sun, 23 Feb 2025 at 18:25, Brian Gerst <brgerst@gmail.com> wrote: > > On Sun, Feb 23, 2025 at 5:30 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > On Sun, Feb 23, 2025 at 11:20 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > > > > > On Sun, 23 Feb 2025 at 10:37, Ingo Molnar <mingo@kernel.org> wrote: > > > > > > > > > > > > * Brian Gerst <brgerst@gmail.com> wrote: > > > > > > > > > Add a new percpu subsection for data that is frequently accessed and > > > > > exclusive to each processor. This is intended to replace the pcpu_hot > > > > > struct on X86, and is available to all architectures. > > > > > > > > > > The one caveat with this approach is that it depends on the linker to > > > > > effeciently pack data that is smaller than machine word size. The > > > > > binutils linker does this properly: > > > > > > > > > > ffffffff842f6000 D __per_cpu_hot_start > > > > > ffffffff842f6000 D softirq_pending > > > > > ffffffff842f6002 D hardirq_stack_inuse > > > > > ffffffff842f6008 D hardirq_stack_ptr > > > > > ffffffff842f6010 D __ref_stack_chk_guard > > > > > ffffffff842f6010 D __stack_chk_guard > > > > > ffffffff842f6018 D const_cpu_current_top_of_stack > > > > > ffffffff842f6018 D cpu_current_top_of_stack > > > > > ffffffff842f6020 D const_current_task > > > > > ffffffff842f6020 D current_task > > > > > ffffffff842f6028 D __preempt_count > > > > > ffffffff842f602c D cpu_number > > > > > ffffffff842f6030 D this_cpu_off > > > > > ffffffff842f6038 D __x86_call_depth > > > > > ffffffff842f6040 D __per_cpu_hot_end > > > > > > > > > > The LLVM linker doesn't do as well with packing smaller data objects, > > > > > causing it to spill over into a second cacheline. > > > > > > > > ... now it's linker-ordered. But on the other hand that can be an > > > > advantage too: the linker will try to (or at least has a chance to) > > > > order the fields optimally for cache density, while the hand-packing > > > > always has the potential to bitrot without much of an outside, > > > > actionable indicator for the bitrot. > > > > > > > > > > The linker will need some help here - by default, it just emits these > > > variables in the order they appear in the input. > > > > > > If we emit each such variable 'foo' into .data..hot.foo, and define > > > the contents of the section as > > > > > > *(SORT_BY_ALIGNMENT(.data..hot.*)) > > > > > > we should get optimal packing as long as the alignment of these > > > variables does not exceed their size. > > > > Is it possible to warn/error when data is spilled over the cache line? > > Previously, there was: > > > > -static_assert(sizeof(struct pcpu_hot) == 64); > > > > that failed the build in this case. > > I think it should be a warning and not an error. If it does spill > into a second cacheline the kernel will still boot and function > properly so it's not a fatal error, it just could hit performance a > bit. By decentralizing this it does make it harder to account for > size, especially with conditional builds. Unfortunately, the linker > script language does not have a WARNING() counterpart to ASSERT(). > Why should it even be a warning? What is the problem if the build in question has two cachelines worth of hot per-CPU data?
© 2016 - 2026 Red Hat, Inc.