[PATCH v2 00/11] Add a percpu subsection for cache hot data

Brian Gerst posted 11 patches 11 months, 2 weeks ago
There is a newer version of this series
arch/x86/entry/entry_32.S             |  4 +--
arch/x86/entry/entry_64.S             |  6 ++---
arch/x86/entry/entry_64_compat.S      |  4 +--
arch/x86/include/asm/current.h        | 36 +++++----------------------
arch/x86/include/asm/hardirq.h        |  4 +--
arch/x86/include/asm/irq_stack.h      | 12 ++++-----
arch/x86/include/asm/nospec-branch.h  | 11 ++++----
arch/x86/include/asm/percpu.h         |  4 +--
arch/x86/include/asm/preempt.h        | 25 ++++++++++---------
arch/x86/include/asm/processor.h      | 15 +++++++++--
arch/x86/include/asm/smp.h            |  7 +++---
arch/x86/include/asm/stackprotector.h |  2 +-
arch/x86/kernel/asm-offsets.c         |  5 ----
arch/x86/kernel/cpu/common.c          | 25 +++++++++++++------
arch/x86/kernel/dumpstack_32.c        |  4 +--
arch/x86/kernel/dumpstack_64.c        |  2 +-
arch/x86/kernel/head_64.S             |  4 +--
arch/x86/kernel/irq.c                 |  8 ++++++
arch/x86/kernel/irq_32.c              | 12 +++++----
arch/x86/kernel/irq_64.c              |  6 ++---
arch/x86/kernel/process_32.c          |  6 ++---
arch/x86/kernel/process_64.c          |  6 ++---
arch/x86/kernel/setup_percpu.c        |  7 ++++--
arch/x86/kernel/smpboot.c             |  4 +--
arch/x86/kernel/vmlinux.lds.S         |  6 ++++-
arch/x86/lib/retpoline.S              |  2 +-
include/asm-generic/vmlinux.lds.h     | 10 ++++++++
include/linux/percpu-defs.h           | 12 +++++++++
include/linux/preempt.h               |  1 +
kernel/bpf/verifier.c                 |  4 +--
scripts/gdb/linux/cpus.py             |  2 +-
31 files changed, 145 insertions(+), 111 deletions(-)
[PATCH v2 00/11] Add a percpu subsection for cache hot data
Posted by Brian Gerst 11 months, 2 weeks ago
Add a new percpu subsection for data that is frequently accessed and
exclusive to each processor.  This replaces the pcpu_hot struct on x86,
and is available to all architectures and the core kernel.

ffffffff842fa000 D __per_cpu_hot_start
ffffffff842fa000 D hardirq_stack_ptr
ffffffff842fa008 D __ref_stack_chk_guard
ffffffff842fa008 D __stack_chk_guard
ffffffff842fa010 D const_cpu_current_top_of_stack
ffffffff842fa010 D cpu_current_top_of_stack
ffffffff842fa018 D const_current_task
ffffffff842fa018 D current_task
ffffffff842fa020 D __x86_call_depth
ffffffff842fa028 D this_cpu_off
ffffffff842fa030 D __preempt_count
ffffffff842fa034 D cpu_number
ffffffff842fa038 D __softirq_pending
ffffffff842fa03a D hardirq_stack_inuse
ffffffff842fa040 D __per_cpu_hot_end

This applies to the tip/x86/asm branch.

Changes in V2:
- Renamed macros to *_PER_CPU_CACHE_HOT()
- Restored 64-byte limit.
- Added note that this is only to be used by arch and core code.
- Reserve call depth space even when the mitigaion is disabled.
- Use SORT_BY_ALIGNMENT() for dense data packing.
- Remove now unnecessary includes of current.h, fixing up some indirect
  includes.

Brian Gerst (11):
  percpu: Introduce percpu hot section
  x86/percpu: Move pcpu_hot to percpu hot section
  x86/preempt: Move preempt count to percpu hot section
  x86/smp: Move cpu number to percpu hot section
  x86/retbleed: Move call depth to percpu hot section
  x86/softirq: Move softirq_pending to percpu hot section
  x86/irq: Move irq stacks to percpu hot section
  x86/percpu: Move top_of_stack to percpu hot section
  x86/percpu: Move current_task to percpu hot section
  x86/stackprotector: Move __stack_chk_guard to percpu hot section
  x86/smp: Move this_cpu_off to percpu hot section

 arch/x86/entry/entry_32.S             |  4 +--
 arch/x86/entry/entry_64.S             |  6 ++---
 arch/x86/entry/entry_64_compat.S      |  4 +--
 arch/x86/include/asm/current.h        | 36 +++++----------------------
 arch/x86/include/asm/hardirq.h        |  4 +--
 arch/x86/include/asm/irq_stack.h      | 12 ++++-----
 arch/x86/include/asm/nospec-branch.h  | 11 ++++----
 arch/x86/include/asm/percpu.h         |  4 +--
 arch/x86/include/asm/preempt.h        | 25 ++++++++++---------
 arch/x86/include/asm/processor.h      | 15 +++++++++--
 arch/x86/include/asm/smp.h            |  7 +++---
 arch/x86/include/asm/stackprotector.h |  2 +-
 arch/x86/kernel/asm-offsets.c         |  5 ----
 arch/x86/kernel/cpu/common.c          | 25 +++++++++++++------
 arch/x86/kernel/dumpstack_32.c        |  4 +--
 arch/x86/kernel/dumpstack_64.c        |  2 +-
 arch/x86/kernel/head_64.S             |  4 +--
 arch/x86/kernel/irq.c                 |  8 ++++++
 arch/x86/kernel/irq_32.c              | 12 +++++----
 arch/x86/kernel/irq_64.c              |  6 ++---
 arch/x86/kernel/process_32.c          |  6 ++---
 arch/x86/kernel/process_64.c          |  6 ++---
 arch/x86/kernel/setup_percpu.c        |  7 ++++--
 arch/x86/kernel/smpboot.c             |  4 +--
 arch/x86/kernel/vmlinux.lds.S         |  6 ++++-
 arch/x86/lib/retpoline.S              |  2 +-
 include/asm-generic/vmlinux.lds.h     | 10 ++++++++
 include/linux/percpu-defs.h           | 12 +++++++++
 include/linux/preempt.h               |  1 +
 kernel/bpf/verifier.c                 |  4 +--
 scripts/gdb/linux/cpus.py             |  2 +-
 31 files changed, 145 insertions(+), 111 deletions(-)


base-commit: 79165720f31868d9a9f7e5a50a09d5fe510d1822
-- 
2.48.1
Re: [PATCH v2 00/11] Add a percpu subsection for cache hot data
Posted by Peter Zijlstra 11 months, 2 weeks ago
On Wed, Feb 26, 2025 at 01:05:19PM -0500, Brian Gerst wrote:
> Add a new percpu subsection for data that is frequently accessed and
> exclusive to each processor.  This replaces the pcpu_hot struct on x86,
> and is available to all architectures and the core kernel.
> 
> ffffffff842fa000 D __per_cpu_hot_start
> ffffffff842fa000 D hardirq_stack_ptr
> ffffffff842fa008 D __ref_stack_chk_guard
> ffffffff842fa008 D __stack_chk_guard
> ffffffff842fa010 D const_cpu_current_top_of_stack
> ffffffff842fa010 D cpu_current_top_of_stack
> ffffffff842fa018 D const_current_task
> ffffffff842fa018 D current_task
> ffffffff842fa020 D __x86_call_depth
> ffffffff842fa028 D this_cpu_off
> ffffffff842fa030 D __preempt_count
> ffffffff842fa034 D cpu_number
> ffffffff842fa038 D __softirq_pending
> ffffffff842fa03a D hardirq_stack_inuse
> ffffffff842fa040 D __per_cpu_hot_end

The above is useful, but not quite as useful as looking at:

$ pahole -C pcpu_hot defconfig-build/vmlinux.o
struct pcpu_hot {
        union {
                struct {
                        struct task_struct * current_task; /*     0     8 */
                        int        preempt_count;        /*     8     4 */
                        int        cpu_number;           /*    12     4 */
                        u64        call_depth;           /*    16     8 */
                        long unsigned int top_of_stack;  /*    24     8 */
                        void *     hardirq_stack_ptr;    /*    32     8 */
                        u16        softirq_pending;      /*    40     2 */
                        bool       hardirq_stack_inuse;  /*    42     1 */
                };                                       /*     0    48 */
                u8                 pad[64];              /*     0    64 */
        };                                               /*     0    64 */

        /* size: 64, cachelines: 1, members: 1 */
};

A slightly more useful variant of your listing would be:

$ readelf -Ws defconfig-build/vmlinux | sort -k 2 | awk 'BEGIN {p=0} /__per_cpu_hot_start/ {p=1} { if (p) print $2 " " $3 " " $8 } /__per_cpu_hot_end/ {p=0}'
ffffffff834f5000 0 __per_cpu_hot_start
ffffffff834f5000 8 hardirq_stack_ptr
ffffffff834f5008 0 __ref_stack_chk_guard
ffffffff834f5008 8 __stack_chk_guard
ffffffff834f5010 0 const_cpu_current_top_of_stack
ffffffff834f5010 8 cpu_current_top_of_stack
ffffffff834f5018 0 const_current_task
ffffffff834f5018 8 current_task
ffffffff834f5020 8 __x86_call_depth
ffffffff834f5028 8 this_cpu_off
ffffffff834f5030 4 __preempt_count
ffffffff834f5034 4 cpu_number
ffffffff834f5038 2 __softirq_pending
ffffffff834f503a 1 hardirq_stack_inuse
ffffffff834f5040 0 __per_cpu_hot_end

as it also gets the size for each symbol. Allowing us to compute the
hole as 0x40-0x3b, or 5 bytes.
Re: [PATCH v2 00/11] Add a percpu subsection for cache hot data
Posted by Brian Gerst 11 months, 2 weeks ago
On Wed, Feb 26, 2025 at 3:23 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Feb 26, 2025 at 01:05:19PM -0500, Brian Gerst wrote:
> > Add a new percpu subsection for data that is frequently accessed and
> > exclusive to each processor.  This replaces the pcpu_hot struct on x86,
> > and is available to all architectures and the core kernel.
> >
> > ffffffff842fa000 D __per_cpu_hot_start
> > ffffffff842fa000 D hardirq_stack_ptr
> > ffffffff842fa008 D __ref_stack_chk_guard
> > ffffffff842fa008 D __stack_chk_guard
> > ffffffff842fa010 D const_cpu_current_top_of_stack
> > ffffffff842fa010 D cpu_current_top_of_stack
> > ffffffff842fa018 D const_current_task
> > ffffffff842fa018 D current_task
> > ffffffff842fa020 D __x86_call_depth
> > ffffffff842fa028 D this_cpu_off
> > ffffffff842fa030 D __preempt_count
> > ffffffff842fa034 D cpu_number
> > ffffffff842fa038 D __softirq_pending
> > ffffffff842fa03a D hardirq_stack_inuse
> > ffffffff842fa040 D __per_cpu_hot_end
>
> The above is useful, but not quite as useful as looking at:
>
> $ pahole -C pcpu_hot defconfig-build/vmlinux.o
> struct pcpu_hot {
>         union {
>                 struct {
>                         struct task_struct * current_task; /*     0     8 */
>                         int        preempt_count;        /*     8     4 */
>                         int        cpu_number;           /*    12     4 */
>                         u64        call_depth;           /*    16     8 */
>                         long unsigned int top_of_stack;  /*    24     8 */
>                         void *     hardirq_stack_ptr;    /*    32     8 */
>                         u16        softirq_pending;      /*    40     2 */
>                         bool       hardirq_stack_inuse;  /*    42     1 */
>                 };                                       /*     0    48 */
>                 u8                 pad[64];              /*     0    64 */
>         };                                               /*     0    64 */
>
>         /* size: 64, cachelines: 1, members: 1 */
> };
>
> A slightly more useful variant of your listing would be:
>
> $ readelf -Ws defconfig-build/vmlinux | sort -k 2 | awk 'BEGIN {p=0} /__per_cpu_hot_start/ {p=1} { if (p) print $2 " " $3 " " $8 } /__per_cpu_hot_end/ {p=0}'
> ffffffff834f5000 0 __per_cpu_hot_start
> ffffffff834f5000 8 hardirq_stack_ptr
> ffffffff834f5008 0 __ref_stack_chk_guard
> ffffffff834f5008 8 __stack_chk_guard
> ffffffff834f5010 0 const_cpu_current_top_of_stack
> ffffffff834f5010 8 cpu_current_top_of_stack
> ffffffff834f5018 0 const_current_task
> ffffffff834f5018 8 current_task
> ffffffff834f5020 8 __x86_call_depth
> ffffffff834f5028 8 this_cpu_off
> ffffffff834f5030 4 __preempt_count
> ffffffff834f5034 4 cpu_number
> ffffffff834f5038 2 __softirq_pending
> ffffffff834f503a 1 hardirq_stack_inuse
> ffffffff834f5040 0 __per_cpu_hot_end
>
> as it also gets the size for each symbol. Allowing us to compute the
> hole as 0x40-0x3b, or 5 bytes.

If all the variables in this section are scalar or pointer types,
SORT_BY_ALIGNMENT() should result in no padding between them.  I can
add a __per_cpu_hot_pad symbol  to show the actual end of the data
(not aligned to the next cacheline like __per_cpu_hot_end).


Brian Gerst
Re: [PATCH v2 00/11] Add a percpu subsection for cache hot data
Posted by Brian Gerst 11 months, 2 weeks ago
On Wed, Feb 26, 2025 at 8:29 PM Brian Gerst <brgerst@gmail.com> wrote:
>
> On Wed, Feb 26, 2025 at 3:23 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Feb 26, 2025 at 01:05:19PM -0500, Brian Gerst wrote:
> > > Add a new percpu subsection for data that is frequently accessed and
> > > exclusive to each processor.  This replaces the pcpu_hot struct on x86,
> > > and is available to all architectures and the core kernel.
> > >
> > > ffffffff842fa000 D __per_cpu_hot_start
> > > ffffffff842fa000 D hardirq_stack_ptr
> > > ffffffff842fa008 D __ref_stack_chk_guard
> > > ffffffff842fa008 D __stack_chk_guard
> > > ffffffff842fa010 D const_cpu_current_top_of_stack
> > > ffffffff842fa010 D cpu_current_top_of_stack
> > > ffffffff842fa018 D const_current_task
> > > ffffffff842fa018 D current_task
> > > ffffffff842fa020 D __x86_call_depth
> > > ffffffff842fa028 D this_cpu_off
> > > ffffffff842fa030 D __preempt_count
> > > ffffffff842fa034 D cpu_number
> > > ffffffff842fa038 D __softirq_pending
> > > ffffffff842fa03a D hardirq_stack_inuse
> > > ffffffff842fa040 D __per_cpu_hot_end
> >
> > The above is useful, but not quite as useful as looking at:
> >
> > $ pahole -C pcpu_hot defconfig-build/vmlinux.o
> > struct pcpu_hot {
> >         union {
> >                 struct {
> >                         struct task_struct * current_task; /*     0     8 */
> >                         int        preempt_count;        /*     8     4 */
> >                         int        cpu_number;           /*    12     4 */
> >                         u64        call_depth;           /*    16     8 */
> >                         long unsigned int top_of_stack;  /*    24     8 */
> >                         void *     hardirq_stack_ptr;    /*    32     8 */
> >                         u16        softirq_pending;      /*    40     2 */
> >                         bool       hardirq_stack_inuse;  /*    42     1 */
> >                 };                                       /*     0    48 */
> >                 u8                 pad[64];              /*     0    64 */
> >         };                                               /*     0    64 */
> >
> >         /* size: 64, cachelines: 1, members: 1 */
> > };
> >
> > A slightly more useful variant of your listing would be:
> >
> > $ readelf -Ws defconfig-build/vmlinux | sort -k 2 | awk 'BEGIN {p=0} /__per_cpu_hot_start/ {p=1} { if (p) print $2 " " $3 " " $8 } /__per_cpu_hot_end/ {p=0}'
> > ffffffff834f5000 0 __per_cpu_hot_start
> > ffffffff834f5000 8 hardirq_stack_ptr
> > ffffffff834f5008 0 __ref_stack_chk_guard
> > ffffffff834f5008 8 __stack_chk_guard
> > ffffffff834f5010 0 const_cpu_current_top_of_stack
> > ffffffff834f5010 8 cpu_current_top_of_stack
> > ffffffff834f5018 0 const_current_task
> > ffffffff834f5018 8 current_task
> > ffffffff834f5020 8 __x86_call_depth
> > ffffffff834f5028 8 this_cpu_off
> > ffffffff834f5030 4 __preempt_count
> > ffffffff834f5034 4 cpu_number
> > ffffffff834f5038 2 __softirq_pending
> > ffffffff834f503a 1 hardirq_stack_inuse
> > ffffffff834f5040 0 __per_cpu_hot_end
> >
> > as it also gets the size for each symbol. Allowing us to compute the
> > hole as 0x40-0x3b, or 5 bytes.
>
> If all the variables in this section are scalar or pointer types,
> SORT_BY_ALIGNMENT() should result in no padding between them.  I can
> add a __per_cpu_hot_pad symbol  to show the actual end of the data
> (not aligned to the next cacheline like __per_cpu_hot_end).

Is this better? (from System.map)

ffffffff834f5000 D __per_cpu_hot_start
ffffffff834f5000 D hardirq_stack_ptr
ffffffff834f5008 D __ref_stack_chk_guard
ffffffff834f5008 D __stack_chk_guard
ffffffff834f5010 D const_cpu_current_top_of_stack
ffffffff834f5010 D cpu_current_top_of_stack
ffffffff834f5018 D const_current_task
ffffffff834f5018 D current_task
ffffffff834f5020 D __x86_call_depth
ffffffff834f5028 D this_cpu_off
ffffffff834f5030 D __preempt_count
ffffffff834f5034 D cpu_number
ffffffff834f5038 D __softirq_pending
ffffffff834f503a D hardirq_stack_inuse
ffffffff834f503b D __per_cpu_hot_pad
ffffffff834f5040 D __per_cpu_hot_end


Brian Gerst