[Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf

Dapeng Mi posted 23 patches 1 week, 3 days ago
arch/arm/kernel/perf_regs.c           |   8 +-
arch/arm64/kernel/perf_regs.c         |   8 +-
arch/csky/kernel/perf_regs.c          |   8 +-
arch/loongarch/kernel/perf_regs.c     |   8 +-
arch/mips/kernel/perf_regs.c          |   8 +-
arch/parisc/kernel/perf_regs.c        |   8 +-
arch/powerpc/perf/perf_regs.c         |   2 +-
arch/riscv/kernel/perf_regs.c         |   8 +-
arch/s390/kernel/perf_regs.c          |   2 +-
arch/x86/events/core.c                | 415 +++++++++++++++++++++++++-
arch/x86/events/intel/core.c          | 232 ++++++++++++--
arch/x86/events/intel/ds.c            | 235 +++++++++++----
arch/x86/events/perf_event.h          |  85 +++++-
arch/x86/include/asm/fpu/sched.h      |   5 +-
arch/x86/include/asm/fpu/xstate.h     |   3 +
arch/x86/include/asm/msr-index.h      |   7 +
arch/x86/include/asm/perf_event.h     |  35 ++-
arch/x86/include/uapi/asm/perf_regs.h |  51 ++++
arch/x86/kernel/fpu/core.c            |  27 +-
arch/x86/kernel/fpu/xstate.c          |  25 +-
arch/x86/kernel/perf_regs.c           | 163 ++++++++--
arch/x86/xen/pmu.c                    |   5 +-
include/linux/perf_event.h            |  19 ++
include/linux/perf_regs.h             |  38 +--
include/uapi/linux/perf_event.h       |  49 ++-
kernel/events/core.c                  | 189 ++++++++++--
26 files changed, 1418 insertions(+), 225 deletions(-)
[Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf
Posted by Dapeng Mi 1 week, 3 days ago
Patch layout:
- Patches 1-6: Bug fixes and cleanup needed before enabling XSAVES-based
  sampling in NMI context
- Patches 7-9: FPU-related preparation, including xsaves_nmi() and
  related cleanup/optimization
- Patches 10-11: PMI-based XMM sampling support through the existing
  sample_regs_intr/sample_regs_user interfaces for both
  PERF_SAMPLE_REGS_INTR and PERF_SAMPLE_REGS_USER
- Patches 12-19: New SIMD register interface and support for
  XMM/YMM/ZMM/OPMASK, APX eGPRs, and SSP through that interface
- Patch 20: Extend arch PEBS to support YMM/ZMM/OPMASK, APX eGPRs, and
  SSP with the new interface
- Patch 21: Enable new interface-based sampling
- Patches 22-23: arch PEBS bug fix and sanity check

Changes since V7:
- Validate the return value of intel_pmu_init_hybrid() (Patch 01/23).
- Replace pt_regs with x86_perf_regs in xen_pmu_irq_handler()
  (Patch 06/23).
- Improve event_has_extended_regs() (Patch 09/23).
- Explicitly ensure the allocated XSAVE area is 64-byte aligned
  (Patch 10/23, Sashiko).
- Clear the SIMD register pointers in x86_user_regs to avoid exposing
  stale register data to user space (Patch 11/23, Sashiko).
- Refine the SIMD register interface and sample data layout, and add the
  missing SIMD data reservation in perf_prepare_sample() for non-x86
  architectures (Patch 12/23, Sashiko).
- Improve perf_simd_reg_validate() for x86 (Patch 13/23, Sashiko).
- Refine SSP sampling and ensure the GPR sub-group flag is set for PEBS
  (Patch 19/23, Sashiko).
- Fix the incorrect large-PEBS check for XMM (Patch 20/23, Sashiko).
- Fix missing handling in x86_pmu_handle_guest_pebs() for back-to-back
  PMI detection (Patch 22/23, Sashiko).
- Strengthen the PEBS record header sanity checks to prevent invalid
  memory access (Patch 23/23, Sashiko).

Changes since V6:
- Fix potential overwritten issue in hybrid PMU structure (patch 01/24)
- Restrict PEBS events work on GP counters if no PEBS baseline suggested
  (patch 02/24)
- Use per-cpu x86_intr_regs for perf_event_nmi_handler() instead of
  temporary variable (patch 06/24)
- Add helper update_fpu_state_and_flag() to ensure TIF_NEED_FPU_LOAD is
  set after save_fpregs_to_fpstate() call (patch 09/24)
- Optimize and simplify x86_pmu_sample_xregs(), etc. (patch 11/24)
- Add macro word_for_each_set_bit() to simplify u64 set-bit iteration
  (patch 13/24)
- Add sanity check for PEBS fragment size (patch 24/24)

Changes since V5:
- Introduce 3 commits to fix newly found PEBS issues (Patch 01~03/19)
- Address Peter comments, including,
  * Fully support user-regs sampling of the SIMD/eGPRs/SSP registers
  * Adjust newly added fields in perf_event_attr to avoid holes
  * Fix the endian issue introduced by for_each_set_bit() in
    event/core.c
  * Remove some unnecessary macros from UAPI header perf_regs.h
  * Enhance b2b NMI detection for all PEBS handlers to ensure identical
    behaviors of all PEBS handlers
- Split perf-tools patches which would be posted in a separate patchset
  later

Changes since V4:
- Rewrite some functions comments and commit messages (Dave)
- Add arch-PEBS based SIMD/eGPRs/SSP sampling support (Patch 15/19)
- Fix "suspecious NMI" warnning observed on PTL/NVL P-core and DMR by
  activating back-to-back NMI detection mechanism (Patch 16/19)
- Fix some minor issues on perf-tool patches (Patch 18/19)

Changes since V3:
- Drop the SIMD registers if an NMI hits kernel mode for REGS_USER.
- Only dump the available regs, rather than zero and dump the
  unavailable regs. It's possible that the dumped registers are a subset
  of the requested registers.
- Some minor updates to address Dapeng's comments in V3.

Changes since V2:
- Use the FPU format for the x86_pmu.ext_regs_mask as well
- Add a check before invoking xsaves_nmi()
- Add perf_simd_reg_check() to retrieve the number of available
  registers. If the kernel fails to get the requested registers, e.g.,
  XSAVES fails, nothing dumps to the userspace (the V2 dumps all 0s).
- Add POC perf tool patches

Changes since V1:
- Apply the new interfaces to configure and dump the SIMD registers
- Utilize the existing FPU functions, e.g., xstate_calculate_size,
  get_xsave_addr().


This series adds support on x86 for sampling SIMD registers, APX eGPRs,
and SSP with both PMI-based and PEBS-based sampling.

Starting with Intel Ice Lake, PEBS can sample XMM registers, but PMI-based
XMM sampling is still not available. On newer Intel platforms with
architectural PEBS support, such as Clearwater Forest and Diamond Rapids,
the hardware also gains support for sampling additional SIMD state
(XMM/YMM/ZMM/OPMASK), APX extended GPRs, and SSP.

To support these registers consistently across both PMI and PEBS, this
series makes the following changes:

1. Adds a new perf_event_attr interface for SIMD register selection.
   The existing sample_regs_user/sample_regs_intr bitmaps do not have
   enough space to represent the full SIMD register set, so this series
   introduces dedicated fields for SIMD and predicate register masks and
   element widths.

2. Introduces a new sample data layout for SIMD register data.
   SIMD register payload is appended after the GPR payload, and a new ABI
   flag, PERF_SAMPLE_REGS_ABI_SIMD, indicates its presence.

3. Adds xsaves_nmi() to allow SIMD/eGPR/SSP sampling from PMI handlers in
   NMI context.

4. Extends the arch PEBS path to support YMM/ZMM/OPMASK, APX eGPRs, and
   SSP sampling.


New perf_event_attr fields
--------------------------

This series adds the following fields to perf_event_attr:

    /*
     * Defines the sampling SIMD/PRED(predicate) register bitmaps and
     * qword (8-byte) lengths.
     *
     * sample_simd_regs_enabled != 0 indicates SIMD/PRED registers are
     * requested. The register bitmaps and element sizes are described by:
     *
     *   sample_simd_{vec,pred}_reg_{intr,user}
     *   sample_simd_{vec,pred}_reg_qwords
     *
     * sample_simd_regs_enabled == 0 indicates no SIMD/PRED registers are
     * requested.
     */
    __u16 sample_simd_regs_enabled;
    __u16 sample_simd_pred_reg_qwords;
    __u16 sample_simd_vec_reg_qwords;
    __u16 __reserved_4;

    __u32 sample_simd_pred_reg_intr;
    __u32 sample_simd_pred_reg_user;
    __u64 sample_simd_vec_reg_intr;
    __u64 sample_simd_vec_reg_user;

Field semantics:
- sample_simd_vec_reg_qwords: qword count for regular SIMD registers
- sample_simd_pred_reg_qwords: qword count for predicate registers
- sample_simd_vec_reg_{intr,user}: SIMD register masks for
  PERF_SAMPLE_REGS_INTR and PERF_SAMPLE_REGS_USER
- sample_simd_pred_reg_{intr,user}: predicate register masks for
  PERF_SAMPLE_REGS_INTR and PERF_SAMPLE_REGS_USER
- sample_simd_regs_enabled: indicates whether the new SIMD fields are in use

Examples:

To sample ZMM registers for PERF_SAMPLE_REGS_INTR:

    sample_simd_regs_enabled = 1
    sample_simd_vec_reg_qwords = 8          // 512 bits = 8 qwords
    sample_simd_vec_reg_intr = 0xffffffff   // zmm0-zmm31

To sample OPMASK registers for PERF_SAMPLE_REGS_USER:

    sample_simd_regs_enabled = 1
    sample_simd_pred_reg_qwords = 1         // 64 bits = 1 qword
    sample_simd_pred_reg_user = 0xff        // opmask0-opmask7

After introducing these fields, bits [63:32] in sample_regs_user and
sample_regs_intr are reclaimed for APX eGPRs and SSP instead of the
previous XMM0-XMM15 encoding.

Discussion of the new SIMD register interface is available at:
https://lore.kernel.org/lkml/20250617081458.GI1613376@noisy.programming.kicks-ass.net/

Sample data layout
------------------

SIMD register data is appended after the GPR data.

For PERF_SAMPLE_REGS_USER:

    { u64 abi;                      // enum perf_sample_regs_abi
      u64 regs[weight(mask)];
      struct {
            u64 nr_vectors;         // 0 ... weight(sample_simd_vec_reg_user)
            u64 vector_qwords;      // 0 ... sample_simd_vec_reg_qwords
            u64 nr_pred;            // 0 ... weight(sample_simd_pred_reg_user)
            u64 pred_qwords;        // 0 ... sample_simd_pred_reg_qwords
            u64 data[nr_vectors * vector_qwords +
                     nr_pred * pred_qwords];
      } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
    }

For PERF_SAMPLE_REGS_INTR:

    { u64 abi;                      // enum perf_sample_regs_abi
      u64 regs[weight(mask)];
      struct {
            u64 nr_vectors;         // 0 ... weight(sample_simd_vec_reg_intr)
            u64 vector_qwords;      // 0 ... sample_simd_vec_reg_qwords
            u64 nr_pred;            // 0 ... weight(sample_simd_pred_reg_intr)
            u64 pred_qwords;        // 0 ... sample_simd_pred_reg_qwords
            u64 data[nr_vectors * vector_qwords +
                     nr_pred * pred_qwords];
      } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
    }

PERF_SAMPLE_REGS_ABI_SIMD indicates that SIMD register data is present.

The metadata fields are encoded as u64 to keep perf tool parsing and
cross-endian support straightforward.

Example
-------

  $ perf record -I?
  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27
  R28 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7

  $ perf record --user-regs=?
  available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
  R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27
  R28 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7

  $ perf record -e branches:p \
        -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask \
        -c 100000 ./test
  $ perf report -D

  ...
  14027761992115 0xcf30 [0x8a8]: PERF_RECORD_SAMPLE(IP, 0x1): 29964/29964:
  0xffffffff9f085e24 period: 100000 addr: 0
  ... intr regs: mask 0x18001010003 ABI 64-bit
  .... AX    0xdffffc0000000000
  .... BX    0xffff8882297685e8
  .... R8    0x0000000000000000
  .... R16   0x0000000000000000
  .... R31   0x0000000000000000
  .... SSP   0x0000000000000000
  ... SIMD ABI nr_vectors 32 vector_qwords 8 nr_pred 8 pred_qwords 1
  .... ZMM[0][0] 0x616c2f656d6f682f
  .... ZMM[0][1] 0x696c2f7265737562
  ...
  .... ZMM[31][7] 0x0000000000000000
  .... OPMASK[0] 0x00000000fffffe00
  ....
  .... OPMASK[7] 0x0000000000000000
  ...

Testing
-------

The following intr-regs, user-regs, and combined sampling tests were run
on DMR and NVL. The sampled register data was reported correctly and no
issues were observed.

  $ ./perf record -e branches:p \
        -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask -b -c 10000 sleep 1

  $ ./perf record -e branches \
        -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask -b -c 10000 sleep 1

  $ ./perf record -e branches:p \
        --user-regs=ax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask \
        -b -c 10000 sleep 1

  $ ./perf record -e branches \
        --user-regs=ax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask \
        -b -c 10000 sleep 1

  $ ./perf record -e branches:p \
        -Ixmm,ymm,zmm,opmask \
        --user-regs=ax,bx,r8,r16,r31,ssp \
        -b -c 10000 sleep 1

  $ ./perf record -e branches:p \
        --user-regs=xmm,ymm,zmm,opmask \
        -Iax,bx,r8,r16,r31,ssp \
        -b -c 10000 sleep 1

  $ ./perf record -e branches:p \
        -Iax,bx,r9,r17,r30,ssp \
        --user-regs=ax,bx,r8,r16,r31,ssp \
        -b -c 10000 sleep 1

  $ ./perf record -e branches:p \
        -Ixmm,opmask --user-regs=zmm \
        -b -c 10000 taskset -c 0 sleep 1


History:
  v7: https://lore.kernel.org/all/20260324004118.3772171-1-dapeng1.mi@linux.intel.com/
  v6: https://lore.kernel.org/all/20260209072047.2180332-1-dapeng1.mi@linux.intel.com/
  v5: https://lore.kernel.org/all/20251203065500.2597594-1-dapeng1.mi@linux.intel.com/
  v4: https://lore.kernel.org/all/20250925061213.178796-1-dapeng1.mi@linux.intel.com/
  v3: https://lore.kernel.org/lkml/20250815213435.1702022-1-kan.liang@linux.intel.com/
  v2: https://lore.kernel.org/lkml/20250626195610.405379-1-kan.liang@linux.intel.com/
  v1: https://lore.kernel.org/lkml/20250613134943.3186517-1-kan.liang@linux.intel.com/

Dapeng Mi (19):
  perf/x86/intel: Validate return value of intel_pmu_init_hybrid()
  perf/x86: Move hybrid PMU initialization before x86_pmu_starting_cpu()
  perf/x86/intel: Enable large PEBS sampling for XMMs
  perf/x86/intel: Convert x86_perf_regs to per-cpu variables
  perf: Eliminate duplicate arch-specific functions definations
  perf/x86: Use x86_perf_regs in the x86 nmi handlers
  x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state
  perf/x86: Enable XMM Register Sampling for Non-PEBS Events
  perf/x86: Enable XMM register sampling for REGS_USER case
  perf/x86: Support XMM sampling using sample_simd_vec_reg_* fields
  perf/x86: Support YMM sampling using sample_simd_vec_reg_* fields
  perf/x86: Support ZMM sampling using sample_simd_vec_reg_* fields
  perf/x86: Support OPMASK sampling using sample_simd_pred_reg_* fields
  perf: Enhance perf_reg_validate() with simd_enabled argument
  perf/x86: Support eGPRs sampling using sample_regs_* fields
  perf/x86: Support SSP sampling using sample_regs_* fields
  perf/x86/intel: Support arch-PEBS based SIMD/eGPRs/SSP sampling
  perf/x86: Activate back-to-back NMI detection for arch-PEBS induced
    NMIs
  perf/x86/intel: Add sanity check for PEBS fragment size

Kan Liang (4):
  x86/fpu/xstate: Add xsaves_nmi() helper
  perf: Move and enhance has_extended_regs() for arch-specific use
  perf: Add sampling support for SIMD registers
  perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability

 arch/arm/kernel/perf_regs.c           |   8 +-
 arch/arm64/kernel/perf_regs.c         |   8 +-
 arch/csky/kernel/perf_regs.c          |   8 +-
 arch/loongarch/kernel/perf_regs.c     |   8 +-
 arch/mips/kernel/perf_regs.c          |   8 +-
 arch/parisc/kernel/perf_regs.c        |   8 +-
 arch/powerpc/perf/perf_regs.c         |   2 +-
 arch/riscv/kernel/perf_regs.c         |   8 +-
 arch/s390/kernel/perf_regs.c          |   2 +-
 arch/x86/events/core.c                | 415 +++++++++++++++++++++++++-
 arch/x86/events/intel/core.c          | 232 ++++++++++++--
 arch/x86/events/intel/ds.c            | 235 +++++++++++----
 arch/x86/events/perf_event.h          |  85 +++++-
 arch/x86/include/asm/fpu/sched.h      |   5 +-
 arch/x86/include/asm/fpu/xstate.h     |   3 +
 arch/x86/include/asm/msr-index.h      |   7 +
 arch/x86/include/asm/perf_event.h     |  35 ++-
 arch/x86/include/uapi/asm/perf_regs.h |  51 ++++
 arch/x86/kernel/fpu/core.c            |  27 +-
 arch/x86/kernel/fpu/xstate.c          |  25 +-
 arch/x86/kernel/perf_regs.c           | 163 ++++++++--
 arch/x86/xen/pmu.c                    |   5 +-
 include/linux/perf_event.h            |  19 ++
 include/linux/perf_regs.h             |  38 +--
 include/uapi/linux/perf_event.h       |  49 ++-
 kernel/events/core.c                  | 189 ++++++++++--
 26 files changed, 1418 insertions(+), 225 deletions(-)


base-commit: 66cc29745f2f5815482587bb9fbc1e8a3e6fcf00
-- 
2.34.1
Re: [Patch v8 00/23] Support SIMD/eGPRs/SSP registers sampling for perf
Posted by Mi, Dapeng 1 week, 3 days ago
The corresponding perf tools support is here.
https://lore.kernel.org/all/20260529082451.591783-1-dapeng1.mi@linux.intel.com/

Thanks.


On 5/29/2026 3:56 PM, Dapeng Mi wrote:
> Patch layout:
> - Patches 1-6: Bug fixes and cleanup needed before enabling XSAVES-based
>   sampling in NMI context
> - Patches 7-9: FPU-related preparation, including xsaves_nmi() and
>   related cleanup/optimization
> - Patches 10-11: PMI-based XMM sampling support through the existing
>   sample_regs_intr/sample_regs_user interfaces for both
>   PERF_SAMPLE_REGS_INTR and PERF_SAMPLE_REGS_USER
> - Patches 12-19: New SIMD register interface and support for
>   XMM/YMM/ZMM/OPMASK, APX eGPRs, and SSP through that interface
> - Patch 20: Extend arch PEBS to support YMM/ZMM/OPMASK, APX eGPRs, and
>   SSP with the new interface
> - Patch 21: Enable new interface-based sampling
> - Patches 22-23: arch PEBS bug fix and sanity check
>
> Changes since V7:
> - Validate the return value of intel_pmu_init_hybrid() (Patch 01/23).
> - Replace pt_regs with x86_perf_regs in xen_pmu_irq_handler()
>   (Patch 06/23).
> - Improve event_has_extended_regs() (Patch 09/23).
> - Explicitly ensure the allocated XSAVE area is 64-byte aligned
>   (Patch 10/23, Sashiko).
> - Clear the SIMD register pointers in x86_user_regs to avoid exposing
>   stale register data to user space (Patch 11/23, Sashiko).
> - Refine the SIMD register interface and sample data layout, and add the
>   missing SIMD data reservation in perf_prepare_sample() for non-x86
>   architectures (Patch 12/23, Sashiko).
> - Improve perf_simd_reg_validate() for x86 (Patch 13/23, Sashiko).
> - Refine SSP sampling and ensure the GPR sub-group flag is set for PEBS
>   (Patch 19/23, Sashiko).
> - Fix the incorrect large-PEBS check for XMM (Patch 20/23, Sashiko).
> - Fix missing handling in x86_pmu_handle_guest_pebs() for back-to-back
>   PMI detection (Patch 22/23, Sashiko).
> - Strengthen the PEBS record header sanity checks to prevent invalid
>   memory access (Patch 23/23, Sashiko).
>
> Changes since V6:
> - Fix potential overwritten issue in hybrid PMU structure (patch 01/24)
> - Restrict PEBS events work on GP counters if no PEBS baseline suggested
>   (patch 02/24)
> - Use per-cpu x86_intr_regs for perf_event_nmi_handler() instead of
>   temporary variable (patch 06/24)
> - Add helper update_fpu_state_and_flag() to ensure TIF_NEED_FPU_LOAD is
>   set after save_fpregs_to_fpstate() call (patch 09/24)
> - Optimize and simplify x86_pmu_sample_xregs(), etc. (patch 11/24)
> - Add macro word_for_each_set_bit() to simplify u64 set-bit iteration
>   (patch 13/24)
> - Add sanity check for PEBS fragment size (patch 24/24)
>
> Changes since V5:
> - Introduce 3 commits to fix newly found PEBS issues (Patch 01~03/19)
> - Address Peter comments, including,
>   * Fully support user-regs sampling of the SIMD/eGPRs/SSP registers
>   * Adjust newly added fields in perf_event_attr to avoid holes
>   * Fix the endian issue introduced by for_each_set_bit() in
>     event/core.c
>   * Remove some unnecessary macros from UAPI header perf_regs.h
>   * Enhance b2b NMI detection for all PEBS handlers to ensure identical
>     behaviors of all PEBS handlers
> - Split perf-tools patches which would be posted in a separate patchset
>   later
>
> Changes since V4:
> - Rewrite some functions comments and commit messages (Dave)
> - Add arch-PEBS based SIMD/eGPRs/SSP sampling support (Patch 15/19)
> - Fix "suspecious NMI" warnning observed on PTL/NVL P-core and DMR by
>   activating back-to-back NMI detection mechanism (Patch 16/19)
> - Fix some minor issues on perf-tool patches (Patch 18/19)
>
> Changes since V3:
> - Drop the SIMD registers if an NMI hits kernel mode for REGS_USER.
> - Only dump the available regs, rather than zero and dump the
>   unavailable regs. It's possible that the dumped registers are a subset
>   of the requested registers.
> - Some minor updates to address Dapeng's comments in V3.
>
> Changes since V2:
> - Use the FPU format for the x86_pmu.ext_regs_mask as well
> - Add a check before invoking xsaves_nmi()
> - Add perf_simd_reg_check() to retrieve the number of available
>   registers. If the kernel fails to get the requested registers, e.g.,
>   XSAVES fails, nothing dumps to the userspace (the V2 dumps all 0s).
> - Add POC perf tool patches
>
> Changes since V1:
> - Apply the new interfaces to configure and dump the SIMD registers
> - Utilize the existing FPU functions, e.g., xstate_calculate_size,
>   get_xsave_addr().
>
>
> This series adds support on x86 for sampling SIMD registers, APX eGPRs,
> and SSP with both PMI-based and PEBS-based sampling.
>
> Starting with Intel Ice Lake, PEBS can sample XMM registers, but PMI-based
> XMM sampling is still not available. On newer Intel platforms with
> architectural PEBS support, such as Clearwater Forest and Diamond Rapids,
> the hardware also gains support for sampling additional SIMD state
> (XMM/YMM/ZMM/OPMASK), APX extended GPRs, and SSP.
>
> To support these registers consistently across both PMI and PEBS, this
> series makes the following changes:
>
> 1. Adds a new perf_event_attr interface for SIMD register selection.
>    The existing sample_regs_user/sample_regs_intr bitmaps do not have
>    enough space to represent the full SIMD register set, so this series
>    introduces dedicated fields for SIMD and predicate register masks and
>    element widths.
>
> 2. Introduces a new sample data layout for SIMD register data.
>    SIMD register payload is appended after the GPR payload, and a new ABI
>    flag, PERF_SAMPLE_REGS_ABI_SIMD, indicates its presence.
>
> 3. Adds xsaves_nmi() to allow SIMD/eGPR/SSP sampling from PMI handlers in
>    NMI context.
>
> 4. Extends the arch PEBS path to support YMM/ZMM/OPMASK, APX eGPRs, and
>    SSP sampling.
>
>
> New perf_event_attr fields
> --------------------------
>
> This series adds the following fields to perf_event_attr:
>
>     /*
>      * Defines the sampling SIMD/PRED(predicate) register bitmaps and
>      * qword (8-byte) lengths.
>      *
>      * sample_simd_regs_enabled != 0 indicates SIMD/PRED registers are
>      * requested. The register bitmaps and element sizes are described by:
>      *
>      *   sample_simd_{vec,pred}_reg_{intr,user}
>      *   sample_simd_{vec,pred}_reg_qwords
>      *
>      * sample_simd_regs_enabled == 0 indicates no SIMD/PRED registers are
>      * requested.
>      */
>     __u16 sample_simd_regs_enabled;
>     __u16 sample_simd_pred_reg_qwords;
>     __u16 sample_simd_vec_reg_qwords;
>     __u16 __reserved_4;
>
>     __u32 sample_simd_pred_reg_intr;
>     __u32 sample_simd_pred_reg_user;
>     __u64 sample_simd_vec_reg_intr;
>     __u64 sample_simd_vec_reg_user;
>
> Field semantics:
> - sample_simd_vec_reg_qwords: qword count for regular SIMD registers
> - sample_simd_pred_reg_qwords: qword count for predicate registers
> - sample_simd_vec_reg_{intr,user}: SIMD register masks for
>   PERF_SAMPLE_REGS_INTR and PERF_SAMPLE_REGS_USER
> - sample_simd_pred_reg_{intr,user}: predicate register masks for
>   PERF_SAMPLE_REGS_INTR and PERF_SAMPLE_REGS_USER
> - sample_simd_regs_enabled: indicates whether the new SIMD fields are in use
>
> Examples:
>
> To sample ZMM registers for PERF_SAMPLE_REGS_INTR:
>
>     sample_simd_regs_enabled = 1
>     sample_simd_vec_reg_qwords = 8          // 512 bits = 8 qwords
>     sample_simd_vec_reg_intr = 0xffffffff   // zmm0-zmm31
>
> To sample OPMASK registers for PERF_SAMPLE_REGS_USER:
>
>     sample_simd_regs_enabled = 1
>     sample_simd_pred_reg_qwords = 1         // 64 bits = 1 qword
>     sample_simd_pred_reg_user = 0xff        // opmask0-opmask7
>
> After introducing these fields, bits [63:32] in sample_regs_user and
> sample_regs_intr are reclaimed for APX eGPRs and SSP instead of the
> previous XMM0-XMM15 encoding.
>
> Discussion of the new SIMD register interface is available at:
> https://lore.kernel.org/lkml/20250617081458.GI1613376@noisy.programming.kicks-ass.net/
>
> Sample data layout
> ------------------
>
> SIMD register data is appended after the GPR data.
>
> For PERF_SAMPLE_REGS_USER:
>
>     { u64 abi;                      // enum perf_sample_regs_abi
>       u64 regs[weight(mask)];
>       struct {
>             u64 nr_vectors;         // 0 ... weight(sample_simd_vec_reg_user)
>             u64 vector_qwords;      // 0 ... sample_simd_vec_reg_qwords
>             u64 nr_pred;            // 0 ... weight(sample_simd_pred_reg_user)
>             u64 pred_qwords;        // 0 ... sample_simd_pred_reg_qwords
>             u64 data[nr_vectors * vector_qwords +
>                      nr_pred * pred_qwords];
>       } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>     }
>
> For PERF_SAMPLE_REGS_INTR:
>
>     { u64 abi;                      // enum perf_sample_regs_abi
>       u64 regs[weight(mask)];
>       struct {
>             u64 nr_vectors;         // 0 ... weight(sample_simd_vec_reg_intr)
>             u64 vector_qwords;      // 0 ... sample_simd_vec_reg_qwords
>             u64 nr_pred;            // 0 ... weight(sample_simd_pred_reg_intr)
>             u64 pred_qwords;        // 0 ... sample_simd_pred_reg_qwords
>             u64 data[nr_vectors * vector_qwords +
>                      nr_pred * pred_qwords];
>       } && (abi & PERF_SAMPLE_REGS_ABI_SIMD)
>     }
>
> PERF_SAMPLE_REGS_ABI_SIMD indicates that SIMD register data is present.
>
> The metadata fields are encoded as u64 to keep perf tool parsing and
> cross-endian support straightforward.
>
> Example
> -------
>
>   $ perf record -I?
>   available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>   R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27
>   R28 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>
>   $ perf record --user-regs=?
>   available registers: AX BX CX DX SI DI BP SP IP FLAGS CS SS R8 R9 R10
>   R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27
>   R28 R29 R30 R31 SSP XMM0-15 YMM0-15 ZMM0-31 OPMASK0-7
>
>   $ perf record -e branches:p \
>         -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask \
>         -c 100000 ./test
>   $ perf report -D
>
>   ...
>   14027761992115 0xcf30 [0x8a8]: PERF_RECORD_SAMPLE(IP, 0x1): 29964/29964:
>   0xffffffff9f085e24 period: 100000 addr: 0
>   ... intr regs: mask 0x18001010003 ABI 64-bit
>   .... AX    0xdffffc0000000000
>   .... BX    0xffff8882297685e8
>   .... R8    0x0000000000000000
>   .... R16   0x0000000000000000
>   .... R31   0x0000000000000000
>   .... SSP   0x0000000000000000
>   ... SIMD ABI nr_vectors 32 vector_qwords 8 nr_pred 8 pred_qwords 1
>   .... ZMM[0][0] 0x616c2f656d6f682f
>   .... ZMM[0][1] 0x696c2f7265737562
>   ...
>   .... ZMM[31][7] 0x0000000000000000
>   .... OPMASK[0] 0x00000000fffffe00
>   ....
>   .... OPMASK[7] 0x0000000000000000
>   ...
>
> Testing
> -------
>
> The following intr-regs, user-regs, and combined sampling tests were run
> on DMR and NVL. The sampled register data was reported correctly and no
> issues were observed.
>
>   $ ./perf record -e branches:p \
>         -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask -b -c 10000 sleep 1
>
>   $ ./perf record -e branches \
>         -Iax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask -b -c 10000 sleep 1
>
>   $ ./perf record -e branches:p \
>         --user-regs=ax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask \
>         -b -c 10000 sleep 1
>
>   $ ./perf record -e branches \
>         --user-regs=ax,bx,r8,r16,r31,ssp,xmm,ymm,zmm,opmask \
>         -b -c 10000 sleep 1
>
>   $ ./perf record -e branches:p \
>         -Ixmm,ymm,zmm,opmask \
>         --user-regs=ax,bx,r8,r16,r31,ssp \
>         -b -c 10000 sleep 1
>
>   $ ./perf record -e branches:p \
>         --user-regs=xmm,ymm,zmm,opmask \
>         -Iax,bx,r8,r16,r31,ssp \
>         -b -c 10000 sleep 1
>
>   $ ./perf record -e branches:p \
>         -Iax,bx,r9,r17,r30,ssp \
>         --user-regs=ax,bx,r8,r16,r31,ssp \
>         -b -c 10000 sleep 1
>
>   $ ./perf record -e branches:p \
>         -Ixmm,opmask --user-regs=zmm \
>         -b -c 10000 taskset -c 0 sleep 1
>
>
> History:
>   v7: https://lore.kernel.org/all/20260324004118.3772171-1-dapeng1.mi@linux.intel.com/
>   v6: https://lore.kernel.org/all/20260209072047.2180332-1-dapeng1.mi@linux.intel.com/
>   v5: https://lore.kernel.org/all/20251203065500.2597594-1-dapeng1.mi@linux.intel.com/
>   v4: https://lore.kernel.org/all/20250925061213.178796-1-dapeng1.mi@linux.intel.com/
>   v3: https://lore.kernel.org/lkml/20250815213435.1702022-1-kan.liang@linux.intel.com/
>   v2: https://lore.kernel.org/lkml/20250626195610.405379-1-kan.liang@linux.intel.com/
>   v1: https://lore.kernel.org/lkml/20250613134943.3186517-1-kan.liang@linux.intel.com/
>
> Dapeng Mi (19):
>   perf/x86/intel: Validate return value of intel_pmu_init_hybrid()
>   perf/x86: Move hybrid PMU initialization before x86_pmu_starting_cpu()
>   perf/x86/intel: Enable large PEBS sampling for XMMs
>   perf/x86/intel: Convert x86_perf_regs to per-cpu variables
>   perf: Eliminate duplicate arch-specific functions definations
>   perf/x86: Use x86_perf_regs in the x86 nmi handlers
>   x86/fpu: Ensure TIF_NEED_FPU_LOAD is set after saving FPU state
>   perf/x86: Enable XMM Register Sampling for Non-PEBS Events
>   perf/x86: Enable XMM register sampling for REGS_USER case
>   perf/x86: Support XMM sampling using sample_simd_vec_reg_* fields
>   perf/x86: Support YMM sampling using sample_simd_vec_reg_* fields
>   perf/x86: Support ZMM sampling using sample_simd_vec_reg_* fields
>   perf/x86: Support OPMASK sampling using sample_simd_pred_reg_* fields
>   perf: Enhance perf_reg_validate() with simd_enabled argument
>   perf/x86: Support eGPRs sampling using sample_regs_* fields
>   perf/x86: Support SSP sampling using sample_regs_* fields
>   perf/x86/intel: Support arch-PEBS based SIMD/eGPRs/SSP sampling
>   perf/x86: Activate back-to-back NMI detection for arch-PEBS induced
>     NMIs
>   perf/x86/intel: Add sanity check for PEBS fragment size
>
> Kan Liang (4):
>   x86/fpu/xstate: Add xsaves_nmi() helper
>   perf: Move and enhance has_extended_regs() for arch-specific use
>   perf: Add sampling support for SIMD registers
>   perf/x86/intel: Enable PERF_PMU_CAP_SIMD_REGS capability
>
>  arch/arm/kernel/perf_regs.c           |   8 +-
>  arch/arm64/kernel/perf_regs.c         |   8 +-
>  arch/csky/kernel/perf_regs.c          |   8 +-
>  arch/loongarch/kernel/perf_regs.c     |   8 +-
>  arch/mips/kernel/perf_regs.c          |   8 +-
>  arch/parisc/kernel/perf_regs.c        |   8 +-
>  arch/powerpc/perf/perf_regs.c         |   2 +-
>  arch/riscv/kernel/perf_regs.c         |   8 +-
>  arch/s390/kernel/perf_regs.c          |   2 +-
>  arch/x86/events/core.c                | 415 +++++++++++++++++++++++++-
>  arch/x86/events/intel/core.c          | 232 ++++++++++++--
>  arch/x86/events/intel/ds.c            | 235 +++++++++++----
>  arch/x86/events/perf_event.h          |  85 +++++-
>  arch/x86/include/asm/fpu/sched.h      |   5 +-
>  arch/x86/include/asm/fpu/xstate.h     |   3 +
>  arch/x86/include/asm/msr-index.h      |   7 +
>  arch/x86/include/asm/perf_event.h     |  35 ++-
>  arch/x86/include/uapi/asm/perf_regs.h |  51 ++++
>  arch/x86/kernel/fpu/core.c            |  27 +-
>  arch/x86/kernel/fpu/xstate.c          |  25 +-
>  arch/x86/kernel/perf_regs.c           | 163 ++++++++--
>  arch/x86/xen/pmu.c                    |   5 +-
>  include/linux/perf_event.h            |  19 ++
>  include/linux/perf_regs.h             |  38 +--
>  include/uapi/linux/perf_event.h       |  49 ++-
>  kernel/events/core.c                  | 189 ++++++++++--
>  26 files changed, 1418 insertions(+), 225 deletions(-)
>
>
> base-commit: 66cc29745f2f5815482587bb9fbc1e8a3e6fcf00