[RFC PATCH 00/12] Support vector and more extended registers in perf

kan.liang@linux.intel.com posted 12 patches 10 months, 1 week ago
There is a newer version of this series
arch/arm/kernel/perf_regs.c           |   9 +-
arch/arm64/kernel/perf_regs.c         |   9 +-
arch/csky/kernel/perf_regs.c          |   9 +-
arch/loongarch/kernel/perf_regs.c     |   8 +-
arch/mips/kernel/perf_regs.c          |   9 +-
arch/powerpc/perf/perf_regs.c         |   9 +-
arch/riscv/kernel/perf_regs.c         |   8 +-
arch/s390/kernel/perf_regs.c          |   9 +-
arch/x86/events/core.c                | 226 ++++++++++++++++++++++++--
arch/x86/events/intel/core.c          |  49 ++++++
arch/x86/events/intel/ds.c            |  12 +-
arch/x86/events/perf_event.h          |  58 +++++++
arch/x86/include/asm/fpu/xstate.h     |   1 +
arch/x86/include/asm/perf_event.h     |   6 +
arch/x86/include/uapi/asm/perf_regs.h | 101 ++++++++++++
arch/x86/kernel/fpu/xstate.c          |  22 +++
arch/x86/kernel/perf_regs.c           |  85 +++++++++-
include/linux/perf_event.h            |  23 +++
include/linux/perf_regs.h             |  29 +++-
include/uapi/linux/perf_event.h       |   8 +
kernel/events/core.c                  |  63 +++++--
21 files changed, 699 insertions(+), 54 deletions(-)
[RFC PATCH 00/12] Support vector and more extended registers in perf
Posted by kan.liang@linux.intel.com 10 months, 1 week ago
From: Kan Liang <kan.liang@linux.intel.com>

Starting from the Intel Ice Lake, the XMM registers can be collected in
a PEBS record. More registers, e.g., YMM, ZMM, OPMASK, SPP and APX, will
be added in the upcoming Architecture PEBS as well. But it requires the
hardware support.

The patch set provides a software solution to mitigate the hardware
requirement. It utilizes the XSAVES command to retrieve the requested
registers in the overflow handler. The feature isn't limited to the PEBS
event or specific platforms anymore.
The hardware solution (if available) is still preferred, since it has
low overhead (especially with the large PEBS) and is more accurate.

In theory, the solution should work for all X86 platforms. But I only
have newer Inter platforms to test. The patch set only enable the
feature for Intel Ice Lake and later platforms.

Open:
The new registers include YMM, ZMM, OPMASK, SSP, and APX.
The sample_regs_user/intr has run out. A new field in the
struct perf_event_attr is required for the registers.
There could be several options as below for the new field.

- Follow a similar format to XSAVES. Introduce the below fields to store
  the bitmap of the registers.
  struct perf_event_attr {
        ...
        __u64   sample_ext_regs_intr[2];
        __u64   sample_ext_regs_user[2];
        ...
  }
  Includes YMMH (16 bits), APX (16 bits), OPMASK (8 bits),
           ZMMH0-15 (16 bits), H16ZMM (16 bits), SSP
  For example, if a user wants YMM8, the perf tool needs to set the
  corresponding bits of XMM8 and YMMH8, and reconstruct the result.
  The method is similar to the existing method for
  sample_regs_user/intr, and match the XSAVES format.
  The kernel doesn't need to do extra configuration and reconstruction.
  It's implemented in the patch set.

- Similar to the above method. But the fields are the bitmap of the
  complete registers, E.g., YMM (16 bits), APX (16 bits),
  OPMASK (8 bits), ZMM (32 bits), SSP.
  The kernel needs to do extra configuration and reconstruction,
  which may brings extra overhead.

- Combine the XMM, YMM, and ZMM. So all the registers can be put into
  one u64 field.
        ...
        union {
                __u64 sample_ext_regs_intr;   //sample_ext_regs_user is simiar
                struct {
                        __u32 vector_bitmap;
                        __u32 vector_type   : 3, //0b001 XMM 0b010 YMM 0b100 ZMM
                              apx_bitmap    : 16,
                              opmask_bitmap : 8,
                              ssp_bitmap    : 1,
                              reserved      : 4,

                };
        ...
  For example, if the YMM8-15 is required,
  vector_bitmap: 0x0000ff00
  vector_type: 0x2
  This method can save two __u64 in the struct perf_event_attr.
  But it's not straightforward since it mixes the type and bitmap.
  The kernel also needs to do extra configuration and reconstruction.

Please let me know if there are more ideas.

Thanks,
Kan



Kan Liang (12):
  perf/x86: Use x86_perf_regs in the x86 nmi handler
  perf/x86: Setup the regs data
  x86/fpu/xstate: Add xsaves_nmi
  perf: Move has_extended_regs() to header file
  perf/x86: Support XMM register for non-PEBS and REGS_USER
  perf: Support extension of sample_regs
  perf/x86: Add YMMH in extended regs
  perf/x86: Add APX in extended regs
  perf/x86: Add OPMASK in extended regs
  perf/x86: Add ZMM in extended regs
  perf/x86: Add SSP in extended regs
  perf/x86/intel: Support extended registers

 arch/arm/kernel/perf_regs.c           |   9 +-
 arch/arm64/kernel/perf_regs.c         |   9 +-
 arch/csky/kernel/perf_regs.c          |   9 +-
 arch/loongarch/kernel/perf_regs.c     |   8 +-
 arch/mips/kernel/perf_regs.c          |   9 +-
 arch/powerpc/perf/perf_regs.c         |   9 +-
 arch/riscv/kernel/perf_regs.c         |   8 +-
 arch/s390/kernel/perf_regs.c          |   9 +-
 arch/x86/events/core.c                | 226 ++++++++++++++++++++++++--
 arch/x86/events/intel/core.c          |  49 ++++++
 arch/x86/events/intel/ds.c            |  12 +-
 arch/x86/events/perf_event.h          |  58 +++++++
 arch/x86/include/asm/fpu/xstate.h     |   1 +
 arch/x86/include/asm/perf_event.h     |   6 +
 arch/x86/include/uapi/asm/perf_regs.h | 101 ++++++++++++
 arch/x86/kernel/fpu/xstate.c          |  22 +++
 arch/x86/kernel/perf_regs.c           |  85 +++++++++-
 include/linux/perf_event.h            |  23 +++
 include/linux/perf_regs.h             |  29 +++-
 include/uapi/linux/perf_event.h       |   8 +
 kernel/events/core.c                  |  63 +++++--
 21 files changed, 699 insertions(+), 54 deletions(-)

-- 
2.38.1
Re: [RFC PATCH 00/12] Support vector and more extended registers in perf
Posted by Peter Zijlstra 10 months ago
On Fri, Jun 13, 2025 at 06:49:31AM -0700, kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> Starting from the Intel Ice Lake, the XMM registers can be collected in
> a PEBS record. More registers, e.g., YMM, ZMM, OPMASK, SPP and APX, will
> be added in the upcoming Architecture PEBS as well. But it requires the
> hardware support.
> 
> The patch set provides a software solution to mitigate the hardware
> requirement. It utilizes the XSAVES command to retrieve the requested
> registers in the overflow handler. The feature isn't limited to the PEBS
> event or specific platforms anymore.
> The hardware solution (if available) is still preferred, since it has
> low overhead (especially with the large PEBS) and is more accurate.
> 
> In theory, the solution should work for all X86 platforms. But I only
> have newer Inter platforms to test. The patch set only enable the
> feature for Intel Ice Lake and later platforms.
> 
> Open:
> The new registers include YMM, ZMM, OPMASK, SSP, and APX.
> The sample_regs_user/intr has run out. A new field in the
> struct perf_event_attr is required for the registers.
> There could be several options as below for the new field.
> 
> - Follow a similar format to XSAVES. Introduce the below fields to store
>   the bitmap of the registers.
>   struct perf_event_attr {
>         ...
>         __u64   sample_ext_regs_intr[2];
>         __u64   sample_ext_regs_user[2];
>         ...
>   }
>   Includes YMMH (16 bits), APX (16 bits), OPMASK (8 bits),
>            ZMMH0-15 (16 bits), H16ZMM (16 bits), SSP
>   For example, if a user wants YMM8, the perf tool needs to set the
>   corresponding bits of XMM8 and YMMH8, and reconstruct the result.
>   The method is similar to the existing method for
>   sample_regs_user/intr, and match the XSAVES format.
>   The kernel doesn't need to do extra configuration and reconstruction.
>   It's implemented in the patch set.
> 
> - Similar to the above method. But the fields are the bitmap of the
>   complete registers, E.g., YMM (16 bits), APX (16 bits),
>   OPMASK (8 bits), ZMM (32 bits), SSP.
>   The kernel needs to do extra configuration and reconstruction,
>   which may brings extra overhead.
> 
> - Combine the XMM, YMM, and ZMM. So all the registers can be put into
>   one u64 field.
>         ...
>         union {
>                 __u64 sample_ext_regs_intr;   //sample_ext_regs_user is simiar
>                 struct {
>                         __u32 vector_bitmap;
>                         __u32 vector_type   : 3, //0b001 XMM 0b010 YMM 0b100 ZMM
>                               apx_bitmap    : 16,
>                               opmask_bitmap : 8,
>                               ssp_bitmap    : 1,
>                               reserved      : 4,
> 
>                 };
>         ...
>   For example, if the YMM8-15 is required,
>   vector_bitmap: 0x0000ff00
>   vector_type: 0x2
>   This method can save two __u64 in the struct perf_event_attr.
>   But it's not straightforward since it mixes the type and bitmap.
>   The kernel also needs to do extra configuration and reconstruction.
> 
> Please let me know if there are more ideas.

https://lkml.kernel.org/r/20250416155327.GD17910@noisy.programming.kicks-ass.net

comes to mind. Combine that with a rule that reclaims the XMM register
space from perf_event_x86_regs when sample_simd_reg_words != 0, and then
we can put APX and SPP there.
Re: [RFC PATCH 00/12] Support vector and more extended registers in perf
Posted by Liang, Kan 10 months ago

On 2025-06-17 4:24 a.m., Peter Zijlstra wrote:
> On Fri, Jun 13, 2025 at 06:49:31AM -0700, kan.liang@linux.intel.com wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> Starting from the Intel Ice Lake, the XMM registers can be collected in
>> a PEBS record. More registers, e.g., YMM, ZMM, OPMASK, SPP and APX, will
>> be added in the upcoming Architecture PEBS as well. But it requires the
>> hardware support.
>>
>> The patch set provides a software solution to mitigate the hardware
>> requirement. It utilizes the XSAVES command to retrieve the requested
>> registers in the overflow handler. The feature isn't limited to the PEBS
>> event or specific platforms anymore.
>> The hardware solution (if available) is still preferred, since it has
>> low overhead (especially with the large PEBS) and is more accurate.
>>
>> In theory, the solution should work for all X86 platforms. But I only
>> have newer Inter platforms to test. The patch set only enable the
>> feature for Intel Ice Lake and later platforms.
>>
>> Open:
>> The new registers include YMM, ZMM, OPMASK, SSP, and APX.
>> The sample_regs_user/intr has run out. A new field in the
>> struct perf_event_attr is required for the registers.
>> There could be several options as below for the new field.
>>
>> - Follow a similar format to XSAVES. Introduce the below fields to store
>>   the bitmap of the registers.
>>   struct perf_event_attr {
>>         ...
>>         __u64   sample_ext_regs_intr[2];
>>         __u64   sample_ext_regs_user[2];
>>         ...
>>   }
>>   Includes YMMH (16 bits), APX (16 bits), OPMASK (8 bits),
>>            ZMMH0-15 (16 bits), H16ZMM (16 bits), SSP
>>   For example, if a user wants YMM8, the perf tool needs to set the
>>   corresponding bits of XMM8 and YMMH8, and reconstruct the result.
>>   The method is similar to the existing method for
>>   sample_regs_user/intr, and match the XSAVES format.
>>   The kernel doesn't need to do extra configuration and reconstruction.
>>   It's implemented in the patch set.
>>
>> - Similar to the above method. But the fields are the bitmap of the
>>   complete registers, E.g., YMM (16 bits), APX (16 bits),
>>   OPMASK (8 bits), ZMM (32 bits), SSP.
>>   The kernel needs to do extra configuration and reconstruction,
>>   which may brings extra overhead.
>>
>> - Combine the XMM, YMM, and ZMM. So all the registers can be put into
>>   one u64 field.
>>         ...
>>         union {
>>                 __u64 sample_ext_regs_intr;   //sample_ext_regs_user is simiar
>>                 struct {
>>                         __u32 vector_bitmap;
>>                         __u32 vector_type   : 3, //0b001 XMM 0b010 YMM 0b100 ZMM
>>                               apx_bitmap    : 16,
>>                               opmask_bitmap : 8,
>>                               ssp_bitmap    : 1,
>>                               reserved      : 4,
>>
>>                 };
>>         ...
>>   For example, if the YMM8-15 is required,
>>   vector_bitmap: 0x0000ff00
>>   vector_type: 0x2
>>   This method can save two __u64 in the struct perf_event_attr.
>>   But it's not straightforward since it mixes the type and bitmap.
>>   The kernel also needs to do extra configuration and reconstruction.
>>
>> Please let me know if there are more ideas.
> 
> https://lkml.kernel.org/r/20250416155327.GD17910@noisy.programming.kicks-ass.net
>

It's similar to the third method, but using the words to replace the
type. Also there are more space left in case we add more SIMDs in future.
I will implement it in the V2.
> comes to mind. Combine that with a rule that reclaims the XMM register
> space from perf_event_x86_regs when sample_simd_reg_words != 0, and then
> we can put APX and SPP there.

OK. So the sample_simd_reg_words actually has another meaning now. It's
used as a flag to tell whether utilizing the old format.

If so, I think it may be better to have a dedicate sample_simd_reg_flag
field.

For example,

#define SAMPLE_SIMD_FLAGS_FORMAT_LEGACY		0x0
#define SAMPLE_SIMD_FLAGS_FORMAT_WORDS		0x1

	__u8 sample_simd_reg_flags;
	__u8 sample_simd_reg_words;
	__u64 sample_simd_reg_intr;
	__u64 sample_simd_reg_user;

If (sample_simd_reg_flags != 0) reclaims the XMM space for APX and SPP.

Does it make sense?

Thanks,
Kan
Re: [RFC PATCH 00/12] Support vector and more extended registers in perf
Posted by Peter Zijlstra 10 months ago
On Tue, Jun 17, 2025 at 09:52:12AM -0400, Liang, Kan wrote:

> OK. So the sample_simd_reg_words actually has another meaning now.

Well, any simd field being non-zero means userspace knows about it. Sort
of an implicit flag.

> It's used as a flag to tell whether utilizing the old format.
> 
> If so, I think it may be better to have a dedicate sample_simd_reg_flag
> field.
> 
> For example,
> 
> #define SAMPLE_SIMD_FLAGS_FORMAT_LEGACY		0x0
> #define SAMPLE_SIMD_FLAGS_FORMAT_WORDS		0x1
> 
> 	__u8 sample_simd_reg_flags;
> 	__u8 sample_simd_reg_words;
> 	__u64 sample_simd_reg_intr;
> 	__u64 sample_simd_reg_user;
> 
> If (sample_simd_reg_flags != 0) reclaims the XMM space for APX and SPP.
> 
> Does it make sense?

Not sure, it eats up a whole byte. Dapeng seemed to favour separate
intr/user vector width (although I'm not quite sure what the use would
be).

If you want an explicit bit, we might as well use one from __reserved_1,
we still have some left.
Re: [RFC PATCH 00/12] Support vector and more extended registers in perf
Posted by Liang, Kan 10 months ago

On 2025-06-17 10:29 a.m., Peter Zijlstra wrote:
> On Tue, Jun 17, 2025 at 09:52:12AM -0400, Liang, Kan wrote:
> 
>> OK. So the sample_simd_reg_words actually has another meaning now.
> 
> Well, any simd field being non-zero means userspace knows about it. Sort
> of an implicit flag.

Yes, but the tool probably wouldn't to touch any simd fields if user
doesn't ask for simd registers

> 
>> It's used as a flag to tell whether utilizing the old format.
>>
>> If so, I think it may be better to have a dedicate sample_simd_reg_flag
>> field.
>>
>> For example,
>>
>> #define SAMPLE_SIMD_FLAGS_FORMAT_LEGACY		0x0
>> #define SAMPLE_SIMD_FLAGS_FORMAT_WORDS		0x1
>>
>> 	__u8 sample_simd_reg_flags;
>> 	__u8 sample_simd_reg_words;
>> 	__u64 sample_simd_reg_intr;
>> 	__u64 sample_simd_reg_user;
>>
>> If (sample_simd_reg_flags != 0) reclaims the XMM space for APX and SPP.
>>
>> Does it make sense?
> 
> Not sure, it eats up a whole byte. Dapeng seemed to favour separate
> intr/user vector width (although I'm not quite sure what the use would
> be).
> 
> If you want an explicit bit, we might as well use one from __reserved_1,
> we still have some left.

OK. I may add a sample_simd_reg : 1 to explicitly tell kernel to utilize
the sample_simd_reg_XXX.

Thanks,
Kan
Re: [RFC PATCH 00/12] Support vector and more extended registers in perf
Posted by Mi, Dapeng 10 months ago
On 6/17/2025 11:23 PM, Liang, Kan wrote:
>
> On 2025-06-17 10:29 a.m., Peter Zijlstra wrote:
>> On Tue, Jun 17, 2025 at 09:52:12AM -0400, Liang, Kan wrote:
>>
>>> OK. So the sample_simd_reg_words actually has another meaning now.
>> Well, any simd field being non-zero means userspace knows about it. Sort
>> of an implicit flag.
> Yes, but the tool probably wouldn't to touch any simd fields if user
> doesn't ask for simd registers
>
>>> It's used as a flag to tell whether utilizing the old format.
>>>
>>> If so, I think it may be better to have a dedicate sample_simd_reg_flag
>>> field.
>>>
>>> For example,
>>>
>>> #define SAMPLE_SIMD_FLAGS_FORMAT_LEGACY		0x0
>>> #define SAMPLE_SIMD_FLAGS_FORMAT_WORDS		0x1
>>>
>>> 	__u8 sample_simd_reg_flags;
>>> 	__u8 sample_simd_reg_words;
>>> 	__u64 sample_simd_reg_intr;
>>> 	__u64 sample_simd_reg_user;
>>>
>>> If (sample_simd_reg_flags != 0) reclaims the XMM space for APX and SPP.
>>>
>>> Does it make sense?

Not sure if I missed some discussion, but are these fields only intended
for SIMD regs? What about the APX extended GPRs? Suppose APX eGPRs can
reuse the legacy XMM bitmaps in sample_regs_user/intr[47:32], but we need
an extra flag to distinguish it's XMM regs or APX eGPRs, maybe add an extra
bit sample_egpr_reg : 1 in sample_simd_reg_words, but the *simd* word in
the name would become ambiguous.


>> Not sure, it eats up a whole byte. Dapeng seemed to favour separate
>> intr/user vector width (although I'm not quite sure what the use would
>> be).

The reason that I prefer to add 2 separate "words" item is that user could
sample interrupt and user space SIMD regs (but with different bit-width)
simultaneously in theory, like "--intr-regs=YMM0, --user-regs=XMM0". 


>>
>> If you want an explicit bit, we might as well use one from __reserved_1,
>> we still have some left.
> OK. I may add a sample_simd_reg : 1 to explicitly tell kernel to utilize
> the sample_simd_reg_XXX.
>
> Thanks,
> Kan
Re: [RFC PATCH 00/12] Support vector and more extended registers in perf
Posted by Liang, Kan 10 months ago

On 2025-06-17 8:57 p.m., Mi, Dapeng wrote:
> 
> On 6/17/2025 11:23 PM, Liang, Kan wrote:
>>
>> On 2025-06-17 10:29 a.m., Peter Zijlstra wrote:
>>> On Tue, Jun 17, 2025 at 09:52:12AM -0400, Liang, Kan wrote:
>>>
>>>> OK. So the sample_simd_reg_words actually has another meaning now.
>>> Well, any simd field being non-zero means userspace knows about it. Sort
>>> of an implicit flag.
>> Yes, but the tool probably wouldn't to touch any simd fields if user
>> doesn't ask for simd registers
>>
>>>> It's used as a flag to tell whether utilizing the old format.
>>>>
>>>> If so, I think it may be better to have a dedicate sample_simd_reg_flag
>>>> field.
>>>>
>>>> For example,
>>>>
>>>> #define SAMPLE_SIMD_FLAGS_FORMAT_LEGACY		0x0
>>>> #define SAMPLE_SIMD_FLAGS_FORMAT_WORDS		0x1
>>>>
>>>> 	__u8 sample_simd_reg_flags;
>>>> 	__u8 sample_simd_reg_words;
>>>> 	__u64 sample_simd_reg_intr;
>>>> 	__u64 sample_simd_reg_user;
>>>>
>>>> If (sample_simd_reg_flags != 0) reclaims the XMM space for APX and SPP.
>>>>
>>>> Does it make sense?
> 
> Not sure if I missed some discussion, but are these fields only intended
> for SIMD regs? What about the APX extended GPRs? Suppose APX eGPRs can
> reuse the legacy XMM bitmaps in sample_regs_user/intr[47:32], but we need
> an extra flag to distinguish it's XMM regs or APX eGPRs, maybe add an extra
> bit sample_egpr_reg : 1 in sample_simd_reg_words, but the *simd* word in
> the name would become ambiguous.

It can be used to explicitly tell the kernel that a new format is
expected. The new format means
- Put APX and SPP into sample_regs_user/intr[47:32]
- Use the sample_simd_reg_*

Alternatively, as Peter suggested, we can use the sample_simd_reg_words
to imply the new format.
If so, I will make it an union, for example.
	union {
		__u16 sample_reg_flags;
		__u16 sample_simd_reg_words;
	};

The first thing the tool does should be to set sample_reg_flags = 1,
regardless of whether simd is requested.

> 
> 
>>> Not sure, it eats up a whole byte. Dapeng seemed to favour separate
>>> intr/user vector width (although I'm not quite sure what the use would
>>> be).
> 
> The reason that I prefer to add 2 separate "words" item is that user could
> sample interrupt and user space SIMD regs (but with different bit-width)
> simultaneously in theory, like "--intr-regs=YMM0, --user-regs=XMM0".

I'm not sure why the user wants a different bit-width. The
--user-regs=XMM0" doesn't seem to provide more useful information.

Anyway, I believe the tool can handle this case. The tool can always ask
YMM0 for both --intr-regs and --user-regs, but only output the XMM0 for
--user-regs. The only drawback is that the kernel may dump extra
information for the --user-regs. I don't think it's a big problem.

Thanks,
Kan 
> 
> 
>>>
>>> If you want an explicit bit, we might as well use one from __reserved_1,
>>> we still have some left.
>> OK. I may add a sample_simd_reg : 1 to explicitly tell kernel to utilize
>> the sample_simd_reg_XXX.
>>
>> Thanks,
>> Kan

Re: [RFC PATCH 00/12] Support vector and more extended registers in perf
Posted by Mi, Dapeng 10 months ago
On 6/18/2025 6:47 PM, Liang, Kan wrote:
>
> On 2025-06-17 8:57 p.m., Mi, Dapeng wrote:
>> On 6/17/2025 11:23 PM, Liang, Kan wrote:
>>> On 2025-06-17 10:29 a.m., Peter Zijlstra wrote:
>>>> On Tue, Jun 17, 2025 at 09:52:12AM -0400, Liang, Kan wrote:
>>>>
>>>>> OK. So the sample_simd_reg_words actually has another meaning now.
>>>> Well, any simd field being non-zero means userspace knows about it. Sort
>>>> of an implicit flag.
>>> Yes, but the tool probably wouldn't to touch any simd fields if user
>>> doesn't ask for simd registers
>>>
>>>>> It's used as a flag to tell whether utilizing the old format.
>>>>>
>>>>> If so, I think it may be better to have a dedicate sample_simd_reg_flag
>>>>> field.
>>>>>
>>>>> For example,
>>>>>
>>>>> #define SAMPLE_SIMD_FLAGS_FORMAT_LEGACY		0x0
>>>>> #define SAMPLE_SIMD_FLAGS_FORMAT_WORDS		0x1
>>>>>
>>>>> 	__u8 sample_simd_reg_flags;
>>>>> 	__u8 sample_simd_reg_words;
>>>>> 	__u64 sample_simd_reg_intr;
>>>>> 	__u64 sample_simd_reg_user;
>>>>>
>>>>> If (sample_simd_reg_flags != 0) reclaims the XMM space for APX and SPP.
>>>>>
>>>>> Does it make sense?
>> Not sure if I missed some discussion, but are these fields only intended
>> for SIMD regs? What about the APX extended GPRs? Suppose APX eGPRs can
>> reuse the legacy XMM bitmaps in sample_regs_user/intr[47:32], but we need
>> an extra flag to distinguish it's XMM regs or APX eGPRs, maybe add an extra
>> bit sample_egpr_reg : 1 in sample_simd_reg_words, but the *simd* word in
>> the name would become ambiguous.
> It can be used to explicitly tell the kernel that a new format is
> expected. The new format means
> - Put APX and SPP into sample_regs_user/intr[47:32]
> - Use the sample_simd_reg_*
>
> Alternatively, as Peter suggested, we can use the sample_simd_reg_words
> to imply the new format.
> If so, I will make it an union, for example.
> 	union {
> 		__u16 sample_reg_flags;
> 		__u16 sample_simd_reg_words;
> 	};
>
> The first thing the tool does should be to set sample_reg_flags = 1,
> regardless of whether simd is requested.

So just double check, as long as the sample_reg_flags
(sample_simd_reg_words) > 0, the below new format would be used.

    sample_regs_user/intr[31:0] bits unchanged, still represent the
original GPRs.

    sample_regs_user/intr[47:32] bits represents APX eGPRs R31 - R16.

    As for the SIMD regs including XMM regs, they are represented by the
dedicated SIMD regs structure ( or regs bitmap and regs word length) .

If sample_reg_flags (sample_simd_reg_words) == 0, then it falls back to
current format.

    sample_regs_user/intr[31:0] bits represent the original GPRs.

    sample_regs_user/intr[63:32] bits represent XMM regs.

If so, I think it's fine. The new format looks more reasonable than current
one.


>
>>
>>>> Not sure, it eats up a whole byte. Dapeng seemed to favour separate
>>>> intr/user vector width (although I'm not quite sure what the use would
>>>> be).
>> The reason that I prefer to add 2 separate "words" item is that user could
>> sample interrupt and user space SIMD regs (but with different bit-width)
>> simultaneously in theory, like "--intr-regs=YMM0, --user-regs=XMM0".
> I'm not sure why the user wants a different bit-width. The
> --user-regs=XMM0" doesn't seem to provide more useful information.
>
> Anyway, I believe the tool can handle this case. The tool can always ask
> YMM0 for both --intr-regs and --user-regs, but only output the XMM0 for
> --user-regs. The only drawback is that the kernel may dump extra
> information for the --user-regs. I don't think it's a big problem.
If we intent to handle it in user space tools, I'm not sure if user space
tool can easily know which records are from user space and filter out the
SIMD regs from kernel space and how complicated would the change be. IMO,
adding an extra u16 "words" would be much easier and won't consume too much
memory.


>
> Thanks,
> Kan 
>>
>>>> If you want an explicit bit, we might as well use one from __reserved_1,
>>>> we still have some left.
>>> OK. I may add a sample_simd_reg : 1 to explicitly tell kernel to utilize
>>> the sample_simd_reg_XXX.
>>>
>>> Thanks,
>>> Kan
Re: [RFC PATCH 00/12] Support vector and more extended registers in perf
Posted by Liang, Kan 10 months ago

On 2025-06-18 8:28 a.m., Mi, Dapeng wrote:
>>>>> Not sure, it eats up a whole byte. Dapeng seemed to favour separate
>>>>> intr/user vector width (although I'm not quite sure what the use would
>>>>> be).
>>> The reason that I prefer to add 2 separate "words" item is that user could
>>> sample interrupt and user space SIMD regs (but with different bit-width)
>>> simultaneously in theory, like "--intr-regs=YMM0, --user-regs=XMM0".
>> I'm not sure why the user wants a different bit-width. The
>> --user-regs=XMM0" doesn't seem to provide more useful information.
>>
>> Anyway, I believe the tool can handle this case. The tool can always ask
>> YMM0 for both --intr-regs and --user-regs, but only output the XMM0 for
>> --user-regs. The only drawback is that the kernel may dump extra
>> information for the --user-regs. I don't think it's a big problem.
> If we intent to handle it in user space tools, I'm not sure if user space
> tool can easily know which records are from user space and filter out the
> SIMD regs from kernel space and how complicated would the change be. IMO,
> adding an extra u16 "words" would be much easier and won't consume too much
> memory.

The filter is always done in kernel for --user-regs. The only difference
is that the YMM (after filter) will be dumped to the perf.data. The tool
just show the XMM registers to the end user.

Thanks,
Kan
Re: [RFC PATCH 00/12] Support vector and more extended registers in perf
Posted by Mi, Dapeng 10 months ago
On 6/18/2025 9:15 PM, Liang, Kan wrote:
>
> On 2025-06-18 8:28 a.m., Mi, Dapeng wrote:
>>>>>> Not sure, it eats up a whole byte. Dapeng seemed to favour separate
>>>>>> intr/user vector width (although I'm not quite sure what the use would
>>>>>> be).
>>>> The reason that I prefer to add 2 separate "words" item is that user could
>>>> sample interrupt and user space SIMD regs (but with different bit-width)
>>>> simultaneously in theory, like "--intr-regs=YMM0, --user-regs=XMM0".
>>> I'm not sure why the user wants a different bit-width. The
>>> --user-regs=XMM0" doesn't seem to provide more useful information.
>>>
>>> Anyway, I believe the tool can handle this case. The tool can always ask
>>> YMM0 for both --intr-regs and --user-regs, but only output the XMM0 for
>>> --user-regs. The only drawback is that the kernel may dump extra
>>> information for the --user-regs. I don't think it's a big problem.
>> If we intent to handle it in user space tools, I'm not sure if user space
>> tool can easily know which records are from user space and filter out the
>> SIMD regs from kernel space and how complicated would the change be. IMO,
>> adding an extra u16 "words" would be much easier and won't consume too much
>> memory.
> The filter is always done in kernel for --user-regs. The only difference
> is that the YMM (after filter) will be dumped to the perf.data. The tool
> just show the XMM registers to the end user.

Ok. But there could be another case, user may want to sample some APX eGPRs
in user space and sample SIMD regs in interrupt, like "--intr-regs=YMM0,
--user-regs=R16", then we have to define 2 separate "words" fields.

Anyway, it looks we would define a SIMD_REGS structure like below, and I
suppose we would create 2 instances, one is for interrupt, the other is for
user space. It's enough.

PERF_SAMPLE_SIMD_REGS := {
	u16 nr_vectors;
	u16 vector_length;
	u16 nr_pred;
	u16 pred_length;
	u64 data[];
}


>
> Thanks,
> Kan
>
>
Re: [RFC PATCH 00/12] Support vector and more extended registers in perf
Posted by Liang, Kan 10 months ago

On 2025-06-18 8:41 p.m., Mi, Dapeng wrote:
> 
> On 6/18/2025 9:15 PM, Liang, Kan wrote:
>>
>> On 2025-06-18 8:28 a.m., Mi, Dapeng wrote:
>>>>>>> Not sure, it eats up a whole byte. Dapeng seemed to favour separate
>>>>>>> intr/user vector width (although I'm not quite sure what the use would
>>>>>>> be).
>>>>> The reason that I prefer to add 2 separate "words" item is that user could
>>>>> sample interrupt and user space SIMD regs (but with different bit-width)
>>>>> simultaneously in theory, like "--intr-regs=YMM0, --user-regs=XMM0".
>>>> I'm not sure why the user wants a different bit-width. The
>>>> --user-regs=XMM0" doesn't seem to provide more useful information.
>>>>
>>>> Anyway, I believe the tool can handle this case. The tool can always ask
>>>> YMM0 for both --intr-regs and --user-regs, but only output the XMM0 for
>>>> --user-regs. The only drawback is that the kernel may dump extra
>>>> information for the --user-regs. I don't think it's a big problem.
>>> If we intent to handle it in user space tools, I'm not sure if user space
>>> tool can easily know which records are from user space and filter out the
>>> SIMD regs from kernel space and how complicated would the change be. IMO,
>>> adding an extra u16 "words" would be much easier and won't consume too much
>>> memory.
>> The filter is always done in kernel for --user-regs. The only difference
>> is that the YMM (after filter) will be dumped to the perf.data. The tool
>> just show the XMM registers to the end user.
> 
> Ok. But there could be another case, user may want to sample some APX eGPRs
> in user space and sample SIMD regs in interrupt, like "--intr-regs=YMM0,
> --user-regs=R16", then we have to define 2 separate "words" fields.
> 

Not for eGPRs. It uses the regular GP regs space, which implies u64 for
a 64b kernel. The "words" fields is only for vector and predicate registers.

I've stated working on the V2. The new interface would be as below.

diff --git a/include/uapi/linux/perf_event.h
b/include/uapi/linux/perf_event.h
index 78a362b80027..f7b8971fa99d 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -382,6 +382,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER6			120	/* Add: aux_sample_size */
 #define PERF_ATTR_SIZE_VER7			128	/* Add: sig_data */
 #define PERF_ATTR_SIZE_VER8			136	/* Add: config3 */
+#define PERF_ATTR_SIZE_VER9			184	/* Add: sample_simd_regs */

 /*
  * 'struct perf_event_attr' contains various attributes that define
@@ -543,6 +544,24 @@ struct perf_event_attr {
 	__u64	sig_data;

 	__u64	config3; /* extension of config2 */
+
+
+	/*
+	 * Defines set of SIMD registers to dump on samples.
+	 * The sample_simd_req_enabled !=0 implies the
+	 * set of SIMD registers is used to config all SIMD registers.
+	 * If !sample_simd_req_enabled, sample_regs_XXX may be used to
+	 * config some SIMD registers on X86.
+	 */
+	union {
+		__u16 sample_simd_reg_enabled;
+		__u16 sample_simd_pred_reg_qwords;
+	};
+	__u16 sample_simd_pred_reg_intr;
+	__u16 sample_simd_pred_reg_user;
+	__u16 sample_simd_reg_qwords;
+	__u64 sample_simd_reg_intr;
+	__u64 sample_simd_reg_user;
 };

 /*
@@ -1016,7 +1035,15 @@ enum perf_event_type {
 	 *      } && PERF_SAMPLE_BRANCH_STACK
 	 *
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *	  	u16 nr_vectors;
+	 *	  	u16 vector_qwords;
+	 *	  	u16 nr_pred;
+	 *	  	u16 pred_qwords;
+	 *	  	u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && sample_simd_reg_enabled
+	 *	} && PERF_SAMPLE_REGS_USER
 	 *
 	 *	{ u64			size;
 	 *	  char			data[size];
@@ -1043,7 +1070,15 @@ enum perf_event_type {
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
 	 *	{ u64			abi; # enum perf_sample_regs_abi
-	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
+	 *	  u64			regs[weight(mask)];
+	 *	  struct {
+	 *	  	u16 nr_vectors;
+	 *	  	u16 vector_qwords;
+	 *	  	u16 nr_pred;
+	 *	  	u16 pred_qwords;
+	 *	  	u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
+	 *	  } && sample_simd_reg_enabled
+	 *	} && PERF_SAMPLE_REGS_INTR
 	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
 	 *	{ u64			cgroup;} && PERF_SAMPLE_CGROUP
 	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE


Thanks,
Kan
Re: [RFC PATCH 00/12] Support vector and more extended registers in perf
Posted by Peter Zijlstra 10 months ago
On Thu, Jun 19, 2025 at 07:11:23AM -0400, Liang, Kan wrote:

> @@ -543,6 +544,24 @@ struct perf_event_attr {
>  	__u64	sig_data;
> 
>  	__u64	config3; /* extension of config2 */
> +
> +
> +	/*
> +	 * Defines set of SIMD registers to dump on samples.
> +	 * The sample_simd_req_enabled !=0 implies the
> +	 * set of SIMD registers is used to config all SIMD registers.
> +	 * If !sample_simd_req_enabled, sample_regs_XXX may be used to
> +	 * config some SIMD registers on X86.
> +	 */
> +	union {
> +		__u16 sample_simd_reg_enabled;
> +		__u16 sample_simd_pred_reg_qwords;
> +	};
> +	__u16 sample_simd_pred_reg_intr;
> +	__u16 sample_simd_pred_reg_user;

This limits things to max 16 predicate registers. ARM will fully fill
that with present hardware.

> +	__u16 sample_simd_reg_qwords;
> +	__u64 sample_simd_reg_intr;
> +	__u64 sample_simd_reg_user;

I would perhaps make this vec_reg.

>  };
> 
>  /*
> @@ -1016,7 +1035,15 @@ enum perf_event_type {
>  	 *      } && PERF_SAMPLE_BRANCH_STACK
>  	 *
>  	 *	{ u64			abi; # enum perf_sample_regs_abi
> -	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
> +	 *	  u64			regs[weight(mask)];
> +	 *	  struct {
> +	 *	  	u16 nr_vectors;
> +	 *	  	u16 vector_qwords;
> +	 *	  	u16 nr_pred;
> +	 *	  	u16 pred_qwords;
> +	 *	  	u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +	 *	  } && sample_simd_reg_enabled

Instead of using sample_simd_reg_enabled here I would perhaps extend
perf_sample_regs_abi. The current abi word is woefully underused.

Also, realistically, what you want to look at here is:

  sample_simd_{pred,vec}_reg_user;

If those are empty, there will be no registers.

> +	 *	} && PERF_SAMPLE_REGS_USER
>  	 *
>  	 *	{ u64			size;
>  	 *	  char			data[size];
Re: [RFC PATCH 00/12] Support vector and more extended registers in perf
Posted by Liang, Kan 10 months ago

On 2025-06-19 9:38 a.m., Peter Zijlstra wrote:
> On Thu, Jun 19, 2025 at 07:11:23AM -0400, Liang, Kan wrote:
> 
>> @@ -543,6 +544,24 @@ struct perf_event_attr {
>>  	__u64	sig_data;
>>
>>  	__u64	config3; /* extension of config2 */
>> +
>> +
>> +	/*
>> +	 * Defines set of SIMD registers to dump on samples.
>> +	 * The sample_simd_req_enabled !=0 implies the
>> +	 * set of SIMD registers is used to config all SIMD registers.
>> +	 * If !sample_simd_req_enabled, sample_regs_XXX may be used to
>> +	 * config some SIMD registers on X86.
>> +	 */
>> +	union {
>> +		__u16 sample_simd_reg_enabled;
>> +		__u16 sample_simd_pred_reg_qwords;
>> +	};
>> +	__u16 sample_simd_pred_reg_intr;
>> +	__u16 sample_simd_pred_reg_user;
> 
> This limits things to max 16 predicate registers. ARM will fully fill
> that with present hardware.

I think I can use __u32 for predicate registers.
It means we need one more u64 for the qwords. It should not be a problem.

> 
>> +	__u16 sample_simd_reg_qwords;
>> +	__u64 sample_simd_reg_intr;
>> +	__u64 sample_simd_reg_user;
> 
> I would perhaps make this vec_reg.

Sure.

> 
>>  };
>>
>>  /*
>> @@ -1016,7 +1035,15 @@ enum perf_event_type {
>>  	 *      } && PERF_SAMPLE_BRANCH_STACK
>>  	 *
>>  	 *	{ u64			abi; # enum perf_sample_regs_abi
>> -	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
>> +	 *	  u64			regs[weight(mask)];
>> +	 *	  struct {
>> +	 *	  	u16 nr_vectors;
>> +	 *	  	u16 vector_qwords;
>> +	 *	  	u16 nr_pred;
>> +	 *	  	u16 pred_qwords;
>> +	 *	  	u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
>> +	 *	  } && sample_simd_reg_enabled
> 
> Instead of using sample_simd_reg_enabled here I would perhaps extend
> perf_sample_regs_abi. The current abi word is woefully underused.
> 

Yes. Now I think the abi is used like a version number. I guess I can
add PERF_SAMPLE_REGS_ABI_SIMD and change it to a bitmap.
There should be no impact on the existing tool, since version and bitmap
are the same for 1 and 2.
 enum perf_sample_regs_abi {
-       PERF_SAMPLE_REGS_ABI_NONE               = 0,
-       PERF_SAMPLE_REGS_ABI_32                 = 1,
-       PERF_SAMPLE_REGS_ABI_64                 = 2,
+       PERF_SAMPLE_REGS_ABI_NONE               = 0x0,
+       PERF_SAMPLE_REGS_ABI_32                 = 0x1,
+       PERF_SAMPLE_REGS_ABI_64                 = 0x2,
+       PERF_SAMPLE_REGS_ABI_SIMD               = 0x4,
 };

> Also, realistically, what you want to look at here is:
> 
>   sample_simd_{pred,vec}_reg_user;
> 
> If those are empty, there will be no registers.

Sure. But I will still keep the sample_simd_reg_enabled, since it can
explicitly tell if the new format is used.

Thanks,
Kan

> 
>> +	 *	} && PERF_SAMPLE_REGS_USER
>>  	 *
>>  	 *	{ u64			size;
>>  	 *	  char			data[size];
>
Re: [RFC PATCH 00/12] Support vector and more extended registers in perf
Posted by Mi, Dapeng 10 months ago
On 6/19/2025 7:11 PM, Liang, Kan wrote:
>
> On 2025-06-18 8:41 p.m., Mi, Dapeng wrote:
>> On 6/18/2025 9:15 PM, Liang, Kan wrote:
>>> On 2025-06-18 8:28 a.m., Mi, Dapeng wrote:
>>>>>>>> Not sure, it eats up a whole byte. Dapeng seemed to favour separate
>>>>>>>> intr/user vector width (although I'm not quite sure what the use would
>>>>>>>> be).
>>>>>> The reason that I prefer to add 2 separate "words" item is that user could
>>>>>> sample interrupt and user space SIMD regs (but with different bit-width)
>>>>>> simultaneously in theory, like "--intr-regs=YMM0, --user-regs=XMM0".
>>>>> I'm not sure why the user wants a different bit-width. The
>>>>> --user-regs=XMM0" doesn't seem to provide more useful information.
>>>>>
>>>>> Anyway, I believe the tool can handle this case. The tool can always ask
>>>>> YMM0 for both --intr-regs and --user-regs, but only output the XMM0 for
>>>>> --user-regs. The only drawback is that the kernel may dump extra
>>>>> information for the --user-regs. I don't think it's a big problem.
>>>> If we intent to handle it in user space tools, I'm not sure if user space
>>>> tool can easily know which records are from user space and filter out the
>>>> SIMD regs from kernel space and how complicated would the change be. IMO,
>>>> adding an extra u16 "words" would be much easier and won't consume too much
>>>> memory.
>>> The filter is always done in kernel for --user-regs. The only difference
>>> is that the YMM (after filter) will be dumped to the perf.data. The tool
>>> just show the XMM registers to the end user.
>> Ok. But there could be another case, user may want to sample some APX eGPRs
>> in user space and sample SIMD regs in interrupt, like "--intr-regs=YMM0,
>> --user-regs=R16", then we have to define 2 separate "words" fields.
>>
> Not for eGPRs. It uses the regular GP regs space, which implies u64 for
> a 64b kernel. The "words" fields is only for vector and predicate registers.
>
> I've stated working on the V2. The new interface would be as below.
>
> diff --git a/include/uapi/linux/perf_event.h
> b/include/uapi/linux/perf_event.h
> index 78a362b80027..f7b8971fa99d 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -382,6 +382,7 @@ enum perf_event_read_format {
>  #define PERF_ATTR_SIZE_VER6			120	/* Add: aux_sample_size */
>  #define PERF_ATTR_SIZE_VER7			128	/* Add: sig_data */
>  #define PERF_ATTR_SIZE_VER8			136	/* Add: config3 */
> +#define PERF_ATTR_SIZE_VER9			184	/* Add: sample_simd_regs */
>
>  /*
>   * 'struct perf_event_attr' contains various attributes that define
> @@ -543,6 +544,24 @@ struct perf_event_attr {
>  	__u64	sig_data;
>
>  	__u64	config3; /* extension of config2 */
> +
> +
> +	/*
> +	 * Defines set of SIMD registers to dump on samples.
> +	 * The sample_simd_req_enabled !=0 implies the
> +	 * set of SIMD registers is used to config all SIMD registers.
> +	 * If !sample_simd_req_enabled, sample_regs_XXX may be used to
> +	 * config some SIMD registers on X86.
> +	 */
> +	union {
> +		__u16 sample_simd_reg_enabled;
> +		__u16 sample_simd_pred_reg_qwords;
> +	};
> +	__u16 sample_simd_pred_reg_intr;
> +	__u16 sample_simd_pred_reg_user;

This is still a bitmap, right? Is it enough for ARM?


> +	__u16 sample_simd_reg_qwords;
> +	__u64 sample_simd_reg_intr;
> +	__u64 sample_simd_reg_user;
>  };
>
>  /*
> @@ -1016,7 +1035,15 @@ enum perf_event_type {
>  	 *      } && PERF_SAMPLE_BRANCH_STACK
>  	 *
>  	 *	{ u64			abi; # enum perf_sample_regs_abi
> -	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER
> +	 *	  u64			regs[weight(mask)];
> +	 *	  struct {
> +	 *	  	u16 nr_vectors;
> +	 *	  	u16 vector_qwords;
> +	 *	  	u16 nr_pred;
> +	 *	  	u16 pred_qwords;
> +	 *	  	u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +	 *	  } && sample_simd_reg_enabled
> +	 *	} && PERF_SAMPLE_REGS_USER
>  	 *
>  	 *	{ u64			size;
>  	 *	  char			data[size];
> @@ -1043,7 +1070,15 @@ enum perf_event_type {
>  	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
>  	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
>  	 *	{ u64			abi; # enum perf_sample_regs_abi
> -	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
> +	 *	  u64			regs[weight(mask)];
> +	 *	  struct {
> +	 *	  	u16 nr_vectors;
> +	 *	  	u16 vector_qwords;
> +	 *	  	u16 nr_pred;
> +	 *	  	u16 pred_qwords;
> +	 *	  	u64 data[nr_vectors * vector_qwords + nr_pred * pred_qwords];
> +	 *	  } && sample_simd_reg_enabled
> +	 *	} && PERF_SAMPLE_REGS_INTR
>  	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>  	 *	{ u64			cgroup;} && PERF_SAMPLE_CGROUP
>  	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
>
>
> Thanks,
> Kan
Re: [RFC PATCH 00/12] Support vector and more extended registers in perf
Posted by Peter Zijlstra 10 months ago
On Tue, Jun 17, 2025 at 11:23:10AM -0400, Liang, Kan wrote:
> 
> 
> On 2025-06-17 10:29 a.m., Peter Zijlstra wrote:
> > On Tue, Jun 17, 2025 at 09:52:12AM -0400, Liang, Kan wrote:
> > 
> >> OK. So the sample_simd_reg_words actually has another meaning now.
> > 
> > Well, any simd field being non-zero means userspace knows about it. Sort
> > of an implicit flag.
> 
> Yes, but the tool probably wouldn't to touch any simd fields if user
> doesn't ask for simd registers

Trivial enough to have the tool unconditionally write a simd_words size
if the attr thing is big enough. But sure, whatever :-)
Re: [RFC PATCH 00/12] Support vector and more extended registers in perf
Posted by Mi, Dapeng 10 months ago
On 6/13/2025 9:49 PM, kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
>
> Starting from the Intel Ice Lake, the XMM registers can be collected in
> a PEBS record. More registers, e.g., YMM, ZMM, OPMASK, SPP and APX, will
> be added in the upcoming Architecture PEBS as well. But it requires the
> hardware support.
>
> The patch set provides a software solution to mitigate the hardware
> requirement. It utilizes the XSAVES command to retrieve the requested
> registers in the overflow handler. The feature isn't limited to the PEBS
> event or specific platforms anymore.
> The hardware solution (if available) is still preferred, since it has
> low overhead (especially with the large PEBS) and is more accurate.
>
> In theory, the solution should work for all X86 platforms. But I only
> have newer Inter platforms to test. The patch set only enable the
> feature for Intel Ice Lake and later platforms.
>
> Open:
> The new registers include YMM, ZMM, OPMASK, SSP, and APX.
> The sample_regs_user/intr has run out. A new field in the
> struct perf_event_attr is required for the registers.
> There could be several options as below for the new field.
>
> - Follow a similar format to XSAVES. Introduce the below fields to store
>   the bitmap of the registers.
>   struct perf_event_attr {
>         ...
>         __u64   sample_ext_regs_intr[2];
>         __u64   sample_ext_regs_user[2];
>         ...
>   }
>   Includes YMMH (16 bits), APX (16 bits), OPMASK (8 bits),
>            ZMMH0-15 (16 bits), H16ZMM (16 bits), SSP
>   For example, if a user wants YMM8, the perf tool needs to set the
>   corresponding bits of XMM8 and YMMH8, and reconstruct the result.
>   The method is similar to the existing method for
>   sample_regs_user/intr, and match the XSAVES format.
>   The kernel doesn't need to do extra configuration and reconstruction.
>   It's implemented in the patch set.
>
> - Similar to the above method. But the fields are the bitmap of the
>   complete registers, E.g., YMM (16 bits), APX (16 bits),
>   OPMASK (8 bits), ZMM (32 bits), SSP.
>   The kernel needs to do extra configuration and reconstruction,
>   which may brings extra overhead.
>
> - Combine the XMM, YMM, and ZMM. So all the registers can be put into
>   one u64 field.
>         ...
>         union {
>                 __u64 sample_ext_regs_intr;   //sample_ext_regs_user is simiar
>                 struct {
>                         __u32 vector_bitmap;
>                         __u32 vector_type   : 3, //0b001 XMM 0b010 YMM 0b100 ZMM
>                               apx_bitmap    : 16,
>                               opmask_bitmap : 8,
>                               ssp_bitmap    : 1,
>                               reserved      : 4,
>
>                 };
>         ...
>   For example, if the YMM8-15 is required,
>   vector_bitmap: 0x0000ff00
>   vector_type: 0x2
>   This method can save two __u64 in the struct perf_event_attr.
>   But it's not straightforward since it mixes the type and bitmap.
>   The kernel also needs to do extra configuration and reconstruction.
>
> Please let me know if there are more ideas.

+1 for method 1 or 2, and the method 2 is more preferred. 

Method 1 doesn't need to reconstruct YMM/ZMM regs in kernel space, but it
offloads the reconstructions into user space, all user space perf related
tools have to reconstruct them by themselves. Not 100% sure, but I suppose
this needs a big change for perf tools to reconstruct and show the YMM/ZMM
regs.

The cons of method 2 is that it could need to extra memory space and memory
copy if users intent to sample these overlapped regs simultaneously, like
XMM0/YMM0/ZMM0, but suppose we can add extra check in perf tools and tell
users that these regs are overlapped and just force to sample the regs with
largest bit-width. 


>
> Thanks,
> Kan
>
>
>
> Kan Liang (12):
>   perf/x86: Use x86_perf_regs in the x86 nmi handler
>   perf/x86: Setup the regs data
>   x86/fpu/xstate: Add xsaves_nmi
>   perf: Move has_extended_regs() to header file
>   perf/x86: Support XMM register for non-PEBS and REGS_USER
>   perf: Support extension of sample_regs
>   perf/x86: Add YMMH in extended regs
>   perf/x86: Add APX in extended regs
>   perf/x86: Add OPMASK in extended regs
>   perf/x86: Add ZMM in extended regs
>   perf/x86: Add SSP in extended regs
>   perf/x86/intel: Support extended registers
>
>  arch/arm/kernel/perf_regs.c           |   9 +-
>  arch/arm64/kernel/perf_regs.c         |   9 +-
>  arch/csky/kernel/perf_regs.c          |   9 +-
>  arch/loongarch/kernel/perf_regs.c     |   8 +-
>  arch/mips/kernel/perf_regs.c          |   9 +-
>  arch/powerpc/perf/perf_regs.c         |   9 +-
>  arch/riscv/kernel/perf_regs.c         |   8 +-
>  arch/s390/kernel/perf_regs.c          |   9 +-
>  arch/x86/events/core.c                | 226 ++++++++++++++++++++++++--
>  arch/x86/events/intel/core.c          |  49 ++++++
>  arch/x86/events/intel/ds.c            |  12 +-
>  arch/x86/events/perf_event.h          |  58 +++++++
>  arch/x86/include/asm/fpu/xstate.h     |   1 +
>  arch/x86/include/asm/perf_event.h     |   6 +
>  arch/x86/include/uapi/asm/perf_regs.h | 101 ++++++++++++
>  arch/x86/kernel/fpu/xstate.c          |  22 +++
>  arch/x86/kernel/perf_regs.c           |  85 +++++++++-
>  include/linux/perf_event.h            |  23 +++
>  include/linux/perf_regs.h             |  29 +++-
>  include/uapi/linux/perf_event.h       |   8 +
>  kernel/events/core.c                  |  63 +++++--
>  21 files changed, 699 insertions(+), 54 deletions(-)
>