perf: arm_spe: Add support for SPE VM interface

[PATCH 0/3] perf: arm_spe: Add support for SPE VM interface

Posted by James Clark 3 months, 1 week ago

SPE can be used from within a guest as long as the driver adheres to the
new VM interface spec [1]. Because the driver should behave correctly
whether it's running in a guest or not, the first patches are marked as
a fix. Furthermore, in future versions of the architecture the PE will
be allowed to behave in the same way.

The last patch adds new behavior to make it easier for guests to be
able to reserve large buffers. It's not strictly necessary, so it's not
marked as a fix.

[1]: https://developer.arm.com/documentation/den0154/latest/

Signed-off-by: James Clark <james.clark@linaro.org>
---
James Clark (3):
      perf: arm_spe: Add barrier before enabling profiling buffer
      perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1
      perf: arm_spe: Add support for SPE VM interface

 arch/arm64/include/asm/sysreg.h |  1 +
 arch/arm64/tools/sysreg         |  6 ++++-
 drivers/perf/arm_spe_pmu.c      | 60 ++++++++++++++++++++++++++++++++---------
 3 files changed, 54 insertions(+), 13 deletions(-)
---
base-commit: d0b3b7b22dfa1f4b515fd3a295b3fd958f9e81af
change-id: 20250609-james-spe-vm-interface-2bb41e238072

Best regards,
-- 
James Clark <james.clark@linaro.org>

Re: [PATCH 0/3] perf: arm_spe: Add support for SPE VM interface

Posted by Alexandru Elisei 2 months, 1 week ago

Hi,

On Tue, Jul 01, 2025 at 04:31:56PM +0100, James Clark wrote:
> SPE can be used from within a guest as long as the driver adheres to the
> new VM interface spec [1]. Because the driver should behave correctly
> whether it's running in a guest or not, the first patches are marked as
> a fix. Furthermore, in future versions of the architecture the PE will
> be allowed to behave in the same way.
> 
> The last patch adds new behavior to make it easier for guests to be
> able to reserve large buffers. It's not strictly necessary, so it's not
> marked as a fix.

I had a look at the patches, and they all look ok to me, so for the series:

Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>

I also tested the series by hacking SPE virtualization support in KVM:

- without these changes, the SPE driver gets into an infinite loop because it
  clears PMBSR_EL1.S before clearing PMBLIMITR_EL.E, and the hypervisor is
  allowed to ignore the write to PMBSR_EL1.

- with these changes, that doesn't happen.

- ran perf for about a day in a loop in a virtual machine and didn't notice
  anything out of the ordinary.

- ran perf for about a day in a loop on baremetal and similary everything looked
  alright.

- checked that the SPE driver correctly decodes the maximum buffer size for
  sizes 4M, 2M (2M is right at the boundary between the two encoding schemes)
  and 1M; that's also correctly reflected in
  /sys/devices/platform/<spe>/arm_spe_0/caps/max_buffer_size.

- checked that perf is not allowed to use a buffer larger than the maximum.

- checked that the SPE driver correctly detects a buffer size management event.

So:

Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>

While testing I noticed two things:

1. When perf tries to use a buffer larger than the maximum, the error is EINVAL
(22):

# cat /sys/devices/platform/spe/arm_spe_0/caps/max_buff_size
4194304
# perf record -ae arm_spe// -m,16M -- sleep 10
failed to mmap with 22 (Invalid argument)

(used 16M as the buffer size because what the driver ends up programming is half
that).

I would have expected to get back ENOMEM (12), that seems less ambiguous to me.
I had to hack the driver to print an error message to dmesg when the max buffer
size is exceed to make sure that's why I was seeing the error message in perf,
and it wasn't because of something else. I get that that's because .setup_aux()
can only return NULL on error, but feels like there's room for improvement here.

2. A hypervisor is allowed to inject a buffer size event even though the buffer
set by the guest is smaller than the maximum advertised. For example, this can
happen if there isn't enough memory to pin the buffer, or if the limit on pinned
memory is exceeded in the hypervisor (implementation specific behaviour, not
mandated in DEN0154, of course).

In this situation, when the SPE driver gets a buffer size management event
injected by the hypervisor, there is no way for the driver to communicate it to
the perf instance, and the profiled process continues executing even though
profiling has stopped.

That's not different from what happens today with buffer management events, but
unlike the other events, which aren't under the control of userspace, the buffer
size event is potentially recoverable if userspace restarts perf with a smaller
buffer.

Do you think there's something that can be done to improve this situation?

Thanks,
Alex

> 
> [1]: https://developer.arm.com/documentation/den0154/latest/
> 
> Signed-off-by: James Clark <james.clark@linaro.org>
> ---
> James Clark (3):
>       perf: arm_spe: Add barrier before enabling profiling buffer
>       perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1
>       perf: arm_spe: Add support for SPE VM interface
> 
>  arch/arm64/include/asm/sysreg.h |  1 +
>  arch/arm64/tools/sysreg         |  6 ++++-
>  drivers/perf/arm_spe_pmu.c      | 60 ++++++++++++++++++++++++++++++++---------
>  3 files changed, 54 insertions(+), 13 deletions(-)
> ---
> base-commit: d0b3b7b22dfa1f4b515fd3a295b3fd958f9e81af
> change-id: 20250609-james-spe-vm-interface-2bb41e238072
> 
> Best regards,
> -- 
> James Clark <james.clark@linaro.org>
>

Re: [PATCH 0/3] perf: arm_spe: Add support for SPE VM interface

Posted by James Clark 2 months ago

On 01/08/2025 2:28 pm, Alexandru Elisei wrote:
> Hi,
> 
> On Tue, Jul 01, 2025 at 04:31:56PM +0100, James Clark wrote:
>> SPE can be used from within a guest as long as the driver adheres to the
>> new VM interface spec [1]. Because the driver should behave correctly
>> whether it's running in a guest or not, the first patches are marked as
>> a fix. Furthermore, in future versions of the architecture the PE will
>> be allowed to behave in the same way.
>>
>> The last patch adds new behavior to make it easier for guests to be
>> able to reserve large buffers. It's not strictly necessary, so it's not
>> marked as a fix.
> 
> I had a look at the patches, and they all look ok to me, so for the series:
> 
> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
> 
> I also tested the series by hacking SPE virtualization support in KVM:
> 
> - without these changes, the SPE driver gets into an infinite loop because it
>    clears PMBSR_EL1.S before clearing PMBLIMITR_EL.E, and the hypervisor is
>    allowed to ignore the write to PMBSR_EL1.
> 
> - with these changes, that doesn't happen.
> 
> - ran perf for about a day in a loop in a virtual machine and didn't notice
>    anything out of the ordinary.
> 
> - ran perf for about a day in a loop on baremetal and similary everything looked
>    alright.
> 
> - checked that the SPE driver correctly decodes the maximum buffer size for
>    sizes 4M, 2M (2M is right at the boundary between the two encoding schemes)
>    and 1M; that's also correctly reflected in
>    /sys/devices/platform/<spe>/arm_spe_0/caps/max_buffer_size.
> 
> - checked that perf is not allowed to use a buffer larger than the maximum.
> 
> - checked that the SPE driver correctly detects a buffer size management event.
> 
> So:
> 
> Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
> 
> While testing I noticed two things:
> 
> 1. When perf tries to use a buffer larger than the maximum, the error is EINVAL
> (22):
> 
> # cat /sys/devices/platform/spe/arm_spe_0/caps/max_buff_size
> 4194304
> # perf record -ae arm_spe// -m,16M -- sleep 10
> failed to mmap with 22 (Invalid argument)
> 
> (used 16M as the buffer size because what the driver ends up programming is half
> that).
> 
> I would have expected to get back ENOMEM (12), that seems less ambiguous to me.
> I had to hack the driver to print an error message to dmesg when the max buffer
> size is exceed to make sure that's why I was seeing the error message in perf,
> and it wasn't because of something else. I get that that's because .setup_aux()
> can only return NULL on error, but feels like there's room for improvement here.
> 

We could add an error code, rb_alloc_aux() already returns one and that 
calls setup_aux(). But the scenarios would be either EINVAL or ENOMEM 
and wouldn't give the user the exact reason ("need > 2 pages", "need 
even number of pages", etc). So I'm not sure it would be enough of an 
improvement over returning NULL to be worth it.

However I will add a warning into Perf if the user asks for more than 
caps/max_buffer_size. That would be a useful message and Perf can do it 
itself so it doesn't need to be in the driver changes.

> 2. A hypervisor is allowed to inject a buffer size event even though the buffer
> set by the guest is smaller than the maximum advertised. For example, this can
> happen if there isn't enough memory to pin the buffer, or if the limit on pinned
> memory is exceeded in the hypervisor (implementation specific behaviour, not
> mandated in DEN0154, of course).
> 
> In this situation, when the SPE driver gets a buffer size management event
> injected by the hypervisor, there is no way for the driver to communicate it to
> the perf instance, and the profiled process continues executing even though
> profiling has stopped.
> 
> That's not different from what happens today with buffer management events, but
> unlike the other events, which aren't under the control of userspace, the buffer
> size event is potentially recoverable if userspace restarts perf with a smaller
> buffer.
> 
> Do you think there's something that can be done to improve this situation?
> 
> Thanks,
> Alex
> 

It doesn't look like there's currently anything that can stop an event 
or signal to Perf that the event has gone bad.

We could add something like "__u32 error" to struct 
perf_event_mmap_page. But I'm not sure what you'd do with it. If Perf is 
the parent of the process you wouldn't want to kill it in case anything 
bad happens. So you're left with leaving it running anyway. If it's just 
an error message that you want then there's already one in dmesg for 
buffer management errors, and that string is a lot better than a single 
code. Unless these new codes were detailed PMU specific ones? Actually 
it's a whole page so why not make it a string...

It's not a case of the samples ending randomly somewhere though, you'll 
either get all of them or none of them. So it will be quite obvious to 
the user that something has gone wrong. Secondly I think the scenario of 
not being able to pin memory when asking for less than the limit would 
be very rare. It's probably fine to leave it like this for now and we 
can always add something later, maybe if people start to run into it for 
real.

James

Re: [PATCH 0/3] perf: arm_spe: Add support for SPE VM interface

Posted by Suzuki K Poulose 2 months ago

On 04/08/2025 17:00, James Clark wrote:
> 
> 
> On 01/08/2025 2:28 pm, Alexandru Elisei wrote:
>> Hi,
>>
>> On Tue, Jul 01, 2025 at 04:31:56PM +0100, James Clark wrote:
>>> SPE can be used from within a guest as long as the driver adheres to the
>>> new VM interface spec [1]. Because the driver should behave correctly
>>> whether it's running in a guest or not, the first patches are marked as
>>> a fix. Furthermore, in future versions of the architecture the PE will
>>> be allowed to behave in the same way.
>>>
>>> The last patch adds new behavior to make it easier for guests to be
>>> able to reserve large buffers. It's not strictly necessary, so it's not
>>> marked as a fix.
>>
>> I had a look at the patches, and they all look ok to me, so for the 
>> series:
>>
>> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
>>
>> I also tested the series by hacking SPE virtualization support in KVM:
>>
>> - without these changes, the SPE driver gets into an infinite loop 
>> because it
>>    clears PMBSR_EL1.S before clearing PMBLIMITR_EL.E, and the 
>> hypervisor is
>>    allowed to ignore the write to PMBSR_EL1.
>>
>> - with these changes, that doesn't happen.
>>
>> - ran perf for about a day in a loop in a virtual machine and didn't 
>> notice
>>    anything out of the ordinary.
>>
>> - ran perf for about a day in a loop on baremetal and similary 
>> everything looked
>>    alright.
>>
>> - checked that the SPE driver correctly decodes the maximum buffer 
>> size for
>>    sizes 4M, 2M (2M is right at the boundary between the two encoding 
>> schemes)
>>    and 1M; that's also correctly reflected in
>>    /sys/devices/platform/<spe>/arm_spe_0/caps/max_buffer_size.
>>
>> - checked that perf is not allowed to use a buffer larger than the 
>> maximum.
>>
>> - checked that the SPE driver correctly detects a buffer size 
>> management event.
>>
>> So:
>>
>> Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
>>
>> While testing I noticed two things:
>>
>> 1. When perf tries to use a buffer larger than the maximum, the error 
>> is EINVAL
>> (22):
>>
>> # cat /sys/devices/platform/spe/arm_spe_0/caps/max_buff_size
>> 4194304
>> # perf record -ae arm_spe// -m,16M -- sleep 10
>> failed to mmap with 22 (Invalid argument)
>>
>> (used 16M as the buffer size because what the driver ends up 
>> programming is half
>> that).
>>
>> I would have expected to get back ENOMEM (12), that seems less 
>> ambiguous to me.
>> I had to hack the driver to print an error message to dmesg when the 
>> max buffer
>> size is exceed to make sure that's why I was seeing the error message 
>> in perf,
>> and it wasn't because of something else. I get that that's 
>> because .setup_aux()
>> can only return NULL on error, but feels like there's room for 
>> improvement here.
>>
> 
> We could add an error code, rb_alloc_aux() already returns one and that 
> calls setup_aux(). But the scenarios would be either EINVAL or ENOMEM 
> and wouldn't give the user the exact reason ("need > 2 pages", "need 
> even number of pages", etc). So I'm not sure it would be enough of an 
> improvement over returning NULL to be worth it.
> 
> However I will add a warning into Perf if the user asks for more than 
> caps/max_buffer_size. That would be a useful message and Perf can do it 
> itself so it doesn't need to be in the driver changes.
> 
>> 2. A hypervisor is allowed to inject a buffer size event even though 
>> the buffer
>> set by the guest is smaller than the maximum advertised. For example, 
>> this can
>> happen if there isn't enough memory to pin the buffer, or if the limit 
>> on pinned
>> memory is exceeded in the hypervisor (implementation specific 
>> behaviour, not
>> mandated in DEN0154, of course).
>>
>> In this situation, when the SPE driver gets a buffer size management 
>> event
>> injected by the hypervisor, there is no way for the driver to 
>> communicate it to
>> the perf instance, and the profiled process continues executing even 
>> though
>> profiling has stopped.
>>
>> That's not different from what happens today with buffer management 
>> events, but
>> unlike the other events, which aren't under the control of userspace, 
>> the buffer
>> size event is potentially recoverable if userspace restarts perf with 
>> a smaller
>> buffer.
>>
>> Do you think there's something that can be done to improve this 
>> situation?
>>
>> Thanks,
>> Alex
>>
> 
> It doesn't look like there's currently anything that can stop an event 
> or signal to Perf that the event has gone bad.
> 
> We could add something like "__u32 error" to struct 
> perf_event_mmap_page. But I'm not sure what you'd do with it. If Perf is 
> the parent of the process you wouldn't want to kill it in case anything 
> bad happens. So you're left with leaving it running anyway. If it's just 
> an error message that you want then there's already one in dmesg for 
> buffer management errors, and that string is a lot better than a single 
> code. Unless these new codes were detailed PMU specific ones? Actually 
> it's a whole page so why not make it a string...
> 
> It's not a case of the samples ending randomly somewhere though, you'll 
> either get all of them or none of them. So it will be quite obvious to 
> the user that something has gone wrong. Secondly I think the scenario of 
> not being able to pin memory when asking for less than the limit would 
> be very rare. It's probably fine to leave it like this for now and we 
> can always add something later, maybe if people start to run into it for 
> real.

Do we get EMPTY AUX records in this case where there was no profiling
data ? If so, why not convey the error via AUX record header flags ?

Suzuk

> 
> James
> 
>