Xen flamegraph (hypervisor stacktrace profile) support

[RFC PATCH v1 00/10] Xen flamegraph (hypervisor stacktrace profile) support

Posted by Edwin Török 6 months, 2 weeks ago

I've long wanted to get stacktraces when profiling Xen, otherwise all
you'd see is e.g. the address of memcpy, but without knowing which
function called it you can't optimize it.

Once you have stacktraces, even a simple low (prime) frequency timer
based profile can show hotspots that would be optimization candidates,
aka Flamegraphs. (even if the sample doesn't
always hit within the same function and individually they'd be too small
to be noticable, it should hit in one of the parents if it is a
bottleneck).

Example flamegraph produced using these patches:
 * workload: an otherwise idle VM migrated on localhost by XAPI in a loop:
 https://cdn.jsdelivr.net/gh/edwintorok/xen@pmustack-coverletter/docs/tmp/migrate-localhost.svg?x=473.2&y=2197&s=null
 * workload: VM migrated between 2 hosts by XAPI (NFS storage):
 https://cdn.jsdelivr.net/gh/edwintorok/xen@pmustack-coverletter/docs/tmp/migrate-send.svg?x=950.6&y=2197
 https://cdn.jsdelivr.net/gh/edwintorok/xen@pmustack-coverletter/docs/tmp/migrate-receive.svg?x=906.6&y=869

There might be other approaches that could be tried in the future, e.g. Last Branch Record, but:
 * although both Intel and AMD support it, AFAIK Xen doesn't support it on AMD
 yet
 * there is a hardware limit to how deep it can be (~32?)
 * LBR may need some additional configuration to enable it to trace the
  hypervisor
 * Intel PMU is completely broken on the system I tried it on, so I
  would've had to first fix that

This is some very early experimental work, thought I'd share it to get
feedback on:
 * the desired ABI additions in pmu.h and arch-x86/pmu.h
 * any bugs you may spot
 * if anyone wants to port the python symbol lookup to perf itself
 (actually latest perf ships a flamegraph.py too)

It also starts to become useful enough to spot performance hotspots in
Xen, e.g. the rwlock.c scaling issue with large CONFIG_NR_CPUS, or
unexpected page faults in 'unmap_page_range' (spotted by Andrew).

It builds on top of:
 * the existing VPMU support, documented by Boris Ostrovsky in this thread: https://lists.xenproject.org/archives/html/xen-devel/2016-08/msg03244.html
 * a python script by Andriy to post-process the perf output

Steps to enable:
 1. ensure that you've got a build of Xen with CONFIG_FRAME_POINTER=y.
 Debug builds would have it, but for performance testing creating a
 release build with frame pointers enabled is recommended.

 2. Apply both the Linux and Xen patches.
    I tested on top of ~6.6.22, and Xen 4.21+ (5c798ac8854af3528a78ca5a622c9ea68399809b)

 3. ensure that VPMU is enabled in Xen, e.g. a GRUB line like:
 ```
 multiboot2 /boot/xen.efi dom0_mem=4288M,max:4288M crashkernel=256M,below=4G console=vga vga=mode-0x0311 watchdog=0 vpmu=on dom0_vcpus_pin
 ```
 On a XenServer system that can be achieved by:

 ```
 /opt/xensource/libexec/xen-cmdline --set-xen watchdog=0
 /opt/xensource/libexec/xen-cmdline --set-xen vpmu=on
 /opt/xensource/libexec/xen-cmdline --delete-xen dom0_max_vcpus=1-16
 /opt/xensource/libexec/xen-cmdline --set-xen dom0_vcpus_pin
 reboot
 ```

 4. On everyboot: enable desired vPMU features:
 ```
 echo 9 >/sys/hypervisor/pmu/pmu_features
 echo all >/sys/hypervisor/pmu/pmu_mode
 ```

 5. Record a trace, e.g. a simple timer based stacktrace, useful for
 initial investigation with a flamegraph:
 ```
 perf kvm --host --guest record -a -F 97 -g
 ```

 Or if you also want to trace userspace:
 ```
 perf kvm --host --guest record -a -F 97 --call-graph=dwarf
 ```

 6. Look at the report: perf kvm --host --guest report.
  This will contain hex addresses for now, but a script can be used to
  resolve them.

 7. Use the provided python script, and look at symbolized output

Caveats:
 * x86-only for now
 * only tested on AMD EPYC 8124P
 * Xen PMU support was broken to begin with on Xeon Silver 4514Y, so I
 wasn't able to test there ('perf top' fails to parse samples). I'll
 try to figure out what is wrong there separately
 * for now I edit the release config in xen.spec to enable frame
 pointers. Eventually it might be useful to have a 3rd build variant:
 release-fp. Or teach Xen to produce/parse ORC or SFrame formats without
 requiring frame pointers.
 * perf produces raw hex addresses, and a python script is used to
 post-process it and obtain symbols. Eventually perf should be updated
 to do this processing itself (there was an old patch for Linux 3.12 by Borislav Petkov)
 * I've only tested capturing Dom0 stack traces. Linux doesn't support
  guest stacktraces yet (it can only lookup the guest RIP)
 * the Linux patch will need to be forwarded ported to master before submission
 * All the caveats for using regular VPMU apply, except for the lack of
  stacktraces, that is fixed here!
    * Dom0 must run hard pinned on all host CPUs
    * Watchdog must be disabled
    * not security supported
    * x86 only
    * secureboot needs to be disabled


Edwin Török (10):
  pmu.h: add a BUILD_BUG_ON to ensure it fits within one page
  arch-x86/pmu.h: document current memory layout for VPMU
  arch-x86/pmu.h: convert ascii art drawing to Unicode
  vpmu.c: factor out register conversion
  pmu.h: introduce a stacktrace area
  arch-x86/pmu.h: convert ascii art diagram to Unicode
  arch-x86/vpmu.c: store guest registers when domain_id == DOMID_XEN
  pmu.h: expose a hypervisor stacktrace feature
  vpmu.c hypervisor stacktrace support in vPMU
  xen/tools/pyperf.py: example script to parse perf output

 xen/arch/x86/cpu/vpmu.c           | 130 ++++++++++++++++++++------
 xen/arch/x86/cpu/vpmu_amd.c       |   2 +-
 xen/arch/x86/cpu/vpmu_intel.c     |   2 +-
 xen/arch/x86/include/asm/vpmu.h   |   1 +
 xen/include/public/arch-arm.h     |   1 +
 xen/include/public/arch-ppc.h     |   1 +
 xen/include/public/arch-riscv.h   |   1 +
 xen/include/public/arch-x86/pmu.h | 101 ++++++++++++++++++++-
 xen/include/public/pmu.h          |  41 ++++++++-
 xen/tools/pyperf.py               | 146 ++++++++++++++++++++++++++++++
 10 files changed, 395 insertions(+), 31 deletions(-)
 create mode 100644 xen/tools/pyperf.py

-- 
2.47.1

Re: [RFC PATCH v1 00/10] Xen flamegraph (hypervisor stacktrace profile) support

Posted by Demi Marie Obenour 6 months, 2 weeks ago

On 7/25/25 11:06, Edwin Török wrote:
> Caveats:
>  * x86-only for now
>  * only tested on AMD EPYC 8124P
>  * Xen PMU support was broken to begin with on Xeon Silver 4514Y, so I
>  wasn't able to test there ('perf top' fails to parse samples). I'll
>  try to figure out what is wrong there separately
>  * for now I edit the release config in xen.spec to enable frame
>  pointers. Eventually it might be useful to have a 3rd build variant:
>  release-fp. Or teach Xen to produce/parse ORC or SFrame formats without
>  requiring frame pointers.

That would definitely be nice.

>  * perf produces raw hex addresses, and a python script is used to
>  post-process it and obtain symbols. Eventually perf should be updated
>  to do this processing itself (there was an old patch for Linux 3.12 by Borislav Petkov)
>  * I've only tested capturing Dom0 stack traces. Linux doesn't support
>   guest stacktraces yet (it can only lookup the guest RIP)

What would be needed to fix this?  Capturing guest stacktraces from the host
or Xen seems like a really bad idea, but it might make sense to interrupt the
guest and allow it to provide a (strictly validated) stack trace for use by
the host.  This would need to be done asynchronously, as Linux is moving
towards generating stack traces outside of the NMI handler.

>  * the Linux patch will need to be forwarded ported to master before submission
>  * All the caveats for using regular VPMU apply, except for the lack of
>   stacktraces, that is fixed here!
>     * Dom0 must run hard pinned on all host CPUs
>     * Watchdog must be disabled
>     * not security supported
>     * x86 only
>     * secureboot needs to be disabled

What would be needed to fix these limitations?  With them it isn't really
possible to do profiling on production systems, only on dedicated development
boxes.  That works great if you have a dev box and can create a realistic
workload with non-sensitive data, but less great if you have a problem that
you can't reproduce on a non-production system.  It's also not usable
for real-time monitoring of production environments.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

Re: [RFC PATCH v1 00/10] Xen flamegraph (hypervisor stacktrace profile) support

Posted by Edwin Torok 6 months, 2 weeks ago

On Fri, Jul 25, 2025 at 11:26 PM Demi Marie Obenour
<demiobenour@gmail.com> wrote:
>
> On 7/25/25 11:06, Edwin Török wrote:
> > Caveats:
> >  * x86-only for now
> >  * only tested on AMD EPYC 8124P
> >  * Xen PMU support was broken to begin with on Xeon Silver 4514Y, so I
> >  wasn't able to test there ('perf top' fails to parse samples). I'll
> >  try to figure out what is wrong there separately
> >  * for now I edit the release config in xen.spec to enable frame
> >  pointers. Eventually it might be useful to have a 3rd build variant:
> >  release-fp. Or teach Xen to produce/parse ORC or SFrame formats without
> >  requiring frame pointers.
>
> That would definitely be nice.
>
> >  * perf produces raw hex addresses, and a python script is used to
> >  post-process it and obtain symbols. Eventually perf should be updated
> >  to do this processing itself (there was an old patch for Linux 3.12 by Borislav Petkov)
> >  * I've only tested capturing Dom0 stack traces. Linux doesn't support
> >   guest stacktraces yet (it can only lookup the guest RIP)
>
> What would be needed to fix this?  Capturing guest stacktraces from the host
> or Xen seems like a really bad idea, but it might make sense to interrupt the
> guest and allow it to provide a (strictly validated) stack trace for use by
> the host.  This would need to be done asynchronously, as Linux is moving
> towards generating stack traces outside of the NMI handler.

The way perf captures stacktraces for userspace is that it either
walks its stack by following framepointers
and copying memory from userspace, or it can take a copy of the entire
userspace stack (up to a limit of ~64KiB),
and let perf userspace construct a stacktrace from that (for --callgraph=dwarf).
I'd expect that copying from userspace is a lot faster than copying
from a guest, because for a guest you'd also need to map
the page first, which would be an additional cost (and you'd have to
be careful not to infinitely recurse if you get another interrupt
while mapping), unless you keep the entire guest address space mapped,
or have a cache of mapped stack pages.

You can let a guest profile itself though, in which case it can
process its own stacktrace, but exposing Xen's stacktrace to untrusted
guests is probably not a good idea.

You could try to also do what I've done with Xen here: have the guest
provide the stacktrace to the hypervisor, which provides it to Dom0.
But then you'd need to run some code inside the guest, and that may
not be possible if you are currently handling something on behalf of
the guest in Xen.

I'd first wait to see whether KVM implements this, and then implement
something similar for Xen. AFAICT KVM doesn't support this either.

>
> >  * the Linux patch will need to be forwarded ported to master before submission
> >  * All the caveats for using regular VPMU apply, except for the lack of
> >   stacktraces, that is fixed here!

> What would be needed to fix these limitations?

See below for my answers to each one, although others on this mailing
list might be able to provide a more correct answer.

> >     * Dom0 must run hard pinned on all host CPUs

Not sure. I think Dom0 needs to be able to run some code whenever the
NMI arrives, and that needs to run on the CPU it arrived on, unless
you define a way for one CPU to also receive and process interrupts
for CPUs that Dom0 doesn't run on.
The pinning requirement could be lifted if everything is correctly
context switched

> >     * Watchdog must be disabled

IIUC the Xen watchdog and the profiling interrupt both use NMIs, so
you can only have one of them active.
In fact even with bare metal Linux the NMI watchdog sometimes needs to
be disabled for certain perf counters to work, although basic timer
based profiling and most counters work with NMI enabled. If needed
'perf' prints a message to disable the Linux NMI watchdog, but if you
follow those instructions literally the host will panic and reboot 20s
later because the soft lockup detector won't work anymore (so that too
would need to be disabled).

> >     * not security supported

See https://xenbits.xen.org/xsa/advisory-163.html

Also even if you ignore security support, using vPMU on production
systems currently is probably not a good idea, there are probably lots
of pre-existing bugs to fix, and the bugs might be micro-architecture
specific.
E.g. with vPMU enabled running 'perf stat -ddd' in Dom0 caused one of
my (older) hosts to completely freeze (all vCPUs except one stuck in a
spinlock, and the last one not running anywhere), whereas it ran
perfectly fine on other (newer) hosts. I haven't debugged yet what is
causing it (could also be a bug in Linux, or the Linux Xen PMU driver
and not Xen).

There is a way to restrict what performance counters are exposed to
guests, and e.g. I think EC2 used to expose some of these to guests.
Initially temperatures/turbo boost could be measured from guests, but
that got disabled following an XSA:
https://www.brendangregg.com/blog/2014-09-15/the-msrs-of-ec2.html.
Later a restricted set of PMCs got exposed (vpmu=ipc, or vpmu=arch),
which then got enabled for EC2 guests (don't know whether they still
expose these): https://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html

If that is enabled, the stacktrace is already suitably restricted to
Dom0-only, so should be safe to use, i.e. even if you can't use
`vpmu=on`, you might be able to use `vpmu=ipc`.
Currently neither of these is security supported though.

> >     * x86 only

This one should be fixable, all it needs is a way to do a stacktrace,
which should already be present in the arch-specific traps.c (although
AFAICT only X86 and ARM implement stack traces currently), although
that of course assumes that other arches would have a PMU
implementation to begin with.
AFAICT xenpmu_op is only implemented on x86:
```
#ifdef CONFIG_X86
xenpmu_op(unsigned int op, xen_pmu_params_t *arg)
#endif
```

> >     * secureboot needs to be disabled
>

This is because to enable vpmu you need to modify the Xen cmdline, and
that is restricted under secure boot.
If you enable vpmu at build time then it might work, but see above
about no security support.

>  With them it isn't really
> possible to do profiling on production systems, only on dedicated development
> boxes.

I'd like to be able to do profiling on production too. But I'm taking
it one step at a time, at least now I'll have a way to do profiling on
development/test boxes.

For production use a different approach might be needed, e.g. LBR, or
a dedicated way to get just a hypervisor stacktrace on a timer,
without involving the (v)PMU at all.
That would require some new integration with `perf` too.

> That works great if you have a dev box and can create a realistic
> workload with non-sensitive data, but less great if you have a problem that
> you can't reproduce on a non-production system.  It's also not usable
> for real-time monitoring of production environments.

Best regards,
--Edwin

> --
> Sincerely,
> Demi Marie Obenour (she/her/hers)

Re: [RFC PATCH v1 00/10] Xen flamegraph (hypervisor stacktrace profile) support

Posted by Edwin Torok 6 months, 2 weeks ago

On Fri, Jul 25, 2025 at 4:07 PM Edwin Török <edwin.torok@cloud.com> wrote:
> [...]
>
> Edwin Török (10):
>   pmu.h: add a BUILD_BUG_ON to ensure it fits within one page
>   arch-x86/pmu.h: document current memory layout for VPMU
>   arch-x86/pmu.h: convert ascii art drawing to Unicode
>   vpmu.c: factor out register conversion
>   pmu.h: introduce a stacktrace area
>   arch-x86/pmu.h: convert ascii art diagram to Unicode
>   arch-x86/vpmu.c: store guest registers when domain_id == DOMID_XEN
>   pmu.h: expose a hypervisor stacktrace feature
>   vpmu.c hypervisor stacktrace support in vPMU
>   xen/tools/pyperf.py: example script to parse perf output
>
>  xen/arch/x86/cpu/vpmu.c           | 130 ++++++++++++++++++++------
>  xen/arch/x86/cpu/vpmu_amd.c       |   2 +-
>  xen/arch/x86/cpu/vpmu_intel.c     |   2 +-
>  xen/arch/x86/include/asm/vpmu.h   |   1 +
>  xen/include/public/arch-arm.h     |   1 +
>  xen/include/public/arch-ppc.h     |   1 +
>  xen/include/public/arch-riscv.h   |   1 +
>  xen/include/public/arch-x86/pmu.h | 101 ++++++++++++++++++++-
>  xen/include/public/pmu.h          |  41 ++++++++-
>  xen/tools/pyperf.py               | 146 ++++++++++++++++++++++++++++++
>  10 files changed, 395 insertions(+), 31 deletions(-)
>  create mode 100644 xen/tools/pyperf.py
>

For convenience this is also available as a git repository here:
https://gitlab.com/xen-project/people/edwintorok/xen/-/commits/pmustack?ref_type=heads
https://github.com/edwintorok/linux-stable/commits/pmustack/

Best regards,
--Edwin