[v2] perf: KVM: Enable callchains for guests

[PATCH v2 0/5] perf: KVM: Enable callchains for guests

Posted by Tianyi Liu 2 years, 4 months ago

Hi there,

This series of patches enables callchains for guests (used by perf kvm),
which holds the top spot on the perf wiki TODO list [1]. This allows users
to perform guest OS callchain or performance analysis from external
using PMU events.

The event processing flow is as follows (shown as backtrace):
  #0 kvm_arch_vcpu_get_frame_pointer / kvm_arch_vcpu_read_virt (per arch)
  #1 kvm_guest_get_frame_pointer / kvm_guest_read_virt
     <callback function pointers in `struct perf_guest_info_callbacks`>
  #2 perf_guest_get_frame_pointer / perf_guest_read_virt
  #3 perf_callchain_guest
  #4 get_perf_callchain
  #5 perf_callchain

Between #0 and #1 is the interface between KVM and the arch-specific
impl, while between #1 and #2 is the interface between Perf and KVM.
The 1st patch implements #0. The 2nd patch extends interfaces between #1
and #2, while the 3rd patch implements #1. The 4th patch implements #3
and modifies #4 #5. The last patch is for userspace utils.

Since arm64 hasn't provided some foundational infrastructure (interface
for reading from a virtual address of guest), the arm64 implementation
is stubbed for now because it's a bit complex, and will be implemented
later.

Tested with both 32-bit and 64-bit guest operating systems / unikernels,
that `perf script` could correctly show the certain callchains.
FlameGraphs can also be generated with this series of patches and [2].

Any feedback will be greatly appreciated.

[1] https://perf.wiki.kernel.org/index.php/Todo
[2] https://github.com/brendangregg/FlameGraph

v1:
https://lore.kernel.org/kvm/SYYP282MB108686A73C0F896D90D246569DE5A@SYYP282MB1086.AUSP282.PROD.OUTLOOK.COM/

Changes since v1:
- v1 only includes partial KVM modifications, while v2 is a complete
implementation. Also updated based on Sean's feedback.

Tianyi Liu (5):
  KVM: Add arch specific interfaces for sampling guest callchains
  perf kvm: Introduce guest interfaces for sampling callchains
  KVM: implement new perf interfaces
  perf kvm: Support sampling guest callchains
  perf tools: Support PERF_CONTEXT_GUEST_* flags

 arch/arm64/kvm/arm.c                | 17 +++++++++
 arch/x86/events/core.c              | 56 +++++++++++++++++++++++------
 arch/x86/kvm/x86.c                  | 18 ++++++++++
 include/linux/kvm_host.h            |  4 +++
 include/linux/perf_event.h          | 18 +++++++++-
 kernel/bpf/stackmap.c               |  8 ++---
 kernel/events/callchain.c           | 27 +++++++++++++-
 kernel/events/core.c                | 17 ++++++++-
 tools/perf/builtin-timechart.c      |  6 ++++
 tools/perf/util/data-convert-json.c |  6 ++++
 tools/perf/util/machine.c           |  6 ++++
 virt/kvm/kvm_main.c                 | 25 +++++++++++++
 12 files changed, 191 insertions(+), 17 deletions(-)


base-commit: 8a749fd1a8720d4619c91c8b6e7528c0a355c0aa
-- 
2.42.0

Re: [PATCH v2 0/5] perf: KVM: Enable callchains for guests

Posted by Marc Zyngier 2 years, 3 months ago

On Sun, 08 Oct 2023 15:48:17 +0100,
Tianyi Liu <i.pear@outlook.com> wrote:
> 
> Hi there,
> 
> This series of patches enables callchains for guests (used by perf kvm),
> which holds the top spot on the perf wiki TODO list [1]. This allows users
> to perform guest OS callchain or performance analysis from external
> using PMU events.
> 
> The event processing flow is as follows (shown as backtrace):
>   #0 kvm_arch_vcpu_get_frame_pointer / kvm_arch_vcpu_read_virt (per arch)
>   #1 kvm_guest_get_frame_pointer / kvm_guest_read_virt
>      <callback function pointers in `struct perf_guest_info_callbacks`>
>   #2 perf_guest_get_frame_pointer / perf_guest_read_virt
>   #3 perf_callchain_guest
>   #4 get_perf_callchain
>   #5 perf_callchain
> 
> Between #0 and #1 is the interface between KVM and the arch-specific
> impl, while between #1 and #2 is the interface between Perf and KVM.
> The 1st patch implements #0. The 2nd patch extends interfaces between #1
> and #2, while the 3rd patch implements #1. The 4th patch implements #3
> and modifies #4 #5. The last patch is for userspace utils.
> 
> Since arm64 hasn't provided some foundational infrastructure (interface
> for reading from a virtual address of guest), the arm64 implementation
> is stubbed for now because it's a bit complex, and will be implemented
> later.

I hope you realise that such an "interface" would be, by definition,
fragile and very likely to break in a subtle way. The only existing
case where we walk the guest's page tables is for NV, and even that is
extremely fragile.

Given that, I really wonder why this needs to happen in the kernel.
Userspace has all the required information to interrupt a vcpu and
walk its current context, without any additional kernel support. What
are the bits here that cannot be implemented anywhere else?

> 
> Tested with both 32-bit and 64-bit guest operating systems / unikernels,
> that `perf script` could correctly show the certain callchains.
> FlameGraphs can also be generated with this series of patches and [2].
> 
> Any feedback will be greatly appreciated.
> 
> [1] https://perf.wiki.kernel.org/index.php/Todo
> [2] https://github.com/brendangregg/FlameGraph
> 
> v1:
> https://lore.kernel.org/kvm/SYYP282MB108686A73C0F896D90D246569DE5A@SYYP282MB1086.AUSP282.PROD.OUTLOOK.COM/
> 
> Changes since v1:
> - v1 only includes partial KVM modifications, while v2 is a complete
> implementation. Also updated based on Sean's feedback.
> 
> Tianyi Liu (5):
>   KVM: Add arch specific interfaces for sampling guest callchains
>   perf kvm: Introduce guest interfaces for sampling callchains
>   KVM: implement new perf interfaces
>   perf kvm: Support sampling guest callchains
>   perf tools: Support PERF_CONTEXT_GUEST_* flags
> 
>  arch/arm64/kvm/arm.c                | 17 +++++++++

Given that there is more to KVM than just arm64 and x86, I suggest
that you move the lack of support for this feature into the main KVM
code.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v2 0/5] perf: KVM: Enable callchains for guests

Posted by Tianyi Liu 2 years, 3 months ago

Hi Marc,

On Sun, 11 Oct 2023 16:45:17 +0000, Marc Zyngier wrote:
> > The event processing flow is as follows (shown as backtrace):
> >   #0 kvm_arch_vcpu_get_frame_pointer / kvm_arch_vcpu_read_virt (per arch)
> >   #1 kvm_guest_get_frame_pointer / kvm_guest_read_virt
> >      <callback function pointers in `struct perf_guest_info_callbacks`>
> >   #2 perf_guest_get_frame_pointer / perf_guest_read_virt
> >   #3 perf_callchain_guest
> >   #4 get_perf_callchain
> >   #5 perf_callchain
> >
> > Between #0 and #1 is the interface between KVM and the arch-specific
> > impl, while between #1 and #2 is the interface between Perf and KVM.
> > The 1st patch implements #0. The 2nd patch extends interfaces between #1
> > and #2, while the 3rd patch implements #1. The 4th patch implements #3
> > and modifies #4 #5. The last patch is for userspace utils.
> >
> > Since arm64 hasn't provided some foundational infrastructure (interface
> > for reading from a virtual address of guest), the arm64 implementation
> > is stubbed for now because it's a bit complex, and will be implemented
> > later.
> 
> I hope you realise that such an "interface" would be, by definition,
> fragile and very likely to break in a subtle way. The only existing
> case where we walk the guest's page tables is for NV, and even that is
> extremely fragile.
>

For walking the guest's page tables, yes, there're only very few
use cases. Most of them are used in nested virtualization and XEN.

> Given that, I really wonder why this needs to happen in the kernel.
> Userspace has all the required information to interrupt a vcpu and
> walk its current context, without any additional kernel support. What
> are the bits here that cannot be implemented anywhere else?
>

Thanks for pointing this out, I agree with your opinion.
Whether it's walking guest's contexts or performing an unwind,
user space can indeed accomplish these tasks.
The only reasons I see for implementing them in the kernel are performance
and the access to a broader range of PMU events.

Consider if I were to implement these functionalities in userspace:
I could have `perf kvm` periodically access the guest through the KVM API
to retrieve the necessary information. However, interrupting a VCPU
through the KVM API from user space might introduce higher latency
(not tested specifically), and the overhead of syscalls could also
limit the sampling frequency.

Additionally, it seems that user space can only interrupt the VCPU
at a certain frequency, without harnessing the richness of the PMU's
performance events. And if we incorporate the logic into the kernel,
`perf kvm` can bind to various PMU events and sample with a faster
performance in PMU interrupts.

So, it appears to be a tradeoff -- whether it's necessary to introduce
more complexity in the kernel to gain access to a broader range and more
precise performance data with less overhead. In my current use case,
I just require simple periodic sampling, which is sufficient for me,
so I'm open to both approaches.

> > Tianyi Liu (5):
> >   KVM: Add arch specific interfaces for sampling guest callchains
> >   perf kvm: Introduce guest interfaces for sampling callchains
> >   KVM: implement new perf interfaces
> >   perf kvm: Support sampling guest callchains
> >   perf tools: Support PERF_CONTEXT_GUEST_* flags
> >
> >  arch/arm64/kvm/arm.c                | 17 +++++++++
> 
> Given that there is more to KVM than just arm64 and x86, I suggest
> that you move the lack of support for this feature into the main KVM
> code.

Currently, sampling for KVM guests is only available for the guest's
instruction pointer, and even the support is limited, it is available
on only two architectures (x86 and arm64). This functionality relies on
a kernel configuration option called `CONFIG_GUEST_PERF_EVENTS`,
which will only be enabled on x86 and arm64.
Within the main KVM code, these interfaces are enclosed within
`#ifdef CONFIG_GUEST_PERF_EVENTS`. Do you think these are enough?

Best regards,
Tianyi Liu

Re: [PATCH v2 0/5] perf: KVM: Enable callchains for guests

Posted by Mark Rutland 2 years, 3 months ago

On Thu, Oct 12, 2023 at 02:35:42PM +0800, Tianyi Liu wrote:
> Hi Marc,
> 
> On Sun, 11 Oct 2023 16:45:17 +0000, Marc Zyngier wrote:
> > > The event processing flow is as follows (shown as backtrace):
> > >   #0 kvm_arch_vcpu_get_frame_pointer / kvm_arch_vcpu_read_virt (per arch)
> > >   #1 kvm_guest_get_frame_pointer / kvm_guest_read_virt
> > >      <callback function pointers in `struct perf_guest_info_callbacks`>
> > >   #2 perf_guest_get_frame_pointer / perf_guest_read_virt
> > >   #3 perf_callchain_guest
> > >   #4 get_perf_callchain
> > >   #5 perf_callchain
> > >
> > > Between #0 and #1 is the interface between KVM and the arch-specific
> > > impl, while between #1 and #2 is the interface between Perf and KVM.
> > > The 1st patch implements #0. The 2nd patch extends interfaces between #1
> > > and #2, while the 3rd patch implements #1. The 4th patch implements #3
> > > and modifies #4 #5. The last patch is for userspace utils.
> > >
> > > Since arm64 hasn't provided some foundational infrastructure (interface
> > > for reading from a virtual address of guest), the arm64 implementation
> > > is stubbed for now because it's a bit complex, and will be implemented
> > > later.
> > 
> > I hope you realise that such an "interface" would be, by definition,
> > fragile and very likely to break in a subtle way. The only existing
> > case where we walk the guest's page tables is for NV, and even that is
> > extremely fragile.
> 
> For walking the guest's page tables, yes, there're only very few
> use cases. Most of them are used in nested virtualization and XEN.

The key point isn't the lack of use cases; the key point is that *this is
fragile*.

Consider that walking guest page tables is only safe because:

(a) The walks happen in the guest-physical / intermiediate-physical address
    space of the guest, and so are not themselves subject to translation via
    the guest's page tables.

(b) Special traps were added to the architecture (e.g. for TLB invalidation)
    which allow the host to avoid race conditions when the guest modifies page
    tables.

For unwind we'd have to walk structures in the guest's virtual address space,
which can change under our feet at any time the guest is running, and handling
that requires much more care.

I think this needs a stronger justification, and an explanation of how you
handle such races.

Mark.

> > Given that, I really wonder why this needs to happen in the kernel.
> > Userspace has all the required information to interrupt a vcpu and
> > walk its current context, without any additional kernel support. What
> > are the bits here that cannot be implemented anywhere else?
> 
> Thanks for pointing this out, I agree with your opinion.
> Whether it's walking guest's contexts or performing an unwind,
> user space can indeed accomplish these tasks.
> The only reasons I see for implementing them in the kernel are performance
> and the access to a broader range of PMU events.
> 
> Consider if I were to implement these functionalities in userspace:
> I could have `perf kvm` periodically access the guest through the KVM API
> to retrieve the necessary information. However, interrupting a VCPU
> through the KVM API from user space might introduce higher latency
> (not tested specifically), and the overhead of syscalls could also
> limit the sampling frequency.
> 
> Additionally, it seems that user space can only interrupt the VCPU
> at a certain frequency, without harnessing the richness of the PMU's
> performance events. And if we incorporate the logic into the kernel,
> `perf kvm` can bind to various PMU events and sample with a faster
> performance in PMU interrupts.
> 
> So, it appears to be a tradeoff -- whether it's necessary to introduce
> more complexity in the kernel to gain access to a broader range and more
> precise performance data with less overhead. In my current use case,
> I just require simple periodic sampling, which is sufficient for me,
> so I'm open to both approaches.
> 
> > > Tianyi Liu (5):
> > >   KVM: Add arch specific interfaces for sampling guest callchains
> > >   perf kvm: Introduce guest interfaces for sampling callchains
> > >   KVM: implement new perf interfaces
> > >   perf kvm: Support sampling guest callchains
> > >   perf tools: Support PERF_CONTEXT_GUEST_* flags
> > >
> > >  arch/arm64/kvm/arm.c                | 17 +++++++++
> > 
> > Given that there is more to KVM than just arm64 and x86, I suggest
> > that you move the lack of support for this feature into the main KVM
> > code.
> 
> Currently, sampling for KVM guests is only available for the guest's
> instruction pointer, and even the support is limited, it is available
> on only two architectures (x86 and arm64). This functionality relies on
> a kernel configuration option called `CONFIG_GUEST_PERF_EVENTS`,
> which will only be enabled on x86 and arm64.
> Within the main KVM code, these interfaces are enclosed within
> `#ifdef CONFIG_GUEST_PERF_EVENTS`. Do you think these are enough?
> 
> Best regards,
> Tianyi Liu

Re: [PATCH v2 0/5] perf: KVM: Enable callchains for guests

Posted by Tianyi Liu 2 years, 3 months ago

Hi Mark,

On Fri, 13 Oct 2023 15:01:22 +0100, Mark Rutland wrote:
> > > > The event processing flow is as follows (shown as backtrace):
> > > >   #0 kvm_arch_vcpu_get_frame_pointer / kvm_arch_vcpu_read_virt (per arch)
> > > >   #1 kvm_guest_get_frame_pointer / kvm_guest_read_virt
> > > >      <callback function pointers in `struct perf_guest_info_callbacks`>
> > > >   #2 perf_guest_get_frame_pointer / perf_guest_read_virt
> > > >   #3 perf_callchain_guest
> > > >   #4 get_perf_callchain
> > > >   #5 perf_callchain
> > > >
> > > > Between #0 and #1 is the interface between KVM and the arch-specific
> > > > impl, while between #1 and #2 is the interface between Perf and KVM.
> > > > The 1st patch implements #0. The 2nd patch extends interfaces between #1
> > > > and #2, while the 3rd patch implements #1. The 4th patch implements #3
> > > > and modifies #4 #5. The last patch is for userspace utils.
> > > >
> > > > Since arm64 hasn't provided some foundational infrastructure (interface
> > > > for reading from a virtual address of guest), the arm64 implementation
> > > > is stubbed for now because it's a bit complex, and will be implemented
> > > > later.
> > >
> > > I hope you realise that such an "interface" would be, by definition,
> > > fragile and very likely to break in a subtle way. The only existing
> > > case where we walk the guest's page tables is for NV, and even that is
> > > extremely fragile.
> >
> > For walking the guest's page tables, yes, there're only very few
> > use cases. Most of them are used in nested virtualization and XEN.
> 
> The key point isn't the lack of use cases; the key point is that *this is
> fragile*.
> 
> Consider that walking guest page tables is only safe because:
> 
> (a) The walks happen in the guest-physical / intermiediate-physical address
>     space of the guest, and so are not themselves subject to translation via
>     the guest's page tables.
> 
> (b) Special traps were added to the architecture (e.g. for TLB invalidation)
>     which allow the host to avoid race conditions when the guest modifies page
>     tables.
> 
> For unwind we'd have to walk structures in the guest's virtual address space,
> which can change under our feet at any time the guest is running, and handling
> that requires much more care.
> 
> I think this needs a stronger justification, and an explanation of how you
> handle such races.

Yes, guests can modify the page tables at any time, so the page table
we obtain may be corrupted. We may not be able to complete the traversal
of the page table or may receive incorrect data.

However, these are not critical issues because we often encounter
incorrect stack unwinding results. In fact, here we assume that the
guest OS/program has stack frames (compiled with `fno-omit-frame-pointer`),
but many programs do not adhere to such an assumption, which often leads
to invalid results. This is almost unavoidable, especially when the
guest OS is running third-party programs. The unwind results we record
may be incorrect; if the unwind cannot continue, we only record the
existing results. Addresses that cannot be resolved to symbols will be
later marked as `[unknown]` by `perf kvm`, and this is very common.

Our unwind strategy is conservative to ensure safety and do our best in
readonly situations. If the guest page table is broken, or the address
to be read is somehow not in the guest page table, we will not inject a
page fault but simply stop the unwind. The function that walks the
page table is done entirely in software and is readonly, having no
additional impact on the guest. Some results could also be incorrect.
It is sufficient as long as most of the records are correct for profiling.

Do you think these address your concerns?

Thanks,
Tianyi Liu