hvf x86 correctness and efficiency improvements

[PATCH 0/3] hvf x86 correctness and efficiency improvements

Posted by Phil Dennis-Jordan 7 months, 1 week ago

This is a series of semi-related patches for the x86 macOS Hypervisor.framework
(hvf) accelerator backend. The intention is to make VMs run slightly more
efficiently on macOS host machines. They have been subject to some months of
CI workloads with macOS guest VMs without issues and they seem to give a few
percent performance improvement. (Though this varies greatly with the type of
workload.)

Patch 1 enables the INVTSC CPUID bit when running with hvf. This can enable
some optimisations in the guest OS, and I've not found any reason it shouldn't
be allowed for hvf based hosts.

Patch 2 fixes hvf_kick_vcpu_thread so it actually forces a VM exit instead of
doing nothing. I guess this previously didn't cause any huge issues because
hvf's hv_vcpu_run() would exit so extremely frequently on its own accord. The
temp variable is needed because the pointer expected by the hv_vcpu_interrupt()
call doesn't match the fd field's type in the hvf accel's struct AccelCPUState.
I'm unsure if it would be better to change that struct field to the relevant
architecture's handle types, hv_vcpuid_t (x86, unsigned int) and hv_vcpu_t
(aarch64, uint64_t), perhaps via an intermediate typedef?

Patch 3, which replaces the call to hv_vcpu_run() with the more modern
hv_vcpu_run_until() for running the guest vCPU. The newer API is available
from macOS 10.15 host systems onwards. This call causes significantly fewer
VM exits, which also means we really need that exit-forcing interrupt from
patch 2. The reduction in VM exits means less overhead from exits and less
contention on the BQL. Using hv_vcpu_run_until() is also a prerequisite for
using certain newer hvf features, though this patchset doesn't use any.

Patches 2 & 3 must therefore be applied in that order, patch 1 is independent.

This work has been sponsored by Sauce Labs Inc.

Phil Dennis-Jordan (3):
  i386: hvf: Adds support for INVTSC cpuid bit
  i386: hvf: In kick_vcpu use hv_vcpu_interrupt to force exit
  i386: hvf: Updates API usage to use modern vCPU run function

 target/i386/hvf/hvf.c       | 26 +++++++++++++++++++++++++-
 target/i386/hvf/x86_cpuid.c |  4 ++++
 2 files changed, 29 insertions(+), 1 deletion(-)

-- 
2.36.1

Re: [PATCH 0/3] hvf x86 correctness and efficiency improvements

Posted by Paolo Bonzini 6 months, 2 weeks ago

On 9/22/23 16:09, Phil Dennis-Jordan wrote:
> Patch 1 enables the INVTSC CPUID bit when running with hvf. This can 
> enable some optimisations in the guest OS, and I've not found any reason 
> it shouldn't be allowed for hvf based hosts.

It can be enabled, but it should include a migration blocker.  In fact, 
probably HVF itself should include a migration blocker because QEMU 
doesn't support TSC scaling.

> Patch 2 fixes hvf_kick_vcpu_thread so it actually forces a VM exit instead of
> doing nothing. I guess this previously didn't cause any huge issues because
> hvf's hv_vcpu_run() would exit so extremely frequently on its own accord.

No, it's because signal delivery should already kick the vCPU out of 
guest mode.  I guess it does with hv_vcpu_run(), but not with 
hv_vcpu_run_until()?

hv_vcpu_interrupt() is a problematic API, in that it is not clear how it 
handles races with hv_vcpu_run().  In particular, whether this causes an 
immediate vmexit or not:

            thread 1                         thread 2
            -----------------------          -----------------------
                                             <CPU not in guest mode>
            hv_vcpu_interrupt()
                                             hv_vcpu_run() (or run_until)

Not that the current code is any better; there is no guarantee that the 
signal is delivered before hv_vcpu_run() is called.  However, if as you 
said hv_vcpu_run() will exit often on its own accord but 
hv_vcpu_run_until() does not, then this could cause difficult to 
reproduce bugs by switching to the latter.

> The temp variable is needed because the pointer expected by the hv_vcpu_interrupt()
> call doesn't match the fd field's type in the hvf accel's struct AccelCPUState.
> I'm unsure if it would be better to change that struct field to the relevant
> architecture's handle types, hv_vcpuid_t (x86, unsigned int) and hv_vcpu_t
> (aarch64, uint64_t), perhaps via an intermediate typedef?

I think fd and the HVF type should be placed in an anonymous union.

Thanks,

Paolo

Re: [PATCH 0/3] hvf x86 correctness and efficiency improvements

Posted by Phil Dennis-Jordan 6 months, 2 weeks ago

Hi Paolo,

On Mon, 16 Oct 2023 at 16:39, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 9/22/23 16:09, Phil Dennis-Jordan wrote:
> > Patch 1 enables the INVTSC CPUID bit when running with hvf. This can
> > enable some optimisations in the guest OS, and I've not found any reason
> > it shouldn't be allowed for hvf based hosts.
>
> It can be enabled, but it should include a migration blocker.  In fact,
> probably HVF itself should include a migration blocker because QEMU
> doesn't support TSC scaling.

I didn't think Qemu's HVF backend supported migration in any form at this
point anyway? Or do you mean machine model versioning of the default
setting?

> > Patch 2 fixes hvf_kick_vcpu_thread so it actually forces a VM exit
instead of
> > doing nothing. I guess this previously didn't cause any huge issues
because
> > hvf's hv_vcpu_run() would exit so extremely frequently on its own
accord.
>
> No, it's because signal delivery should already kick the vCPU out of
> guest mode.  I guess it does with hv_vcpu_run(), but not with
> hv_vcpu_run_until()?

It's possible! At any rate, hv_vcpu_run() only seems to run for VERY short
time slices in practice. I'll try gathering some data on exit reasons/kick
delivery latencies with the various combinations of kick methods and vcpu
run APIs if that helps.

> hv_vcpu_interrupt() is a problematic API, in that it is not clear how it
> handles races with hv_vcpu_run().  In particular, whether this causes an
> immediate vmexit or not:
>
>             thread 1                         thread 2
>             -----------------------          -----------------------
>                                              <CPU not in guest mode>
>             hv_vcpu_interrupt()
>                                              hv_vcpu_run() (or run_until)

Roman brought me up to speed on this issue in the thread here:
https://lists.gnu.org/archive/html/qemu-devel/2023-10/msg02156.html
including a patch and discussion on the subject from 2020. I've now updated
Roman's old patch which addressed it thanks to your feedback at the time
but which never got merged. (Judging by the list archives, v4 just never
got reviewed.) Details on my latest progress here:

https://lists.gnu.org/archive/html/qemu-devel/2023-10/msg04752.html

> Not that the current code is any better; there is no guarantee that the
> signal is delivered before hv_vcpu_run() is called.  However, if as you
> said hv_vcpu_run() will exit often on its own accord but
> hv_vcpu_run_until() does not, then this could cause difficult to
> reproduce bugs by switching to the latter.

At least with Roman's updated kick patch, according to tracing points I
inserted locally, I'm not seeing any slow or undelivered kicks with
hv_vcpu_run_until. However, I'm still observing the (S)VGA issues to which
Roman drew my attention at the same time, so I'm wondering if the issues
we're seeing with hv_vcpu_run_until aren't (all) related to interrupt
delivery. (To be clear, switching to hv_vcpu_run_until() WITHOUT
hv_vcpu_interrupt() causes some very obvious problems where the vCPU simply
doesn't exit at all for long periods.)

From my point of view, there are at least 2 strong incentives for switching
to hv_vcpu_run_until():

1. hv_vcpu_run() exits very frequently, and often there is actually nothing
for the VMM to do except call hv_vcpu_run() again. With Qemu's current hvf
backend, each exit causes a BQL acquisition, and VMs with a bunch of vCPUs
rapidly become limited by BQL contention according to my profiling.

2. The HVF in macOS 12 and newer contains an in-kernel APIC implementation,
in a similar vein to KVM’s kernel irqchip. This should further reduce
apparent VM exits, but using it is conditional on using hv_vcpu_run_until().
Calling hv_vcpu_run() errors out with that enabled. Integrating that into
Qemu is still a work in progress, but I’ve got pretty far. (most of the
APIC and IOAPIC suites in kvm-unit-tests pass)

For those 2 reasons I’m pretty motivated to make things work with
hv_vcpu_run_until()
one way or another.

> > The temp variable is needed because the pointer expected by the
hv_vcpu_interrupt()
> > call doesn't match the fd field's type in the hvf accel's struct
AccelCPUState.
> > I'm unsure if it would be better to change that struct field to the
relevant
> > architecture's handle types, hv_vcpuid_t (x86, unsigned int) and
hv_vcpu_t
> > (aarch64, uint64_t), perhaps via an intermediate typedef?
>
> I think fd and the HVF type should be placed in an anonymous union.

That’s a good idea, I’ll put a patch together doing exactly that.

Thanks,
Phil

Re: [PATCH 0/3] hvf x86 correctness and efficiency improvements

Posted by Paolo Bonzini 6 months, 2 weeks ago

On Mon, Oct 16, 2023 at 6:45 PM Phil Dennis-Jordan <lists@philjordan.eu> wrote:
>
> Hi Paolo,
>
>
> On Mon, 16 Oct 2023 at 16:39, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On 9/22/23 16:09, Phil Dennis-Jordan wrote:
> > > Patch 1 enables the INVTSC CPUID bit when running with hvf. This can
> > > enable some optimisations in the guest OS, and I've not found any reason
> > > it shouldn't be allowed for hvf based hosts.
> >
> > It can be enabled, but it should include a migration blocker.  In fact,
> > probably HVF itself should include a migration blocker because QEMU
> > doesn't support TSC scaling.
>
> I didn't think Qemu's HVF backend supported migration in any form at this point anyway? Or do you mean machine model versioning of the default setting?

If it doesn't support migration, it needs to register a migration blocker.

> switching to hv_vcpu_run_until() WITHOUT hv_vcpu_interrupt()
> causes some very obvious problems where the vCPU simply
> doesn't exit at all for long periods.)

Yes, that makes sense. It looks like hv_vcpu_run_until() has an
equivalent of a "do ... while (errno == EINTR)" loop inside it.

> 1. hv_vcpu_run() exits very frequently, and often there is actually
> nothing for the VMM to do except call hv_vcpu_run() again. With
> Qemu's current hvf backend, each exit causes a BQL acquisition,
> and VMs with a bunch of vCPUs rapidly become limited by BQL
> contention according to my profiling.

Yes, that should be fixed anyway, but I agree it is a separate issue.

Paolo

Re: [PATCH 0/3] hvf x86 correctness and efficiency improvements

Posted by Phil Dennis-Jordan 6 months, 2 weeks ago

On Mon, 16 Oct 2023 at 18:49, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > switching to hv_vcpu_run_until() WITHOUT hv_vcpu_interrupt()
> > causes some very obvious problems where the vCPU simply
> > doesn't exit at all for long periods.)
>
> Yes, that makes sense. It looks like hv_vcpu_run_until() has an
> equivalent of a "do ... while (errno == EINTR)" loop inside it.

Thinking about this some more, I wonder if it's worth putting together
some test code to check empirically how signals and hv_vcpu_interrupt
interact with each of the 2 vcpu_run implementations. It should be
pretty straightforward to establish whether calling hv_vcpu_interrupt
*before* hv_vcpu_run*() causes an instant exit when it is called, and
whether the signal causes hv_vcpu_run() to exit. (Based on observed
behaviour, I'm already pretty confident hv_vcpu_run_until() ignores
signals until it exits for other reasons.)

> > 1. hv_vcpu_run() exits very frequently, and often there is actually
> > nothing for the VMM to do except call hv_vcpu_run() again. With
> > Qemu's current hvf backend, each exit causes a BQL acquisition,
> > and VMs with a bunch of vCPUs rapidly become limited by BQL
> > contention according to my profiling.
>
> Yes, that should be fixed anyway, but I agree it is a separate issue.

I've locally got a bunch of very messy patches for reducing BQL
acquisition in the x86 HVF vCPU loop. I found it difficult to make
much of a real difference however, and the code gets a lot harder to
follow.

- Decoding instructions that need emulating can be done outside the
lock. Actually running the instruction typically ends up causing an
action that needs it though, so the win isn't that big.
- With hv_vcpu_run() there are a bunch of (EPT fault) exits that don't
really need anything in particular to be done. Or instead of detecting
those outside the lock you can switch to hv_vcpu_run_until() which
avoids those exits altogether.
- KVM's vCPU loop calls into MMIO faults without the BQL, but they
acquire it internally I think?
- Going from xAPIC to x2APIC reduces the number of exits, and using
MSRs is a bit quicker than walking the memory hierarchy, so that
definitely helps too, but the APIC implementation still needs the BQL
held, and untangling that is probably harder than switching to an
in-kernel APIC.

Beyond that: there's a good chance that turning the BQL into a
read-write lock would help, but that's a much bigger undertaking than
I'm currently willing to entertain!

Re hvf fd:
> I think fd and the HVF type should be placed in an anonymous union.

Taking a look at the code and thinking through the implications, I'm
not actually sure what the intention of the union is? IIRC Qemu is
built with strict aliasing rules disabled, but there seems little
point in actively using union aliasing here? We can just as easily
modify this block at the top of hvf_int.h:

#ifdef __aarch64__
#include <Hypervisor/Hypervisor.h>
#else
#include <Hypervisor/hv.h>
#endif

to something like:

#ifdef __aarch64__
#include <Hypervisor/Hypervisor.h>
typedef hv_vcpu_t hvf_vcpu_id;
#else
#include <Hypervisor/hv.h>
typedef hv_vcpuid_t hvf_vcpu_id;
#endif

And then:

struct AccelCPUState {
    hvf_vcpu_id fd;
    void *exit;
    …

Or perhaps this variant, if we want AccelCPUState to have consistent
size/alignment properties, we can use a union after all, but never
actually use the fd_padding field:

struct AccelCPUState {
    union {
        hvf_vcpu_id fd;
        uint64_t fd_padding;
    };
    void *exit;
    …

(Or is that what you were thinking of with the union idea in the first place?)

Thanks,
Phil

Re: [PATCH 0/3] hvf x86 correctness and efficiency improvements

Posted by Paolo Bonzini 6 months, 2 weeks ago

On Mon, Oct 16, 2023 at 10:05 PM Phil Dennis-Jordan <lists@philjordan.eu> wrote:
> > I think fd and the HVF type should be placed in an anonymous union.
>
> Taking a look at the code and thinking through the implications, I'm
> not actually sure what the intention of the union is

Nevermind, I got confused that AccelCPUState depends on what the
actual accelerator is. So yeah, you can just change the type of "fd";
it was an int only because in the past the field was defined for all
accelerators.

Paolo

Re: [PATCH 0/3] hvf x86 correctness and efficiency improvements

Posted by Phil Dennis-Jordan 6 months, 4 weeks ago

Ping - let me know if there's anything particularly controversial,
unclear, etc. about these patches or if I can do anything to make
reviewing easier.

Thanks!


On Fri, 22 Sept 2023 at 16:09, Phil Dennis-Jordan <phil@philjordan.eu> wrote:
>
> This is a series of semi-related patches for the x86 macOS Hypervisor.framework
> (hvf) accelerator backend. The intention is to make VMs run slightly more
> efficiently on macOS host machines. They have been subject to some months of
> CI workloads with macOS guest VMs without issues and they seem to give a few
> percent performance improvement. (Though this varies greatly with the type of
> workload.)
>
> Patch 1 enables the INVTSC CPUID bit when running with hvf. This can enable
> some optimisations in the guest OS, and I've not found any reason it shouldn't
> be allowed for hvf based hosts.
>
> Patch 2 fixes hvf_kick_vcpu_thread so it actually forces a VM exit instead of
> doing nothing. I guess this previously didn't cause any huge issues because
> hvf's hv_vcpu_run() would exit so extremely frequently on its own accord. The
> temp variable is needed because the pointer expected by the hv_vcpu_interrupt()
> call doesn't match the fd field's type in the hvf accel's struct AccelCPUState.
> I'm unsure if it would be better to change that struct field to the relevant
> architecture's handle types, hv_vcpuid_t (x86, unsigned int) and hv_vcpu_t
> (aarch64, uint64_t), perhaps via an intermediate typedef?
>
> Patch 3, which replaces the call to hv_vcpu_run() with the more modern
> hv_vcpu_run_until() for running the guest vCPU. The newer API is available
> from macOS 10.15 host systems onwards. This call causes significantly fewer
> VM exits, which also means we really need that exit-forcing interrupt from
> patch 2. The reduction in VM exits means less overhead from exits and less
> contention on the BQL. Using hv_vcpu_run_until() is also a prerequisite for
> using certain newer hvf features, though this patchset doesn't use any.
>
> Patches 2 & 3 must therefore be applied in that order, patch 1 is independent.
>
> This work has been sponsored by Sauce Labs Inc.
>
> Phil Dennis-Jordan (3):
>   i386: hvf: Adds support for INVTSC cpuid bit
>   i386: hvf: In kick_vcpu use hv_vcpu_interrupt to force exit
>   i386: hvf: Updates API usage to use modern vCPU run function
>
>  target/i386/hvf/hvf.c       | 26 +++++++++++++++++++++++++-
>  target/i386/hvf/x86_cpuid.c |  4 ++++
>  2 files changed, 29 insertions(+), 1 deletion(-)
>
> --
> 2.36.1
>