[v1] hvf: Implement Apple Silicon Support

[PATCH 2/8] hvf: Move common code out

Posted by Alexander Graf 4 years, 3 months ago

Until now, Hypervisor.framework has only been available on x86_64 systems.
With Apple Silicon shipping now, it extends its reach to aarch64. To
prepare for support for multiple architectures, let's move common code out
into its own accel directory.

Signed-off-by: Alexander Graf <agraf@csgraf.de>
---
 MAINTAINERS                 |   9 +-
 accel/hvf/hvf-all.c         |  56 +++++
 accel/hvf/hvf-cpus.c        | 468 ++++++++++++++++++++++++++++++++++++
 accel/hvf/meson.build       |   7 +
 accel/meson.build           |   1 +
 include/sysemu/hvf_int.h    |  69 ++++++
 target/i386/hvf/hvf-cpus.c  | 131 ----------
 target/i386/hvf/hvf-cpus.h  |  25 --
 target/i386/hvf/hvf-i386.h  |  48 +---
 target/i386/hvf/hvf.c       | 360 +--------------------------
 target/i386/hvf/meson.build |   1 -
 target/i386/hvf/x86hvf.c    |  11 +-
 target/i386/hvf/x86hvf.h    |   2 -
 13 files changed, 619 insertions(+), 569 deletions(-)
 create mode 100644 accel/hvf/hvf-all.c
 create mode 100644 accel/hvf/hvf-cpus.c
 create mode 100644 accel/hvf/meson.build
 create mode 100644 include/sysemu/hvf_int.h
 delete mode 100644 target/i386/hvf/hvf-cpus.c
 delete mode 100644 target/i386/hvf/hvf-cpus.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 68bc160f41..ca4b6d9279 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -444,9 +444,16 @@ M: Cameron Esfahani <dirty@apple.com>
 M: Roman Bolshakov <r.bolshakov@yadro.com>
 W: https://wiki.qemu.org/Features/HVF
 S: Maintained
-F: accel/stubs/hvf-stub.c
 F: target/i386/hvf/
+
+HVF
+M: Cameron Esfahani <dirty@apple.com>
+M: Roman Bolshakov <r.bolshakov@yadro.com>
+W: https://wiki.qemu.org/Features/HVF
+S: Maintained
+F: accel/hvf/
 F: include/sysemu/hvf.h
+F: include/sysemu/hvf_int.h
 
 WHPX CPUs
 M: Sunil Muthuswamy <sunilmut@microsoft.com>
diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
new file mode 100644
index 0000000000..47d77a472a
--- /dev/null
+++ b/accel/hvf/hvf-all.c
@@ -0,0 +1,56 @@
+/*
+ * QEMU Hypervisor.framework support
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Contributions after 2012-01-13 are licensed under the terms of the
+ * GNU GPL, version 2 or (at your option) any later version.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "sysemu/hvf.h"
+#include "sysemu/hvf_int.h"
+#include "sysemu/runstate.h"
+
+#include "qemu/main-loop.h"
+#include "sysemu/accel.h"
+
+#include <Hypervisor/Hypervisor.h>
+
+bool hvf_allowed;
+HVFState *hvf_state;
+
+void assert_hvf_ok(hv_return_t ret)
+{
+    if (ret == HV_SUCCESS) {
+        return;
+    }
+
+    switch (ret) {
+    case HV_ERROR:
+        error_report("Error: HV_ERROR");
+        break;
+    case HV_BUSY:
+        error_report("Error: HV_BUSY");
+        break;
+    case HV_BAD_ARGUMENT:
+        error_report("Error: HV_BAD_ARGUMENT");
+        break;
+    case HV_NO_RESOURCES:
+        error_report("Error: HV_NO_RESOURCES");
+        break;
+    case HV_NO_DEVICE:
+        error_report("Error: HV_NO_DEVICE");
+        break;
+    case HV_UNSUPPORTED:
+        error_report("Error: HV_UNSUPPORTED");
+        break;
+    default:
+        error_report("Unknown Error");
+    }
+
+    abort();
+}
diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
new file mode 100644
index 0000000000..f9bb5502b7
--- /dev/null
+++ b/accel/hvf/hvf-cpus.c
@@ -0,0 +1,468 @@
+/*
+ * Copyright 2008 IBM Corporation
+ *           2008 Red Hat, Inc.
+ * Copyright 2011 Intel Corporation
+ * Copyright 2016 Veertu, Inc.
+ * Copyright 2017 The Android Open Source Project
+ *
+ * QEMU Hypervisor.framework support
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ *
+ * This file contain code under public domain from the hvdos project:
+ * https://github.com/mist64/hvdos
+ *
+ * Parts Copyright (c) 2011 NetApp, Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "qemu/main-loop.h"
+#include "exec/address-spaces.h"
+#include "exec/exec-all.h"
+#include "sysemu/cpus.h"
+#include "sysemu/hvf.h"
+#include "sysemu/hvf_int.h"
+#include "sysemu/runstate.h"
+#include "qemu/guest-random.h"
+
+#include <Hypervisor/Hypervisor.h>
+
+/* Memory slots */
+
+struct mac_slot {
+    int present;
+    uint64_t size;
+    uint64_t gpa_start;
+    uint64_t gva;
+};
+
+hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
+{
+    hvf_slot *slot;
+    int x;
+    for (x = 0; x < hvf_state->num_slots; ++x) {
+        slot = &hvf_state->slots[x];
+        if (slot->size && start < (slot->start + slot->size) &&
+            (start + size) > slot->start) {
+            return slot;
+        }
+    }
+    return NULL;
+}
+
+struct mac_slot mac_slots[32];
+
+static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
+{
+    struct mac_slot *macslot;
+    hv_return_t ret;
+
+    macslot = &mac_slots[slot->slot_id];
+
+    if (macslot->present) {
+        if (macslot->size != slot->size) {
+            macslot->present = 0;
+            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
+            assert_hvf_ok(ret);
+        }
+    }
+
+    if (!slot->size) {
+        return 0;
+    }
+
+    macslot->present = 1;
+    macslot->gpa_start = slot->start;
+    macslot->size = slot->size;
+    ret = hv_vm_map(slot->mem, slot->start, slot->size, flags);
+    assert_hvf_ok(ret);
+    return 0;
+}
+
+static void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
+{
+    hvf_slot *mem;
+    MemoryRegion *area = section->mr;
+    bool writeable = !area->readonly && !area->rom_device;
+    hv_memory_flags_t flags;
+
+    if (!memory_region_is_ram(area)) {
+        if (writeable) {
+            return;
+        } else if (!memory_region_is_romd(area)) {
+            /*
+             * If the memory device is not in romd_mode, then we actually want
+             * to remove the hvf memory slot so all accesses will trap.
+             */
+             add = false;
+        }
+    }
+
+    mem = hvf_find_overlap_slot(
+            section->offset_within_address_space,
+            int128_get64(section->size));
+
+    if (mem && add) {
+        if (mem->size == int128_get64(section->size) &&
+            mem->start == section->offset_within_address_space &&
+            mem->mem == (memory_region_get_ram_ptr(area) +
+            section->offset_within_region)) {
+            return; /* Same region was attempted to register, go away. */
+        }
+    }
+
+    /* Region needs to be reset. set the size to 0 and remap it. */
+    if (mem) {
+        mem->size = 0;
+        if (do_hvf_set_memory(mem, 0)) {
+            error_report("Failed to reset overlapping slot");
+            abort();
+        }
+    }
+
+    if (!add) {
+        return;
+    }
+
+    if (area->readonly ||
+        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
+        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
+    } else {
+        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
+    }
+
+    /* Now make a new slot. */
+    int x;
+
+    for (x = 0; x < hvf_state->num_slots; ++x) {
+        mem = &hvf_state->slots[x];
+        if (!mem->size) {
+            break;
+        }
+    }
+
+    if (x == hvf_state->num_slots) {
+        error_report("No free slots");
+        abort();
+    }
+
+    mem->size = int128_get64(section->size);
+    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
+    mem->start = section->offset_within_address_space;
+    mem->region = area;
+
+    if (do_hvf_set_memory(mem, flags)) {
+        error_report("Error registering new memory slot");
+        abort();
+    }
+}
+
+static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
+{
+    hvf_slot *slot;
+
+    slot = hvf_find_overlap_slot(
+            section->offset_within_address_space,
+            int128_get64(section->size));
+
+    /* protect region against writes; begin tracking it */
+    if (on) {
+        slot->flags |= HVF_SLOT_LOG;
+        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
+                      HV_MEMORY_READ);
+    /* stop tracking region*/
+    } else {
+        slot->flags &= ~HVF_SLOT_LOG;
+        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
+                      HV_MEMORY_READ | HV_MEMORY_WRITE);
+    }
+}
+
+static void hvf_log_start(MemoryListener *listener,
+                          MemoryRegionSection *section, int old, int new)
+{
+    if (old != 0) {
+        return;
+    }
+
+    hvf_set_dirty_tracking(section, 1);
+}
+
+static void hvf_log_stop(MemoryListener *listener,
+                         MemoryRegionSection *section, int old, int new)
+{
+    if (new != 0) {
+        return;
+    }
+
+    hvf_set_dirty_tracking(section, 0);
+}
+
+static void hvf_log_sync(MemoryListener *listener,
+                         MemoryRegionSection *section)
+{
+    /*
+     * sync of dirty pages is handled elsewhere; just make sure we keep
+     * tracking the region.
+     */
+    hvf_set_dirty_tracking(section, 1);
+}
+
+static void hvf_region_add(MemoryListener *listener,
+                           MemoryRegionSection *section)
+{
+    hvf_set_phys_mem(section, true);
+}
+
+static void hvf_region_del(MemoryListener *listener,
+                           MemoryRegionSection *section)
+{
+    hvf_set_phys_mem(section, false);
+}
+
+static MemoryListener hvf_memory_listener = {
+    .priority = 10,
+    .region_add = hvf_region_add,
+    .region_del = hvf_region_del,
+    .log_start = hvf_log_start,
+    .log_stop = hvf_log_stop,
+    .log_sync = hvf_log_sync,
+};
+
+static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
+{
+    if (!cpu->vcpu_dirty) {
+        hvf_get_registers(cpu);
+        cpu->vcpu_dirty = true;
+    }
+}
+
+static void hvf_cpu_synchronize_state(CPUState *cpu)
+{
+    if (!cpu->vcpu_dirty) {
+        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
+    }
+}
+
+static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
+                                              run_on_cpu_data arg)
+{
+    hvf_put_registers(cpu);
+    cpu->vcpu_dirty = false;
+}
+
+static void hvf_cpu_synchronize_post_reset(CPUState *cpu)
+{
+    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
+}
+
+static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
+                                             run_on_cpu_data arg)
+{
+    hvf_put_registers(cpu);
+    cpu->vcpu_dirty = false;
+}
+
+static void hvf_cpu_synchronize_post_init(CPUState *cpu)
+{
+    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
+}
+
+static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
+                                              run_on_cpu_data arg)
+{
+    cpu->vcpu_dirty = true;
+}
+
+static void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
+{
+    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
+}
+
+static void hvf_vcpu_destroy(CPUState *cpu)
+{
+    hv_return_t ret = hv_vcpu_destroy(cpu->hvf_fd);
+    assert_hvf_ok(ret);
+
+    hvf_arch_vcpu_destroy(cpu);
+}
+
+static void dummy_signal(int sig)
+{
+}
+
+static int hvf_init_vcpu(CPUState *cpu)
+{
+    int r;
+
+    /* init cpu signals */
+    sigset_t set;
+    struct sigaction sigact;
+
+    memset(&sigact, 0, sizeof(sigact));
+    sigact.sa_handler = dummy_signal;
+    sigaction(SIG_IPI, &sigact, NULL);
+
+    pthread_sigmask(SIG_BLOCK, NULL, &set);
+    sigdelset(&set, SIG_IPI);
+
+#ifdef __aarch64__
+    r = hv_vcpu_create(&cpu->hvf_fd, (hv_vcpu_exit_t **)&cpu->hvf_exit, NULL);
+#else
+    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
+#endif
+    cpu->vcpu_dirty = 1;
+    assert_hvf_ok(r);
+
+    return hvf_arch_init_vcpu(cpu);
+}
+
+/*
+ * The HVF-specific vCPU thread function. This one should only run when the host
+ * CPU supports the VMX "unrestricted guest" feature.
+ */
+static void *hvf_cpu_thread_fn(void *arg)
+{
+    CPUState *cpu = arg;
+
+    int r;
+
+    assert(hvf_enabled());
+
+    rcu_register_thread();
+
+    qemu_mutex_lock_iothread();
+    qemu_thread_get_self(cpu->thread);
+
+    cpu->thread_id = qemu_get_thread_id();
+    cpu->can_do_io = 1;
+    current_cpu = cpu;
+
+    hvf_init_vcpu(cpu);
+
+    /* signal CPU creation */
+    cpu_thread_signal_created(cpu);
+    qemu_guest_random_seed_thread_part2(cpu->random_seed);
+
+    do {
+        if (cpu_can_run(cpu)) {
+            r = hvf_vcpu_exec(cpu);
+            if (r == EXCP_DEBUG) {
+                cpu_handle_guest_debug(cpu);
+            }
+        }
+        qemu_wait_io_event(cpu);
+    } while (!cpu->unplug || cpu_can_run(cpu));
+
+    hvf_vcpu_destroy(cpu);
+    cpu_thread_signal_destroyed(cpu);
+    qemu_mutex_unlock_iothread();
+    rcu_unregister_thread();
+    return NULL;
+}
+
+static void hvf_start_vcpu_thread(CPUState *cpu)
+{
+    char thread_name[VCPU_THREAD_NAME_SIZE];
+
+    /*
+     * HVF currently does not support TCG, and only runs in
+     * unrestricted-guest mode.
+     */
+    assert(hvf_enabled());
+
+    cpu->thread = g_malloc0(sizeof(QemuThread));
+    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
+    qemu_cond_init(cpu->halt_cond);
+
+    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
+             cpu->cpu_index);
+    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
+                       cpu, QEMU_THREAD_JOINABLE);
+}
+
+static const CpusAccel hvf_cpus = {
+    .create_vcpu_thread = hvf_start_vcpu_thread,
+
+    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
+    .synchronize_post_init = hvf_cpu_synchronize_post_init,
+    .synchronize_state = hvf_cpu_synchronize_state,
+    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
+};
+
+static int hvf_accel_init(MachineState *ms)
+{
+    int x;
+    hv_return_t ret;
+    HVFState *s;
+
+    ret = hv_vm_create(HV_VM_DEFAULT);
+    assert_hvf_ok(ret);
+
+    s = g_new0(HVFState, 1);
+
+    s->num_slots = 32;
+    for (x = 0; x < s->num_slots; ++x) {
+        s->slots[x].size = 0;
+        s->slots[x].slot_id = x;
+    }
+
+    hvf_state = s;
+    memory_listener_register(&hvf_memory_listener, &address_space_memory);
+    cpus_register_accel(&hvf_cpus);
+    return 0;
+}
+
+static void hvf_accel_class_init(ObjectClass *oc, void *data)
+{
+    AccelClass *ac = ACCEL_CLASS(oc);
+    ac->name = "HVF";
+    ac->init_machine = hvf_accel_init;
+    ac->allowed = &hvf_allowed;
+}
+
+static const TypeInfo hvf_accel_type = {
+    .name = TYPE_HVF_ACCEL,
+    .parent = TYPE_ACCEL,
+    .class_init = hvf_accel_class_init,
+};
+
+static void hvf_type_init(void)
+{
+    type_register_static(&hvf_accel_type);
+}
+
+type_init(hvf_type_init);
diff --git a/accel/hvf/meson.build b/accel/hvf/meson.build
new file mode 100644
index 0000000000..dfd6b68dc7
--- /dev/null
+++ b/accel/hvf/meson.build
@@ -0,0 +1,7 @@
+hvf_ss = ss.source_set()
+hvf_ss.add(files(
+  'hvf-all.c',
+  'hvf-cpus.c',
+))
+
+specific_ss.add_all(when: 'CONFIG_HVF', if_true: hvf_ss)
diff --git a/accel/meson.build b/accel/meson.build
index b26cca227a..6de12ce5d5 100644
--- a/accel/meson.build
+++ b/accel/meson.build
@@ -1,5 +1,6 @@
 softmmu_ss.add(files('accel.c'))
 
+subdir('hvf')
 subdir('qtest')
 subdir('kvm')
 subdir('tcg')
diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
new file mode 100644
index 0000000000..de9bad23a8
--- /dev/null
+++ b/include/sysemu/hvf_int.h
@@ -0,0 +1,69 @@
+/*
+ * QEMU Hypervisor.framework (HVF) support
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+/* header to be included in HVF-specific code */
+
+#ifndef HVF_INT_H
+#define HVF_INT_H
+
+#include <Hypervisor/Hypervisor.h>
+
+#define HVF_MAX_VCPU 0x10
+
+extern struct hvf_state hvf_global;
+
+struct hvf_vm {
+    int id;
+    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
+};
+
+struct hvf_state {
+    uint32_t version;
+    struct hvf_vm *vm;
+    uint64_t mem_quota;
+};
+
+/* hvf_slot flags */
+#define HVF_SLOT_LOG (1 << 0)
+
+typedef struct hvf_slot {
+    uint64_t start;
+    uint64_t size;
+    uint8_t *mem;
+    int slot_id;
+    uint32_t flags;
+    MemoryRegion *region;
+} hvf_slot;
+
+typedef struct hvf_vcpu_caps {
+    uint64_t vmx_cap_pinbased;
+    uint64_t vmx_cap_procbased;
+    uint64_t vmx_cap_procbased2;
+    uint64_t vmx_cap_entry;
+    uint64_t vmx_cap_exit;
+    uint64_t vmx_cap_preemption_timer;
+} hvf_vcpu_caps;
+
+struct HVFState {
+    AccelState parent;
+    hvf_slot slots[32];
+    int num_slots;
+
+    hvf_vcpu_caps *hvf_caps;
+};
+extern HVFState *hvf_state;
+
+void assert_hvf_ok(hv_return_t ret);
+int hvf_get_registers(CPUState *cpu);
+int hvf_put_registers(CPUState *cpu);
+int hvf_arch_init_vcpu(CPUState *cpu);
+void hvf_arch_vcpu_destroy(CPUState *cpu);
+int hvf_vcpu_exec(CPUState *cpu);
+hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
+
+#endif
diff --git a/target/i386/hvf/hvf-cpus.c b/target/i386/hvf/hvf-cpus.c
deleted file mode 100644
index 817b3d7452..0000000000
--- a/target/i386/hvf/hvf-cpus.c
+++ /dev/null
@@ -1,131 +0,0 @@
-/*
- * Copyright 2008 IBM Corporation
- *           2008 Red Hat, Inc.
- * Copyright 2011 Intel Corporation
- * Copyright 2016 Veertu, Inc.
- * Copyright 2017 The Android Open Source Project
- *
- * QEMU Hypervisor.framework support
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of version 2 of the GNU General Public
- * License as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, see <http://www.gnu.org/licenses/>.
- *
- * This file contain code under public domain from the hvdos project:
- * https://github.com/mist64/hvdos
- *
- * Parts Copyright (c) 2011 NetApp, Inc.
- * All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- * 1. Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in the
- *    documentation and/or other materials provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
- * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
- * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
- * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- */
-
-#include "qemu/osdep.h"
-#include "qemu/error-report.h"
-#include "qemu/main-loop.h"
-#include "sysemu/hvf.h"
-#include "sysemu/runstate.h"
-#include "target/i386/cpu.h"
-#include "qemu/guest-random.h"
-
-#include "hvf-cpus.h"
-
-/*
- * The HVF-specific vCPU thread function. This one should only run when the host
- * CPU supports the VMX "unrestricted guest" feature.
- */
-static void *hvf_cpu_thread_fn(void *arg)
-{
-    CPUState *cpu = arg;
-
-    int r;
-
-    assert(hvf_enabled());
-
-    rcu_register_thread();
-
-    qemu_mutex_lock_iothread();
-    qemu_thread_get_self(cpu->thread);
-
-    cpu->thread_id = qemu_get_thread_id();
-    cpu->can_do_io = 1;
-    current_cpu = cpu;
-
-    hvf_init_vcpu(cpu);
-
-    /* signal CPU creation */
-    cpu_thread_signal_created(cpu);
-    qemu_guest_random_seed_thread_part2(cpu->random_seed);
-
-    do {
-        if (cpu_can_run(cpu)) {
-            r = hvf_vcpu_exec(cpu);
-            if (r == EXCP_DEBUG) {
-                cpu_handle_guest_debug(cpu);
-            }
-        }
-        qemu_wait_io_event(cpu);
-    } while (!cpu->unplug || cpu_can_run(cpu));
-
-    hvf_vcpu_destroy(cpu);
-    cpu_thread_signal_destroyed(cpu);
-    qemu_mutex_unlock_iothread();
-    rcu_unregister_thread();
-    return NULL;
-}
-
-static void hvf_start_vcpu_thread(CPUState *cpu)
-{
-    char thread_name[VCPU_THREAD_NAME_SIZE];
-
-    /*
-     * HVF currently does not support TCG, and only runs in
-     * unrestricted-guest mode.
-     */
-    assert(hvf_enabled());
-
-    cpu->thread = g_malloc0(sizeof(QemuThread));
-    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
-    qemu_cond_init(cpu->halt_cond);
-
-    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
-             cpu->cpu_index);
-    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
-                       cpu, QEMU_THREAD_JOINABLE);
-}
-
-const CpusAccel hvf_cpus = {
-    .create_vcpu_thread = hvf_start_vcpu_thread,
-
-    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
-    .synchronize_post_init = hvf_cpu_synchronize_post_init,
-    .synchronize_state = hvf_cpu_synchronize_state,
-    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
-};
diff --git a/target/i386/hvf/hvf-cpus.h b/target/i386/hvf/hvf-cpus.h
deleted file mode 100644
index ced31b82c0..0000000000
--- a/target/i386/hvf/hvf-cpus.h
+++ /dev/null
@@ -1,25 +0,0 @@
-/*
- * Accelerator CPUS Interface
- *
- * Copyright 2020 SUSE LLC
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- */
-
-#ifndef HVF_CPUS_H
-#define HVF_CPUS_H
-
-#include "sysemu/cpus.h"
-
-extern const CpusAccel hvf_cpus;
-
-int hvf_init_vcpu(CPUState *);
-int hvf_vcpu_exec(CPUState *);
-void hvf_cpu_synchronize_state(CPUState *);
-void hvf_cpu_synchronize_post_reset(CPUState *);
-void hvf_cpu_synchronize_post_init(CPUState *);
-void hvf_cpu_synchronize_pre_loadvm(CPUState *);
-void hvf_vcpu_destroy(CPUState *);
-
-#endif /* HVF_CPUS_H */
diff --git a/target/i386/hvf/hvf-i386.h b/target/i386/hvf/hvf-i386.h
index e0edffd077..6d56f8f6bb 100644
--- a/target/i386/hvf/hvf-i386.h
+++ b/target/i386/hvf/hvf-i386.h
@@ -18,57 +18,11 @@
 
 #include "sysemu/accel.h"
 #include "sysemu/hvf.h"
+#include "sysemu/hvf_int.h"
 #include "cpu.h"
 #include "x86.h"
 
-#define HVF_MAX_VCPU 0x10
-
-extern struct hvf_state hvf_global;
-
-struct hvf_vm {
-    int id;
-    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
-};
-
-struct hvf_state {
-    uint32_t version;
-    struct hvf_vm *vm;
-    uint64_t mem_quota;
-};
-
-/* hvf_slot flags */
-#define HVF_SLOT_LOG (1 << 0)
-
-typedef struct hvf_slot {
-    uint64_t start;
-    uint64_t size;
-    uint8_t *mem;
-    int slot_id;
-    uint32_t flags;
-    MemoryRegion *region;
-} hvf_slot;
-
-typedef struct hvf_vcpu_caps {
-    uint64_t vmx_cap_pinbased;
-    uint64_t vmx_cap_procbased;
-    uint64_t vmx_cap_procbased2;
-    uint64_t vmx_cap_entry;
-    uint64_t vmx_cap_exit;
-    uint64_t vmx_cap_preemption_timer;
-} hvf_vcpu_caps;
-
-struct HVFState {
-    AccelState parent;
-    hvf_slot slots[32];
-    int num_slots;
-
-    hvf_vcpu_caps *hvf_caps;
-};
-extern HVFState *hvf_state;
-
-void hvf_set_phys_mem(MemoryRegionSection *, bool);
 void hvf_handle_io(CPUArchState *, uint16_t, void *, int, int, int);
-hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
 
 #ifdef NEED_CPU_H
 /* Functions exported to host specific mode */
diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
index ed9356565c..8b96ecd619 100644
--- a/target/i386/hvf/hvf.c
+++ b/target/i386/hvf/hvf.c
@@ -51,6 +51,7 @@
 #include "qemu/error-report.h"
 
 #include "sysemu/hvf.h"
+#include "sysemu/hvf_int.h"
 #include "sysemu/runstate.h"
 #include "hvf-i386.h"
 #include "vmcs.h"
@@ -72,171 +73,6 @@
 #include "sysemu/accel.h"
 #include "target/i386/cpu.h"
 
-#include "hvf-cpus.h"
-
-HVFState *hvf_state;
-
-static void assert_hvf_ok(hv_return_t ret)
-{
-    if (ret == HV_SUCCESS) {
-        return;
-    }
-
-    switch (ret) {
-    case HV_ERROR:
-        error_report("Error: HV_ERROR");
-        break;
-    case HV_BUSY:
-        error_report("Error: HV_BUSY");
-        break;
-    case HV_BAD_ARGUMENT:
-        error_report("Error: HV_BAD_ARGUMENT");
-        break;
-    case HV_NO_RESOURCES:
-        error_report("Error: HV_NO_RESOURCES");
-        break;
-    case HV_NO_DEVICE:
-        error_report("Error: HV_NO_DEVICE");
-        break;
-    case HV_UNSUPPORTED:
-        error_report("Error: HV_UNSUPPORTED");
-        break;
-    default:
-        error_report("Unknown Error");
-    }
-
-    abort();
-}
-
-/* Memory slots */
-hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
-{
-    hvf_slot *slot;
-    int x;
-    for (x = 0; x < hvf_state->num_slots; ++x) {
-        slot = &hvf_state->slots[x];
-        if (slot->size && start < (slot->start + slot->size) &&
-            (start + size) > slot->start) {
-            return slot;
-        }
-    }
-    return NULL;
-}
-
-struct mac_slot {
-    int present;
-    uint64_t size;
-    uint64_t gpa_start;
-    uint64_t gva;
-};
-
-struct mac_slot mac_slots[32];
-
-static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
-{
-    struct mac_slot *macslot;
-    hv_return_t ret;
-
-    macslot = &mac_slots[slot->slot_id];
-
-    if (macslot->present) {
-        if (macslot->size != slot->size) {
-            macslot->present = 0;
-            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
-            assert_hvf_ok(ret);
-        }
-    }
-
-    if (!slot->size) {
-        return 0;
-    }
-
-    macslot->present = 1;
-    macslot->gpa_start = slot->start;
-    macslot->size = slot->size;
-    ret = hv_vm_map((hv_uvaddr_t)slot->mem, slot->start, slot->size, flags);
-    assert_hvf_ok(ret);
-    return 0;
-}
-
-void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
-{
-    hvf_slot *mem;
-    MemoryRegion *area = section->mr;
-    bool writeable = !area->readonly && !area->rom_device;
-    hv_memory_flags_t flags;
-
-    if (!memory_region_is_ram(area)) {
-        if (writeable) {
-            return;
-        } else if (!memory_region_is_romd(area)) {
-            /*
-             * If the memory device is not in romd_mode, then we actually want
-             * to remove the hvf memory slot so all accesses will trap.
-             */
-             add = false;
-        }
-    }
-
-    mem = hvf_find_overlap_slot(
-            section->offset_within_address_space,
-            int128_get64(section->size));
-
-    if (mem && add) {
-        if (mem->size == int128_get64(section->size) &&
-            mem->start == section->offset_within_address_space &&
-            mem->mem == (memory_region_get_ram_ptr(area) +
-            section->offset_within_region)) {
-            return; /* Same region was attempted to register, go away. */
-        }
-    }
-
-    /* Region needs to be reset. set the size to 0 and remap it. */
-    if (mem) {
-        mem->size = 0;
-        if (do_hvf_set_memory(mem, 0)) {
-            error_report("Failed to reset overlapping slot");
-            abort();
-        }
-    }
-
-    if (!add) {
-        return;
-    }
-
-    if (area->readonly ||
-        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
-        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
-    } else {
-        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
-    }
-
-    /* Now make a new slot. */
-    int x;
-
-    for (x = 0; x < hvf_state->num_slots; ++x) {
-        mem = &hvf_state->slots[x];
-        if (!mem->size) {
-            break;
-        }
-    }
-
-    if (x == hvf_state->num_slots) {
-        error_report("No free slots");
-        abort();
-    }
-
-    mem->size = int128_get64(section->size);
-    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
-    mem->start = section->offset_within_address_space;
-    mem->region = area;
-
-    if (do_hvf_set_memory(mem, flags)) {
-        error_report("Error registering new memory slot");
-        abort();
-    }
-}
-
 void vmx_update_tpr(CPUState *cpu)
 {
     /* TODO: need integrate APIC handling */
@@ -276,56 +112,6 @@ void hvf_handle_io(CPUArchState *env, uint16_t port, void *buffer,
     }
 }
 
-static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
-{
-    if (!cpu->vcpu_dirty) {
-        hvf_get_registers(cpu);
-        cpu->vcpu_dirty = true;
-    }
-}
-
-void hvf_cpu_synchronize_state(CPUState *cpu)
-{
-    if (!cpu->vcpu_dirty) {
-        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
-    }
-}
-
-static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
-                                              run_on_cpu_data arg)
-{
-    hvf_put_registers(cpu);
-    cpu->vcpu_dirty = false;
-}
-
-void hvf_cpu_synchronize_post_reset(CPUState *cpu)
-{
-    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
-}
-
-static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
-                                             run_on_cpu_data arg)
-{
-    hvf_put_registers(cpu);
-    cpu->vcpu_dirty = false;
-}
-
-void hvf_cpu_synchronize_post_init(CPUState *cpu)
-{
-    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
-}
-
-static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
-                                              run_on_cpu_data arg)
-{
-    cpu->vcpu_dirty = true;
-}
-
-void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
-{
-    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
-}
-
 static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
 {
     int read, write;
@@ -370,109 +156,19 @@ static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
     return false;
 }
 
-static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
-{
-    hvf_slot *slot;
-
-    slot = hvf_find_overlap_slot(
-            section->offset_within_address_space,
-            int128_get64(section->size));
-
-    /* protect region against writes; begin tracking it */
-    if (on) {
-        slot->flags |= HVF_SLOT_LOG;
-        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
-                      HV_MEMORY_READ);
-    /* stop tracking region*/
-    } else {
-        slot->flags &= ~HVF_SLOT_LOG;
-        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
-                      HV_MEMORY_READ | HV_MEMORY_WRITE);
-    }
-}
-
-static void hvf_log_start(MemoryListener *listener,
-                          MemoryRegionSection *section, int old, int new)
-{
-    if (old != 0) {
-        return;
-    }
-
-    hvf_set_dirty_tracking(section, 1);
-}
-
-static void hvf_log_stop(MemoryListener *listener,
-                         MemoryRegionSection *section, int old, int new)
-{
-    if (new != 0) {
-        return;
-    }
-
-    hvf_set_dirty_tracking(section, 0);
-}
-
-static void hvf_log_sync(MemoryListener *listener,
-                         MemoryRegionSection *section)
-{
-    /*
-     * sync of dirty pages is handled elsewhere; just make sure we keep
-     * tracking the region.
-     */
-    hvf_set_dirty_tracking(section, 1);
-}
-
-static void hvf_region_add(MemoryListener *listener,
-                           MemoryRegionSection *section)
-{
-    hvf_set_phys_mem(section, true);
-}
-
-static void hvf_region_del(MemoryListener *listener,
-                           MemoryRegionSection *section)
-{
-    hvf_set_phys_mem(section, false);
-}
-
-static MemoryListener hvf_memory_listener = {
-    .priority = 10,
-    .region_add = hvf_region_add,
-    .region_del = hvf_region_del,
-    .log_start = hvf_log_start,
-    .log_stop = hvf_log_stop,
-    .log_sync = hvf_log_sync,
-};
-
-void hvf_vcpu_destroy(CPUState *cpu)
+void hvf_arch_vcpu_destroy(CPUState *cpu)
 {
     X86CPU *x86_cpu = X86_CPU(cpu);
     CPUX86State *env = &x86_cpu->env;
 
-    hv_return_t ret = hv_vcpu_destroy((hv_vcpuid_t)cpu->hvf_fd);
     g_free(env->hvf_mmio_buf);
-    assert_hvf_ok(ret);
-}
-
-static void dummy_signal(int sig)
-{
 }
 
-int hvf_init_vcpu(CPUState *cpu)
+int hvf_arch_init_vcpu(CPUState *cpu)
 {
 
     X86CPU *x86cpu = X86_CPU(cpu);
     CPUX86State *env = &x86cpu->env;
-    int r;
-
-    /* init cpu signals */
-    sigset_t set;
-    struct sigaction sigact;
-
-    memset(&sigact, 0, sizeof(sigact));
-    sigact.sa_handler = dummy_signal;
-    sigaction(SIG_IPI, &sigact, NULL);
-
-    pthread_sigmask(SIG_BLOCK, NULL, &set);
-    sigdelset(&set, SIG_IPI);
 
     init_emu();
     init_decoder();
@@ -480,10 +176,6 @@ int hvf_init_vcpu(CPUState *cpu)
     hvf_state->hvf_caps = g_new0(struct hvf_vcpu_caps, 1);
     env->hvf_mmio_buf = g_new(char, 4096);
 
-    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
-    cpu->vcpu_dirty = 1;
-    assert_hvf_ok(r);
-
     if (hv_vmx_read_capability(HV_VMX_CAP_PINBASED,
         &hvf_state->hvf_caps->vmx_cap_pinbased)) {
         abort();
@@ -865,49 +557,3 @@ int hvf_vcpu_exec(CPUState *cpu)
 
     return ret;
 }
-
-bool hvf_allowed;
-
-static int hvf_accel_init(MachineState *ms)
-{
-    int x;
-    hv_return_t ret;
-    HVFState *s;
-
-    ret = hv_vm_create(HV_VM_DEFAULT);
-    assert_hvf_ok(ret);
-
-    s = g_new0(HVFState, 1);
- 
-    s->num_slots = 32;
-    for (x = 0; x < s->num_slots; ++x) {
-        s->slots[x].size = 0;
-        s->slots[x].slot_id = x;
-    }
-  
-    hvf_state = s;
-    memory_listener_register(&hvf_memory_listener, &address_space_memory);
-    cpus_register_accel(&hvf_cpus);
-    return 0;
-}
-
-static void hvf_accel_class_init(ObjectClass *oc, void *data)
-{
-    AccelClass *ac = ACCEL_CLASS(oc);
-    ac->name = "HVF";
-    ac->init_machine = hvf_accel_init;
-    ac->allowed = &hvf_allowed;
-}
-
-static const TypeInfo hvf_accel_type = {
-    .name = TYPE_HVF_ACCEL,
-    .parent = TYPE_ACCEL,
-    .class_init = hvf_accel_class_init,
-};
-
-static void hvf_type_init(void)
-{
-    type_register_static(&hvf_accel_type);
-}
-
-type_init(hvf_type_init);
diff --git a/target/i386/hvf/meson.build b/target/i386/hvf/meson.build
index 409c9a3f14..c8a43717ee 100644
--- a/target/i386/hvf/meson.build
+++ b/target/i386/hvf/meson.build
@@ -1,6 +1,5 @@
 i386_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true: files(
   'hvf.c',
-  'hvf-cpus.c',
   'x86.c',
   'x86_cpuid.c',
   'x86_decode.c',
diff --git a/target/i386/hvf/x86hvf.c b/target/i386/hvf/x86hvf.c
index bbec412b6c..89b8e9d87a 100644
--- a/target/i386/hvf/x86hvf.c
+++ b/target/i386/hvf/x86hvf.c
@@ -20,6 +20,9 @@
 #include "qemu/osdep.h"
 
 #include "qemu-common.h"
+#include "sysemu/hvf.h"
+#include "sysemu/hvf_int.h"
+#include "sysemu/hw_accel.h"
 #include "x86hvf.h"
 #include "vmx.h"
 #include "vmcs.h"
@@ -32,8 +35,6 @@
 #include <Hypervisor/hv.h>
 #include <Hypervisor/hv_vmx.h>
 
-#include "hvf-cpus.h"
-
 void hvf_set_segment(struct CPUState *cpu, struct vmx_segment *vmx_seg,
                      SegmentCache *qseg, bool is_tr)
 {
@@ -437,7 +438,7 @@ int hvf_process_events(CPUState *cpu_state)
     env->eflags = rreg(cpu_state->hvf_fd, HV_X86_RFLAGS);
 
     if (cpu_state->interrupt_request & CPU_INTERRUPT_INIT) {
-        hvf_cpu_synchronize_state(cpu_state);
+        cpu_synchronize_state(cpu_state);
         do_cpu_init(cpu);
     }
 
@@ -451,12 +452,12 @@ int hvf_process_events(CPUState *cpu_state)
         cpu_state->halted = 0;
     }
     if (cpu_state->interrupt_request & CPU_INTERRUPT_SIPI) {
-        hvf_cpu_synchronize_state(cpu_state);
+        cpu_synchronize_state(cpu_state);
         do_cpu_sipi(cpu);
     }
     if (cpu_state->interrupt_request & CPU_INTERRUPT_TPR) {
         cpu_state->interrupt_request &= ~CPU_INTERRUPT_TPR;
-        hvf_cpu_synchronize_state(cpu_state);
+        cpu_synchronize_state(cpu_state);
         apic_handle_tpr_access_report(cpu->apic_state, env->eip,
                                       env->tpr_access_type);
     }
diff --git a/target/i386/hvf/x86hvf.h b/target/i386/hvf/x86hvf.h
index 635ab0f34e..99ed8d608d 100644
--- a/target/i386/hvf/x86hvf.h
+++ b/target/i386/hvf/x86hvf.h
@@ -21,8 +21,6 @@
 #include "x86_descr.h"
 
 int hvf_process_events(CPUState *);
-int hvf_put_registers(CPUState *);
-int hvf_get_registers(CPUState *);
 bool hvf_inject_interrupts(CPUState *);
 void hvf_set_segment(struct CPUState *cpu, struct vmx_segment *vmx_seg,
                      SegmentCache *qseg, bool is_tr);
-- 
2.24.3 (Apple Git-128)

Re: [PATCH 2/8] hvf: Move common code out

Posted by Roman Bolshakov 4 years, 3 months ago

On Thu, Nov 26, 2020 at 10:50:11PM +0100, Alexander Graf wrote:
> Until now, Hypervisor.framework has only been available on x86_64 systems.
> With Apple Silicon shipping now, it extends its reach to aarch64. To
> prepare for support for multiple architectures, let's move common code out
> into its own accel directory.
> 
> Signed-off-by: Alexander Graf <agraf@csgraf.de>
> ---
>  MAINTAINERS                 |   9 +-
>  accel/hvf/hvf-all.c         |  56 +++++
>  accel/hvf/hvf-cpus.c        | 468 ++++++++++++++++++++++++++++++++++++
>  accel/hvf/meson.build       |   7 +
>  accel/meson.build           |   1 +
>  include/sysemu/hvf_int.h    |  69 ++++++
>  target/i386/hvf/hvf-cpus.c  | 131 ----------
>  target/i386/hvf/hvf-cpus.h  |  25 --
>  target/i386/hvf/hvf-i386.h  |  48 +---
>  target/i386/hvf/hvf.c       | 360 +--------------------------
>  target/i386/hvf/meson.build |   1 -
>  target/i386/hvf/x86hvf.c    |  11 +-
>  target/i386/hvf/x86hvf.h    |   2 -
>  13 files changed, 619 insertions(+), 569 deletions(-)
>  create mode 100644 accel/hvf/hvf-all.c
>  create mode 100644 accel/hvf/hvf-cpus.c
>  create mode 100644 accel/hvf/meson.build
>  create mode 100644 include/sysemu/hvf_int.h
>  delete mode 100644 target/i386/hvf/hvf-cpus.c
>  delete mode 100644 target/i386/hvf/hvf-cpus.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 68bc160f41..ca4b6d9279 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -444,9 +444,16 @@ M: Cameron Esfahani <dirty@apple.com>
>  M: Roman Bolshakov <r.bolshakov@yadro.com>
>  W: https://wiki.qemu.org/Features/HVF
>  S: Maintained
> -F: accel/stubs/hvf-stub.c

There was a patch for that in the RFC series from Claudio.

>  F: target/i386/hvf/
> +
> +HVF
> +M: Cameron Esfahani <dirty@apple.com>
> +M: Roman Bolshakov <r.bolshakov@yadro.com>
> +W: https://wiki.qemu.org/Features/HVF
> +S: Maintained
> +F: accel/hvf/
>  F: include/sysemu/hvf.h
> +F: include/sysemu/hvf_int.h
>  
>  WHPX CPUs
>  M: Sunil Muthuswamy <sunilmut@microsoft.com>
> diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
> new file mode 100644
> index 0000000000..47d77a472a
> --- /dev/null
> +++ b/accel/hvf/hvf-all.c
> @@ -0,0 +1,56 @@
> +/*
> + * QEMU Hypervisor.framework support
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * Contributions after 2012-01-13 are licensed under the terms of the
> + * GNU GPL, version 2 or (at your option) any later version.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +#include "qemu/error-report.h"
> +#include "sysemu/hvf.h"
> +#include "sysemu/hvf_int.h"
> +#include "sysemu/runstate.h"
> +
> +#include "qemu/main-loop.h"
> +#include "sysemu/accel.h"
> +
> +#include <Hypervisor/Hypervisor.h>
> +
> +bool hvf_allowed;
> +HVFState *hvf_state;
> +
> +void assert_hvf_ok(hv_return_t ret)
> +{
> +    if (ret == HV_SUCCESS) {
> +        return;
> +    }
> +
> +    switch (ret) {
> +    case HV_ERROR:
> +        error_report("Error: HV_ERROR");
> +        break;
> +    case HV_BUSY:
> +        error_report("Error: HV_BUSY");
> +        break;
> +    case HV_BAD_ARGUMENT:
> +        error_report("Error: HV_BAD_ARGUMENT");
> +        break;
> +    case HV_NO_RESOURCES:
> +        error_report("Error: HV_NO_RESOURCES");
> +        break;
> +    case HV_NO_DEVICE:
> +        error_report("Error: HV_NO_DEVICE");
> +        break;
> +    case HV_UNSUPPORTED:
> +        error_report("Error: HV_UNSUPPORTED");
> +        break;
> +    default:
> +        error_report("Unknown Error");
> +    }
> +
> +    abort();
> +}
> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> new file mode 100644
> index 0000000000..f9bb5502b7
> --- /dev/null
> +++ b/accel/hvf/hvf-cpus.c
> @@ -0,0 +1,468 @@
> +/*
> + * Copyright 2008 IBM Corporation
> + *           2008 Red Hat, Inc.
> + * Copyright 2011 Intel Corporation
> + * Copyright 2016 Veertu, Inc.
> + * Copyright 2017 The Android Open Source Project
> + *
> + * QEMU Hypervisor.framework support
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, see <http://www.gnu.org/licenses/>.
> + *
> + * This file contain code under public domain from the hvdos project:
> + * https://github.com/mist64/hvdos
> + *
> + * Parts Copyright (c) 2011 NetApp, Inc.
> + * All rights reserved.
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> + * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/error-report.h"
> +#include "qemu/main-loop.h"
> +#include "exec/address-spaces.h"
> +#include "exec/exec-all.h"
> +#include "sysemu/cpus.h"
> +#include "sysemu/hvf.h"
> +#include "sysemu/hvf_int.h"
> +#include "sysemu/runstate.h"
> +#include "qemu/guest-random.h"
> +
> +#include <Hypervisor/Hypervisor.h>
> +
> +/* Memory slots */
> +
> +struct mac_slot {
> +    int present;
> +    uint64_t size;
> +    uint64_t gpa_start;
> +    uint64_t gva;
> +};
> +
> +hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
> +{
> +    hvf_slot *slot;
> +    int x;
> +    for (x = 0; x < hvf_state->num_slots; ++x) {
> +        slot = &hvf_state->slots[x];
> +        if (slot->size && start < (slot->start + slot->size) &&
> +            (start + size) > slot->start) {
> +            return slot;
> +        }
> +    }
> +    return NULL;
> +}
> +
> +struct mac_slot mac_slots[32];
> +
> +static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
> +{
> +    struct mac_slot *macslot;
> +    hv_return_t ret;
> +
> +    macslot = &mac_slots[slot->slot_id];
> +
> +    if (macslot->present) {
> +        if (macslot->size != slot->size) {
> +            macslot->present = 0;
> +            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
> +            assert_hvf_ok(ret);
> +        }
> +    }
> +
> +    if (!slot->size) {
> +        return 0;
> +    }
> +
> +    macslot->present = 1;
> +    macslot->gpa_start = slot->start;
> +    macslot->size = slot->size;
> +    ret = hv_vm_map(slot->mem, slot->start, slot->size, flags);
> +    assert_hvf_ok(ret);
> +    return 0;
> +}
> +
> +static void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
> +{
> +    hvf_slot *mem;
> +    MemoryRegion *area = section->mr;
> +    bool writeable = !area->readonly && !area->rom_device;
> +    hv_memory_flags_t flags;
> +
> +    if (!memory_region_is_ram(area)) {
> +        if (writeable) {
> +            return;
> +        } else if (!memory_region_is_romd(area)) {
> +            /*
> +             * If the memory device is not in romd_mode, then we actually want
> +             * to remove the hvf memory slot so all accesses will trap.
> +             */
> +             add = false;
> +        }
> +    }
> +
> +    mem = hvf_find_overlap_slot(
> +            section->offset_within_address_space,
> +            int128_get64(section->size));
> +
> +    if (mem && add) {
> +        if (mem->size == int128_get64(section->size) &&
> +            mem->start == section->offset_within_address_space &&
> +            mem->mem == (memory_region_get_ram_ptr(area) +
> +            section->offset_within_region)) {
> +            return; /* Same region was attempted to register, go away. */
> +        }
> +    }
> +
> +    /* Region needs to be reset. set the size to 0 and remap it. */
> +    if (mem) {
> +        mem->size = 0;
> +        if (do_hvf_set_memory(mem, 0)) {
> +            error_report("Failed to reset overlapping slot");
> +            abort();
> +        }
> +    }
> +
> +    if (!add) {
> +        return;
> +    }
> +
> +    if (area->readonly ||
> +        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
> +        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
> +    } else {
> +        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
> +    }
> +
> +    /* Now make a new slot. */
> +    int x;
> +
> +    for (x = 0; x < hvf_state->num_slots; ++x) {
> +        mem = &hvf_state->slots[x];
> +        if (!mem->size) {
> +            break;
> +        }
> +    }
> +
> +    if (x == hvf_state->num_slots) {
> +        error_report("No free slots");
> +        abort();
> +    }
> +
> +    mem->size = int128_get64(section->size);
> +    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
> +    mem->start = section->offset_within_address_space;
> +    mem->region = area;
> +
> +    if (do_hvf_set_memory(mem, flags)) {
> +        error_report("Error registering new memory slot");
> +        abort();
> +    }
> +}
> +
> +static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
> +{
> +    hvf_slot *slot;
> +
> +    slot = hvf_find_overlap_slot(
> +            section->offset_within_address_space,
> +            int128_get64(section->size));
> +
> +    /* protect region against writes; begin tracking it */
> +    if (on) {
> +        slot->flags |= HVF_SLOT_LOG;
> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
> +                      HV_MEMORY_READ);
> +    /* stop tracking region*/
> +    } else {
> +        slot->flags &= ~HVF_SLOT_LOG;
> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
> +                      HV_MEMORY_READ | HV_MEMORY_WRITE);
> +    }
> +}
> +
> +static void hvf_log_start(MemoryListener *listener,
> +                          MemoryRegionSection *section, int old, int new)
> +{
> +    if (old != 0) {
> +        return;
> +    }
> +
> +    hvf_set_dirty_tracking(section, 1);
> +}
> +
> +static void hvf_log_stop(MemoryListener *listener,
> +                         MemoryRegionSection *section, int old, int new)
> +{
> +    if (new != 0) {
> +        return;
> +    }
> +
> +    hvf_set_dirty_tracking(section, 0);
> +}
> +
> +static void hvf_log_sync(MemoryListener *listener,
> +                         MemoryRegionSection *section)
> +{
> +    /*
> +     * sync of dirty pages is handled elsewhere; just make sure we keep
> +     * tracking the region.
> +     */
> +    hvf_set_dirty_tracking(section, 1);
> +}
> +
> +static void hvf_region_add(MemoryListener *listener,
> +                           MemoryRegionSection *section)
> +{
> +    hvf_set_phys_mem(section, true);
> +}
> +
> +static void hvf_region_del(MemoryListener *listener,
> +                           MemoryRegionSection *section)
> +{
> +    hvf_set_phys_mem(section, false);
> +}
> +
> +static MemoryListener hvf_memory_listener = {
> +    .priority = 10,
> +    .region_add = hvf_region_add,
> +    .region_del = hvf_region_del,
> +    .log_start = hvf_log_start,
> +    .log_stop = hvf_log_stop,
> +    .log_sync = hvf_log_sync,
> +};
> +
> +static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
> +{
> +    if (!cpu->vcpu_dirty) {
> +        hvf_get_registers(cpu);
> +        cpu->vcpu_dirty = true;
> +    }
> +}
> +
> +static void hvf_cpu_synchronize_state(CPUState *cpu)
> +{
> +    if (!cpu->vcpu_dirty) {
> +        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
> +    }
> +}
> +
> +static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
> +                                              run_on_cpu_data arg)
> +{
> +    hvf_put_registers(cpu);
> +    cpu->vcpu_dirty = false;
> +}
> +
> +static void hvf_cpu_synchronize_post_reset(CPUState *cpu)
> +{
> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
> +}
> +
> +static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
> +                                             run_on_cpu_data arg)
> +{
> +    hvf_put_registers(cpu);
> +    cpu->vcpu_dirty = false;
> +}
> +
> +static void hvf_cpu_synchronize_post_init(CPUState *cpu)
> +{
> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
> +}
> +
> +static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
> +                                              run_on_cpu_data arg)
> +{
> +    cpu->vcpu_dirty = true;
> +}
> +
> +static void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
> +{
> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
> +}
> +
> +static void hvf_vcpu_destroy(CPUState *cpu)
> +{
> +    hv_return_t ret = hv_vcpu_destroy(cpu->hvf_fd);
> +    assert_hvf_ok(ret);
> +
> +    hvf_arch_vcpu_destroy(cpu);
> +}
> +
> +static void dummy_signal(int sig)
> +{
> +}
> +
> +static int hvf_init_vcpu(CPUState *cpu)
> +{
> +    int r;
> +
> +    /* init cpu signals */
> +    sigset_t set;
> +    struct sigaction sigact;
> +
> +    memset(&sigact, 0, sizeof(sigact));
> +    sigact.sa_handler = dummy_signal;
> +    sigaction(SIG_IPI, &sigact, NULL);
> +
> +    pthread_sigmask(SIG_BLOCK, NULL, &set);
> +    sigdelset(&set, SIG_IPI);
> +
> +#ifdef __aarch64__
> +    r = hv_vcpu_create(&cpu->hvf_fd, (hv_vcpu_exit_t **)&cpu->hvf_exit, NULL);
> +#else
> +    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
> +#endif

I think the first __aarch64__ bit fits better to arm part of the series.

> +    cpu->vcpu_dirty = 1;
> +    assert_hvf_ok(r);
> +
> +    return hvf_arch_init_vcpu(cpu);
> +}
> +
> +/*
> + * The HVF-specific vCPU thread function. This one should only run when the host
> + * CPU supports the VMX "unrestricted guest" feature.
> + */
> +static void *hvf_cpu_thread_fn(void *arg)
> +{
> +    CPUState *cpu = arg;
> +
> +    int r;
> +
> +    assert(hvf_enabled());
> +
> +    rcu_register_thread();
> +
> +    qemu_mutex_lock_iothread();
> +    qemu_thread_get_self(cpu->thread);
> +
> +    cpu->thread_id = qemu_get_thread_id();
> +    cpu->can_do_io = 1;
> +    current_cpu = cpu;
> +
> +    hvf_init_vcpu(cpu);
> +
> +    /* signal CPU creation */
> +    cpu_thread_signal_created(cpu);
> +    qemu_guest_random_seed_thread_part2(cpu->random_seed);
> +
> +    do {
> +        if (cpu_can_run(cpu)) {
> +            r = hvf_vcpu_exec(cpu);
> +            if (r == EXCP_DEBUG) {
> +                cpu_handle_guest_debug(cpu);
> +            }
> +        }
> +        qemu_wait_io_event(cpu);
> +    } while (!cpu->unplug || cpu_can_run(cpu));
> +
> +    hvf_vcpu_destroy(cpu);
> +    cpu_thread_signal_destroyed(cpu);
> +    qemu_mutex_unlock_iothread();
> +    rcu_unregister_thread();
> +    return NULL;
> +}
> +
> +static void hvf_start_vcpu_thread(CPUState *cpu)
> +{
> +    char thread_name[VCPU_THREAD_NAME_SIZE];
> +
> +    /*
> +     * HVF currently does not support TCG, and only runs in
> +     * unrestricted-guest mode.
> +     */
> +    assert(hvf_enabled());
> +
> +    cpu->thread = g_malloc0(sizeof(QemuThread));
> +    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
> +    qemu_cond_init(cpu->halt_cond);
> +
> +    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
> +             cpu->cpu_index);
> +    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
> +                       cpu, QEMU_THREAD_JOINABLE);
> +}
> +
> +static const CpusAccel hvf_cpus = {
> +    .create_vcpu_thread = hvf_start_vcpu_thread,
> +
> +    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
> +    .synchronize_post_init = hvf_cpu_synchronize_post_init,
> +    .synchronize_state = hvf_cpu_synchronize_state,
> +    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
> +};
> +
> +static int hvf_accel_init(MachineState *ms)
> +{
> +    int x;
> +    hv_return_t ret;
> +    HVFState *s;
> +
> +    ret = hv_vm_create(HV_VM_DEFAULT);
> +    assert_hvf_ok(ret);
> +
> +    s = g_new0(HVFState, 1);
> +
> +    s->num_slots = 32;
> +    for (x = 0; x < s->num_slots; ++x) {
> +        s->slots[x].size = 0;
> +        s->slots[x].slot_id = x;
> +    }
> +
> +    hvf_state = s;
> +    memory_listener_register(&hvf_memory_listener, &address_space_memory);
> +    cpus_register_accel(&hvf_cpus);
> +    return 0;
> +}
> +
> +static void hvf_accel_class_init(ObjectClass *oc, void *data)
> +{
> +    AccelClass *ac = ACCEL_CLASS(oc);
> +    ac->name = "HVF";
> +    ac->init_machine = hvf_accel_init;
> +    ac->allowed = &hvf_allowed;
> +}
> +
> +static const TypeInfo hvf_accel_type = {
> +    .name = TYPE_HVF_ACCEL,
> +    .parent = TYPE_ACCEL,
> +    .class_init = hvf_accel_class_init,
> +};
> +
> +static void hvf_type_init(void)
> +{
> +    type_register_static(&hvf_accel_type);
> +}
> +
> +type_init(hvf_type_init);
> diff --git a/accel/hvf/meson.build b/accel/hvf/meson.build
> new file mode 100644
> index 0000000000..dfd6b68dc7
> --- /dev/null
> +++ b/accel/hvf/meson.build
> @@ -0,0 +1,7 @@
> +hvf_ss = ss.source_set()
> +hvf_ss.add(files(
> +  'hvf-all.c',
> +  'hvf-cpus.c',
> +))
> +
> +specific_ss.add_all(when: 'CONFIG_HVF', if_true: hvf_ss)
> diff --git a/accel/meson.build b/accel/meson.build
> index b26cca227a..6de12ce5d5 100644
> --- a/accel/meson.build
> +++ b/accel/meson.build
> @@ -1,5 +1,6 @@
>  softmmu_ss.add(files('accel.c'))
>  
> +subdir('hvf')
>  subdir('qtest')
>  subdir('kvm')
>  subdir('tcg')
> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> new file mode 100644
> index 0000000000..de9bad23a8
> --- /dev/null
> +++ b/include/sysemu/hvf_int.h
> @@ -0,0 +1,69 @@
> +/*
> + * QEMU Hypervisor.framework (HVF) support
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +/* header to be included in HVF-specific code */
> +
> +#ifndef HVF_INT_H
> +#define HVF_INT_H
> +
> +#include <Hypervisor/Hypervisor.h>
> +
> +#define HVF_MAX_VCPU 0x10
> +
> +extern struct hvf_state hvf_global;
> +
> +struct hvf_vm {
> +    int id;
> +    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
> +};
> +
> +struct hvf_state {
> +    uint32_t version;
> +    struct hvf_vm *vm;
> +    uint64_t mem_quota;
> +};
> +
> +/* hvf_slot flags */
> +#define HVF_SLOT_LOG (1 << 0)
> +
> +typedef struct hvf_slot {
> +    uint64_t start;
> +    uint64_t size;
> +    uint8_t *mem;
> +    int slot_id;
> +    uint32_t flags;
> +    MemoryRegion *region;
> +} hvf_slot;
> +
> +typedef struct hvf_vcpu_caps {
> +    uint64_t vmx_cap_pinbased;
> +    uint64_t vmx_cap_procbased;
> +    uint64_t vmx_cap_procbased2;
> +    uint64_t vmx_cap_entry;
> +    uint64_t vmx_cap_exit;
> +    uint64_t vmx_cap_preemption_timer;
> +} hvf_vcpu_caps;
> +
> +struct HVFState {
> +    AccelState parent;
> +    hvf_slot slots[32];
> +    int num_slots;
> +
> +    hvf_vcpu_caps *hvf_caps;
> +};
> +extern HVFState *hvf_state;
> +
> +void assert_hvf_ok(hv_return_t ret);
> +int hvf_get_registers(CPUState *cpu);
> +int hvf_put_registers(CPUState *cpu);
> +int hvf_arch_init_vcpu(CPUState *cpu);
> +void hvf_arch_vcpu_destroy(CPUState *cpu);
> +int hvf_vcpu_exec(CPUState *cpu);
> +hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
> +
> +#endif
> diff --git a/target/i386/hvf/hvf-cpus.c b/target/i386/hvf/hvf-cpus.c
> deleted file mode 100644
> index 817b3d7452..0000000000
> --- a/target/i386/hvf/hvf-cpus.c
> +++ /dev/null
> @@ -1,131 +0,0 @@
> -/*
> - * Copyright 2008 IBM Corporation
> - *           2008 Red Hat, Inc.
> - * Copyright 2011 Intel Corporation
> - * Copyright 2016 Veertu, Inc.
> - * Copyright 2017 The Android Open Source Project
> - *
> - * QEMU Hypervisor.framework support
> - *
> - * This program is free software; you can redistribute it and/or
> - * modify it under the terms of version 2 of the GNU General Public
> - * License as published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful,
> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> - * General Public License for more details.
> - *
> - * You should have received a copy of the GNU General Public License
> - * along with this program; if not, see <http://www.gnu.org/licenses/>.
> - *
> - * This file contain code under public domain from the hvdos project:
> - * https://github.com/mist64/hvdos
> - *
> - * Parts Copyright (c) 2011 NetApp, Inc.
> - * All rights reserved.
> - *
> - * Redistribution and use in source and binary forms, with or without
> - * modification, are permitted provided that the following conditions
> - * are met:
> - * 1. Redistributions of source code must retain the above copyright
> - *    notice, this list of conditions and the following disclaimer.
> - * 2. Redistributions in binary form must reproduce the above copyright
> - *    notice, this list of conditions and the following disclaimer in the
> - *    documentation and/or other materials provided with the distribution.
> - *
> - * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
> - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> - * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
> - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> - * SUCH DAMAGE.
> - */
> -
> -#include "qemu/osdep.h"
> -#include "qemu/error-report.h"
> -#include "qemu/main-loop.h"
> -#include "sysemu/hvf.h"
> -#include "sysemu/runstate.h"
> -#include "target/i386/cpu.h"
> -#include "qemu/guest-random.h"
> -
> -#include "hvf-cpus.h"
> -
> -/*
> - * The HVF-specific vCPU thread function. This one should only run when the host
> - * CPU supports the VMX "unrestricted guest" feature.
> - */
> -static void *hvf_cpu_thread_fn(void *arg)
> -{
> -    CPUState *cpu = arg;
> -
> -    int r;
> -
> -    assert(hvf_enabled());
> -
> -    rcu_register_thread();
> -
> -    qemu_mutex_lock_iothread();
> -    qemu_thread_get_self(cpu->thread);
> -
> -    cpu->thread_id = qemu_get_thread_id();
> -    cpu->can_do_io = 1;
> -    current_cpu = cpu;
> -
> -    hvf_init_vcpu(cpu);
> -
> -    /* signal CPU creation */
> -    cpu_thread_signal_created(cpu);
> -    qemu_guest_random_seed_thread_part2(cpu->random_seed);
> -
> -    do {
> -        if (cpu_can_run(cpu)) {
> -            r = hvf_vcpu_exec(cpu);
> -            if (r == EXCP_DEBUG) {
> -                cpu_handle_guest_debug(cpu);
> -            }
> -        }
> -        qemu_wait_io_event(cpu);
> -    } while (!cpu->unplug || cpu_can_run(cpu));
> -
> -    hvf_vcpu_destroy(cpu);
> -    cpu_thread_signal_destroyed(cpu);
> -    qemu_mutex_unlock_iothread();
> -    rcu_unregister_thread();
> -    return NULL;
> -}
> -
> -static void hvf_start_vcpu_thread(CPUState *cpu)
> -{
> -    char thread_name[VCPU_THREAD_NAME_SIZE];
> -
> -    /*
> -     * HVF currently does not support TCG, and only runs in
> -     * unrestricted-guest mode.
> -     */
> -    assert(hvf_enabled());
> -
> -    cpu->thread = g_malloc0(sizeof(QemuThread));
> -    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
> -    qemu_cond_init(cpu->halt_cond);
> -
> -    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
> -             cpu->cpu_index);
> -    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
> -                       cpu, QEMU_THREAD_JOINABLE);
> -}
> -
> -const CpusAccel hvf_cpus = {
> -    .create_vcpu_thread = hvf_start_vcpu_thread,
> -
> -    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
> -    .synchronize_post_init = hvf_cpu_synchronize_post_init,
> -    .synchronize_state = hvf_cpu_synchronize_state,
> -    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
> -};
> diff --git a/target/i386/hvf/hvf-cpus.h b/target/i386/hvf/hvf-cpus.h
> deleted file mode 100644
> index ced31b82c0..0000000000
> --- a/target/i386/hvf/hvf-cpus.h
> +++ /dev/null
> @@ -1,25 +0,0 @@
> -/*
> - * Accelerator CPUS Interface
> - *
> - * Copyright 2020 SUSE LLC
> - *
> - * This work is licensed under the terms of the GNU GPL, version 2 or later.
> - * See the COPYING file in the top-level directory.
> - */
> -
> -#ifndef HVF_CPUS_H
> -#define HVF_CPUS_H
> -
> -#include "sysemu/cpus.h"
> -
> -extern const CpusAccel hvf_cpus;
> -
> -int hvf_init_vcpu(CPUState *);
> -int hvf_vcpu_exec(CPUState *);
> -void hvf_cpu_synchronize_state(CPUState *);
> -void hvf_cpu_synchronize_post_reset(CPUState *);
> -void hvf_cpu_synchronize_post_init(CPUState *);
> -void hvf_cpu_synchronize_pre_loadvm(CPUState *);
> -void hvf_vcpu_destroy(CPUState *);
> -
> -#endif /* HVF_CPUS_H */
> diff --git a/target/i386/hvf/hvf-i386.h b/target/i386/hvf/hvf-i386.h
> index e0edffd077..6d56f8f6bb 100644
> --- a/target/i386/hvf/hvf-i386.h
> +++ b/target/i386/hvf/hvf-i386.h
> @@ -18,57 +18,11 @@
>  
>  #include "sysemu/accel.h"
>  #include "sysemu/hvf.h"
> +#include "sysemu/hvf_int.h"
>  #include "cpu.h"
>  #include "x86.h"
>  
> -#define HVF_MAX_VCPU 0x10
> -
> -extern struct hvf_state hvf_global;
> -
> -struct hvf_vm {
> -    int id;
> -    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
> -};
> -
> -struct hvf_state {
> -    uint32_t version;
> -    struct hvf_vm *vm;
> -    uint64_t mem_quota;
> -};
> -
> -/* hvf_slot flags */
> -#define HVF_SLOT_LOG (1 << 0)
> -
> -typedef struct hvf_slot {
> -    uint64_t start;
> -    uint64_t size;
> -    uint8_t *mem;
> -    int slot_id;
> -    uint32_t flags;
> -    MemoryRegion *region;
> -} hvf_slot;
> -
> -typedef struct hvf_vcpu_caps {
> -    uint64_t vmx_cap_pinbased;
> -    uint64_t vmx_cap_procbased;
> -    uint64_t vmx_cap_procbased2;
> -    uint64_t vmx_cap_entry;
> -    uint64_t vmx_cap_exit;
> -    uint64_t vmx_cap_preemption_timer;
> -} hvf_vcpu_caps;
> -
> -struct HVFState {
> -    AccelState parent;
> -    hvf_slot slots[32];
> -    int num_slots;
> -
> -    hvf_vcpu_caps *hvf_caps;
> -};
> -extern HVFState *hvf_state;
> -
> -void hvf_set_phys_mem(MemoryRegionSection *, bool);
>  void hvf_handle_io(CPUArchState *, uint16_t, void *, int, int, int);
> -hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>  
>  #ifdef NEED_CPU_H
>  /* Functions exported to host specific mode */
> diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
> index ed9356565c..8b96ecd619 100644
> --- a/target/i386/hvf/hvf.c
> +++ b/target/i386/hvf/hvf.c
> @@ -51,6 +51,7 @@
>  #include "qemu/error-report.h"
>  
>  #include "sysemu/hvf.h"
> +#include "sysemu/hvf_int.h"
>  #include "sysemu/runstate.h"
>  #include "hvf-i386.h"
>  #include "vmcs.h"
> @@ -72,171 +73,6 @@
>  #include "sysemu/accel.h"
>  #include "target/i386/cpu.h"
>  
> -#include "hvf-cpus.h"
> -
> -HVFState *hvf_state;
> -
> -static void assert_hvf_ok(hv_return_t ret)
> -{
> -    if (ret == HV_SUCCESS) {
> -        return;
> -    }
> -
> -    switch (ret) {
> -    case HV_ERROR:
> -        error_report("Error: HV_ERROR");
> -        break;
> -    case HV_BUSY:
> -        error_report("Error: HV_BUSY");
> -        break;
> -    case HV_BAD_ARGUMENT:
> -        error_report("Error: HV_BAD_ARGUMENT");
> -        break;
> -    case HV_NO_RESOURCES:
> -        error_report("Error: HV_NO_RESOURCES");
> -        break;
> -    case HV_NO_DEVICE:
> -        error_report("Error: HV_NO_DEVICE");
> -        break;
> -    case HV_UNSUPPORTED:
> -        error_report("Error: HV_UNSUPPORTED");
> -        break;
> -    default:
> -        error_report("Unknown Error");
> -    }
> -
> -    abort();
> -}
> -
> -/* Memory slots */
> -hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
> -{
> -    hvf_slot *slot;
> -    int x;
> -    for (x = 0; x < hvf_state->num_slots; ++x) {
> -        slot = &hvf_state->slots[x];
> -        if (slot->size && start < (slot->start + slot->size) &&
> -            (start + size) > slot->start) {
> -            return slot;
> -        }
> -    }
> -    return NULL;
> -}
> -
> -struct mac_slot {
> -    int present;
> -    uint64_t size;
> -    uint64_t gpa_start;
> -    uint64_t gva;
> -};
> -
> -struct mac_slot mac_slots[32];
> -
> -static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
> -{
> -    struct mac_slot *macslot;
> -    hv_return_t ret;
> -
> -    macslot = &mac_slots[slot->slot_id];
> -
> -    if (macslot->present) {
> -        if (macslot->size != slot->size) {
> -            macslot->present = 0;
> -            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
> -            assert_hvf_ok(ret);
> -        }
> -    }
> -
> -    if (!slot->size) {
> -        return 0;
> -    }
> -
> -    macslot->present = 1;
> -    macslot->gpa_start = slot->start;
> -    macslot->size = slot->size;
> -    ret = hv_vm_map((hv_uvaddr_t)slot->mem, slot->start, slot->size, flags);
> -    assert_hvf_ok(ret);
> -    return 0;
> -}
> -
> -void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
> -{
> -    hvf_slot *mem;
> -    MemoryRegion *area = section->mr;
> -    bool writeable = !area->readonly && !area->rom_device;
> -    hv_memory_flags_t flags;
> -
> -    if (!memory_region_is_ram(area)) {
> -        if (writeable) {
> -            return;
> -        } else if (!memory_region_is_romd(area)) {
> -            /*
> -             * If the memory device is not in romd_mode, then we actually want
> -             * to remove the hvf memory slot so all accesses will trap.
> -             */
> -             add = false;
> -        }
> -    }
> -
> -    mem = hvf_find_overlap_slot(
> -            section->offset_within_address_space,
> -            int128_get64(section->size));
> -
> -    if (mem && add) {
> -        if (mem->size == int128_get64(section->size) &&
> -            mem->start == section->offset_within_address_space &&
> -            mem->mem == (memory_region_get_ram_ptr(area) +
> -            section->offset_within_region)) {
> -            return; /* Same region was attempted to register, go away. */
> -        }
> -    }
> -
> -    /* Region needs to be reset. set the size to 0 and remap it. */
> -    if (mem) {
> -        mem->size = 0;
> -        if (do_hvf_set_memory(mem, 0)) {
> -            error_report("Failed to reset overlapping slot");
> -            abort();
> -        }
> -    }
> -
> -    if (!add) {
> -        return;
> -    }
> -
> -    if (area->readonly ||
> -        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
> -        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
> -    } else {
> -        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
> -    }
> -
> -    /* Now make a new slot. */
> -    int x;
> -
> -    for (x = 0; x < hvf_state->num_slots; ++x) {
> -        mem = &hvf_state->slots[x];
> -        if (!mem->size) {
> -            break;
> -        }
> -    }
> -
> -    if (x == hvf_state->num_slots) {
> -        error_report("No free slots");
> -        abort();
> -    }
> -
> -    mem->size = int128_get64(section->size);
> -    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
> -    mem->start = section->offset_within_address_space;
> -    mem->region = area;
> -
> -    if (do_hvf_set_memory(mem, flags)) {
> -        error_report("Error registering new memory slot");
> -        abort();
> -    }
> -}
> -
>  void vmx_update_tpr(CPUState *cpu)
>  {
>      /* TODO: need integrate APIC handling */
> @@ -276,56 +112,6 @@ void hvf_handle_io(CPUArchState *env, uint16_t port, void *buffer,
>      }
>  }
>  
> -static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
> -{
> -    if (!cpu->vcpu_dirty) {
> -        hvf_get_registers(cpu);
> -        cpu->vcpu_dirty = true;
> -    }
> -}
> -
> -void hvf_cpu_synchronize_state(CPUState *cpu)
> -{
> -    if (!cpu->vcpu_dirty) {
> -        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
> -    }
> -}
> -
> -static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
> -                                              run_on_cpu_data arg)
> -{
> -    hvf_put_registers(cpu);
> -    cpu->vcpu_dirty = false;
> -}
> -
> -void hvf_cpu_synchronize_post_reset(CPUState *cpu)
> -{
> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
> -}
> -
> -static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
> -                                             run_on_cpu_data arg)
> -{
> -    hvf_put_registers(cpu);
> -    cpu->vcpu_dirty = false;
> -}
> -
> -void hvf_cpu_synchronize_post_init(CPUState *cpu)
> -{
> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
> -}
> -
> -static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
> -                                              run_on_cpu_data arg)
> -{
> -    cpu->vcpu_dirty = true;
> -}
> -
> -void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
> -{
> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
> -}
> -
>  static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
>  {
>      int read, write;
> @@ -370,109 +156,19 @@ static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
>      return false;
>  }
>  
> -static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
> -{
> -    hvf_slot *slot;
> -
> -    slot = hvf_find_overlap_slot(
> -            section->offset_within_address_space,
> -            int128_get64(section->size));
> -
> -    /* protect region against writes; begin tracking it */
> -    if (on) {
> -        slot->flags |= HVF_SLOT_LOG;
> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
> -                      HV_MEMORY_READ);
> -    /* stop tracking region*/
> -    } else {
> -        slot->flags &= ~HVF_SLOT_LOG;
> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
> -                      HV_MEMORY_READ | HV_MEMORY_WRITE);
> -    }
> -}
> -
> -static void hvf_log_start(MemoryListener *listener,
> -                          MemoryRegionSection *section, int old, int new)
> -{
> -    if (old != 0) {
> -        return;
> -    }
> -
> -    hvf_set_dirty_tracking(section, 1);
> -}
> -
> -static void hvf_log_stop(MemoryListener *listener,
> -                         MemoryRegionSection *section, int old, int new)
> -{
> -    if (new != 0) {
> -        return;
> -    }
> -
> -    hvf_set_dirty_tracking(section, 0);
> -}
> -
> -static void hvf_log_sync(MemoryListener *listener,
> -                         MemoryRegionSection *section)
> -{
> -    /*
> -     * sync of dirty pages is handled elsewhere; just make sure we keep
> -     * tracking the region.
> -     */
> -    hvf_set_dirty_tracking(section, 1);
> -}
> -
> -static void hvf_region_add(MemoryListener *listener,
> -                           MemoryRegionSection *section)
> -{
> -    hvf_set_phys_mem(section, true);
> -}
> -
> -static void hvf_region_del(MemoryListener *listener,
> -                           MemoryRegionSection *section)
> -{
> -    hvf_set_phys_mem(section, false);
> -}
> -
> -static MemoryListener hvf_memory_listener = {
> -    .priority = 10,
> -    .region_add = hvf_region_add,
> -    .region_del = hvf_region_del,
> -    .log_start = hvf_log_start,
> -    .log_stop = hvf_log_stop,
> -    .log_sync = hvf_log_sync,
> -};
> -
> -void hvf_vcpu_destroy(CPUState *cpu)
> +void hvf_arch_vcpu_destroy(CPUState *cpu)
>  {
>      X86CPU *x86_cpu = X86_CPU(cpu);
>      CPUX86State *env = &x86_cpu->env;
>  
> -    hv_return_t ret = hv_vcpu_destroy((hv_vcpuid_t)cpu->hvf_fd);
>      g_free(env->hvf_mmio_buf);
> -    assert_hvf_ok(ret);
> -}
> -
> -static void dummy_signal(int sig)
> -{
>  }
>  
> -int hvf_init_vcpu(CPUState *cpu)
> +int hvf_arch_init_vcpu(CPUState *cpu)
>  {
>  
>      X86CPU *x86cpu = X86_CPU(cpu);
>      CPUX86State *env = &x86cpu->env;
> -    int r;
> -
> -    /* init cpu signals */
> -    sigset_t set;
> -    struct sigaction sigact;
> -
> -    memset(&sigact, 0, sizeof(sigact));
> -    sigact.sa_handler = dummy_signal;
> -    sigaction(SIG_IPI, &sigact, NULL);
> -
> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> -    sigdelset(&set, SIG_IPI);
>  
>      init_emu();
>      init_decoder();
> @@ -480,10 +176,6 @@ int hvf_init_vcpu(CPUState *cpu)
>      hvf_state->hvf_caps = g_new0(struct hvf_vcpu_caps, 1);
>      env->hvf_mmio_buf = g_new(char, 4096);
>  
> -    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
> -    cpu->vcpu_dirty = 1;
> -    assert_hvf_ok(r);
> -
>      if (hv_vmx_read_capability(HV_VMX_CAP_PINBASED,
>          &hvf_state->hvf_caps->vmx_cap_pinbased)) {
>          abort();
> @@ -865,49 +557,3 @@ int hvf_vcpu_exec(CPUState *cpu)
>  
>      return ret;
>  }
> -
> -bool hvf_allowed;
> -
> -static int hvf_accel_init(MachineState *ms)
> -{
> -    int x;
> -    hv_return_t ret;
> -    HVFState *s;
> -
> -    ret = hv_vm_create(HV_VM_DEFAULT);
> -    assert_hvf_ok(ret);
> -
> -    s = g_new0(HVFState, 1);
> - 
> -    s->num_slots = 32;
> -    for (x = 0; x < s->num_slots; ++x) {
> -        s->slots[x].size = 0;
> -        s->slots[x].slot_id = x;
> -    }
> -  
> -    hvf_state = s;
> -    memory_listener_register(&hvf_memory_listener, &address_space_memory);
> -    cpus_register_accel(&hvf_cpus);
> -    return 0;
> -}
> -
> -static void hvf_accel_class_init(ObjectClass *oc, void *data)
> -{
> -    AccelClass *ac = ACCEL_CLASS(oc);
> -    ac->name = "HVF";
> -    ac->init_machine = hvf_accel_init;
> -    ac->allowed = &hvf_allowed;
> -}
> -
> -static const TypeInfo hvf_accel_type = {
> -    .name = TYPE_HVF_ACCEL,
> -    .parent = TYPE_ACCEL,
> -    .class_init = hvf_accel_class_init,
> -};
> -
> -static void hvf_type_init(void)
> -{
> -    type_register_static(&hvf_accel_type);
> -}
> -
> -type_init(hvf_type_init);
> diff --git a/target/i386/hvf/meson.build b/target/i386/hvf/meson.build
> index 409c9a3f14..c8a43717ee 100644
> --- a/target/i386/hvf/meson.build
> +++ b/target/i386/hvf/meson.build
> @@ -1,6 +1,5 @@
>  i386_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true: files(
>    'hvf.c',
> -  'hvf-cpus.c',
>    'x86.c',
>    'x86_cpuid.c',
>    'x86_decode.c',
> diff --git a/target/i386/hvf/x86hvf.c b/target/i386/hvf/x86hvf.c
> index bbec412b6c..89b8e9d87a 100644
> --- a/target/i386/hvf/x86hvf.c
> +++ b/target/i386/hvf/x86hvf.c
> @@ -20,6 +20,9 @@
>  #include "qemu/osdep.h"
>  
>  #include "qemu-common.h"
> +#include "sysemu/hvf.h"
> +#include "sysemu/hvf_int.h"
> +#include "sysemu/hw_accel.h"
>  #include "x86hvf.h"
>  #include "vmx.h"
>  #include "vmcs.h"
> @@ -32,8 +35,6 @@
>  #include <Hypervisor/hv.h>
>  #include <Hypervisor/hv_vmx.h>
>  
> -#include "hvf-cpus.h"
> -
>  void hvf_set_segment(struct CPUState *cpu, struct vmx_segment *vmx_seg,
>                       SegmentCache *qseg, bool is_tr)
>  {
> @@ -437,7 +438,7 @@ int hvf_process_events(CPUState *cpu_state)
>      env->eflags = rreg(cpu_state->hvf_fd, HV_X86_RFLAGS);
>  
>      if (cpu_state->interrupt_request & CPU_INTERRUPT_INIT) {
> -        hvf_cpu_synchronize_state(cpu_state);
> +        cpu_synchronize_state(cpu_state);
>          do_cpu_init(cpu);
>      }
>  
> @@ -451,12 +452,12 @@ int hvf_process_events(CPUState *cpu_state)
>          cpu_state->halted = 0;
>      }
>      if (cpu_state->interrupt_request & CPU_INTERRUPT_SIPI) {
> -        hvf_cpu_synchronize_state(cpu_state);
> +        cpu_synchronize_state(cpu_state);
>          do_cpu_sipi(cpu);
>      }
>      if (cpu_state->interrupt_request & CPU_INTERRUPT_TPR) {
>          cpu_state->interrupt_request &= ~CPU_INTERRUPT_TPR;
> -        hvf_cpu_synchronize_state(cpu_state);
> +        cpu_synchronize_state(cpu_state);

The changes from hvf_cpu_*() to cpu_*() are cleanup and perhaps should
be a separate patch. It follows cpu/accel cleanups Claudio was doing the
summer.

Phillipe raised the idea that the patch might go ahead of ARM-specific
part (which might involve some discussions) and I agree with that.

Some sync between Claudio series (CC'd him) and the patch might be need.

Thanks,
Roman

>          apic_handle_tpr_access_report(cpu->apic_state, env->eip,
>                                        env->tpr_access_type);
>      }
> diff --git a/target/i386/hvf/x86hvf.h b/target/i386/hvf/x86hvf.h
> index 635ab0f34e..99ed8d608d 100644
> --- a/target/i386/hvf/x86hvf.h
> +++ b/target/i386/hvf/x86hvf.h
> @@ -21,8 +21,6 @@
>  #include "x86_descr.h"
>  
>  int hvf_process_events(CPUState *);
> -int hvf_put_registers(CPUState *);
> -int hvf_get_registers(CPUState *);
>  bool hvf_inject_interrupts(CPUState *);
>  void hvf_set_segment(struct CPUState *cpu, struct vmx_segment *vmx_seg,
>                       SegmentCache *qseg, bool is_tr);
> -- 
> 2.24.3 (Apple Git-128)
> 
> 
>

Re: [PATCH 2/8] hvf: Move common code out

Posted by Alexander Graf 4 years, 3 months ago

On 27.11.20 21:00, Roman Bolshakov wrote:
> On Thu, Nov 26, 2020 at 10:50:11PM +0100, Alexander Graf wrote:
>> Until now, Hypervisor.framework has only been available on x86_64 systems.
>> With Apple Silicon shipping now, it extends its reach to aarch64. To
>> prepare for support for multiple architectures, let's move common code out
>> into its own accel directory.
>>
>> Signed-off-by: Alexander Graf <agraf@csgraf.de>
>> ---
>>   MAINTAINERS                 |   9 +-
>>   accel/hvf/hvf-all.c         |  56 +++++
>>   accel/hvf/hvf-cpus.c        | 468 ++++++++++++++++++++++++++++++++++++
>>   accel/hvf/meson.build       |   7 +
>>   accel/meson.build           |   1 +
>>   include/sysemu/hvf_int.h    |  69 ++++++
>>   target/i386/hvf/hvf-cpus.c  | 131 ----------
>>   target/i386/hvf/hvf-cpus.h  |  25 --
>>   target/i386/hvf/hvf-i386.h  |  48 +---
>>   target/i386/hvf/hvf.c       | 360 +--------------------------
>>   target/i386/hvf/meson.build |   1 -
>>   target/i386/hvf/x86hvf.c    |  11 +-
>>   target/i386/hvf/x86hvf.h    |   2 -
>>   13 files changed, 619 insertions(+), 569 deletions(-)
>>   create mode 100644 accel/hvf/hvf-all.c
>>   create mode 100644 accel/hvf/hvf-cpus.c
>>   create mode 100644 accel/hvf/meson.build
>>   create mode 100644 include/sysemu/hvf_int.h
>>   delete mode 100644 target/i386/hvf/hvf-cpus.c
>>   delete mode 100644 target/i386/hvf/hvf-cpus.h
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 68bc160f41..ca4b6d9279 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -444,9 +444,16 @@ M: Cameron Esfahani <dirty@apple.com>
>>   M: Roman Bolshakov <r.bolshakov@yadro.com>
>>   W: https://wiki.qemu.org/Features/HVF
>>   S: Maintained
>> -F: accel/stubs/hvf-stub.c
> There was a patch for that in the RFC series from Claudio.


Yeah, I'm not worried about this hunk :).


>
>>   F: target/i386/hvf/
>> +
>> +HVF
>> +M: Cameron Esfahani <dirty@apple.com>
>> +M: Roman Bolshakov <r.bolshakov@yadro.com>
>> +W: https://wiki.qemu.org/Features/HVF
>> +S: Maintained
>> +F: accel/hvf/
>>   F: include/sysemu/hvf.h
>> +F: include/sysemu/hvf_int.h
>>   
>>   WHPX CPUs
>>   M: Sunil Muthuswamy <sunilmut@microsoft.com>
>> diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
>> new file mode 100644
>> index 0000000000..47d77a472a
>> --- /dev/null
>> +++ b/accel/hvf/hvf-all.c
>> @@ -0,0 +1,56 @@
>> +/*
>> + * QEMU Hypervisor.framework support
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>> + * the COPYING file in the top-level directory.
>> + *
>> + * Contributions after 2012-01-13 are licensed under the terms of the
>> + * GNU GPL, version 2 or (at your option) any later version.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu-common.h"
>> +#include "qemu/error-report.h"
>> +#include "sysemu/hvf.h"
>> +#include "sysemu/hvf_int.h"
>> +#include "sysemu/runstate.h"
>> +
>> +#include "qemu/main-loop.h"
>> +#include "sysemu/accel.h"
>> +
>> +#include <Hypervisor/Hypervisor.h>
>> +
>> +bool hvf_allowed;
>> +HVFState *hvf_state;
>> +
>> +void assert_hvf_ok(hv_return_t ret)
>> +{
>> +    if (ret == HV_SUCCESS) {
>> +        return;
>> +    }
>> +
>> +    switch (ret) {
>> +    case HV_ERROR:
>> +        error_report("Error: HV_ERROR");
>> +        break;
>> +    case HV_BUSY:
>> +        error_report("Error: HV_BUSY");
>> +        break;
>> +    case HV_BAD_ARGUMENT:
>> +        error_report("Error: HV_BAD_ARGUMENT");
>> +        break;
>> +    case HV_NO_RESOURCES:
>> +        error_report("Error: HV_NO_RESOURCES");
>> +        break;
>> +    case HV_NO_DEVICE:
>> +        error_report("Error: HV_NO_DEVICE");
>> +        break;
>> +    case HV_UNSUPPORTED:
>> +        error_report("Error: HV_UNSUPPORTED");
>> +        break;
>> +    default:
>> +        error_report("Unknown Error");
>> +    }
>> +
>> +    abort();
>> +}
>> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>> new file mode 100644
>> index 0000000000..f9bb5502b7
>> --- /dev/null
>> +++ b/accel/hvf/hvf-cpus.c
>> @@ -0,0 +1,468 @@
>> +/*
>> + * Copyright 2008 IBM Corporation
>> + *           2008 Red Hat, Inc.
>> + * Copyright 2011 Intel Corporation
>> + * Copyright 2016 Veertu, Inc.
>> + * Copyright 2017 The Android Open Source Project
>> + *
>> + * QEMU Hypervisor.framework support
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of version 2 of the GNU General Public
>> + * License as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> + * General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, see <http://www.gnu.org/licenses/>.
>> + *
>> + * This file contain code under public domain from the hvdos project:
>> + * https://github.com/mist64/hvdos
>> + *
>> + * Parts Copyright (c) 2011 NetApp, Inc.
>> + * All rights reserved.
>> + *
>> + * Redistribution and use in source and binary forms, with or without
>> + * modification, are permitted provided that the following conditions
>> + * are met:
>> + * 1. Redistributions of source code must retain the above copyright
>> + *    notice, this list of conditions and the following disclaimer.
>> + * 2. Redistributions in binary form must reproduce the above copyright
>> + *    notice, this list of conditions and the following disclaimer in the
>> + *    documentation and/or other materials provided with the distribution.
>> + *
>> + * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
>> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
>> + * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
>> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
>> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
>> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
>> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>> + * SUCH DAMAGE.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/error-report.h"
>> +#include "qemu/main-loop.h"
>> +#include "exec/address-spaces.h"
>> +#include "exec/exec-all.h"
>> +#include "sysemu/cpus.h"
>> +#include "sysemu/hvf.h"
>> +#include "sysemu/hvf_int.h"
>> +#include "sysemu/runstate.h"
>> +#include "qemu/guest-random.h"
>> +
>> +#include <Hypervisor/Hypervisor.h>
>> +
>> +/* Memory slots */
>> +
>> +struct mac_slot {
>> +    int present;
>> +    uint64_t size;
>> +    uint64_t gpa_start;
>> +    uint64_t gva;
>> +};
>> +
>> +hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>> +{
>> +    hvf_slot *slot;
>> +    int x;
>> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>> +        slot = &hvf_state->slots[x];
>> +        if (slot->size && start < (slot->start + slot->size) &&
>> +            (start + size) > slot->start) {
>> +            return slot;
>> +        }
>> +    }
>> +    return NULL;
>> +}
>> +
>> +struct mac_slot mac_slots[32];
>> +
>> +static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>> +{
>> +    struct mac_slot *macslot;
>> +    hv_return_t ret;
>> +
>> +    macslot = &mac_slots[slot->slot_id];
>> +
>> +    if (macslot->present) {
>> +        if (macslot->size != slot->size) {
>> +            macslot->present = 0;
>> +            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>> +            assert_hvf_ok(ret);
>> +        }
>> +    }
>> +
>> +    if (!slot->size) {
>> +        return 0;
>> +    }
>> +
>> +    macslot->present = 1;
>> +    macslot->gpa_start = slot->start;
>> +    macslot->size = slot->size;
>> +    ret = hv_vm_map(slot->mem, slot->start, slot->size, flags);
>> +    assert_hvf_ok(ret);
>> +    return 0;
>> +}
>> +
>> +static void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>> +{
>> +    hvf_slot *mem;
>> +    MemoryRegion *area = section->mr;
>> +    bool writeable = !area->readonly && !area->rom_device;
>> +    hv_memory_flags_t flags;
>> +
>> +    if (!memory_region_is_ram(area)) {
>> +        if (writeable) {
>> +            return;
>> +        } else if (!memory_region_is_romd(area)) {
>> +            /*
>> +             * If the memory device is not in romd_mode, then we actually want
>> +             * to remove the hvf memory slot so all accesses will trap.
>> +             */
>> +             add = false;
>> +        }
>> +    }
>> +
>> +    mem = hvf_find_overlap_slot(
>> +            section->offset_within_address_space,
>> +            int128_get64(section->size));
>> +
>> +    if (mem && add) {
>> +        if (mem->size == int128_get64(section->size) &&
>> +            mem->start == section->offset_within_address_space &&
>> +            mem->mem == (memory_region_get_ram_ptr(area) +
>> +            section->offset_within_region)) {
>> +            return; /* Same region was attempted to register, go away. */
>> +        }
>> +    }
>> +
>> +    /* Region needs to be reset. set the size to 0 and remap it. */
>> +    if (mem) {
>> +        mem->size = 0;
>> +        if (do_hvf_set_memory(mem, 0)) {
>> +            error_report("Failed to reset overlapping slot");
>> +            abort();
>> +        }
>> +    }
>> +
>> +    if (!add) {
>> +        return;
>> +    }
>> +
>> +    if (area->readonly ||
>> +        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
>> +        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>> +    } else {
>> +        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>> +    }
>> +
>> +    /* Now make a new slot. */
>> +    int x;
>> +
>> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>> +        mem = &hvf_state->slots[x];
>> +        if (!mem->size) {
>> +            break;
>> +        }
>> +    }
>> +
>> +    if (x == hvf_state->num_slots) {
>> +        error_report("No free slots");
>> +        abort();
>> +    }
>> +
>> +    mem->size = int128_get64(section->size);
>> +    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
>> +    mem->start = section->offset_within_address_space;
>> +    mem->region = area;
>> +
>> +    if (do_hvf_set_memory(mem, flags)) {
>> +        error_report("Error registering new memory slot");
>> +        abort();
>> +    }
>> +}
>> +
>> +static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
>> +{
>> +    hvf_slot *slot;
>> +
>> +    slot = hvf_find_overlap_slot(
>> +            section->offset_within_address_space,
>> +            int128_get64(section->size));
>> +
>> +    /* protect region against writes; begin tracking it */
>> +    if (on) {
>> +        slot->flags |= HVF_SLOT_LOG;
>> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>> +                      HV_MEMORY_READ);
>> +    /* stop tracking region*/
>> +    } else {
>> +        slot->flags &= ~HVF_SLOT_LOG;
>> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>> +                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>> +    }
>> +}
>> +
>> +static void hvf_log_start(MemoryListener *listener,
>> +                          MemoryRegionSection *section, int old, int new)
>> +{
>> +    if (old != 0) {
>> +        return;
>> +    }
>> +
>> +    hvf_set_dirty_tracking(section, 1);
>> +}
>> +
>> +static void hvf_log_stop(MemoryListener *listener,
>> +                         MemoryRegionSection *section, int old, int new)
>> +{
>> +    if (new != 0) {
>> +        return;
>> +    }
>> +
>> +    hvf_set_dirty_tracking(section, 0);
>> +}
>> +
>> +static void hvf_log_sync(MemoryListener *listener,
>> +                         MemoryRegionSection *section)
>> +{
>> +    /*
>> +     * sync of dirty pages is handled elsewhere; just make sure we keep
>> +     * tracking the region.
>> +     */
>> +    hvf_set_dirty_tracking(section, 1);
>> +}
>> +
>> +static void hvf_region_add(MemoryListener *listener,
>> +                           MemoryRegionSection *section)
>> +{
>> +    hvf_set_phys_mem(section, true);
>> +}
>> +
>> +static void hvf_region_del(MemoryListener *listener,
>> +                           MemoryRegionSection *section)
>> +{
>> +    hvf_set_phys_mem(section, false);
>> +}
>> +
>> +static MemoryListener hvf_memory_listener = {
>> +    .priority = 10,
>> +    .region_add = hvf_region_add,
>> +    .region_del = hvf_region_del,
>> +    .log_start = hvf_log_start,
>> +    .log_stop = hvf_log_stop,
>> +    .log_sync = hvf_log_sync,
>> +};
>> +
>> +static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
>> +{
>> +    if (!cpu->vcpu_dirty) {
>> +        hvf_get_registers(cpu);
>> +        cpu->vcpu_dirty = true;
>> +    }
>> +}
>> +
>> +static void hvf_cpu_synchronize_state(CPUState *cpu)
>> +{
>> +    if (!cpu->vcpu_dirty) {
>> +        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
>> +    }
>> +}
>> +
>> +static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>> +                                              run_on_cpu_data arg)
>> +{
>> +    hvf_put_registers(cpu);
>> +    cpu->vcpu_dirty = false;
>> +}
>> +
>> +static void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>> +{
>> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
>> +}
>> +
>> +static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>> +                                             run_on_cpu_data arg)
>> +{
>> +    hvf_put_registers(cpu);
>> +    cpu->vcpu_dirty = false;
>> +}
>> +
>> +static void hvf_cpu_synchronize_post_init(CPUState *cpu)
>> +{
>> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
>> +}
>> +
>> +static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>> +                                              run_on_cpu_data arg)
>> +{
>> +    cpu->vcpu_dirty = true;
>> +}
>> +
>> +static void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>> +{
>> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
>> +}
>> +
>> +static void hvf_vcpu_destroy(CPUState *cpu)
>> +{
>> +    hv_return_t ret = hv_vcpu_destroy(cpu->hvf_fd);
>> +    assert_hvf_ok(ret);
>> +
>> +    hvf_arch_vcpu_destroy(cpu);
>> +}
>> +
>> +static void dummy_signal(int sig)
>> +{
>> +}
>> +
>> +static int hvf_init_vcpu(CPUState *cpu)
>> +{
>> +    int r;
>> +
>> +    /* init cpu signals */
>> +    sigset_t set;
>> +    struct sigaction sigact;
>> +
>> +    memset(&sigact, 0, sizeof(sigact));
>> +    sigact.sa_handler = dummy_signal;
>> +    sigaction(SIG_IPI, &sigact, NULL);
>> +
>> +    pthread_sigmask(SIG_BLOCK, NULL, &set);
>> +    sigdelset(&set, SIG_IPI);
>> +
>> +#ifdef __aarch64__
>> +    r = hv_vcpu_create(&cpu->hvf_fd, (hv_vcpu_exit_t **)&cpu->hvf_exit, NULL);
>> +#else
>> +    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>> +#endif
> I think the first __aarch64__ bit fits better to arm part of the series.


Oops. Thanks for catching it! Yes, absolutely. It should be part of the 
ARM enablement.


>
>> +    cpu->vcpu_dirty = 1;
>> +    assert_hvf_ok(r);
>> +
>> +    return hvf_arch_init_vcpu(cpu);
>> +}
>> +
>> +/*
>> + * The HVF-specific vCPU thread function. This one should only run when the host
>> + * CPU supports the VMX "unrestricted guest" feature.
>> + */
>> +static void *hvf_cpu_thread_fn(void *arg)
>> +{
>> +    CPUState *cpu = arg;
>> +
>> +    int r;
>> +
>> +    assert(hvf_enabled());
>> +
>> +    rcu_register_thread();
>> +
>> +    qemu_mutex_lock_iothread();
>> +    qemu_thread_get_self(cpu->thread);
>> +
>> +    cpu->thread_id = qemu_get_thread_id();
>> +    cpu->can_do_io = 1;
>> +    current_cpu = cpu;
>> +
>> +    hvf_init_vcpu(cpu);
>> +
>> +    /* signal CPU creation */
>> +    cpu_thread_signal_created(cpu);
>> +    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>> +
>> +    do {
>> +        if (cpu_can_run(cpu)) {
>> +            r = hvf_vcpu_exec(cpu);
>> +            if (r == EXCP_DEBUG) {
>> +                cpu_handle_guest_debug(cpu);
>> +            }
>> +        }
>> +        qemu_wait_io_event(cpu);
>> +    } while (!cpu->unplug || cpu_can_run(cpu));
>> +
>> +    hvf_vcpu_destroy(cpu);
>> +    cpu_thread_signal_destroyed(cpu);
>> +    qemu_mutex_unlock_iothread();
>> +    rcu_unregister_thread();
>> +    return NULL;
>> +}
>> +
>> +static void hvf_start_vcpu_thread(CPUState *cpu)
>> +{
>> +    char thread_name[VCPU_THREAD_NAME_SIZE];
>> +
>> +    /*
>> +     * HVF currently does not support TCG, and only runs in
>> +     * unrestricted-guest mode.
>> +     */
>> +    assert(hvf_enabled());
>> +
>> +    cpu->thread = g_malloc0(sizeof(QemuThread));
>> +    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>> +    qemu_cond_init(cpu->halt_cond);
>> +
>> +    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>> +             cpu->cpu_index);
>> +    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>> +                       cpu, QEMU_THREAD_JOINABLE);
>> +}
>> +
>> +static const CpusAccel hvf_cpus = {
>> +    .create_vcpu_thread = hvf_start_vcpu_thread,
>> +
>> +    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>> +    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>> +    .synchronize_state = hvf_cpu_synchronize_state,
>> +    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>> +};
>> +
>> +static int hvf_accel_init(MachineState *ms)
>> +{
>> +    int x;
>> +    hv_return_t ret;
>> +    HVFState *s;
>> +
>> +    ret = hv_vm_create(HV_VM_DEFAULT);
>> +    assert_hvf_ok(ret);
>> +
>> +    s = g_new0(HVFState, 1);
>> +
>> +    s->num_slots = 32;
>> +    for (x = 0; x < s->num_slots; ++x) {
>> +        s->slots[x].size = 0;
>> +        s->slots[x].slot_id = x;
>> +    }
>> +
>> +    hvf_state = s;
>> +    memory_listener_register(&hvf_memory_listener, &address_space_memory);
>> +    cpus_register_accel(&hvf_cpus);
>> +    return 0;
>> +}
>> +
>> +static void hvf_accel_class_init(ObjectClass *oc, void *data)
>> +{
>> +    AccelClass *ac = ACCEL_CLASS(oc);
>> +    ac->name = "HVF";
>> +    ac->init_machine = hvf_accel_init;
>> +    ac->allowed = &hvf_allowed;
>> +}
>> +
>> +static const TypeInfo hvf_accel_type = {
>> +    .name = TYPE_HVF_ACCEL,
>> +    .parent = TYPE_ACCEL,
>> +    .class_init = hvf_accel_class_init,
>> +};
>> +
>> +static void hvf_type_init(void)
>> +{
>> +    type_register_static(&hvf_accel_type);
>> +}
>> +
>> +type_init(hvf_type_init);
>> diff --git a/accel/hvf/meson.build b/accel/hvf/meson.build
>> new file mode 100644
>> index 0000000000..dfd6b68dc7
>> --- /dev/null
>> +++ b/accel/hvf/meson.build
>> @@ -0,0 +1,7 @@
>> +hvf_ss = ss.source_set()
>> +hvf_ss.add(files(
>> +  'hvf-all.c',
>> +  'hvf-cpus.c',
>> +))
>> +
>> +specific_ss.add_all(when: 'CONFIG_HVF', if_true: hvf_ss)
>> diff --git a/accel/meson.build b/accel/meson.build
>> index b26cca227a..6de12ce5d5 100644
>> --- a/accel/meson.build
>> +++ b/accel/meson.build
>> @@ -1,5 +1,6 @@
>>   softmmu_ss.add(files('accel.c'))
>>   
>> +subdir('hvf')
>>   subdir('qtest')
>>   subdir('kvm')
>>   subdir('tcg')
>> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
>> new file mode 100644
>> index 0000000000..de9bad23a8
>> --- /dev/null
>> +++ b/include/sysemu/hvf_int.h
>> @@ -0,0 +1,69 @@
>> +/*
>> + * QEMU Hypervisor.framework (HVF) support
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + *
>> + */
>> +
>> +/* header to be included in HVF-specific code */
>> +
>> +#ifndef HVF_INT_H
>> +#define HVF_INT_H
>> +
>> +#include <Hypervisor/Hypervisor.h>
>> +
>> +#define HVF_MAX_VCPU 0x10
>> +
>> +extern struct hvf_state hvf_global;
>> +
>> +struct hvf_vm {
>> +    int id;
>> +    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>> +};
>> +
>> +struct hvf_state {
>> +    uint32_t version;
>> +    struct hvf_vm *vm;
>> +    uint64_t mem_quota;
>> +};
>> +
>> +/* hvf_slot flags */
>> +#define HVF_SLOT_LOG (1 << 0)
>> +
>> +typedef struct hvf_slot {
>> +    uint64_t start;
>> +    uint64_t size;
>> +    uint8_t *mem;
>> +    int slot_id;
>> +    uint32_t flags;
>> +    MemoryRegion *region;
>> +} hvf_slot;
>> +
>> +typedef struct hvf_vcpu_caps {
>> +    uint64_t vmx_cap_pinbased;
>> +    uint64_t vmx_cap_procbased;
>> +    uint64_t vmx_cap_procbased2;
>> +    uint64_t vmx_cap_entry;
>> +    uint64_t vmx_cap_exit;
>> +    uint64_t vmx_cap_preemption_timer;
>> +} hvf_vcpu_caps;
>> +
>> +struct HVFState {
>> +    AccelState parent;
>> +    hvf_slot slots[32];
>> +    int num_slots;
>> +
>> +    hvf_vcpu_caps *hvf_caps;
>> +};
>> +extern HVFState *hvf_state;
>> +
>> +void assert_hvf_ok(hv_return_t ret);
>> +int hvf_get_registers(CPUState *cpu);
>> +int hvf_put_registers(CPUState *cpu);
>> +int hvf_arch_init_vcpu(CPUState *cpu);
>> +void hvf_arch_vcpu_destroy(CPUState *cpu);
>> +int hvf_vcpu_exec(CPUState *cpu);
>> +hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>> +
>> +#endif
>> diff --git a/target/i386/hvf/hvf-cpus.c b/target/i386/hvf/hvf-cpus.c
>> deleted file mode 100644
>> index 817b3d7452..0000000000
>> --- a/target/i386/hvf/hvf-cpus.c
>> +++ /dev/null
>> @@ -1,131 +0,0 @@
>> -/*
>> - * Copyright 2008 IBM Corporation
>> - *           2008 Red Hat, Inc.
>> - * Copyright 2011 Intel Corporation
>> - * Copyright 2016 Veertu, Inc.
>> - * Copyright 2017 The Android Open Source Project
>> - *
>> - * QEMU Hypervisor.framework support
>> - *
>> - * This program is free software; you can redistribute it and/or
>> - * modify it under the terms of version 2 of the GNU General Public
>> - * License as published by the Free Software Foundation.
>> - *
>> - * This program is distributed in the hope that it will be useful,
>> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> - * General Public License for more details.
>> - *
>> - * You should have received a copy of the GNU General Public License
>> - * along with this program; if not, see <http://www.gnu.org/licenses/>.
>> - *
>> - * This file contain code under public domain from the hvdos project:
>> - * https://github.com/mist64/hvdos
>> - *
>> - * Parts Copyright (c) 2011 NetApp, Inc.
>> - * All rights reserved.
>> - *
>> - * Redistribution and use in source and binary forms, with or without
>> - * modification, are permitted provided that the following conditions
>> - * are met:
>> - * 1. Redistributions of source code must retain the above copyright
>> - *    notice, this list of conditions and the following disclaimer.
>> - * 2. Redistributions in binary form must reproduce the above copyright
>> - *    notice, this list of conditions and the following disclaimer in the
>> - *    documentation and/or other materials provided with the distribution.
>> - *
>> - * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>> - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
>> - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
>> - * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
>> - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>> - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
>> - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>> - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
>> - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
>> - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>> - * SUCH DAMAGE.
>> - */
>> -
>> -#include "qemu/osdep.h"
>> -#include "qemu/error-report.h"
>> -#include "qemu/main-loop.h"
>> -#include "sysemu/hvf.h"
>> -#include "sysemu/runstate.h"
>> -#include "target/i386/cpu.h"
>> -#include "qemu/guest-random.h"
>> -
>> -#include "hvf-cpus.h"
>> -
>> -/*
>> - * The HVF-specific vCPU thread function. This one should only run when the host
>> - * CPU supports the VMX "unrestricted guest" feature.
>> - */
>> -static void *hvf_cpu_thread_fn(void *arg)
>> -{
>> -    CPUState *cpu = arg;
>> -
>> -    int r;
>> -
>> -    assert(hvf_enabled());
>> -
>> -    rcu_register_thread();
>> -
>> -    qemu_mutex_lock_iothread();
>> -    qemu_thread_get_self(cpu->thread);
>> -
>> -    cpu->thread_id = qemu_get_thread_id();
>> -    cpu->can_do_io = 1;
>> -    current_cpu = cpu;
>> -
>> -    hvf_init_vcpu(cpu);
>> -
>> -    /* signal CPU creation */
>> -    cpu_thread_signal_created(cpu);
>> -    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>> -
>> -    do {
>> -        if (cpu_can_run(cpu)) {
>> -            r = hvf_vcpu_exec(cpu);
>> -            if (r == EXCP_DEBUG) {
>> -                cpu_handle_guest_debug(cpu);
>> -            }
>> -        }
>> -        qemu_wait_io_event(cpu);
>> -    } while (!cpu->unplug || cpu_can_run(cpu));
>> -
>> -    hvf_vcpu_destroy(cpu);
>> -    cpu_thread_signal_destroyed(cpu);
>> -    qemu_mutex_unlock_iothread();
>> -    rcu_unregister_thread();
>> -    return NULL;
>> -}
>> -
>> -static void hvf_start_vcpu_thread(CPUState *cpu)
>> -{
>> -    char thread_name[VCPU_THREAD_NAME_SIZE];
>> -
>> -    /*
>> -     * HVF currently does not support TCG, and only runs in
>> -     * unrestricted-guest mode.
>> -     */
>> -    assert(hvf_enabled());
>> -
>> -    cpu->thread = g_malloc0(sizeof(QemuThread));
>> -    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>> -    qemu_cond_init(cpu->halt_cond);
>> -
>> -    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>> -             cpu->cpu_index);
>> -    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>> -                       cpu, QEMU_THREAD_JOINABLE);
>> -}
>> -
>> -const CpusAccel hvf_cpus = {
>> -    .create_vcpu_thread = hvf_start_vcpu_thread,
>> -
>> -    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>> -    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>> -    .synchronize_state = hvf_cpu_synchronize_state,
>> -    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>> -};
>> diff --git a/target/i386/hvf/hvf-cpus.h b/target/i386/hvf/hvf-cpus.h
>> deleted file mode 100644
>> index ced31b82c0..0000000000
>> --- a/target/i386/hvf/hvf-cpus.h
>> +++ /dev/null
>> @@ -1,25 +0,0 @@
>> -/*
>> - * Accelerator CPUS Interface
>> - *
>> - * Copyright 2020 SUSE LLC
>> - *
>> - * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> - * See the COPYING file in the top-level directory.
>> - */
>> -
>> -#ifndef HVF_CPUS_H
>> -#define HVF_CPUS_H
>> -
>> -#include "sysemu/cpus.h"
>> -
>> -extern const CpusAccel hvf_cpus;
>> -
>> -int hvf_init_vcpu(CPUState *);
>> -int hvf_vcpu_exec(CPUState *);
>> -void hvf_cpu_synchronize_state(CPUState *);
>> -void hvf_cpu_synchronize_post_reset(CPUState *);
>> -void hvf_cpu_synchronize_post_init(CPUState *);
>> -void hvf_cpu_synchronize_pre_loadvm(CPUState *);
>> -void hvf_vcpu_destroy(CPUState *);
>> -
>> -#endif /* HVF_CPUS_H */
>> diff --git a/target/i386/hvf/hvf-i386.h b/target/i386/hvf/hvf-i386.h
>> index e0edffd077..6d56f8f6bb 100644
>> --- a/target/i386/hvf/hvf-i386.h
>> +++ b/target/i386/hvf/hvf-i386.h
>> @@ -18,57 +18,11 @@
>>   
>>   #include "sysemu/accel.h"
>>   #include "sysemu/hvf.h"
>> +#include "sysemu/hvf_int.h"
>>   #include "cpu.h"
>>   #include "x86.h"
>>   
>> -#define HVF_MAX_VCPU 0x10
>> -
>> -extern struct hvf_state hvf_global;
>> -
>> -struct hvf_vm {
>> -    int id;
>> -    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>> -};
>> -
>> -struct hvf_state {
>> -    uint32_t version;
>> -    struct hvf_vm *vm;
>> -    uint64_t mem_quota;
>> -};
>> -
>> -/* hvf_slot flags */
>> -#define HVF_SLOT_LOG (1 << 0)
>> -
>> -typedef struct hvf_slot {
>> -    uint64_t start;
>> -    uint64_t size;
>> -    uint8_t *mem;
>> -    int slot_id;
>> -    uint32_t flags;
>> -    MemoryRegion *region;
>> -} hvf_slot;
>> -
>> -typedef struct hvf_vcpu_caps {
>> -    uint64_t vmx_cap_pinbased;
>> -    uint64_t vmx_cap_procbased;
>> -    uint64_t vmx_cap_procbased2;
>> -    uint64_t vmx_cap_entry;
>> -    uint64_t vmx_cap_exit;
>> -    uint64_t vmx_cap_preemption_timer;
>> -} hvf_vcpu_caps;
>> -
>> -struct HVFState {
>> -    AccelState parent;
>> -    hvf_slot slots[32];
>> -    int num_slots;
>> -
>> -    hvf_vcpu_caps *hvf_caps;
>> -};
>> -extern HVFState *hvf_state;
>> -
>> -void hvf_set_phys_mem(MemoryRegionSection *, bool);
>>   void hvf_handle_io(CPUArchState *, uint16_t, void *, int, int, int);
>> -hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>>   
>>   #ifdef NEED_CPU_H
>>   /* Functions exported to host specific mode */
>> diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
>> index ed9356565c..8b96ecd619 100644
>> --- a/target/i386/hvf/hvf.c
>> +++ b/target/i386/hvf/hvf.c
>> @@ -51,6 +51,7 @@
>>   #include "qemu/error-report.h"
>>   
>>   #include "sysemu/hvf.h"
>> +#include "sysemu/hvf_int.h"
>>   #include "sysemu/runstate.h"
>>   #include "hvf-i386.h"
>>   #include "vmcs.h"
>> @@ -72,171 +73,6 @@
>>   #include "sysemu/accel.h"
>>   #include "target/i386/cpu.h"
>>   
>> -#include "hvf-cpus.h"
>> -
>> -HVFState *hvf_state;
>> -
>> -static void assert_hvf_ok(hv_return_t ret)
>> -{
>> -    if (ret == HV_SUCCESS) {
>> -        return;
>> -    }
>> -
>> -    switch (ret) {
>> -    case HV_ERROR:
>> -        error_report("Error: HV_ERROR");
>> -        break;
>> -    case HV_BUSY:
>> -        error_report("Error: HV_BUSY");
>> -        break;
>> -    case HV_BAD_ARGUMENT:
>> -        error_report("Error: HV_BAD_ARGUMENT");
>> -        break;
>> -    case HV_NO_RESOURCES:
>> -        error_report("Error: HV_NO_RESOURCES");
>> -        break;
>> -    case HV_NO_DEVICE:
>> -        error_report("Error: HV_NO_DEVICE");
>> -        break;
>> -    case HV_UNSUPPORTED:
>> -        error_report("Error: HV_UNSUPPORTED");
>> -        break;
>> -    default:
>> -        error_report("Unknown Error");
>> -    }
>> -
>> -    abort();
>> -}
>> -
>> -/* Memory slots */
>> -hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>> -{
>> -    hvf_slot *slot;
>> -    int x;
>> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>> -        slot = &hvf_state->slots[x];
>> -        if (slot->size && start < (slot->start + slot->size) &&
>> -            (start + size) > slot->start) {
>> -            return slot;
>> -        }
>> -    }
>> -    return NULL;
>> -}
>> -
>> -struct mac_slot {
>> -    int present;
>> -    uint64_t size;
>> -    uint64_t gpa_start;
>> -    uint64_t gva;
>> -};
>> -
>> -struct mac_slot mac_slots[32];
>> -
>> -static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>> -{
>> -    struct mac_slot *macslot;
>> -    hv_return_t ret;
>> -
>> -    macslot = &mac_slots[slot->slot_id];
>> -
>> -    if (macslot->present) {
>> -        if (macslot->size != slot->size) {
>> -            macslot->present = 0;
>> -            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>> -            assert_hvf_ok(ret);
>> -        }
>> -    }
>> -
>> -    if (!slot->size) {
>> -        return 0;
>> -    }
>> -
>> -    macslot->present = 1;
>> -    macslot->gpa_start = slot->start;
>> -    macslot->size = slot->size;
>> -    ret = hv_vm_map((hv_uvaddr_t)slot->mem, slot->start, slot->size, flags);
>> -    assert_hvf_ok(ret);
>> -    return 0;
>> -}
>> -
>> -void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>> -{
>> -    hvf_slot *mem;
>> -    MemoryRegion *area = section->mr;
>> -    bool writeable = !area->readonly && !area->rom_device;
>> -    hv_memory_flags_t flags;
>> -
>> -    if (!memory_region_is_ram(area)) {
>> -        if (writeable) {
>> -            return;
>> -        } else if (!memory_region_is_romd(area)) {
>> -            /*
>> -             * If the memory device is not in romd_mode, then we actually want
>> -             * to remove the hvf memory slot so all accesses will trap.
>> -             */
>> -             add = false;
>> -        }
>> -    }
>> -
>> -    mem = hvf_find_overlap_slot(
>> -            section->offset_within_address_space,
>> -            int128_get64(section->size));
>> -
>> -    if (mem && add) {
>> -        if (mem->size == int128_get64(section->size) &&
>> -            mem->start == section->offset_within_address_space &&
>> -            mem->mem == (memory_region_get_ram_ptr(area) +
>> -            section->offset_within_region)) {
>> -            return; /* Same region was attempted to register, go away. */
>> -        }
>> -    }
>> -
>> -    /* Region needs to be reset. set the size to 0 and remap it. */
>> -    if (mem) {
>> -        mem->size = 0;
>> -        if (do_hvf_set_memory(mem, 0)) {
>> -            error_report("Failed to reset overlapping slot");
>> -            abort();
>> -        }
>> -    }
>> -
>> -    if (!add) {
>> -        return;
>> -    }
>> -
>> -    if (area->readonly ||
>> -        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
>> -        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>> -    } else {
>> -        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>> -    }
>> -
>> -    /* Now make a new slot. */
>> -    int x;
>> -
>> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>> -        mem = &hvf_state->slots[x];
>> -        if (!mem->size) {
>> -            break;
>> -        }
>> -    }
>> -
>> -    if (x == hvf_state->num_slots) {
>> -        error_report("No free slots");
>> -        abort();
>> -    }
>> -
>> -    mem->size = int128_get64(section->size);
>> -    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
>> -    mem->start = section->offset_within_address_space;
>> -    mem->region = area;
>> -
>> -    if (do_hvf_set_memory(mem, flags)) {
>> -        error_report("Error registering new memory slot");
>> -        abort();
>> -    }
>> -}
>> -
>>   void vmx_update_tpr(CPUState *cpu)
>>   {
>>       /* TODO: need integrate APIC handling */
>> @@ -276,56 +112,6 @@ void hvf_handle_io(CPUArchState *env, uint16_t port, void *buffer,
>>       }
>>   }
>>   
>> -static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
>> -{
>> -    if (!cpu->vcpu_dirty) {
>> -        hvf_get_registers(cpu);
>> -        cpu->vcpu_dirty = true;
>> -    }
>> -}
>> -
>> -void hvf_cpu_synchronize_state(CPUState *cpu)
>> -{
>> -    if (!cpu->vcpu_dirty) {
>> -        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
>> -    }
>> -}
>> -
>> -static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>> -                                              run_on_cpu_data arg)
>> -{
>> -    hvf_put_registers(cpu);
>> -    cpu->vcpu_dirty = false;
>> -}
>> -
>> -void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>> -{
>> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
>> -}
>> -
>> -static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>> -                                             run_on_cpu_data arg)
>> -{
>> -    hvf_put_registers(cpu);
>> -    cpu->vcpu_dirty = false;
>> -}
>> -
>> -void hvf_cpu_synchronize_post_init(CPUState *cpu)
>> -{
>> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
>> -}
>> -
>> -static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>> -                                              run_on_cpu_data arg)
>> -{
>> -    cpu->vcpu_dirty = true;
>> -}
>> -
>> -void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>> -{
>> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
>> -}
>> -
>>   static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
>>   {
>>       int read, write;
>> @@ -370,109 +156,19 @@ static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
>>       return false;
>>   }
>>   
>> -static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
>> -{
>> -    hvf_slot *slot;
>> -
>> -    slot = hvf_find_overlap_slot(
>> -            section->offset_within_address_space,
>> -            int128_get64(section->size));
>> -
>> -    /* protect region against writes; begin tracking it */
>> -    if (on) {
>> -        slot->flags |= HVF_SLOT_LOG;
>> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>> -                      HV_MEMORY_READ);
>> -    /* stop tracking region*/
>> -    } else {
>> -        slot->flags &= ~HVF_SLOT_LOG;
>> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>> -                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>> -    }
>> -}
>> -
>> -static void hvf_log_start(MemoryListener *listener,
>> -                          MemoryRegionSection *section, int old, int new)
>> -{
>> -    if (old != 0) {
>> -        return;
>> -    }
>> -
>> -    hvf_set_dirty_tracking(section, 1);
>> -}
>> -
>> -static void hvf_log_stop(MemoryListener *listener,
>> -                         MemoryRegionSection *section, int old, int new)
>> -{
>> -    if (new != 0) {
>> -        return;
>> -    }
>> -
>> -    hvf_set_dirty_tracking(section, 0);
>> -}
>> -
>> -static void hvf_log_sync(MemoryListener *listener,
>> -                         MemoryRegionSection *section)
>> -{
>> -    /*
>> -     * sync of dirty pages is handled elsewhere; just make sure we keep
>> -     * tracking the region.
>> -     */
>> -    hvf_set_dirty_tracking(section, 1);
>> -}
>> -
>> -static void hvf_region_add(MemoryListener *listener,
>> -                           MemoryRegionSection *section)
>> -{
>> -    hvf_set_phys_mem(section, true);
>> -}
>> -
>> -static void hvf_region_del(MemoryListener *listener,
>> -                           MemoryRegionSection *section)
>> -{
>> -    hvf_set_phys_mem(section, false);
>> -}
>> -
>> -static MemoryListener hvf_memory_listener = {
>> -    .priority = 10,
>> -    .region_add = hvf_region_add,
>> -    .region_del = hvf_region_del,
>> -    .log_start = hvf_log_start,
>> -    .log_stop = hvf_log_stop,
>> -    .log_sync = hvf_log_sync,
>> -};
>> -
>> -void hvf_vcpu_destroy(CPUState *cpu)
>> +void hvf_arch_vcpu_destroy(CPUState *cpu)
>>   {
>>       X86CPU *x86_cpu = X86_CPU(cpu);
>>       CPUX86State *env = &x86_cpu->env;
>>   
>> -    hv_return_t ret = hv_vcpu_destroy((hv_vcpuid_t)cpu->hvf_fd);
>>       g_free(env->hvf_mmio_buf);
>> -    assert_hvf_ok(ret);
>> -}
>> -
>> -static void dummy_signal(int sig)
>> -{
>>   }
>>   
>> -int hvf_init_vcpu(CPUState *cpu)
>> +int hvf_arch_init_vcpu(CPUState *cpu)
>>   {
>>   
>>       X86CPU *x86cpu = X86_CPU(cpu);
>>       CPUX86State *env = &x86cpu->env;
>> -    int r;
>> -
>> -    /* init cpu signals */
>> -    sigset_t set;
>> -    struct sigaction sigact;
>> -
>> -    memset(&sigact, 0, sizeof(sigact));
>> -    sigact.sa_handler = dummy_signal;
>> -    sigaction(SIG_IPI, &sigact, NULL);
>> -
>> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>> -    sigdelset(&set, SIG_IPI);
>>   
>>       init_emu();
>>       init_decoder();
>> @@ -480,10 +176,6 @@ int hvf_init_vcpu(CPUState *cpu)
>>       hvf_state->hvf_caps = g_new0(struct hvf_vcpu_caps, 1);
>>       env->hvf_mmio_buf = g_new(char, 4096);
>>   
>> -    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>> -    cpu->vcpu_dirty = 1;
>> -    assert_hvf_ok(r);
>> -
>>       if (hv_vmx_read_capability(HV_VMX_CAP_PINBASED,
>>           &hvf_state->hvf_caps->vmx_cap_pinbased)) {
>>           abort();
>> @@ -865,49 +557,3 @@ int hvf_vcpu_exec(CPUState *cpu)
>>   
>>       return ret;
>>   }
>> -
>> -bool hvf_allowed;
>> -
>> -static int hvf_accel_init(MachineState *ms)
>> -{
>> -    int x;
>> -    hv_return_t ret;
>> -    HVFState *s;
>> -
>> -    ret = hv_vm_create(HV_VM_DEFAULT);
>> -    assert_hvf_ok(ret);
>> -
>> -    s = g_new0(HVFState, 1);
>> -
>> -    s->num_slots = 32;
>> -    for (x = 0; x < s->num_slots; ++x) {
>> -        s->slots[x].size = 0;
>> -        s->slots[x].slot_id = x;
>> -    }
>> -
>> -    hvf_state = s;
>> -    memory_listener_register(&hvf_memory_listener, &address_space_memory);
>> -    cpus_register_accel(&hvf_cpus);
>> -    return 0;
>> -}
>> -
>> -static void hvf_accel_class_init(ObjectClass *oc, void *data)
>> -{
>> -    AccelClass *ac = ACCEL_CLASS(oc);
>> -    ac->name = "HVF";
>> -    ac->init_machine = hvf_accel_init;
>> -    ac->allowed = &hvf_allowed;
>> -}
>> -
>> -static const TypeInfo hvf_accel_type = {
>> -    .name = TYPE_HVF_ACCEL,
>> -    .parent = TYPE_ACCEL,
>> -    .class_init = hvf_accel_class_init,
>> -};
>> -
>> -static void hvf_type_init(void)
>> -{
>> -    type_register_static(&hvf_accel_type);
>> -}
>> -
>> -type_init(hvf_type_init);
>> diff --git a/target/i386/hvf/meson.build b/target/i386/hvf/meson.build
>> index 409c9a3f14..c8a43717ee 100644
>> --- a/target/i386/hvf/meson.build
>> +++ b/target/i386/hvf/meson.build
>> @@ -1,6 +1,5 @@
>>   i386_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true: files(
>>     'hvf.c',
>> -  'hvf-cpus.c',
>>     'x86.c',
>>     'x86_cpuid.c',
>>     'x86_decode.c',
>> diff --git a/target/i386/hvf/x86hvf.c b/target/i386/hvf/x86hvf.c
>> index bbec412b6c..89b8e9d87a 100644
>> --- a/target/i386/hvf/x86hvf.c
>> +++ b/target/i386/hvf/x86hvf.c
>> @@ -20,6 +20,9 @@
>>   #include "qemu/osdep.h"
>>   
>>   #include "qemu-common.h"
>> +#include "sysemu/hvf.h"
>> +#include "sysemu/hvf_int.h"
>> +#include "sysemu/hw_accel.h"
>>   #include "x86hvf.h"
>>   #include "vmx.h"
>>   #include "vmcs.h"
>> @@ -32,8 +35,6 @@
>>   #include <Hypervisor/hv.h>
>>   #include <Hypervisor/hv_vmx.h>
>>   
>> -#include "hvf-cpus.h"
>> -
>>   void hvf_set_segment(struct CPUState *cpu, struct vmx_segment *vmx_seg,
>>                        SegmentCache *qseg, bool is_tr)
>>   {
>> @@ -437,7 +438,7 @@ int hvf_process_events(CPUState *cpu_state)
>>       env->eflags = rreg(cpu_state->hvf_fd, HV_X86_RFLAGS);
>>   
>>       if (cpu_state->interrupt_request & CPU_INTERRUPT_INIT) {
>> -        hvf_cpu_synchronize_state(cpu_state);
>> +        cpu_synchronize_state(cpu_state);
>>           do_cpu_init(cpu);
>>       }
>>   
>> @@ -451,12 +452,12 @@ int hvf_process_events(CPUState *cpu_state)
>>           cpu_state->halted = 0;
>>       }
>>       if (cpu_state->interrupt_request & CPU_INTERRUPT_SIPI) {
>> -        hvf_cpu_synchronize_state(cpu_state);
>> +        cpu_synchronize_state(cpu_state);
>>           do_cpu_sipi(cpu);
>>       }
>>       if (cpu_state->interrupt_request & CPU_INTERRUPT_TPR) {
>>           cpu_state->interrupt_request &= ~CPU_INTERRUPT_TPR;
>> -        hvf_cpu_synchronize_state(cpu_state);
>> +        cpu_synchronize_state(cpu_state);
> The changes from hvf_cpu_*() to cpu_*() are cleanup and perhaps should
> be a separate patch. It follows cpu/accel cleanups Claudio was doing the
> summer.


The only reason they're in here is because we no longer have access to 
the hvf_ functions from the file. I am perfectly happy to rebase the 
patch on top of Claudio's if his goes in first. I'm sure it'll be 
trivial for him to rebase on top of this too if my series goes in first.


>
> Phillipe raised the idea that the patch might go ahead of ARM-specific
> part (which might involve some discussions) and I agree with that.
>
> Some sync between Claudio series (CC'd him) and the patch might be need.


I would prefer not to hold back because of the sync. Claudio's cleanup 
is trivial enough to adjust for if it gets merged ahead of this.


Alex

Re: [PATCH 2/8] hvf: Move common code out

Posted by Frank Yang 4 years, 3 months ago

Hi all,

+Peter Collingbourne <pcc@google.com>

I'm a developer on the Android Emulator, which is in a fork of QEMU.

Peter and I have been working on an HVF Apple Silicon backend with an eye
toward Android guests.

We have gotten things to basically switch to Android userspace already
(logcat/shell and graphics available at least)

Our strategy so far has been to import logic from the KVM implementation
and hook into QEMU's software devices that previously assumed to only work
with TCG, or have KVM-specific paths.

Thanks to Alexander for the tip on the 36-bit address space limitation btw;
our way of addressing this is to still allow highmem but not put pci high
mmio so high.

Also, note we have a sleep/signal based mechanism to deal with WFx, which
might be worth looking into in Alexander's implementation as well:

https://android-review.googlesource.com/c/platform/external/qemu/+/1512551

Patches so far, FYI:

https://android-review.googlesource.com/c/platform/external/qemu/+/1513429/1
https://android-review.googlesource.com/c/platform/external/qemu/+/1512554/3
https://android-review.googlesource.com/c/platform/external/qemu/+/1512553/3
https://android-review.googlesource.com/c/platform/external/qemu/+/1512552/3
https://android-review.googlesource.com/c/platform/external/qemu/+/1512551/3

https://android.googlesource.com/platform/external/qemu/+/c17eb6a3ffd50047e9646aff6640b710cb8ff48a
https://android.googlesource.com/platform/external/qemu/+/74bed16de1afb41b7a7ab8da1d1861226c9db63b
https://android.googlesource.com/platform/external/qemu/+/eccd9e47ab2ccb9003455e3bb721f57f9ebc3c01
https://android.googlesource.com/platform/external/qemu/+/54fe3d67ed4698e85826537a4f49b2b9074b2228
https://android.googlesource.com/platform/external/qemu/+/82ef91a6fede1d1000f36be037ad4d58fbe0d102
https://android.googlesource.com/platform/external/qemu/+/c28147aa7c74d98b858e99623d2fe46e74a379f6

Peter's also noticed that there are extra steps needed for M1's to allow
TCG to work, as it involves JIT:

https://android.googlesource.com/platform/external/qemu/+/740e3fe47f88926c6bda9abb22ee6eae1bc254a9

We'd appreciate any feedback/comments :)

Best,

Frank

On Fri, Nov 27, 2020 at 1:57 PM Alexander Graf <agraf@csgraf.de> wrote:

>
> On 27.11.20 21:00, Roman Bolshakov wrote:
> > On Thu, Nov 26, 2020 at 10:50:11PM +0100, Alexander Graf wrote:
> >> Until now, Hypervisor.framework has only been available on x86_64
> systems.
> >> With Apple Silicon shipping now, it extends its reach to aarch64. To
> >> prepare for support for multiple architectures, let's move common code
> out
> >> into its own accel directory.
> >>
> >> Signed-off-by: Alexander Graf <agraf@csgraf.de>
> >> ---
> >>   MAINTAINERS                 |   9 +-
> >>   accel/hvf/hvf-all.c         |  56 +++++
> >>   accel/hvf/hvf-cpus.c        | 468 ++++++++++++++++++++++++++++++++++++
> >>   accel/hvf/meson.build       |   7 +
> >>   accel/meson.build           |   1 +
> >>   include/sysemu/hvf_int.h    |  69 ++++++
> >>   target/i386/hvf/hvf-cpus.c  | 131 ----------
> >>   target/i386/hvf/hvf-cpus.h  |  25 --
> >>   target/i386/hvf/hvf-i386.h  |  48 +---
> >>   target/i386/hvf/hvf.c       | 360 +--------------------------
> >>   target/i386/hvf/meson.build |   1 -
> >>   target/i386/hvf/x86hvf.c    |  11 +-
> >>   target/i386/hvf/x86hvf.h    |   2 -
> >>   13 files changed, 619 insertions(+), 569 deletions(-)
> >>   create mode 100644 accel/hvf/hvf-all.c
> >>   create mode 100644 accel/hvf/hvf-cpus.c
> >>   create mode 100644 accel/hvf/meson.build
> >>   create mode 100644 include/sysemu/hvf_int.h
> >>   delete mode 100644 target/i386/hvf/hvf-cpus.c
> >>   delete mode 100644 target/i386/hvf/hvf-cpus.h
> >>
> >> diff --git a/MAINTAINERS b/MAINTAINERS
> >> index 68bc160f41..ca4b6d9279 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -444,9 +444,16 @@ M: Cameron Esfahani <dirty@apple.com>
> >>   M: Roman Bolshakov <r.bolshakov@yadro.com>
> >>   W: https://wiki.qemu.org/Features/HVF
> >>   S: Maintained
> >> -F: accel/stubs/hvf-stub.c
> > There was a patch for that in the RFC series from Claudio.
>
>
> Yeah, I'm not worried about this hunk :).
>
>
> >
> >>   F: target/i386/hvf/
> >> +
> >> +HVF
> >> +M: Cameron Esfahani <dirty@apple.com>
> >> +M: Roman Bolshakov <r.bolshakov@yadro.com>
> >> +W: https://wiki.qemu.org/Features/HVF
> >> +S: Maintained
> >> +F: accel/hvf/
> >>   F: include/sysemu/hvf.h
> >> +F: include/sysemu/hvf_int.h
> >>
> >>   WHPX CPUs
> >>   M: Sunil Muthuswamy <sunilmut@microsoft.com>
> >> diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
> >> new file mode 100644
> >> index 0000000000..47d77a472a
> >> --- /dev/null
> >> +++ b/accel/hvf/hvf-all.c
> >> @@ -0,0 +1,56 @@
> >> +/*
> >> + * QEMU Hypervisor.framework support
> >> + *
> >> + * This work is licensed under the terms of the GNU GPL, version 2.
> See
> >> + * the COPYING file in the top-level directory.
> >> + *
> >> + * Contributions after 2012-01-13 are licensed under the terms of the
> >> + * GNU GPL, version 2 or (at your option) any later version.
> >> + */
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include "qemu-common.h"
> >> +#include "qemu/error-report.h"
> >> +#include "sysemu/hvf.h"
> >> +#include "sysemu/hvf_int.h"
> >> +#include "sysemu/runstate.h"
> >> +
> >> +#include "qemu/main-loop.h"
> >> +#include "sysemu/accel.h"
> >> +
> >> +#include <Hypervisor/Hypervisor.h>
> >> +
> >> +bool hvf_allowed;
> >> +HVFState *hvf_state;
> >> +
> >> +void assert_hvf_ok(hv_return_t ret)
> >> +{
> >> +    if (ret == HV_SUCCESS) {
> >> +        return;
> >> +    }
> >> +
> >> +    switch (ret) {
> >> +    case HV_ERROR:
> >> +        error_report("Error: HV_ERROR");
> >> +        break;
> >> +    case HV_BUSY:
> >> +        error_report("Error: HV_BUSY");
> >> +        break;
> >> +    case HV_BAD_ARGUMENT:
> >> +        error_report("Error: HV_BAD_ARGUMENT");
> >> +        break;
> >> +    case HV_NO_RESOURCES:
> >> +        error_report("Error: HV_NO_RESOURCES");
> >> +        break;
> >> +    case HV_NO_DEVICE:
> >> +        error_report("Error: HV_NO_DEVICE");
> >> +        break;
> >> +    case HV_UNSUPPORTED:
> >> +        error_report("Error: HV_UNSUPPORTED");
> >> +        break;
> >> +    default:
> >> +        error_report("Unknown Error");
> >> +    }
> >> +
> >> +    abort();
> >> +}
> >> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> >> new file mode 100644
> >> index 0000000000..f9bb5502b7
> >> --- /dev/null
> >> +++ b/accel/hvf/hvf-cpus.c
> >> @@ -0,0 +1,468 @@
> >> +/*
> >> + * Copyright 2008 IBM Corporation
> >> + *           2008 Red Hat, Inc.
> >> + * Copyright 2011 Intel Corporation
> >> + * Copyright 2016 Veertu, Inc.
> >> + * Copyright 2017 The Android Open Source Project
> >> + *
> >> + * QEMU Hypervisor.framework support
> >> + *
> >> + * This program is free software; you can redistribute it and/or
> >> + * modify it under the terms of version 2 of the GNU General Public
> >> + * License as published by the Free Software Foundation.
> >> + *
> >> + * This program is distributed in the hope that it will be useful,
> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >> + * General Public License for more details.
> >> + *
> >> + * You should have received a copy of the GNU General Public License
> >> + * along with this program; if not, see <http://www.gnu.org/licenses/
> >.
> >> + *
> >> + * This file contain code under public domain from the hvdos project:
> >> + * https://github.com/mist64/hvdos
> >> + *
> >> + * Parts Copyright (c) 2011 NetApp, Inc.
> >> + * All rights reserved.
> >> + *
> >> + * Redistribution and use in source and binary forms, with or without
> >> + * modification, are permitted provided that the following conditions
> >> + * are met:
> >> + * 1. Redistributions of source code must retain the above copyright
> >> + *    notice, this list of conditions and the following disclaimer.
> >> + * 2. Redistributions in binary form must reproduce the above copyright
> >> + *    notice, this list of conditions and the following disclaimer in
> the
> >> + *    documentation and/or other materials provided with the
> distribution.
> >> + *
> >> + * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
> >> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
> THE
> >> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
> PURPOSE
> >> + * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE
> LIABLE
> >> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> CONSEQUENTIAL
> >> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
> GOODS
> >> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
> INTERRUPTION)
> >> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> STRICT
> >> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
> ANY WAY
> >> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
> OF
> >> + * SUCH DAMAGE.
> >> + */
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include "qemu/error-report.h"
> >> +#include "qemu/main-loop.h"
> >> +#include "exec/address-spaces.h"
> >> +#include "exec/exec-all.h"
> >> +#include "sysemu/cpus.h"
> >> +#include "sysemu/hvf.h"
> >> +#include "sysemu/hvf_int.h"
> >> +#include "sysemu/runstate.h"
> >> +#include "qemu/guest-random.h"
> >> +
> >> +#include <Hypervisor/Hypervisor.h>
> >> +
> >> +/* Memory slots */
> >> +
> >> +struct mac_slot {
> >> +    int present;
> >> +    uint64_t size;
> >> +    uint64_t gpa_start;
> >> +    uint64_t gva;
> >> +};
> >> +
> >> +hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
> >> +{
> >> +    hvf_slot *slot;
> >> +    int x;
> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
> >> +        slot = &hvf_state->slots[x];
> >> +        if (slot->size && start < (slot->start + slot->size) &&
> >> +            (start + size) > slot->start) {
> >> +            return slot;
> >> +        }
> >> +    }
> >> +    return NULL;
> >> +}
> >> +
> >> +struct mac_slot mac_slots[32];
> >> +
> >> +static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
> >> +{
> >> +    struct mac_slot *macslot;
> >> +    hv_return_t ret;
> >> +
> >> +    macslot = &mac_slots[slot->slot_id];
> >> +
> >> +    if (macslot->present) {
> >> +        if (macslot->size != slot->size) {
> >> +            macslot->present = 0;
> >> +            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
> >> +            assert_hvf_ok(ret);
> >> +        }
> >> +    }
> >> +
> >> +    if (!slot->size) {
> >> +        return 0;
> >> +    }
> >> +
> >> +    macslot->present = 1;
> >> +    macslot->gpa_start = slot->start;
> >> +    macslot->size = slot->size;
> >> +    ret = hv_vm_map(slot->mem, slot->start, slot->size, flags);
> >> +    assert_hvf_ok(ret);
> >> +    return 0;
> >> +}
> >> +
> >> +static void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
> >> +{
> >> +    hvf_slot *mem;
> >> +    MemoryRegion *area = section->mr;
> >> +    bool writeable = !area->readonly && !area->rom_device;
> >> +    hv_memory_flags_t flags;
> >> +
> >> +    if (!memory_region_is_ram(area)) {
> >> +        if (writeable) {
> >> +            return;
> >> +        } else if (!memory_region_is_romd(area)) {
> >> +            /*
> >> +             * If the memory device is not in romd_mode, then we
> actually want
> >> +             * to remove the hvf memory slot so all accesses will trap.
> >> +             */
> >> +             add = false;
> >> +        }
> >> +    }
> >> +
> >> +    mem = hvf_find_overlap_slot(
> >> +            section->offset_within_address_space,
> >> +            int128_get64(section->size));
> >> +
> >> +    if (mem && add) {
> >> +        if (mem->size == int128_get64(section->size) &&
> >> +            mem->start == section->offset_within_address_space &&
> >> +            mem->mem == (memory_region_get_ram_ptr(area) +
> >> +            section->offset_within_region)) {
> >> +            return; /* Same region was attempted to register, go away.
> */
> >> +        }
> >> +    }
> >> +
> >> +    /* Region needs to be reset. set the size to 0 and remap it. */
> >> +    if (mem) {
> >> +        mem->size = 0;
> >> +        if (do_hvf_set_memory(mem, 0)) {
> >> +            error_report("Failed to reset overlapping slot");
> >> +            abort();
> >> +        }
> >> +    }
> >> +
> >> +    if (!add) {
> >> +        return;
> >> +    }
> >> +
> >> +    if (area->readonly ||
> >> +        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
> >> +        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
> >> +    } else {
> >> +        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
> >> +    }
> >> +
> >> +    /* Now make a new slot. */
> >> +    int x;
> >> +
> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
> >> +        mem = &hvf_state->slots[x];
> >> +        if (!mem->size) {
> >> +            break;
> >> +        }
> >> +    }
> >> +
> >> +    if (x == hvf_state->num_slots) {
> >> +        error_report("No free slots");
> >> +        abort();
> >> +    }
> >> +
> >> +    mem->size = int128_get64(section->size);
> >> +    mem->mem = memory_region_get_ram_ptr(area) +
> section->offset_within_region;
> >> +    mem->start = section->offset_within_address_space;
> >> +    mem->region = area;
> >> +
> >> +    if (do_hvf_set_memory(mem, flags)) {
> >> +        error_report("Error registering new memory slot");
> >> +        abort();
> >> +    }
> >> +}
> >> +
> >> +static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool
> on)
> >> +{
> >> +    hvf_slot *slot;
> >> +
> >> +    slot = hvf_find_overlap_slot(
> >> +            section->offset_within_address_space,
> >> +            int128_get64(section->size));
> >> +
> >> +    /* protect region against writes; begin tracking it */
> >> +    if (on) {
> >> +        slot->flags |= HVF_SLOT_LOG;
> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
> >> +                      HV_MEMORY_READ);
> >> +    /* stop tracking region*/
> >> +    } else {
> >> +        slot->flags &= ~HVF_SLOT_LOG;
> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
> >> +                      HV_MEMORY_READ | HV_MEMORY_WRITE);
> >> +    }
> >> +}
> >> +
> >> +static void hvf_log_start(MemoryListener *listener,
> >> +                          MemoryRegionSection *section, int old, int
> new)
> >> +{
> >> +    if (old != 0) {
> >> +        return;
> >> +    }
> >> +
> >> +    hvf_set_dirty_tracking(section, 1);
> >> +}
> >> +
> >> +static void hvf_log_stop(MemoryListener *listener,
> >> +                         MemoryRegionSection *section, int old, int
> new)
> >> +{
> >> +    if (new != 0) {
> >> +        return;
> >> +    }
> >> +
> >> +    hvf_set_dirty_tracking(section, 0);
> >> +}
> >> +
> >> +static void hvf_log_sync(MemoryListener *listener,
> >> +                         MemoryRegionSection *section)
> >> +{
> >> +    /*
> >> +     * sync of dirty pages is handled elsewhere; just make sure we keep
> >> +     * tracking the region.
> >> +     */
> >> +    hvf_set_dirty_tracking(section, 1);
> >> +}
> >> +
> >> +static void hvf_region_add(MemoryListener *listener,
> >> +                           MemoryRegionSection *section)
> >> +{
> >> +    hvf_set_phys_mem(section, true);
> >> +}
> >> +
> >> +static void hvf_region_del(MemoryListener *listener,
> >> +                           MemoryRegionSection *section)
> >> +{
> >> +    hvf_set_phys_mem(section, false);
> >> +}
> >> +
> >> +static MemoryListener hvf_memory_listener = {
> >> +    .priority = 10,
> >> +    .region_add = hvf_region_add,
> >> +    .region_del = hvf_region_del,
> >> +    .log_start = hvf_log_start,
> >> +    .log_stop = hvf_log_stop,
> >> +    .log_sync = hvf_log_sync,
> >> +};
> >> +
> >> +static void do_hvf_cpu_synchronize_state(CPUState *cpu,
> run_on_cpu_data arg)
> >> +{
> >> +    if (!cpu->vcpu_dirty) {
> >> +        hvf_get_registers(cpu);
> >> +        cpu->vcpu_dirty = true;
> >> +    }
> >> +}
> >> +
> >> +static void hvf_cpu_synchronize_state(CPUState *cpu)
> >> +{
> >> +    if (!cpu->vcpu_dirty) {
> >> +        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
> >> +    }
> >> +}
> >> +
> >> +static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
> >> +                                              run_on_cpu_data arg)
> >> +{
> >> +    hvf_put_registers(cpu);
> >> +    cpu->vcpu_dirty = false;
> >> +}
> >> +
> >> +static void hvf_cpu_synchronize_post_reset(CPUState *cpu)
> >> +{
> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset,
> RUN_ON_CPU_NULL);
> >> +}
> >> +
> >> +static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
> >> +                                             run_on_cpu_data arg)
> >> +{
> >> +    hvf_put_registers(cpu);
> >> +    cpu->vcpu_dirty = false;
> >> +}
> >> +
> >> +static void hvf_cpu_synchronize_post_init(CPUState *cpu)
> >> +{
> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
> >> +}
> >> +
> >> +static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
> >> +                                              run_on_cpu_data arg)
> >> +{
> >> +    cpu->vcpu_dirty = true;
> >> +}
> >> +
> >> +static void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
> >> +{
> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm,
> RUN_ON_CPU_NULL);
> >> +}
> >> +
> >> +static void hvf_vcpu_destroy(CPUState *cpu)
> >> +{
> >> +    hv_return_t ret = hv_vcpu_destroy(cpu->hvf_fd);
> >> +    assert_hvf_ok(ret);
> >> +
> >> +    hvf_arch_vcpu_destroy(cpu);
> >> +}
> >> +
> >> +static void dummy_signal(int sig)
> >> +{
> >> +}
> >> +
> >> +static int hvf_init_vcpu(CPUState *cpu)
> >> +{
> >> +    int r;
> >> +
> >> +    /* init cpu signals */
> >> +    sigset_t set;
> >> +    struct sigaction sigact;
> >> +
> >> +    memset(&sigact, 0, sizeof(sigact));
> >> +    sigact.sa_handler = dummy_signal;
> >> +    sigaction(SIG_IPI, &sigact, NULL);
> >> +
> >> +    pthread_sigmask(SIG_BLOCK, NULL, &set);
> >> +    sigdelset(&set, SIG_IPI);
> >> +
> >> +#ifdef __aarch64__
> >> +    r = hv_vcpu_create(&cpu->hvf_fd, (hv_vcpu_exit_t
> **)&cpu->hvf_exit, NULL);
> >> +#else
> >> +    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
> >> +#endif
> > I think the first __aarch64__ bit fits better to arm part of the series.
>
>
> Oops. Thanks for catching it! Yes, absolutely. It should be part of the
> ARM enablement.
>
>
> >
> >> +    cpu->vcpu_dirty = 1;
> >> +    assert_hvf_ok(r);
> >> +
> >> +    return hvf_arch_init_vcpu(cpu);
> >> +}
> >> +
> >> +/*
> >> + * The HVF-specific vCPU thread function. This one should only run
> when the host
> >> + * CPU supports the VMX "unrestricted guest" feature.
> >> + */
> >> +static void *hvf_cpu_thread_fn(void *arg)
> >> +{
> >> +    CPUState *cpu = arg;
> >> +
> >> +    int r;
> >> +
> >> +    assert(hvf_enabled());
> >> +
> >> +    rcu_register_thread();
> >> +
> >> +    qemu_mutex_lock_iothread();
> >> +    qemu_thread_get_self(cpu->thread);
> >> +
> >> +    cpu->thread_id = qemu_get_thread_id();
> >> +    cpu->can_do_io = 1;
> >> +    current_cpu = cpu;
> >> +
> >> +    hvf_init_vcpu(cpu);
> >> +
> >> +    /* signal CPU creation */
> >> +    cpu_thread_signal_created(cpu);
> >> +    qemu_guest_random_seed_thread_part2(cpu->random_seed);
> >> +
> >> +    do {
> >> +        if (cpu_can_run(cpu)) {
> >> +            r = hvf_vcpu_exec(cpu);
> >> +            if (r == EXCP_DEBUG) {
> >> +                cpu_handle_guest_debug(cpu);
> >> +            }
> >> +        }
> >> +        qemu_wait_io_event(cpu);
> >> +    } while (!cpu->unplug || cpu_can_run(cpu));
> >> +
> >> +    hvf_vcpu_destroy(cpu);
> >> +    cpu_thread_signal_destroyed(cpu);
> >> +    qemu_mutex_unlock_iothread();
> >> +    rcu_unregister_thread();
> >> +    return NULL;
> >> +}
> >> +
> >> +static void hvf_start_vcpu_thread(CPUState *cpu)
> >> +{
> >> +    char thread_name[VCPU_THREAD_NAME_SIZE];
> >> +
> >> +    /*
> >> +     * HVF currently does not support TCG, and only runs in
> >> +     * unrestricted-guest mode.
> >> +     */
> >> +    assert(hvf_enabled());
> >> +
> >> +    cpu->thread = g_malloc0(sizeof(QemuThread));
> >> +    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
> >> +    qemu_cond_init(cpu->halt_cond);
> >> +
> >> +    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
> >> +             cpu->cpu_index);
> >> +    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
> >> +                       cpu, QEMU_THREAD_JOINABLE);
> >> +}
> >> +
> >> +static const CpusAccel hvf_cpus = {
> >> +    .create_vcpu_thread = hvf_start_vcpu_thread,
> >> +
> >> +    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
> >> +    .synchronize_post_init = hvf_cpu_synchronize_post_init,
> >> +    .synchronize_state = hvf_cpu_synchronize_state,
> >> +    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
> >> +};
> >> +
> >> +static int hvf_accel_init(MachineState *ms)
> >> +{
> >> +    int x;
> >> +    hv_return_t ret;
> >> +    HVFState *s;
> >> +
> >> +    ret = hv_vm_create(HV_VM_DEFAULT);
> >> +    assert_hvf_ok(ret);
> >> +
> >> +    s = g_new0(HVFState, 1);
> >> +
> >> +    s->num_slots = 32;
> >> +    for (x = 0; x < s->num_slots; ++x) {
> >> +        s->slots[x].size = 0;
> >> +        s->slots[x].slot_id = x;
> >> +    }
> >> +
> >> +    hvf_state = s;
> >> +    memory_listener_register(&hvf_memory_listener,
> &address_space_memory);
> >> +    cpus_register_accel(&hvf_cpus);
> >> +    return 0;
> >> +}
> >> +
> >> +static void hvf_accel_class_init(ObjectClass *oc, void *data)
> >> +{
> >> +    AccelClass *ac = ACCEL_CLASS(oc);
> >> +    ac->name = "HVF";
> >> +    ac->init_machine = hvf_accel_init;
> >> +    ac->allowed = &hvf_allowed;
> >> +}
> >> +
> >> +static const TypeInfo hvf_accel_type = {
> >> +    .name = TYPE_HVF_ACCEL,
> >> +    .parent = TYPE_ACCEL,
> >> +    .class_init = hvf_accel_class_init,
> >> +};
> >> +
> >> +static void hvf_type_init(void)
> >> +{
> >> +    type_register_static(&hvf_accel_type);
> >> +}
> >> +
> >> +type_init(hvf_type_init);
> >> diff --git a/accel/hvf/meson.build b/accel/hvf/meson.build
> >> new file mode 100644
> >> index 0000000000..dfd6b68dc7
> >> --- /dev/null
> >> +++ b/accel/hvf/meson.build
> >> @@ -0,0 +1,7 @@
> >> +hvf_ss = ss.source_set()
> >> +hvf_ss.add(files(
> >> +  'hvf-all.c',
> >> +  'hvf-cpus.c',
> >> +))
> >> +
> >> +specific_ss.add_all(when: 'CONFIG_HVF', if_true: hvf_ss)
> >> diff --git a/accel/meson.build b/accel/meson.build
> >> index b26cca227a..6de12ce5d5 100644
> >> --- a/accel/meson.build
> >> +++ b/accel/meson.build
> >> @@ -1,5 +1,6 @@
> >>   softmmu_ss.add(files('accel.c'))
> >>
> >> +subdir('hvf')
> >>   subdir('qtest')
> >>   subdir('kvm')
> >>   subdir('tcg')
> >> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> >> new file mode 100644
> >> index 0000000000..de9bad23a8
> >> --- /dev/null
> >> +++ b/include/sysemu/hvf_int.h
> >> @@ -0,0 +1,69 @@
> >> +/*
> >> + * QEMU Hypervisor.framework (HVF) support
> >> + *
> >> + * This work is licensed under the terms of the GNU GPL, version 2 or
> later.
> >> + * See the COPYING file in the top-level directory.
> >> + *
> >> + */
> >> +
> >> +/* header to be included in HVF-specific code */
> >> +
> >> +#ifndef HVF_INT_H
> >> +#define HVF_INT_H
> >> +
> >> +#include <Hypervisor/Hypervisor.h>
> >> +
> >> +#define HVF_MAX_VCPU 0x10
> >> +
> >> +extern struct hvf_state hvf_global;
> >> +
> >> +struct hvf_vm {
> >> +    int id;
> >> +    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
> >> +};
> >> +
> >> +struct hvf_state {
> >> +    uint32_t version;
> >> +    struct hvf_vm *vm;
> >> +    uint64_t mem_quota;
> >> +};
> >> +
> >> +/* hvf_slot flags */
> >> +#define HVF_SLOT_LOG (1 << 0)
> >> +
> >> +typedef struct hvf_slot {
> >> +    uint64_t start;
> >> +    uint64_t size;
> >> +    uint8_t *mem;
> >> +    int slot_id;
> >> +    uint32_t flags;
> >> +    MemoryRegion *region;
> >> +} hvf_slot;
> >> +
> >> +typedef struct hvf_vcpu_caps {
> >> +    uint64_t vmx_cap_pinbased;
> >> +    uint64_t vmx_cap_procbased;
> >> +    uint64_t vmx_cap_procbased2;
> >> +    uint64_t vmx_cap_entry;
> >> +    uint64_t vmx_cap_exit;
> >> +    uint64_t vmx_cap_preemption_timer;
> >> +} hvf_vcpu_caps;
> >> +
> >> +struct HVFState {
> >> +    AccelState parent;
> >> +    hvf_slot slots[32];
> >> +    int num_slots;
> >> +
> >> +    hvf_vcpu_caps *hvf_caps;
> >> +};
> >> +extern HVFState *hvf_state;
> >> +
> >> +void assert_hvf_ok(hv_return_t ret);
> >> +int hvf_get_registers(CPUState *cpu);
> >> +int hvf_put_registers(CPUState *cpu);
> >> +int hvf_arch_init_vcpu(CPUState *cpu);
> >> +void hvf_arch_vcpu_destroy(CPUState *cpu);
> >> +int hvf_vcpu_exec(CPUState *cpu);
> >> +hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
> >> +
> >> +#endif
> >> diff --git a/target/i386/hvf/hvf-cpus.c b/target/i386/hvf/hvf-cpus.c
> >> deleted file mode 100644
> >> index 817b3d7452..0000000000
> >> --- a/target/i386/hvf/hvf-cpus.c
> >> +++ /dev/null
> >> @@ -1,131 +0,0 @@
> >> -/*
> >> - * Copyright 2008 IBM Corporation
> >> - *           2008 Red Hat, Inc.
> >> - * Copyright 2011 Intel Corporation
> >> - * Copyright 2016 Veertu, Inc.
> >> - * Copyright 2017 The Android Open Source Project
> >> - *
> >> - * QEMU Hypervisor.framework support
> >> - *
> >> - * This program is free software; you can redistribute it and/or
> >> - * modify it under the terms of version 2 of the GNU General Public
> >> - * License as published by the Free Software Foundation.
> >> - *
> >> - * This program is distributed in the hope that it will be useful,
> >> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >> - * General Public License for more details.
> >> - *
> >> - * You should have received a copy of the GNU General Public License
> >> - * along with this program; if not, see <http://www.gnu.org/licenses/
> >.
> >> - *
> >> - * This file contain code under public domain from the hvdos project:
> >> - * https://github.com/mist64/hvdos
> >> - *
> >> - * Parts Copyright (c) 2011 NetApp, Inc.
> >> - * All rights reserved.
> >> - *
> >> - * Redistribution and use in source and binary forms, with or without
> >> - * modification, are permitted provided that the following conditions
> >> - * are met:
> >> - * 1. Redistributions of source code must retain the above copyright
> >> - *    notice, this list of conditions and the following disclaimer.
> >> - * 2. Redistributions in binary form must reproduce the above copyright
> >> - *    notice, this list of conditions and the following disclaimer in
> the
> >> - *    documentation and/or other materials provided with the
> distribution.
> >> - *
> >> - * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
> >> - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
> THE
> >> - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
> PURPOSE
> >> - * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE
> LIABLE
> >> - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> CONSEQUENTIAL
> >> - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
> GOODS
> >> - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
> INTERRUPTION)
> >> - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> STRICT
> >> - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
> ANY WAY
> >> - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
> OF
> >> - * SUCH DAMAGE.
> >> - */
> >> -
> >> -#include "qemu/osdep.h"
> >> -#include "qemu/error-report.h"
> >> -#include "qemu/main-loop.h"
> >> -#include "sysemu/hvf.h"
> >> -#include "sysemu/runstate.h"
> >> -#include "target/i386/cpu.h"
> >> -#include "qemu/guest-random.h"
> >> -
> >> -#include "hvf-cpus.h"
> >> -
> >> -/*
> >> - * The HVF-specific vCPU thread function. This one should only run
> when the host
> >> - * CPU supports the VMX "unrestricted guest" feature.
> >> - */
> >> -static void *hvf_cpu_thread_fn(void *arg)
> >> -{
> >> -    CPUState *cpu = arg;
> >> -
> >> -    int r;
> >> -
> >> -    assert(hvf_enabled());
> >> -
> >> -    rcu_register_thread();
> >> -
> >> -    qemu_mutex_lock_iothread();
> >> -    qemu_thread_get_self(cpu->thread);
> >> -
> >> -    cpu->thread_id = qemu_get_thread_id();
> >> -    cpu->can_do_io = 1;
> >> -    current_cpu = cpu;
> >> -
> >> -    hvf_init_vcpu(cpu);
> >> -
> >> -    /* signal CPU creation */
> >> -    cpu_thread_signal_created(cpu);
> >> -    qemu_guest_random_seed_thread_part2(cpu->random_seed);
> >> -
> >> -    do {
> >> -        if (cpu_can_run(cpu)) {
> >> -            r = hvf_vcpu_exec(cpu);
> >> -            if (r == EXCP_DEBUG) {
> >> -                cpu_handle_guest_debug(cpu);
> >> -            }
> >> -        }
> >> -        qemu_wait_io_event(cpu);
> >> -    } while (!cpu->unplug || cpu_can_run(cpu));
> >> -
> >> -    hvf_vcpu_destroy(cpu);
> >> -    cpu_thread_signal_destroyed(cpu);
> >> -    qemu_mutex_unlock_iothread();
> >> -    rcu_unregister_thread();
> >> -    return NULL;
> >> -}
> >> -
> >> -static void hvf_start_vcpu_thread(CPUState *cpu)
> >> -{
> >> -    char thread_name[VCPU_THREAD_NAME_SIZE];
> >> -
> >> -    /*
> >> -     * HVF currently does not support TCG, and only runs in
> >> -     * unrestricted-guest mode.
> >> -     */
> >> -    assert(hvf_enabled());
> >> -
> >> -    cpu->thread = g_malloc0(sizeof(QemuThread));
> >> -    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
> >> -    qemu_cond_init(cpu->halt_cond);
> >> -
> >> -    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
> >> -             cpu->cpu_index);
> >> -    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
> >> -                       cpu, QEMU_THREAD_JOINABLE);
> >> -}
> >> -
> >> -const CpusAccel hvf_cpus = {
> >> -    .create_vcpu_thread = hvf_start_vcpu_thread,
> >> -
> >> -    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
> >> -    .synchronize_post_init = hvf_cpu_synchronize_post_init,
> >> -    .synchronize_state = hvf_cpu_synchronize_state,
> >> -    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
> >> -};
> >> diff --git a/target/i386/hvf/hvf-cpus.h b/target/i386/hvf/hvf-cpus.h
> >> deleted file mode 100644
> >> index ced31b82c0..0000000000
> >> --- a/target/i386/hvf/hvf-cpus.h
> >> +++ /dev/null
> >> @@ -1,25 +0,0 @@
> >> -/*
> >> - * Accelerator CPUS Interface
> >> - *
> >> - * Copyright 2020 SUSE LLC
> >> - *
> >> - * This work is licensed under the terms of the GNU GPL, version 2 or
> later.
> >> - * See the COPYING file in the top-level directory.
> >> - */
> >> -
> >> -#ifndef HVF_CPUS_H
> >> -#define HVF_CPUS_H
> >> -
> >> -#include "sysemu/cpus.h"
> >> -
> >> -extern const CpusAccel hvf_cpus;
> >> -
> >> -int hvf_init_vcpu(CPUState *);
> >> -int hvf_vcpu_exec(CPUState *);
> >> -void hvf_cpu_synchronize_state(CPUState *);
> >> -void hvf_cpu_synchronize_post_reset(CPUState *);
> >> -void hvf_cpu_synchronize_post_init(CPUState *);
> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *);
> >> -void hvf_vcpu_destroy(CPUState *);
> >> -
> >> -#endif /* HVF_CPUS_H */
> >> diff --git a/target/i386/hvf/hvf-i386.h b/target/i386/hvf/hvf-i386.h
> >> index e0edffd077..6d56f8f6bb 100644
> >> --- a/target/i386/hvf/hvf-i386.h
> >> +++ b/target/i386/hvf/hvf-i386.h
> >> @@ -18,57 +18,11 @@
> >>
> >>   #include "sysemu/accel.h"
> >>   #include "sysemu/hvf.h"
> >> +#include "sysemu/hvf_int.h"
> >>   #include "cpu.h"
> >>   #include "x86.h"
> >>
> >> -#define HVF_MAX_VCPU 0x10
> >> -
> >> -extern struct hvf_state hvf_global;
> >> -
> >> -struct hvf_vm {
> >> -    int id;
> >> -    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
> >> -};
> >> -
> >> -struct hvf_state {
> >> -    uint32_t version;
> >> -    struct hvf_vm *vm;
> >> -    uint64_t mem_quota;
> >> -};
> >> -
> >> -/* hvf_slot flags */
> >> -#define HVF_SLOT_LOG (1 << 0)
> >> -
> >> -typedef struct hvf_slot {
> >> -    uint64_t start;
> >> -    uint64_t size;
> >> -    uint8_t *mem;
> >> -    int slot_id;
> >> -    uint32_t flags;
> >> -    MemoryRegion *region;
> >> -} hvf_slot;
> >> -
> >> -typedef struct hvf_vcpu_caps {
> >> -    uint64_t vmx_cap_pinbased;
> >> -    uint64_t vmx_cap_procbased;
> >> -    uint64_t vmx_cap_procbased2;
> >> -    uint64_t vmx_cap_entry;
> >> -    uint64_t vmx_cap_exit;
> >> -    uint64_t vmx_cap_preemption_timer;
> >> -} hvf_vcpu_caps;
> >> -
> >> -struct HVFState {
> >> -    AccelState parent;
> >> -    hvf_slot slots[32];
> >> -    int num_slots;
> >> -
> >> -    hvf_vcpu_caps *hvf_caps;
> >> -};
> >> -extern HVFState *hvf_state;
> >> -
> >> -void hvf_set_phys_mem(MemoryRegionSection *, bool);
> >>   void hvf_handle_io(CPUArchState *, uint16_t, void *, int, int, int);
> >> -hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
> >>
> >>   #ifdef NEED_CPU_H
> >>   /* Functions exported to host specific mode */
> >> diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
> >> index ed9356565c..8b96ecd619 100644
> >> --- a/target/i386/hvf/hvf.c
> >> +++ b/target/i386/hvf/hvf.c
> >> @@ -51,6 +51,7 @@
> >>   #include "qemu/error-report.h"
> >>
> >>   #include "sysemu/hvf.h"
> >> +#include "sysemu/hvf_int.h"
> >>   #include "sysemu/runstate.h"
> >>   #include "hvf-i386.h"
> >>   #include "vmcs.h"
> >> @@ -72,171 +73,6 @@
> >>   #include "sysemu/accel.h"
> >>   #include "target/i386/cpu.h"
> >>
> >> -#include "hvf-cpus.h"
> >> -
> >> -HVFState *hvf_state;
> >> -
> >> -static void assert_hvf_ok(hv_return_t ret)
> >> -{
> >> -    if (ret == HV_SUCCESS) {
> >> -        return;
> >> -    }
> >> -
> >> -    switch (ret) {
> >> -    case HV_ERROR:
> >> -        error_report("Error: HV_ERROR");
> >> -        break;
> >> -    case HV_BUSY:
> >> -        error_report("Error: HV_BUSY");
> >> -        break;
> >> -    case HV_BAD_ARGUMENT:
> >> -        error_report("Error: HV_BAD_ARGUMENT");
> >> -        break;
> >> -    case HV_NO_RESOURCES:
> >> -        error_report("Error: HV_NO_RESOURCES");
> >> -        break;
> >> -    case HV_NO_DEVICE:
> >> -        error_report("Error: HV_NO_DEVICE");
> >> -        break;
> >> -    case HV_UNSUPPORTED:
> >> -        error_report("Error: HV_UNSUPPORTED");
> >> -        break;
> >> -    default:
> >> -        error_report("Unknown Error");
> >> -    }
> >> -
> >> -    abort();
> >> -}
> >> -
> >> -/* Memory slots */
> >> -hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
> >> -{
> >> -    hvf_slot *slot;
> >> -    int x;
> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
> >> -        slot = &hvf_state->slots[x];
> >> -        if (slot->size && start < (slot->start + slot->size) &&
> >> -            (start + size) > slot->start) {
> >> -            return slot;
> >> -        }
> >> -    }
> >> -    return NULL;
> >> -}
> >> -
> >> -struct mac_slot {
> >> -    int present;
> >> -    uint64_t size;
> >> -    uint64_t gpa_start;
> >> -    uint64_t gva;
> >> -};
> >> -
> >> -struct mac_slot mac_slots[32];
> >> -
> >> -static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
> >> -{
> >> -    struct mac_slot *macslot;
> >> -    hv_return_t ret;
> >> -
> >> -    macslot = &mac_slots[slot->slot_id];
> >> -
> >> -    if (macslot->present) {
> >> -        if (macslot->size != slot->size) {
> >> -            macslot->present = 0;
> >> -            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
> >> -            assert_hvf_ok(ret);
> >> -        }
> >> -    }
> >> -
> >> -    if (!slot->size) {
> >> -        return 0;
> >> -    }
> >> -
> >> -    macslot->present = 1;
> >> -    macslot->gpa_start = slot->start;
> >> -    macslot->size = slot->size;
> >> -    ret = hv_vm_map((hv_uvaddr_t)slot->mem, slot->start, slot->size,
> flags);
> >> -    assert_hvf_ok(ret);
> >> -    return 0;
> >> -}
> >> -
> >> -void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
> >> -{
> >> -    hvf_slot *mem;
> >> -    MemoryRegion *area = section->mr;
> >> -    bool writeable = !area->readonly && !area->rom_device;
> >> -    hv_memory_flags_t flags;
> >> -
> >> -    if (!memory_region_is_ram(area)) {
> >> -        if (writeable) {
> >> -            return;
> >> -        } else if (!memory_region_is_romd(area)) {
> >> -            /*
> >> -             * If the memory device is not in romd_mode, then we
> actually want
> >> -             * to remove the hvf memory slot so all accesses will trap.
> >> -             */
> >> -             add = false;
> >> -        }
> >> -    }
> >> -
> >> -    mem = hvf_find_overlap_slot(
> >> -            section->offset_within_address_space,
> >> -            int128_get64(section->size));
> >> -
> >> -    if (mem && add) {
> >> -        if (mem->size == int128_get64(section->size) &&
> >> -            mem->start == section->offset_within_address_space &&
> >> -            mem->mem == (memory_region_get_ram_ptr(area) +
> >> -            section->offset_within_region)) {
> >> -            return; /* Same region was attempted to register, go away.
> */
> >> -        }
> >> -    }
> >> -
> >> -    /* Region needs to be reset. set the size to 0 and remap it. */
> >> -    if (mem) {
> >> -        mem->size = 0;
> >> -        if (do_hvf_set_memory(mem, 0)) {
> >> -            error_report("Failed to reset overlapping slot");
> >> -            abort();
> >> -        }
> >> -    }
> >> -
> >> -    if (!add) {
> >> -        return;
> >> -    }
> >> -
> >> -    if (area->readonly ||
> >> -        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
> >> -        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
> >> -    } else {
> >> -        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
> >> -    }
> >> -
> >> -    /* Now make a new slot. */
> >> -    int x;
> >> -
> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
> >> -        mem = &hvf_state->slots[x];
> >> -        if (!mem->size) {
> >> -            break;
> >> -        }
> >> -    }
> >> -
> >> -    if (x == hvf_state->num_slots) {
> >> -        error_report("No free slots");
> >> -        abort();
> >> -    }
> >> -
> >> -    mem->size = int128_get64(section->size);
> >> -    mem->mem = memory_region_get_ram_ptr(area) +
> section->offset_within_region;
> >> -    mem->start = section->offset_within_address_space;
> >> -    mem->region = area;
> >> -
> >> -    if (do_hvf_set_memory(mem, flags)) {
> >> -        error_report("Error registering new memory slot");
> >> -        abort();
> >> -    }
> >> -}
> >> -
> >>   void vmx_update_tpr(CPUState *cpu)
> >>   {
> >>       /* TODO: need integrate APIC handling */
> >> @@ -276,56 +112,6 @@ void hvf_handle_io(CPUArchState *env, uint16_t
> port, void *buffer,
> >>       }
> >>   }
> >>
> >> -static void do_hvf_cpu_synchronize_state(CPUState *cpu,
> run_on_cpu_data arg)
> >> -{
> >> -    if (!cpu->vcpu_dirty) {
> >> -        hvf_get_registers(cpu);
> >> -        cpu->vcpu_dirty = true;
> >> -    }
> >> -}
> >> -
> >> -void hvf_cpu_synchronize_state(CPUState *cpu)
> >> -{
> >> -    if (!cpu->vcpu_dirty) {
> >> -        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
> >> -    }
> >> -}
> >> -
> >> -static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
> >> -                                              run_on_cpu_data arg)
> >> -{
> >> -    hvf_put_registers(cpu);
> >> -    cpu->vcpu_dirty = false;
> >> -}
> >> -
> >> -void hvf_cpu_synchronize_post_reset(CPUState *cpu)
> >> -{
> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset,
> RUN_ON_CPU_NULL);
> >> -}
> >> -
> >> -static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
> >> -                                             run_on_cpu_data arg)
> >> -{
> >> -    hvf_put_registers(cpu);
> >> -    cpu->vcpu_dirty = false;
> >> -}
> >> -
> >> -void hvf_cpu_synchronize_post_init(CPUState *cpu)
> >> -{
> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
> >> -}
> >> -
> >> -static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
> >> -                                              run_on_cpu_data arg)
> >> -{
> >> -    cpu->vcpu_dirty = true;
> >> -}
> >> -
> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
> >> -{
> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm,
> RUN_ON_CPU_NULL);
> >> -}
> >> -
> >>   static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa,
> uint64_t ept_qual)
> >>   {
> >>       int read, write;
> >> @@ -370,109 +156,19 @@ static bool ept_emulation_fault(hvf_slot *slot,
> uint64_t gpa, uint64_t ept_qual)
> >>       return false;
> >>   }
> >>
> >> -static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool
> on)
> >> -{
> >> -    hvf_slot *slot;
> >> -
> >> -    slot = hvf_find_overlap_slot(
> >> -            section->offset_within_address_space,
> >> -            int128_get64(section->size));
> >> -
> >> -    /* protect region against writes; begin tracking it */
> >> -    if (on) {
> >> -        slot->flags |= HVF_SLOT_LOG;
> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
> >> -                      HV_MEMORY_READ);
> >> -    /* stop tracking region*/
> >> -    } else {
> >> -        slot->flags &= ~HVF_SLOT_LOG;
> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
> >> -                      HV_MEMORY_READ | HV_MEMORY_WRITE);
> >> -    }
> >> -}
> >> -
> >> -static void hvf_log_start(MemoryListener *listener,
> >> -                          MemoryRegionSection *section, int old, int
> new)
> >> -{
> >> -    if (old != 0) {
> >> -        return;
> >> -    }
> >> -
> >> -    hvf_set_dirty_tracking(section, 1);
> >> -}
> >> -
> >> -static void hvf_log_stop(MemoryListener *listener,
> >> -                         MemoryRegionSection *section, int old, int
> new)
> >> -{
> >> -    if (new != 0) {
> >> -        return;
> >> -    }
> >> -
> >> -    hvf_set_dirty_tracking(section, 0);
> >> -}
> >> -
> >> -static void hvf_log_sync(MemoryListener *listener,
> >> -                         MemoryRegionSection *section)
> >> -{
> >> -    /*
> >> -     * sync of dirty pages is handled elsewhere; just make sure we keep
> >> -     * tracking the region.
> >> -     */
> >> -    hvf_set_dirty_tracking(section, 1);
> >> -}
> >> -
> >> -static void hvf_region_add(MemoryListener *listener,
> >> -                           MemoryRegionSection *section)
> >> -{
> >> -    hvf_set_phys_mem(section, true);
> >> -}
> >> -
> >> -static void hvf_region_del(MemoryListener *listener,
> >> -                           MemoryRegionSection *section)
> >> -{
> >> -    hvf_set_phys_mem(section, false);
> >> -}
> >> -
> >> -static MemoryListener hvf_memory_listener = {
> >> -    .priority = 10,
> >> -    .region_add = hvf_region_add,
> >> -    .region_del = hvf_region_del,
> >> -    .log_start = hvf_log_start,
> >> -    .log_stop = hvf_log_stop,
> >> -    .log_sync = hvf_log_sync,
> >> -};
> >> -
> >> -void hvf_vcpu_destroy(CPUState *cpu)
> >> +void hvf_arch_vcpu_destroy(CPUState *cpu)
> >>   {
> >>       X86CPU *x86_cpu = X86_CPU(cpu);
> >>       CPUX86State *env = &x86_cpu->env;
> >>
> >> -    hv_return_t ret = hv_vcpu_destroy((hv_vcpuid_t)cpu->hvf_fd);
> >>       g_free(env->hvf_mmio_buf);
> >> -    assert_hvf_ok(ret);
> >> -}
> >> -
> >> -static void dummy_signal(int sig)
> >> -{
> >>   }
> >>
> >> -int hvf_init_vcpu(CPUState *cpu)
> >> +int hvf_arch_init_vcpu(CPUState *cpu)
> >>   {
> >>
> >>       X86CPU *x86cpu = X86_CPU(cpu);
> >>       CPUX86State *env = &x86cpu->env;
> >> -    int r;
> >> -
> >> -    /* init cpu signals */
> >> -    sigset_t set;
> >> -    struct sigaction sigact;
> >> -
> >> -    memset(&sigact, 0, sizeof(sigact));
> >> -    sigact.sa_handler = dummy_signal;
> >> -    sigaction(SIG_IPI, &sigact, NULL);
> >> -
> >> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> >> -    sigdelset(&set, SIG_IPI);
> >>
> >>       init_emu();
> >>       init_decoder();
> >> @@ -480,10 +176,6 @@ int hvf_init_vcpu(CPUState *cpu)
> >>       hvf_state->hvf_caps = g_new0(struct hvf_vcpu_caps, 1);
> >>       env->hvf_mmio_buf = g_new(char, 4096);
> >>
> >> -    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
> >> -    cpu->vcpu_dirty = 1;
> >> -    assert_hvf_ok(r);
> >> -
> >>       if (hv_vmx_read_capability(HV_VMX_CAP_PINBASED,
> >>           &hvf_state->hvf_caps->vmx_cap_pinbased)) {
> >>           abort();
> >> @@ -865,49 +557,3 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>
> >>       return ret;
> >>   }
> >> -
> >> -bool hvf_allowed;
> >> -
> >> -static int hvf_accel_init(MachineState *ms)
> >> -{
> >> -    int x;
> >> -    hv_return_t ret;
> >> -    HVFState *s;
> >> -
> >> -    ret = hv_vm_create(HV_VM_DEFAULT);
> >> -    assert_hvf_ok(ret);
> >> -
> >> -    s = g_new0(HVFState, 1);
> >> -
> >> -    s->num_slots = 32;
> >> -    for (x = 0; x < s->num_slots; ++x) {
> >> -        s->slots[x].size = 0;
> >> -        s->slots[x].slot_id = x;
> >> -    }
> >> -
> >> -    hvf_state = s;
> >> -    memory_listener_register(&hvf_memory_listener,
> &address_space_memory);
> >> -    cpus_register_accel(&hvf_cpus);
> >> -    return 0;
> >> -}
> >> -
> >> -static void hvf_accel_class_init(ObjectClass *oc, void *data)
> >> -{
> >> -    AccelClass *ac = ACCEL_CLASS(oc);
> >> -    ac->name = "HVF";
> >> -    ac->init_machine = hvf_accel_init;
> >> -    ac->allowed = &hvf_allowed;
> >> -}
> >> -
> >> -static const TypeInfo hvf_accel_type = {
> >> -    .name = TYPE_HVF_ACCEL,
> >> -    .parent = TYPE_ACCEL,
> >> -    .class_init = hvf_accel_class_init,
> >> -};
> >> -
> >> -static void hvf_type_init(void)
> >> -{
> >> -    type_register_static(&hvf_accel_type);
> >> -}
> >> -
> >> -type_init(hvf_type_init);
> >> diff --git a/target/i386/hvf/meson.build b/target/i386/hvf/meson.build
> >> index 409c9a3f14..c8a43717ee 100644
> >> --- a/target/i386/hvf/meson.build
> >> +++ b/target/i386/hvf/meson.build
> >> @@ -1,6 +1,5 @@
> >>   i386_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true: files(
> >>     'hvf.c',
> >> -  'hvf-cpus.c',
> >>     'x86.c',
> >>     'x86_cpuid.c',
> >>     'x86_decode.c',
> >> diff --git a/target/i386/hvf/x86hvf.c b/target/i386/hvf/x86hvf.c
> >> index bbec412b6c..89b8e9d87a 100644
> >> --- a/target/i386/hvf/x86hvf.c
> >> +++ b/target/i386/hvf/x86hvf.c
> >> @@ -20,6 +20,9 @@
> >>   #include "qemu/osdep.h"
> >>
> >>   #include "qemu-common.h"
> >> +#include "sysemu/hvf.h"
> >> +#include "sysemu/hvf_int.h"
> >> +#include "sysemu/hw_accel.h"
> >>   #include "x86hvf.h"
> >>   #include "vmx.h"
> >>   #include "vmcs.h"
> >> @@ -32,8 +35,6 @@
> >>   #include <Hypervisor/hv.h>
> >>   #include <Hypervisor/hv_vmx.h>
> >>
> >> -#include "hvf-cpus.h"
> >> -
> >>   void hvf_set_segment(struct CPUState *cpu, struct vmx_segment
> *vmx_seg,
> >>                        SegmentCache *qseg, bool is_tr)
> >>   {
> >> @@ -437,7 +438,7 @@ int hvf_process_events(CPUState *cpu_state)
> >>       env->eflags = rreg(cpu_state->hvf_fd, HV_X86_RFLAGS);
> >>
> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_INIT) {
> >> -        hvf_cpu_synchronize_state(cpu_state);
> >> +        cpu_synchronize_state(cpu_state);
> >>           do_cpu_init(cpu);
> >>       }
> >>
> >> @@ -451,12 +452,12 @@ int hvf_process_events(CPUState *cpu_state)
> >>           cpu_state->halted = 0;
> >>       }
> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_SIPI) {
> >> -        hvf_cpu_synchronize_state(cpu_state);
> >> +        cpu_synchronize_state(cpu_state);
> >>           do_cpu_sipi(cpu);
> >>       }
> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_TPR) {
> >>           cpu_state->interrupt_request &= ~CPU_INTERRUPT_TPR;
> >> -        hvf_cpu_synchronize_state(cpu_state);
> >> +        cpu_synchronize_state(cpu_state);
> > The changes from hvf_cpu_*() to cpu_*() are cleanup and perhaps should
> > be a separate patch. It follows cpu/accel cleanups Claudio was doing the
> > summer.
>
>
> The only reason they're in here is because we no longer have access to
> the hvf_ functions from the file. I am perfectly happy to rebase the
> patch on top of Claudio's if his goes in first. I'm sure it'll be
> trivial for him to rebase on top of this too if my series goes in first.
>
>
> >
> > Phillipe raised the idea that the patch might go ahead of ARM-specific
> > part (which might involve some discussions) and I agree with that.
> >
> > Some sync between Claudio series (CC'd him) and the patch might be need.
>
>
> I would prefer not to hold back because of the sync. Claudio's cleanup
> is trivial enough to adjust for if it gets merged ahead of this.
>
>
> Alex
>
>
>
>

Re: [PATCH 2/8] hvf: Move common code out

Posted by Frank Yang 4 years, 3 months ago

Update: We're not quite sure how to compare the CNTV_CVAL and CNTVCT. But
the high CPU usage seems to be mitigated by having a poll interval (like
KVM does) in handling WFI:

https://android-review.googlesource.com/c/platform/external/qemu/+/1512501

This is loosely inspired by
https://elixir.bootlin.com/linux/v5.10-rc6/source/virt/kvm/kvm_main.c#L2766
which does seem to specify a poll interval.

It would be cool if we could have a lightweight way to enter sleep and
restart the vcpus precisely when CVAL passes, though.

Frank


On Fri, Nov 27, 2020 at 3:30 PM Frank Yang <lfy@google.com> wrote:

> Hi all,
>
> +Peter Collingbourne <pcc@google.com>
>
> I'm a developer on the Android Emulator, which is in a fork of QEMU.
>
> Peter and I have been working on an HVF Apple Silicon backend with an eye
> toward Android guests.
>
> We have gotten things to basically switch to Android userspace already
> (logcat/shell and graphics available at least)
>
> Our strategy so far has been to import logic from the KVM implementation
> and hook into QEMU's software devices that previously assumed to only work
> with TCG, or have KVM-specific paths.
>
> Thanks to Alexander for the tip on the 36-bit address space limitation
> btw; our way of addressing this is to still allow highmem but not put pci
> high mmio so high.
>
> Also, note we have a sleep/signal based mechanism to deal with WFx, which
> might be worth looking into in Alexander's implementation as well:
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551
>
> Patches so far, FYI:
>
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1513429/1
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512554/3
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512553/3
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512552/3
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551/3
>
>
> https://android.googlesource.com/platform/external/qemu/+/c17eb6a3ffd50047e9646aff6640b710cb8ff48a
>
> https://android.googlesource.com/platform/external/qemu/+/74bed16de1afb41b7a7ab8da1d1861226c9db63b
>
> https://android.googlesource.com/platform/external/qemu/+/eccd9e47ab2ccb9003455e3bb721f57f9ebc3c01
>
> https://android.googlesource.com/platform/external/qemu/+/54fe3d67ed4698e85826537a4f49b2b9074b2228
>
> https://android.googlesource.com/platform/external/qemu/+/82ef91a6fede1d1000f36be037ad4d58fbe0d102
>
> https://android.googlesource.com/platform/external/qemu/+/c28147aa7c74d98b858e99623d2fe46e74a379f6
>
> Peter's also noticed that there are extra steps needed for M1's to allow
> TCG to work, as it involves JIT:
>
>
> https://android.googlesource.com/platform/external/qemu/+/740e3fe47f88926c6bda9abb22ee6eae1bc254a9
>
> We'd appreciate any feedback/comments :)
>
> Best,
>
> Frank
>
> On Fri, Nov 27, 2020 at 1:57 PM Alexander Graf <agraf@csgraf.de> wrote:
>
>>
>> On 27.11.20 21:00, Roman Bolshakov wrote:
>> > On Thu, Nov 26, 2020 at 10:50:11PM +0100, Alexander Graf wrote:
>> >> Until now, Hypervisor.framework has only been available on x86_64
>> systems.
>> >> With Apple Silicon shipping now, it extends its reach to aarch64. To
>> >> prepare for support for multiple architectures, let's move common code
>> out
>> >> into its own accel directory.
>> >>
>> >> Signed-off-by: Alexander Graf <agraf@csgraf.de>
>> >> ---
>> >>   MAINTAINERS                 |   9 +-
>> >>   accel/hvf/hvf-all.c         |  56 +++++
>> >>   accel/hvf/hvf-cpus.c        | 468
>> ++++++++++++++++++++++++++++++++++++
>> >>   accel/hvf/meson.build       |   7 +
>> >>   accel/meson.build           |   1 +
>> >>   include/sysemu/hvf_int.h    |  69 ++++++
>> >>   target/i386/hvf/hvf-cpus.c  | 131 ----------
>> >>   target/i386/hvf/hvf-cpus.h  |  25 --
>> >>   target/i386/hvf/hvf-i386.h  |  48 +---
>> >>   target/i386/hvf/hvf.c       | 360 +--------------------------
>> >>   target/i386/hvf/meson.build |   1 -
>> >>   target/i386/hvf/x86hvf.c    |  11 +-
>> >>   target/i386/hvf/x86hvf.h    |   2 -
>> >>   13 files changed, 619 insertions(+), 569 deletions(-)
>> >>   create mode 100644 accel/hvf/hvf-all.c
>> >>   create mode 100644 accel/hvf/hvf-cpus.c
>> >>   create mode 100644 accel/hvf/meson.build
>> >>   create mode 100644 include/sysemu/hvf_int.h
>> >>   delete mode 100644 target/i386/hvf/hvf-cpus.c
>> >>   delete mode 100644 target/i386/hvf/hvf-cpus.h
>> >>
>> >> diff --git a/MAINTAINERS b/MAINTAINERS
>> >> index 68bc160f41..ca4b6d9279 100644
>> >> --- a/MAINTAINERS
>> >> +++ b/MAINTAINERS
>> >> @@ -444,9 +444,16 @@ M: Cameron Esfahani <dirty@apple.com>
>> >>   M: Roman Bolshakov <r.bolshakov@yadro.com>
>> >>   W: https://wiki.qemu.org/Features/HVF
>> >>   S: Maintained
>> >> -F: accel/stubs/hvf-stub.c
>> > There was a patch for that in the RFC series from Claudio.
>>
>>
>> Yeah, I'm not worried about this hunk :).
>>
>>
>> >
>> >>   F: target/i386/hvf/
>> >> +
>> >> +HVF
>> >> +M: Cameron Esfahani <dirty@apple.com>
>> >> +M: Roman Bolshakov <r.bolshakov@yadro.com>
>> >> +W: https://wiki.qemu.org/Features/HVF
>> >> +S: Maintained
>> >> +F: accel/hvf/
>> >>   F: include/sysemu/hvf.h
>> >> +F: include/sysemu/hvf_int.h
>> >>
>> >>   WHPX CPUs
>> >>   M: Sunil Muthuswamy <sunilmut@microsoft.com>
>> >> diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
>> >> new file mode 100644
>> >> index 0000000000..47d77a472a
>> >> --- /dev/null
>> >> +++ b/accel/hvf/hvf-all.c
>> >> @@ -0,0 +1,56 @@
>> >> +/*
>> >> + * QEMU Hypervisor.framework support
>> >> + *
>> >> + * This work is licensed under the terms of the GNU GPL, version 2.
>> See
>> >> + * the COPYING file in the top-level directory.
>> >> + *
>> >> + * Contributions after 2012-01-13 are licensed under the terms of the
>> >> + * GNU GPL, version 2 or (at your option) any later version.
>> >> + */
>> >> +
>> >> +#include "qemu/osdep.h"
>> >> +#include "qemu-common.h"
>> >> +#include "qemu/error-report.h"
>> >> +#include "sysemu/hvf.h"
>> >> +#include "sysemu/hvf_int.h"
>> >> +#include "sysemu/runstate.h"
>> >> +
>> >> +#include "qemu/main-loop.h"
>> >> +#include "sysemu/accel.h"
>> >> +
>> >> +#include <Hypervisor/Hypervisor.h>
>> >> +
>> >> +bool hvf_allowed;
>> >> +HVFState *hvf_state;
>> >> +
>> >> +void assert_hvf_ok(hv_return_t ret)
>> >> +{
>> >> +    if (ret == HV_SUCCESS) {
>> >> +        return;
>> >> +    }
>> >> +
>> >> +    switch (ret) {
>> >> +    case HV_ERROR:
>> >> +        error_report("Error: HV_ERROR");
>> >> +        break;
>> >> +    case HV_BUSY:
>> >> +        error_report("Error: HV_BUSY");
>> >> +        break;
>> >> +    case HV_BAD_ARGUMENT:
>> >> +        error_report("Error: HV_BAD_ARGUMENT");
>> >> +        break;
>> >> +    case HV_NO_RESOURCES:
>> >> +        error_report("Error: HV_NO_RESOURCES");
>> >> +        break;
>> >> +    case HV_NO_DEVICE:
>> >> +        error_report("Error: HV_NO_DEVICE");
>> >> +        break;
>> >> +    case HV_UNSUPPORTED:
>> >> +        error_report("Error: HV_UNSUPPORTED");
>> >> +        break;
>> >> +    default:
>> >> +        error_report("Unknown Error");
>> >> +    }
>> >> +
>> >> +    abort();
>> >> +}
>> >> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>> >> new file mode 100644
>> >> index 0000000000..f9bb5502b7
>> >> --- /dev/null
>> >> +++ b/accel/hvf/hvf-cpus.c
>> >> @@ -0,0 +1,468 @@
>> >> +/*
>> >> + * Copyright 2008 IBM Corporation
>> >> + *           2008 Red Hat, Inc.
>> >> + * Copyright 2011 Intel Corporation
>> >> + * Copyright 2016 Veertu, Inc.
>> >> + * Copyright 2017 The Android Open Source Project
>> >> + *
>> >> + * QEMU Hypervisor.framework support
>> >> + *
>> >> + * This program is free software; you can redistribute it and/or
>> >> + * modify it under the terms of version 2 of the GNU General Public
>> >> + * License as published by the Free Software Foundation.
>> >> + *
>> >> + * This program is distributed in the hope that it will be useful,
>> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> >> + * General Public License for more details.
>> >> + *
>> >> + * You should have received a copy of the GNU General Public License
>> >> + * along with this program; if not, see <http://www.gnu.org/licenses/
>> >.
>> >> + *
>> >> + * This file contain code under public domain from the hvdos project:
>> >> + * https://github.com/mist64/hvdos
>> >> + *
>> >> + * Parts Copyright (c) 2011 NetApp, Inc.
>> >> + * All rights reserved.
>> >> + *
>> >> + * Redistribution and use in source and binary forms, with or without
>> >> + * modification, are permitted provided that the following conditions
>> >> + * are met:
>> >> + * 1. Redistributions of source code must retain the above copyright
>> >> + *    notice, this list of conditions and the following disclaimer.
>> >> + * 2. Redistributions in binary form must reproduce the above
>> copyright
>> >> + *    notice, this list of conditions and the following disclaimer in
>> the
>> >> + *    documentation and/or other materials provided with the
>> distribution.
>> >> + *
>> >> + * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>> >> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
>> THE
>> >> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
>> PURPOSE
>> >> + * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE
>> LIABLE
>> >> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
>> CONSEQUENTIAL
>> >> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
>> GOODS
>> >> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
>> INTERRUPTION)
>> >> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
>> CONTRACT, STRICT
>> >> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
>> ANY WAY
>> >> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
>> POSSIBILITY OF
>> >> + * SUCH DAMAGE.
>> >> + */
>> >> +
>> >> +#include "qemu/osdep.h"
>> >> +#include "qemu/error-report.h"
>> >> +#include "qemu/main-loop.h"
>> >> +#include "exec/address-spaces.h"
>> >> +#include "exec/exec-all.h"
>> >> +#include "sysemu/cpus.h"
>> >> +#include "sysemu/hvf.h"
>> >> +#include "sysemu/hvf_int.h"
>> >> +#include "sysemu/runstate.h"
>> >> +#include "qemu/guest-random.h"
>> >> +
>> >> +#include <Hypervisor/Hypervisor.h>
>> >> +
>> >> +/* Memory slots */
>> >> +
>> >> +struct mac_slot {
>> >> +    int present;
>> >> +    uint64_t size;
>> >> +    uint64_t gpa_start;
>> >> +    uint64_t gva;
>> >> +};
>> >> +
>> >> +hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>> >> +{
>> >> +    hvf_slot *slot;
>> >> +    int x;
>> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>> >> +        slot = &hvf_state->slots[x];
>> >> +        if (slot->size && start < (slot->start + slot->size) &&
>> >> +            (start + size) > slot->start) {
>> >> +            return slot;
>> >> +        }
>> >> +    }
>> >> +    return NULL;
>> >> +}
>> >> +
>> >> +struct mac_slot mac_slots[32];
>> >> +
>> >> +static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>> >> +{
>> >> +    struct mac_slot *macslot;
>> >> +    hv_return_t ret;
>> >> +
>> >> +    macslot = &mac_slots[slot->slot_id];
>> >> +
>> >> +    if (macslot->present) {
>> >> +        if (macslot->size != slot->size) {
>> >> +            macslot->present = 0;
>> >> +            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>> >> +            assert_hvf_ok(ret);
>> >> +        }
>> >> +    }
>> >> +
>> >> +    if (!slot->size) {
>> >> +        return 0;
>> >> +    }
>> >> +
>> >> +    macslot->present = 1;
>> >> +    macslot->gpa_start = slot->start;
>> >> +    macslot->size = slot->size;
>> >> +    ret = hv_vm_map(slot->mem, slot->start, slot->size, flags);
>> >> +    assert_hvf_ok(ret);
>> >> +    return 0;
>> >> +}
>> >> +
>> >> +static void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>> >> +{
>> >> +    hvf_slot *mem;
>> >> +    MemoryRegion *area = section->mr;
>> >> +    bool writeable = !area->readonly && !area->rom_device;
>> >> +    hv_memory_flags_t flags;
>> >> +
>> >> +    if (!memory_region_is_ram(area)) {
>> >> +        if (writeable) {
>> >> +            return;
>> >> +        } else if (!memory_region_is_romd(area)) {
>> >> +            /*
>> >> +             * If the memory device is not in romd_mode, then we
>> actually want
>> >> +             * to remove the hvf memory slot so all accesses will
>> trap.
>> >> +             */
>> >> +             add = false;
>> >> +        }
>> >> +    }
>> >> +
>> >> +    mem = hvf_find_overlap_slot(
>> >> +            section->offset_within_address_space,
>> >> +            int128_get64(section->size));
>> >> +
>> >> +    if (mem && add) {
>> >> +        if (mem->size == int128_get64(section->size) &&
>> >> +            mem->start == section->offset_within_address_space &&
>> >> +            mem->mem == (memory_region_get_ram_ptr(area) +
>> >> +            section->offset_within_region)) {
>> >> +            return; /* Same region was attempted to register, go
>> away. */
>> >> +        }
>> >> +    }
>> >> +
>> >> +    /* Region needs to be reset. set the size to 0 and remap it. */
>> >> +    if (mem) {
>> >> +        mem->size = 0;
>> >> +        if (do_hvf_set_memory(mem, 0)) {
>> >> +            error_report("Failed to reset overlapping slot");
>> >> +            abort();
>> >> +        }
>> >> +    }
>> >> +
>> >> +    if (!add) {
>> >> +        return;
>> >> +    }
>> >> +
>> >> +    if (area->readonly ||
>> >> +        (!memory_region_is_ram(area) && memory_region_is_romd(area)))
>> {
>> >> +        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>> >> +    } else {
>> >> +        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>> >> +    }
>> >> +
>> >> +    /* Now make a new slot. */
>> >> +    int x;
>> >> +
>> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>> >> +        mem = &hvf_state->slots[x];
>> >> +        if (!mem->size) {
>> >> +            break;
>> >> +        }
>> >> +    }
>> >> +
>> >> +    if (x == hvf_state->num_slots) {
>> >> +        error_report("No free slots");
>> >> +        abort();
>> >> +    }
>> >> +
>> >> +    mem->size = int128_get64(section->size);
>> >> +    mem->mem = memory_region_get_ram_ptr(area) +
>> section->offset_within_region;
>> >> +    mem->start = section->offset_within_address_space;
>> >> +    mem->region = area;
>> >> +
>> >> +    if (do_hvf_set_memory(mem, flags)) {
>> >> +        error_report("Error registering new memory slot");
>> >> +        abort();
>> >> +    }
>> >> +}
>> >> +
>> >> +static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool
>> on)
>> >> +{
>> >> +    hvf_slot *slot;
>> >> +
>> >> +    slot = hvf_find_overlap_slot(
>> >> +            section->offset_within_address_space,
>> >> +            int128_get64(section->size));
>> >> +
>> >> +    /* protect region against writes; begin tracking it */
>> >> +    if (on) {
>> >> +        slot->flags |= HVF_SLOT_LOG;
>> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>> >> +                      HV_MEMORY_READ);
>> >> +    /* stop tracking region*/
>> >> +    } else {
>> >> +        slot->flags &= ~HVF_SLOT_LOG;
>> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>> >> +                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>> >> +    }
>> >> +}
>> >> +
>> >> +static void hvf_log_start(MemoryListener *listener,
>> >> +                          MemoryRegionSection *section, int old, int
>> new)
>> >> +{
>> >> +    if (old != 0) {
>> >> +        return;
>> >> +    }
>> >> +
>> >> +    hvf_set_dirty_tracking(section, 1);
>> >> +}
>> >> +
>> >> +static void hvf_log_stop(MemoryListener *listener,
>> >> +                         MemoryRegionSection *section, int old, int
>> new)
>> >> +{
>> >> +    if (new != 0) {
>> >> +        return;
>> >> +    }
>> >> +
>> >> +    hvf_set_dirty_tracking(section, 0);
>> >> +}
>> >> +
>> >> +static void hvf_log_sync(MemoryListener *listener,
>> >> +                         MemoryRegionSection *section)
>> >> +{
>> >> +    /*
>> >> +     * sync of dirty pages is handled elsewhere; just make sure we
>> keep
>> >> +     * tracking the region.
>> >> +     */
>> >> +    hvf_set_dirty_tracking(section, 1);
>> >> +}
>> >> +
>> >> +static void hvf_region_add(MemoryListener *listener,
>> >> +                           MemoryRegionSection *section)
>> >> +{
>> >> +    hvf_set_phys_mem(section, true);
>> >> +}
>> >> +
>> >> +static void hvf_region_del(MemoryListener *listener,
>> >> +                           MemoryRegionSection *section)
>> >> +{
>> >> +    hvf_set_phys_mem(section, false);
>> >> +}
>> >> +
>> >> +static MemoryListener hvf_memory_listener = {
>> >> +    .priority = 10,
>> >> +    .region_add = hvf_region_add,
>> >> +    .region_del = hvf_region_del,
>> >> +    .log_start = hvf_log_start,
>> >> +    .log_stop = hvf_log_stop,
>> >> +    .log_sync = hvf_log_sync,
>> >> +};
>> >> +
>> >> +static void do_hvf_cpu_synchronize_state(CPUState *cpu,
>> run_on_cpu_data arg)
>> >> +{
>> >> +    if (!cpu->vcpu_dirty) {
>> >> +        hvf_get_registers(cpu);
>> >> +        cpu->vcpu_dirty = true;
>> >> +    }
>> >> +}
>> >> +
>> >> +static void hvf_cpu_synchronize_state(CPUState *cpu)
>> >> +{
>> >> +    if (!cpu->vcpu_dirty) {
>> >> +        run_on_cpu(cpu, do_hvf_cpu_synchronize_state,
>> RUN_ON_CPU_NULL);
>> >> +    }
>> >> +}
>> >> +
>> >> +static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>> >> +                                              run_on_cpu_data arg)
>> >> +{
>> >> +    hvf_put_registers(cpu);
>> >> +    cpu->vcpu_dirty = false;
>> >> +}
>> >> +
>> >> +static void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>> >> +{
>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset,
>> RUN_ON_CPU_NULL);
>> >> +}
>> >> +
>> >> +static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>> >> +                                             run_on_cpu_data arg)
>> >> +{
>> >> +    hvf_put_registers(cpu);
>> >> +    cpu->vcpu_dirty = false;
>> >> +}
>> >> +
>> >> +static void hvf_cpu_synchronize_post_init(CPUState *cpu)
>> >> +{
>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init,
>> RUN_ON_CPU_NULL);
>> >> +}
>> >> +
>> >> +static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>> >> +                                              run_on_cpu_data arg)
>> >> +{
>> >> +    cpu->vcpu_dirty = true;
>> >> +}
>> >> +
>> >> +static void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>> >> +{
>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm,
>> RUN_ON_CPU_NULL);
>> >> +}
>> >> +
>> >> +static void hvf_vcpu_destroy(CPUState *cpu)
>> >> +{
>> >> +    hv_return_t ret = hv_vcpu_destroy(cpu->hvf_fd);
>> >> +    assert_hvf_ok(ret);
>> >> +
>> >> +    hvf_arch_vcpu_destroy(cpu);
>> >> +}
>> >> +
>> >> +static void dummy_signal(int sig)
>> >> +{
>> >> +}
>> >> +
>> >> +static int hvf_init_vcpu(CPUState *cpu)
>> >> +{
>> >> +    int r;
>> >> +
>> >> +    /* init cpu signals */
>> >> +    sigset_t set;
>> >> +    struct sigaction sigact;
>> >> +
>> >> +    memset(&sigact, 0, sizeof(sigact));
>> >> +    sigact.sa_handler = dummy_signal;
>> >> +    sigaction(SIG_IPI, &sigact, NULL);
>> >> +
>> >> +    pthread_sigmask(SIG_BLOCK, NULL, &set);
>> >> +    sigdelset(&set, SIG_IPI);
>> >> +
>> >> +#ifdef __aarch64__
>> >> +    r = hv_vcpu_create(&cpu->hvf_fd, (hv_vcpu_exit_t
>> **)&cpu->hvf_exit, NULL);
>> >> +#else
>> >> +    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>> >> +#endif
>> > I think the first __aarch64__ bit fits better to arm part of the series.
>>
>>
>> Oops. Thanks for catching it! Yes, absolutely. It should be part of the
>> ARM enablement.
>>
>>
>> >
>> >> +    cpu->vcpu_dirty = 1;
>> >> +    assert_hvf_ok(r);
>> >> +
>> >> +    return hvf_arch_init_vcpu(cpu);
>> >> +}
>> >> +
>> >> +/*
>> >> + * The HVF-specific vCPU thread function. This one should only run
>> when the host
>> >> + * CPU supports the VMX "unrestricted guest" feature.
>> >> + */
>> >> +static void *hvf_cpu_thread_fn(void *arg)
>> >> +{
>> >> +    CPUState *cpu = arg;
>> >> +
>> >> +    int r;
>> >> +
>> >> +    assert(hvf_enabled());
>> >> +
>> >> +    rcu_register_thread();
>> >> +
>> >> +    qemu_mutex_lock_iothread();
>> >> +    qemu_thread_get_self(cpu->thread);
>> >> +
>> >> +    cpu->thread_id = qemu_get_thread_id();
>> >> +    cpu->can_do_io = 1;
>> >> +    current_cpu = cpu;
>> >> +
>> >> +    hvf_init_vcpu(cpu);
>> >> +
>> >> +    /* signal CPU creation */
>> >> +    cpu_thread_signal_created(cpu);
>> >> +    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>> >> +
>> >> +    do {
>> >> +        if (cpu_can_run(cpu)) {
>> >> +            r = hvf_vcpu_exec(cpu);
>> >> +            if (r == EXCP_DEBUG) {
>> >> +                cpu_handle_guest_debug(cpu);
>> >> +            }
>> >> +        }
>> >> +        qemu_wait_io_event(cpu);
>> >> +    } while (!cpu->unplug || cpu_can_run(cpu));
>> >> +
>> >> +    hvf_vcpu_destroy(cpu);
>> >> +    cpu_thread_signal_destroyed(cpu);
>> >> +    qemu_mutex_unlock_iothread();
>> >> +    rcu_unregister_thread();
>> >> +    return NULL;
>> >> +}
>> >> +
>> >> +static void hvf_start_vcpu_thread(CPUState *cpu)
>> >> +{
>> >> +    char thread_name[VCPU_THREAD_NAME_SIZE];
>> >> +
>> >> +    /*
>> >> +     * HVF currently does not support TCG, and only runs in
>> >> +     * unrestricted-guest mode.
>> >> +     */
>> >> +    assert(hvf_enabled());
>> >> +
>> >> +    cpu->thread = g_malloc0(sizeof(QemuThread));
>> >> +    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>> >> +    qemu_cond_init(cpu->halt_cond);
>> >> +
>> >> +    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>> >> +             cpu->cpu_index);
>> >> +    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>> >> +                       cpu, QEMU_THREAD_JOINABLE);
>> >> +}
>> >> +
>> >> +static const CpusAccel hvf_cpus = {
>> >> +    .create_vcpu_thread = hvf_start_vcpu_thread,
>> >> +
>> >> +    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>> >> +    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>> >> +    .synchronize_state = hvf_cpu_synchronize_state,
>> >> +    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>> >> +};
>> >> +
>> >> +static int hvf_accel_init(MachineState *ms)
>> >> +{
>> >> +    int x;
>> >> +    hv_return_t ret;
>> >> +    HVFState *s;
>> >> +
>> >> +    ret = hv_vm_create(HV_VM_DEFAULT);
>> >> +    assert_hvf_ok(ret);
>> >> +
>> >> +    s = g_new0(HVFState, 1);
>> >> +
>> >> +    s->num_slots = 32;
>> >> +    for (x = 0; x < s->num_slots; ++x) {
>> >> +        s->slots[x].size = 0;
>> >> +        s->slots[x].slot_id = x;
>> >> +    }
>> >> +
>> >> +    hvf_state = s;
>> >> +    memory_listener_register(&hvf_memory_listener,
>> &address_space_memory);
>> >> +    cpus_register_accel(&hvf_cpus);
>> >> +    return 0;
>> >> +}
>> >> +
>> >> +static void hvf_accel_class_init(ObjectClass *oc, void *data)
>> >> +{
>> >> +    AccelClass *ac = ACCEL_CLASS(oc);
>> >> +    ac->name = "HVF";
>> >> +    ac->init_machine = hvf_accel_init;
>> >> +    ac->allowed = &hvf_allowed;
>> >> +}
>> >> +
>> >> +static const TypeInfo hvf_accel_type = {
>> >> +    .name = TYPE_HVF_ACCEL,
>> >> +    .parent = TYPE_ACCEL,
>> >> +    .class_init = hvf_accel_class_init,
>> >> +};
>> >> +
>> >> +static void hvf_type_init(void)
>> >> +{
>> >> +    type_register_static(&hvf_accel_type);
>> >> +}
>> >> +
>> >> +type_init(hvf_type_init);
>> >> diff --git a/accel/hvf/meson.build b/accel/hvf/meson.build
>> >> new file mode 100644
>> >> index 0000000000..dfd6b68dc7
>> >> --- /dev/null
>> >> +++ b/accel/hvf/meson.build
>> >> @@ -0,0 +1,7 @@
>> >> +hvf_ss = ss.source_set()
>> >> +hvf_ss.add(files(
>> >> +  'hvf-all.c',
>> >> +  'hvf-cpus.c',
>> >> +))
>> >> +
>> >> +specific_ss.add_all(when: 'CONFIG_HVF', if_true: hvf_ss)
>> >> diff --git a/accel/meson.build b/accel/meson.build
>> >> index b26cca227a..6de12ce5d5 100644
>> >> --- a/accel/meson.build
>> >> +++ b/accel/meson.build
>> >> @@ -1,5 +1,6 @@
>> >>   softmmu_ss.add(files('accel.c'))
>> >>
>> >> +subdir('hvf')
>> >>   subdir('qtest')
>> >>   subdir('kvm')
>> >>   subdir('tcg')
>> >> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
>> >> new file mode 100644
>> >> index 0000000000..de9bad23a8
>> >> --- /dev/null
>> >> +++ b/include/sysemu/hvf_int.h
>> >> @@ -0,0 +1,69 @@
>> >> +/*
>> >> + * QEMU Hypervisor.framework (HVF) support
>> >> + *
>> >> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> later.
>> >> + * See the COPYING file in the top-level directory.
>> >> + *
>> >> + */
>> >> +
>> >> +/* header to be included in HVF-specific code */
>> >> +
>> >> +#ifndef HVF_INT_H
>> >> +#define HVF_INT_H
>> >> +
>> >> +#include <Hypervisor/Hypervisor.h>
>> >> +
>> >> +#define HVF_MAX_VCPU 0x10
>> >> +
>> >> +extern struct hvf_state hvf_global;
>> >> +
>> >> +struct hvf_vm {
>> >> +    int id;
>> >> +    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>> >> +};
>> >> +
>> >> +struct hvf_state {
>> >> +    uint32_t version;
>> >> +    struct hvf_vm *vm;
>> >> +    uint64_t mem_quota;
>> >> +};
>> >> +
>> >> +/* hvf_slot flags */
>> >> +#define HVF_SLOT_LOG (1 << 0)
>> >> +
>> >> +typedef struct hvf_slot {
>> >> +    uint64_t start;
>> >> +    uint64_t size;
>> >> +    uint8_t *mem;
>> >> +    int slot_id;
>> >> +    uint32_t flags;
>> >> +    MemoryRegion *region;
>> >> +} hvf_slot;
>> >> +
>> >> +typedef struct hvf_vcpu_caps {
>> >> +    uint64_t vmx_cap_pinbased;
>> >> +    uint64_t vmx_cap_procbased;
>> >> +    uint64_t vmx_cap_procbased2;
>> >> +    uint64_t vmx_cap_entry;
>> >> +    uint64_t vmx_cap_exit;
>> >> +    uint64_t vmx_cap_preemption_timer;
>> >> +} hvf_vcpu_caps;
>> >> +
>> >> +struct HVFState {
>> >> +    AccelState parent;
>> >> +    hvf_slot slots[32];
>> >> +    int num_slots;
>> >> +
>> >> +    hvf_vcpu_caps *hvf_caps;
>> >> +};
>> >> +extern HVFState *hvf_state;
>> >> +
>> >> +void assert_hvf_ok(hv_return_t ret);
>> >> +int hvf_get_registers(CPUState *cpu);
>> >> +int hvf_put_registers(CPUState *cpu);
>> >> +int hvf_arch_init_vcpu(CPUState *cpu);
>> >> +void hvf_arch_vcpu_destroy(CPUState *cpu);
>> >> +int hvf_vcpu_exec(CPUState *cpu);
>> >> +hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>> >> +
>> >> +#endif
>> >> diff --git a/target/i386/hvf/hvf-cpus.c b/target/i386/hvf/hvf-cpus.c
>> >> deleted file mode 100644
>> >> index 817b3d7452..0000000000
>> >> --- a/target/i386/hvf/hvf-cpus.c
>> >> +++ /dev/null
>> >> @@ -1,131 +0,0 @@
>> >> -/*
>> >> - * Copyright 2008 IBM Corporation
>> >> - *           2008 Red Hat, Inc.
>> >> - * Copyright 2011 Intel Corporation
>> >> - * Copyright 2016 Veertu, Inc.
>> >> - * Copyright 2017 The Android Open Source Project
>> >> - *
>> >> - * QEMU Hypervisor.framework support
>> >> - *
>> >> - * This program is free software; you can redistribute it and/or
>> >> - * modify it under the terms of version 2 of the GNU General Public
>> >> - * License as published by the Free Software Foundation.
>> >> - *
>> >> - * This program is distributed in the hope that it will be useful,
>> >> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> >> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> >> - * General Public License for more details.
>> >> - *
>> >> - * You should have received a copy of the GNU General Public License
>> >> - * along with this program; if not, see <http://www.gnu.org/licenses/
>> >.
>> >> - *
>> >> - * This file contain code under public domain from the hvdos project:
>> >> - * https://github.com/mist64/hvdos
>> >> - *
>> >> - * Parts Copyright (c) 2011 NetApp, Inc.
>> >> - * All rights reserved.
>> >> - *
>> >> - * Redistribution and use in source and binary forms, with or without
>> >> - * modification, are permitted provided that the following conditions
>> >> - * are met:
>> >> - * 1. Redistributions of source code must retain the above copyright
>> >> - *    notice, this list of conditions and the following disclaimer.
>> >> - * 2. Redistributions in binary form must reproduce the above
>> copyright
>> >> - *    notice, this list of conditions and the following disclaimer in
>> the
>> >> - *    documentation and/or other materials provided with the
>> distribution.
>> >> - *
>> >> - * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>> >> - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
>> THE
>> >> - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
>> PURPOSE
>> >> - * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE
>> LIABLE
>> >> - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
>> CONSEQUENTIAL
>> >> - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
>> GOODS
>> >> - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
>> INTERRUPTION)
>> >> - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
>> CONTRACT, STRICT
>> >> - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
>> ANY WAY
>> >> - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
>> POSSIBILITY OF
>> >> - * SUCH DAMAGE.
>> >> - */
>> >> -
>> >> -#include "qemu/osdep.h"
>> >> -#include "qemu/error-report.h"
>> >> -#include "qemu/main-loop.h"
>> >> -#include "sysemu/hvf.h"
>> >> -#include "sysemu/runstate.h"
>> >> -#include "target/i386/cpu.h"
>> >> -#include "qemu/guest-random.h"
>> >> -
>> >> -#include "hvf-cpus.h"
>> >> -
>> >> -/*
>> >> - * The HVF-specific vCPU thread function. This one should only run
>> when the host
>> >> - * CPU supports the VMX "unrestricted guest" feature.
>> >> - */
>> >> -static void *hvf_cpu_thread_fn(void *arg)
>> >> -{
>> >> -    CPUState *cpu = arg;
>> >> -
>> >> -    int r;
>> >> -
>> >> -    assert(hvf_enabled());
>> >> -
>> >> -    rcu_register_thread();
>> >> -
>> >> -    qemu_mutex_lock_iothread();
>> >> -    qemu_thread_get_self(cpu->thread);
>> >> -
>> >> -    cpu->thread_id = qemu_get_thread_id();
>> >> -    cpu->can_do_io = 1;
>> >> -    current_cpu = cpu;
>> >> -
>> >> -    hvf_init_vcpu(cpu);
>> >> -
>> >> -    /* signal CPU creation */
>> >> -    cpu_thread_signal_created(cpu);
>> >> -    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>> >> -
>> >> -    do {
>> >> -        if (cpu_can_run(cpu)) {
>> >> -            r = hvf_vcpu_exec(cpu);
>> >> -            if (r == EXCP_DEBUG) {
>> >> -                cpu_handle_guest_debug(cpu);
>> >> -            }
>> >> -        }
>> >> -        qemu_wait_io_event(cpu);
>> >> -    } while (!cpu->unplug || cpu_can_run(cpu));
>> >> -
>> >> -    hvf_vcpu_destroy(cpu);
>> >> -    cpu_thread_signal_destroyed(cpu);
>> >> -    qemu_mutex_unlock_iothread();
>> >> -    rcu_unregister_thread();
>> >> -    return NULL;
>> >> -}
>> >> -
>> >> -static void hvf_start_vcpu_thread(CPUState *cpu)
>> >> -{
>> >> -    char thread_name[VCPU_THREAD_NAME_SIZE];
>> >> -
>> >> -    /*
>> >> -     * HVF currently does not support TCG, and only runs in
>> >> -     * unrestricted-guest mode.
>> >> -     */
>> >> -    assert(hvf_enabled());
>> >> -
>> >> -    cpu->thread = g_malloc0(sizeof(QemuThread));
>> >> -    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>> >> -    qemu_cond_init(cpu->halt_cond);
>> >> -
>> >> -    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>> >> -             cpu->cpu_index);
>> >> -    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>> >> -                       cpu, QEMU_THREAD_JOINABLE);
>> >> -}
>> >> -
>> >> -const CpusAccel hvf_cpus = {
>> >> -    .create_vcpu_thread = hvf_start_vcpu_thread,
>> >> -
>> >> -    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>> >> -    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>> >> -    .synchronize_state = hvf_cpu_synchronize_state,
>> >> -    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>> >> -};
>> >> diff --git a/target/i386/hvf/hvf-cpus.h b/target/i386/hvf/hvf-cpus.h
>> >> deleted file mode 100644
>> >> index ced31b82c0..0000000000
>> >> --- a/target/i386/hvf/hvf-cpus.h
>> >> +++ /dev/null
>> >> @@ -1,25 +0,0 @@
>> >> -/*
>> >> - * Accelerator CPUS Interface
>> >> - *
>> >> - * Copyright 2020 SUSE LLC
>> >> - *
>> >> - * This work is licensed under the terms of the GNU GPL, version 2 or
>> later.
>> >> - * See the COPYING file in the top-level directory.
>> >> - */
>> >> -
>> >> -#ifndef HVF_CPUS_H
>> >> -#define HVF_CPUS_H
>> >> -
>> >> -#include "sysemu/cpus.h"
>> >> -
>> >> -extern const CpusAccel hvf_cpus;
>> >> -
>> >> -int hvf_init_vcpu(CPUState *);
>> >> -int hvf_vcpu_exec(CPUState *);
>> >> -void hvf_cpu_synchronize_state(CPUState *);
>> >> -void hvf_cpu_synchronize_post_reset(CPUState *);
>> >> -void hvf_cpu_synchronize_post_init(CPUState *);
>> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *);
>> >> -void hvf_vcpu_destroy(CPUState *);
>> >> -
>> >> -#endif /* HVF_CPUS_H */
>> >> diff --git a/target/i386/hvf/hvf-i386.h b/target/i386/hvf/hvf-i386.h
>> >> index e0edffd077..6d56f8f6bb 100644
>> >> --- a/target/i386/hvf/hvf-i386.h
>> >> +++ b/target/i386/hvf/hvf-i386.h
>> >> @@ -18,57 +18,11 @@
>> >>
>> >>   #include "sysemu/accel.h"
>> >>   #include "sysemu/hvf.h"
>> >> +#include "sysemu/hvf_int.h"
>> >>   #include "cpu.h"
>> >>   #include "x86.h"
>> >>
>> >> -#define HVF_MAX_VCPU 0x10
>> >> -
>> >> -extern struct hvf_state hvf_global;
>> >> -
>> >> -struct hvf_vm {
>> >> -    int id;
>> >> -    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>> >> -};
>> >> -
>> >> -struct hvf_state {
>> >> -    uint32_t version;
>> >> -    struct hvf_vm *vm;
>> >> -    uint64_t mem_quota;
>> >> -};
>> >> -
>> >> -/* hvf_slot flags */
>> >> -#define HVF_SLOT_LOG (1 << 0)
>> >> -
>> >> -typedef struct hvf_slot {
>> >> -    uint64_t start;
>> >> -    uint64_t size;
>> >> -    uint8_t *mem;
>> >> -    int slot_id;
>> >> -    uint32_t flags;
>> >> -    MemoryRegion *region;
>> >> -} hvf_slot;
>> >> -
>> >> -typedef struct hvf_vcpu_caps {
>> >> -    uint64_t vmx_cap_pinbased;
>> >> -    uint64_t vmx_cap_procbased;
>> >> -    uint64_t vmx_cap_procbased2;
>> >> -    uint64_t vmx_cap_entry;
>> >> -    uint64_t vmx_cap_exit;
>> >> -    uint64_t vmx_cap_preemption_timer;
>> >> -} hvf_vcpu_caps;
>> >> -
>> >> -struct HVFState {
>> >> -    AccelState parent;
>> >> -    hvf_slot slots[32];
>> >> -    int num_slots;
>> >> -
>> >> -    hvf_vcpu_caps *hvf_caps;
>> >> -};
>> >> -extern HVFState *hvf_state;
>> >> -
>> >> -void hvf_set_phys_mem(MemoryRegionSection *, bool);
>> >>   void hvf_handle_io(CPUArchState *, uint16_t, void *, int, int, int);
>> >> -hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>> >>
>> >>   #ifdef NEED_CPU_H
>> >>   /* Functions exported to host specific mode */
>> >> diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
>> >> index ed9356565c..8b96ecd619 100644
>> >> --- a/target/i386/hvf/hvf.c
>> >> +++ b/target/i386/hvf/hvf.c
>> >> @@ -51,6 +51,7 @@
>> >>   #include "qemu/error-report.h"
>> >>
>> >>   #include "sysemu/hvf.h"
>> >> +#include "sysemu/hvf_int.h"
>> >>   #include "sysemu/runstate.h"
>> >>   #include "hvf-i386.h"
>> >>   #include "vmcs.h"
>> >> @@ -72,171 +73,6 @@
>> >>   #include "sysemu/accel.h"
>> >>   #include "target/i386/cpu.h"
>> >>
>> >> -#include "hvf-cpus.h"
>> >> -
>> >> -HVFState *hvf_state;
>> >> -
>> >> -static void assert_hvf_ok(hv_return_t ret)
>> >> -{
>> >> -    if (ret == HV_SUCCESS) {
>> >> -        return;
>> >> -    }
>> >> -
>> >> -    switch (ret) {
>> >> -    case HV_ERROR:
>> >> -        error_report("Error: HV_ERROR");
>> >> -        break;
>> >> -    case HV_BUSY:
>> >> -        error_report("Error: HV_BUSY");
>> >> -        break;
>> >> -    case HV_BAD_ARGUMENT:
>> >> -        error_report("Error: HV_BAD_ARGUMENT");
>> >> -        break;
>> >> -    case HV_NO_RESOURCES:
>> >> -        error_report("Error: HV_NO_RESOURCES");
>> >> -        break;
>> >> -    case HV_NO_DEVICE:
>> >> -        error_report("Error: HV_NO_DEVICE");
>> >> -        break;
>> >> -    case HV_UNSUPPORTED:
>> >> -        error_report("Error: HV_UNSUPPORTED");
>> >> -        break;
>> >> -    default:
>> >> -        error_report("Unknown Error");
>> >> -    }
>> >> -
>> >> -    abort();
>> >> -}
>> >> -
>> >> -/* Memory slots */
>> >> -hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>> >> -{
>> >> -    hvf_slot *slot;
>> >> -    int x;
>> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>> >> -        slot = &hvf_state->slots[x];
>> >> -        if (slot->size && start < (slot->start + slot->size) &&
>> >> -            (start + size) > slot->start) {
>> >> -            return slot;
>> >> -        }
>> >> -    }
>> >> -    return NULL;
>> >> -}
>> >> -
>> >> -struct mac_slot {
>> >> -    int present;
>> >> -    uint64_t size;
>> >> -    uint64_t gpa_start;
>> >> -    uint64_t gva;
>> >> -};
>> >> -
>> >> -struct mac_slot mac_slots[32];
>> >> -
>> >> -static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>> >> -{
>> >> -    struct mac_slot *macslot;
>> >> -    hv_return_t ret;
>> >> -
>> >> -    macslot = &mac_slots[slot->slot_id];
>> >> -
>> >> -    if (macslot->present) {
>> >> -        if (macslot->size != slot->size) {
>> >> -            macslot->present = 0;
>> >> -            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>> >> -            assert_hvf_ok(ret);
>> >> -        }
>> >> -    }
>> >> -
>> >> -    if (!slot->size) {
>> >> -        return 0;
>> >> -    }
>> >> -
>> >> -    macslot->present = 1;
>> >> -    macslot->gpa_start = slot->start;
>> >> -    macslot->size = slot->size;
>> >> -    ret = hv_vm_map((hv_uvaddr_t)slot->mem, slot->start, slot->size,
>> flags);
>> >> -    assert_hvf_ok(ret);
>> >> -    return 0;
>> >> -}
>> >> -
>> >> -void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>> >> -{
>> >> -    hvf_slot *mem;
>> >> -    MemoryRegion *area = section->mr;
>> >> -    bool writeable = !area->readonly && !area->rom_device;
>> >> -    hv_memory_flags_t flags;
>> >> -
>> >> -    if (!memory_region_is_ram(area)) {
>> >> -        if (writeable) {
>> >> -            return;
>> >> -        } else if (!memory_region_is_romd(area)) {
>> >> -            /*
>> >> -             * If the memory device is not in romd_mode, then we
>> actually want
>> >> -             * to remove the hvf memory slot so all accesses will
>> trap.
>> >> -             */
>> >> -             add = false;
>> >> -        }
>> >> -    }
>> >> -
>> >> -    mem = hvf_find_overlap_slot(
>> >> -            section->offset_within_address_space,
>> >> -            int128_get64(section->size));
>> >> -
>> >> -    if (mem && add) {
>> >> -        if (mem->size == int128_get64(section->size) &&
>> >> -            mem->start == section->offset_within_address_space &&
>> >> -            mem->mem == (memory_region_get_ram_ptr(area) +
>> >> -            section->offset_within_region)) {
>> >> -            return; /* Same region was attempted to register, go
>> away. */
>> >> -        }
>> >> -    }
>> >> -
>> >> -    /* Region needs to be reset. set the size to 0 and remap it. */
>> >> -    if (mem) {
>> >> -        mem->size = 0;
>> >> -        if (do_hvf_set_memory(mem, 0)) {
>> >> -            error_report("Failed to reset overlapping slot");
>> >> -            abort();
>> >> -        }
>> >> -    }
>> >> -
>> >> -    if (!add) {
>> >> -        return;
>> >> -    }
>> >> -
>> >> -    if (area->readonly ||
>> >> -        (!memory_region_is_ram(area) && memory_region_is_romd(area)))
>> {
>> >> -        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>> >> -    } else {
>> >> -        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>> >> -    }
>> >> -
>> >> -    /* Now make a new slot. */
>> >> -    int x;
>> >> -
>> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>> >> -        mem = &hvf_state->slots[x];
>> >> -        if (!mem->size) {
>> >> -            break;
>> >> -        }
>> >> -    }
>> >> -
>> >> -    if (x == hvf_state->num_slots) {
>> >> -        error_report("No free slots");
>> >> -        abort();
>> >> -    }
>> >> -
>> >> -    mem->size = int128_get64(section->size);
>> >> -    mem->mem = memory_region_get_ram_ptr(area) +
>> section->offset_within_region;
>> >> -    mem->start = section->offset_within_address_space;
>> >> -    mem->region = area;
>> >> -
>> >> -    if (do_hvf_set_memory(mem, flags)) {
>> >> -        error_report("Error registering new memory slot");
>> >> -        abort();
>> >> -    }
>> >> -}
>> >> -
>> >>   void vmx_update_tpr(CPUState *cpu)
>> >>   {
>> >>       /* TODO: need integrate APIC handling */
>> >> @@ -276,56 +112,6 @@ void hvf_handle_io(CPUArchState *env, uint16_t
>> port, void *buffer,
>> >>       }
>> >>   }
>> >>
>> >> -static void do_hvf_cpu_synchronize_state(CPUState *cpu,
>> run_on_cpu_data arg)
>> >> -{
>> >> -    if (!cpu->vcpu_dirty) {
>> >> -        hvf_get_registers(cpu);
>> >> -        cpu->vcpu_dirty = true;
>> >> -    }
>> >> -}
>> >> -
>> >> -void hvf_cpu_synchronize_state(CPUState *cpu)
>> >> -{
>> >> -    if (!cpu->vcpu_dirty) {
>> >> -        run_on_cpu(cpu, do_hvf_cpu_synchronize_state,
>> RUN_ON_CPU_NULL);
>> >> -    }
>> >> -}
>> >> -
>> >> -static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>> >> -                                              run_on_cpu_data arg)
>> >> -{
>> >> -    hvf_put_registers(cpu);
>> >> -    cpu->vcpu_dirty = false;
>> >> -}
>> >> -
>> >> -void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>> >> -{
>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset,
>> RUN_ON_CPU_NULL);
>> >> -}
>> >> -
>> >> -static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>> >> -                                             run_on_cpu_data arg)
>> >> -{
>> >> -    hvf_put_registers(cpu);
>> >> -    cpu->vcpu_dirty = false;
>> >> -}
>> >> -
>> >> -void hvf_cpu_synchronize_post_init(CPUState *cpu)
>> >> -{
>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init,
>> RUN_ON_CPU_NULL);
>> >> -}
>> >> -
>> >> -static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>> >> -                                              run_on_cpu_data arg)
>> >> -{
>> >> -    cpu->vcpu_dirty = true;
>> >> -}
>> >> -
>> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>> >> -{
>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm,
>> RUN_ON_CPU_NULL);
>> >> -}
>> >> -
>> >>   static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa,
>> uint64_t ept_qual)
>> >>   {
>> >>       int read, write;
>> >> @@ -370,109 +156,19 @@ static bool ept_emulation_fault(hvf_slot *slot,
>> uint64_t gpa, uint64_t ept_qual)
>> >>       return false;
>> >>   }
>> >>
>> >> -static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool
>> on)
>> >> -{
>> >> -    hvf_slot *slot;
>> >> -
>> >> -    slot = hvf_find_overlap_slot(
>> >> -            section->offset_within_address_space,
>> >> -            int128_get64(section->size));
>> >> -
>> >> -    /* protect region against writes; begin tracking it */
>> >> -    if (on) {
>> >> -        slot->flags |= HVF_SLOT_LOG;
>> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>> >> -                      HV_MEMORY_READ);
>> >> -    /* stop tracking region*/
>> >> -    } else {
>> >> -        slot->flags &= ~HVF_SLOT_LOG;
>> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>> >> -                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>> >> -    }
>> >> -}
>> >> -
>> >> -static void hvf_log_start(MemoryListener *listener,
>> >> -                          MemoryRegionSection *section, int old, int
>> new)
>> >> -{
>> >> -    if (old != 0) {
>> >> -        return;
>> >> -    }
>> >> -
>> >> -    hvf_set_dirty_tracking(section, 1);
>> >> -}
>> >> -
>> >> -static void hvf_log_stop(MemoryListener *listener,
>> >> -                         MemoryRegionSection *section, int old, int
>> new)
>> >> -{
>> >> -    if (new != 0) {
>> >> -        return;
>> >> -    }
>> >> -
>> >> -    hvf_set_dirty_tracking(section, 0);
>> >> -}
>> >> -
>> >> -static void hvf_log_sync(MemoryListener *listener,
>> >> -                         MemoryRegionSection *section)
>> >> -{
>> >> -    /*
>> >> -     * sync of dirty pages is handled elsewhere; just make sure we
>> keep
>> >> -     * tracking the region.
>> >> -     */
>> >> -    hvf_set_dirty_tracking(section, 1);
>> >> -}
>> >> -
>> >> -static void hvf_region_add(MemoryListener *listener,
>> >> -                           MemoryRegionSection *section)
>> >> -{
>> >> -    hvf_set_phys_mem(section, true);
>> >> -}
>> >> -
>> >> -static void hvf_region_del(MemoryListener *listener,
>> >> -                           MemoryRegionSection *section)
>> >> -{
>> >> -    hvf_set_phys_mem(section, false);
>> >> -}
>> >> -
>> >> -static MemoryListener hvf_memory_listener = {
>> >> -    .priority = 10,
>> >> -    .region_add = hvf_region_add,
>> >> -    .region_del = hvf_region_del,
>> >> -    .log_start = hvf_log_start,
>> >> -    .log_stop = hvf_log_stop,
>> >> -    .log_sync = hvf_log_sync,
>> >> -};
>> >> -
>> >> -void hvf_vcpu_destroy(CPUState *cpu)
>> >> +void hvf_arch_vcpu_destroy(CPUState *cpu)
>> >>   {
>> >>       X86CPU *x86_cpu = X86_CPU(cpu);
>> >>       CPUX86State *env = &x86_cpu->env;
>> >>
>> >> -    hv_return_t ret = hv_vcpu_destroy((hv_vcpuid_t)cpu->hvf_fd);
>> >>       g_free(env->hvf_mmio_buf);
>> >> -    assert_hvf_ok(ret);
>> >> -}
>> >> -
>> >> -static void dummy_signal(int sig)
>> >> -{
>> >>   }
>> >>
>> >> -int hvf_init_vcpu(CPUState *cpu)
>> >> +int hvf_arch_init_vcpu(CPUState *cpu)
>> >>   {
>> >>
>> >>       X86CPU *x86cpu = X86_CPU(cpu);
>> >>       CPUX86State *env = &x86cpu->env;
>> >> -    int r;
>> >> -
>> >> -    /* init cpu signals */
>> >> -    sigset_t set;
>> >> -    struct sigaction sigact;
>> >> -
>> >> -    memset(&sigact, 0, sizeof(sigact));
>> >> -    sigact.sa_handler = dummy_signal;
>> >> -    sigaction(SIG_IPI, &sigact, NULL);
>> >> -
>> >> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>> >> -    sigdelset(&set, SIG_IPI);
>> >>
>> >>       init_emu();
>> >>       init_decoder();
>> >> @@ -480,10 +176,6 @@ int hvf_init_vcpu(CPUState *cpu)
>> >>       hvf_state->hvf_caps = g_new0(struct hvf_vcpu_caps, 1);
>> >>       env->hvf_mmio_buf = g_new(char, 4096);
>> >>
>> >> -    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>> >> -    cpu->vcpu_dirty = 1;
>> >> -    assert_hvf_ok(r);
>> >> -
>> >>       if (hv_vmx_read_capability(HV_VMX_CAP_PINBASED,
>> >>           &hvf_state->hvf_caps->vmx_cap_pinbased)) {
>> >>           abort();
>> >> @@ -865,49 +557,3 @@ int hvf_vcpu_exec(CPUState *cpu)
>> >>
>> >>       return ret;
>> >>   }
>> >> -
>> >> -bool hvf_allowed;
>> >> -
>> >> -static int hvf_accel_init(MachineState *ms)
>> >> -{
>> >> -    int x;
>> >> -    hv_return_t ret;
>> >> -    HVFState *s;
>> >> -
>> >> -    ret = hv_vm_create(HV_VM_DEFAULT);
>> >> -    assert_hvf_ok(ret);
>> >> -
>> >> -    s = g_new0(HVFState, 1);
>> >> -
>> >> -    s->num_slots = 32;
>> >> -    for (x = 0; x < s->num_slots; ++x) {
>> >> -        s->slots[x].size = 0;
>> >> -        s->slots[x].slot_id = x;
>> >> -    }
>> >> -
>> >> -    hvf_state = s;
>> >> -    memory_listener_register(&hvf_memory_listener,
>> &address_space_memory);
>> >> -    cpus_register_accel(&hvf_cpus);
>> >> -    return 0;
>> >> -}
>> >> -
>> >> -static void hvf_accel_class_init(ObjectClass *oc, void *data)
>> >> -{
>> >> -    AccelClass *ac = ACCEL_CLASS(oc);
>> >> -    ac->name = "HVF";
>> >> -    ac->init_machine = hvf_accel_init;
>> >> -    ac->allowed = &hvf_allowed;
>> >> -}
>> >> -
>> >> -static const TypeInfo hvf_accel_type = {
>> >> -    .name = TYPE_HVF_ACCEL,
>> >> -    .parent = TYPE_ACCEL,
>> >> -    .class_init = hvf_accel_class_init,
>> >> -};
>> >> -
>> >> -static void hvf_type_init(void)
>> >> -{
>> >> -    type_register_static(&hvf_accel_type);
>> >> -}
>> >> -
>> >> -type_init(hvf_type_init);
>> >> diff --git a/target/i386/hvf/meson.build b/target/i386/hvf/meson.build
>> >> index 409c9a3f14..c8a43717ee 100644
>> >> --- a/target/i386/hvf/meson.build
>> >> +++ b/target/i386/hvf/meson.build
>> >> @@ -1,6 +1,5 @@
>> >>   i386_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true: files(
>> >>     'hvf.c',
>> >> -  'hvf-cpus.c',
>> >>     'x86.c',
>> >>     'x86_cpuid.c',
>> >>     'x86_decode.c',
>> >> diff --git a/target/i386/hvf/x86hvf.c b/target/i386/hvf/x86hvf.c
>> >> index bbec412b6c..89b8e9d87a 100644
>> >> --- a/target/i386/hvf/x86hvf.c
>> >> +++ b/target/i386/hvf/x86hvf.c
>> >> @@ -20,6 +20,9 @@
>> >>   #include "qemu/osdep.h"
>> >>
>> >>   #include "qemu-common.h"
>> >> +#include "sysemu/hvf.h"
>> >> +#include "sysemu/hvf_int.h"
>> >> +#include "sysemu/hw_accel.h"
>> >>   #include "x86hvf.h"
>> >>   #include "vmx.h"
>> >>   #include "vmcs.h"
>> >> @@ -32,8 +35,6 @@
>> >>   #include <Hypervisor/hv.h>
>> >>   #include <Hypervisor/hv_vmx.h>
>> >>
>> >> -#include "hvf-cpus.h"
>> >> -
>> >>   void hvf_set_segment(struct CPUState *cpu, struct vmx_segment
>> *vmx_seg,
>> >>                        SegmentCache *qseg, bool is_tr)
>> >>   {
>> >> @@ -437,7 +438,7 @@ int hvf_process_events(CPUState *cpu_state)
>> >>       env->eflags = rreg(cpu_state->hvf_fd, HV_X86_RFLAGS);
>> >>
>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_INIT) {
>> >> -        hvf_cpu_synchronize_state(cpu_state);
>> >> +        cpu_synchronize_state(cpu_state);
>> >>           do_cpu_init(cpu);
>> >>       }
>> >>
>> >> @@ -451,12 +452,12 @@ int hvf_process_events(CPUState *cpu_state)
>> >>           cpu_state->halted = 0;
>> >>       }
>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_SIPI) {
>> >> -        hvf_cpu_synchronize_state(cpu_state);
>> >> +        cpu_synchronize_state(cpu_state);
>> >>           do_cpu_sipi(cpu);
>> >>       }
>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_TPR) {
>> >>           cpu_state->interrupt_request &= ~CPU_INTERRUPT_TPR;
>> >> -        hvf_cpu_synchronize_state(cpu_state);
>> >> +        cpu_synchronize_state(cpu_state);
>> > The changes from hvf_cpu_*() to cpu_*() are cleanup and perhaps should
>> > be a separate patch. It follows cpu/accel cleanups Claudio was doing the
>> > summer.
>>
>>
>> The only reason they're in here is because we no longer have access to
>> the hvf_ functions from the file. I am perfectly happy to rebase the
>> patch on top of Claudio's if his goes in first. I'm sure it'll be
>> trivial for him to rebase on top of this too if my series goes in first.
>>
>>
>> >
>> > Phillipe raised the idea that the patch might go ahead of ARM-specific
>> > part (which might involve some discussions) and I agree with that.
>> >
>> > Some sync between Claudio series (CC'd him) and the patch might be need.
>>
>>
>> I would prefer not to hold back because of the sync. Claudio's cleanup
>> is trivial enough to adjust for if it gets merged ahead of this.
>>
>>
>> Alex
>>
>>
>>
>>

Re: [PATCH 2/8] hvf: Move common code out

Posted by Alexander Graf 4 years, 3 months ago

Hi Frank,

Thanks for the update :). Your previous email nudged me into the right 
direction. I previously had implemented WFI through the internal timer 
framework which performed way worse.

Along the way, I stumbled over a few issues though. For starters, the 
signal mask for SIG_IPI was not set correctly, so while pselect() would 
exit, the signal would never get delivered to the thread! For a fix, 
check out

https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/

Please also have a look at my latest stab at WFI emulation. It doesn't 
handle WFE (that's only relevant in overcommitted scenarios). But it 
does handle WFI and even does something similar to hlt polling, albeit 
not with an adaptive threshold.

Also, is there a particular reason you're working on this super 
interesting and useful code in a random downstream fork of QEMU? 
Wouldn't it be more helpful to contribute to the upstream code base instead?


Alex


On 30.11.20 21:15, Frank Yang wrote:
> Update: We're not quite sure how to compare the CNTV_CVAL and CNTVCT. 
> But the high CPU usage seems to be mitigated by having a poll interval 
> (like KVM does) in handling WFI:
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512501 
> <https://android-review.googlesource.com/c/platform/external/qemu/+/1512501>
>
> This is loosely inspired by 
> https://elixir.bootlin.com/linux/v5.10-rc6/source/virt/kvm/kvm_main.c#L2766 
> <https://elixir.bootlin.com/linux/v5.10-rc6/source/virt/kvm/kvm_main.c#L2766> 
> which does seem to specify a poll interval.
>
> It would be cool if we could have a lightweight way to enter sleep and 
> restart the vcpus precisely when CVAL passes, though.
>
> Frank
>
>
> On Fri, Nov 27, 2020 at 3:30 PM Frank Yang <lfy@google.com 
> <mailto:lfy@google.com>> wrote:
>
>     Hi all,
>
>     +Peter Collingbourne <mailto:pcc@google.com>
>
>     I'm a developer on the Android Emulator, which is in a fork of QEMU.
>
>     Peter and I have been working on an HVF Apple Silicon backend with
>     an eye toward Android guests.
>
>     We have gotten things to basically switch to Android userspace
>     already (logcat/shell and graphics available at least)
>
>     Our strategy so far has been to import logic from the KVM
>     implementation and hook into QEMU's software devices
>     that previously assumed to only work with TCG, or have
>     KVM-specific paths.
>
>     Thanks to Alexander for the tip on the 36-bit address space
>     limitation btw; our way of addressing this is to still allow
>     highmem but not put pci high mmio so high.
>
>     Also, note we have a sleep/signal based mechanism to deal with
>     WFx, which might be worth looking into in Alexander's
>     implementation as well:
>
>     https://android-review.googlesource.com/c/platform/external/qemu/+/1512551
>     <https://android-review.googlesource.com/c/platform/external/qemu/+/1512551>
>
>     Patches so far, FYI:
>
>     https://android-review.googlesource.com/c/platform/external/qemu/+/1513429/1
>     <https://android-review.googlesource.com/c/platform/external/qemu/+/1513429/1>
>     https://android-review.googlesource.com/c/platform/external/qemu/+/1512554/3
>     <https://android-review.googlesource.com/c/platform/external/qemu/+/1512554/3>
>     https://android-review.googlesource.com/c/platform/external/qemu/+/1512553/3
>     <https://android-review.googlesource.com/c/platform/external/qemu/+/1512553/3>
>     https://android-review.googlesource.com/c/platform/external/qemu/+/1512552/3
>     <https://android-review.googlesource.com/c/platform/external/qemu/+/1512552/3>
>     https://android-review.googlesource.com/c/platform/external/qemu/+/1512551/3
>     <https://android-review.googlesource.com/c/platform/external/qemu/+/1512551/3>
>
>     https://android.googlesource.com/platform/external/qemu/+/c17eb6a3ffd50047e9646aff6640b710cb8ff48a
>     <https://android.googlesource.com/platform/external/qemu/+/c17eb6a3ffd50047e9646aff6640b710cb8ff48a>
>     https://android.googlesource.com/platform/external/qemu/+/74bed16de1afb41b7a7ab8da1d1861226c9db63b
>     <https://android.googlesource.com/platform/external/qemu/+/74bed16de1afb41b7a7ab8da1d1861226c9db63b>
>     https://android.googlesource.com/platform/external/qemu/+/eccd9e47ab2ccb9003455e3bb721f57f9ebc3c01
>     <https://android.googlesource.com/platform/external/qemu/+/eccd9e47ab2ccb9003455e3bb721f57f9ebc3c01>
>     https://android.googlesource.com/platform/external/qemu/+/54fe3d67ed4698e85826537a4f49b2b9074b2228
>     <https://android.googlesource.com/platform/external/qemu/+/54fe3d67ed4698e85826537a4f49b2b9074b2228>
>     https://android.googlesource.com/platform/external/qemu/+/82ef91a6fede1d1000f36be037ad4d58fbe0d102
>     <https://android.googlesource.com/platform/external/qemu/+/82ef91a6fede1d1000f36be037ad4d58fbe0d102>
>     https://android.googlesource.com/platform/external/qemu/+/c28147aa7c74d98b858e99623d2fe46e74a379f6
>     <https://android.googlesource.com/platform/external/qemu/+/c28147aa7c74d98b858e99623d2fe46e74a379f6>
>
>     Peter's also noticed that there are extra steps needed for M1's to
>     allow TCG to work, as it involves JIT:
>
>     https://android.googlesource.com/platform/external/qemu/+/740e3fe47f88926c6bda9abb22ee6eae1bc254a9
>     <https://android.googlesource.com/platform/external/qemu/+/740e3fe47f88926c6bda9abb22ee6eae1bc254a9>
>
>     We'd appreciate any feedback/comments :)
>
>     Best,
>
>     Frank
>
>     On Fri, Nov 27, 2020 at 1:57 PM Alexander Graf <agraf@csgraf.de
>     <mailto:agraf@csgraf.de>> wrote:
>
>
>         On 27.11.20 21:00, Roman Bolshakov wrote:
>         > On Thu, Nov 26, 2020 at 10:50:11PM +0100, Alexander Graf wrote:
>         >> Until now, Hypervisor.framework has only been available on
>         x86_64 systems.
>         >> With Apple Silicon shipping now, it extends its reach to
>         aarch64. To
>         >> prepare for support for multiple architectures, let's move
>         common code out
>         >> into its own accel directory.
>         >>
>         >> Signed-off-by: Alexander Graf <agraf@csgraf.de
>         <mailto:agraf@csgraf.de>>
>         >> ---
>         >>   MAINTAINERS                 |   9 +-
>         >>   accel/hvf/hvf-all.c         |  56 +++++
>         >>   accel/hvf/hvf-cpus.c        | 468
>         ++++++++++++++++++++++++++++++++++++
>         >>   accel/hvf/meson.build       |   7 +
>         >>   accel/meson.build           |   1 +
>         >>   include/sysemu/hvf_int.h    |  69 ++++++
>         >>   target/i386/hvf/hvf-cpus.c  | 131 ----------
>         >>   target/i386/hvf/hvf-cpus.h  |  25 --
>         >>   target/i386/hvf/hvf-i386.h  |  48 +---
>         >>   target/i386/hvf/hvf.c       | 360 +--------------------------
>         >>   target/i386/hvf/meson.build |   1 -
>         >>   target/i386/hvf/x86hvf.c    |  11 +-
>         >>   target/i386/hvf/x86hvf.h    |   2 -
>         >>   13 files changed, 619 insertions(+), 569 deletions(-)
>         >>   create mode 100644 accel/hvf/hvf-all.c
>         >>   create mode 100644 accel/hvf/hvf-cpus.c
>         >>   create mode 100644 accel/hvf/meson.build
>         >>   create mode 100644 include/sysemu/hvf_int.h
>         >>   delete mode 100644 target/i386/hvf/hvf-cpus.c
>         >>   delete mode 100644 target/i386/hvf/hvf-cpus.h
>         >>
>         >> diff --git a/MAINTAINERS b/MAINTAINERS
>         >> index 68bc160f41..ca4b6d9279 100644
>         >> --- a/MAINTAINERS
>         >> +++ b/MAINTAINERS
>         >> @@ -444,9 +444,16 @@ M: Cameron Esfahani <dirty@apple.com
>         <mailto:dirty@apple.com>>
>         >>   M: Roman Bolshakov <r.bolshakov@yadro.com
>         <mailto:r.bolshakov@yadro.com>>
>         >>   W: https://wiki.qemu.org/Features/HVF
>         <https://wiki.qemu.org/Features/HVF>
>         >>   S: Maintained
>         >> -F: accel/stubs/hvf-stub.c
>         > There was a patch for that in the RFC series from Claudio.
>
>
>         Yeah, I'm not worried about this hunk :).
>
>
>         >
>         >>   F: target/i386/hvf/
>         >> +
>         >> +HVF
>         >> +M: Cameron Esfahani <dirty@apple.com <mailto:dirty@apple.com>>
>         >> +M: Roman Bolshakov <r.bolshakov@yadro.com
>         <mailto:r.bolshakov@yadro.com>>
>         >> +W: https://wiki.qemu.org/Features/HVF
>         <https://wiki.qemu.org/Features/HVF>
>         >> +S: Maintained
>         >> +F: accel/hvf/
>         >>   F: include/sysemu/hvf.h
>         >> +F: include/sysemu/hvf_int.h
>         >>
>         >>   WHPX CPUs
>         >>   M: Sunil Muthuswamy <sunilmut@microsoft.com
>         <mailto:sunilmut@microsoft.com>>
>         >> diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
>         >> new file mode 100644
>         >> index 0000000000..47d77a472a
>         >> --- /dev/null
>         >> +++ b/accel/hvf/hvf-all.c
>         >> @@ -0,0 +1,56 @@
>         >> +/*
>         >> + * QEMU Hypervisor.framework support
>         >> + *
>         >> + * This work is licensed under the terms of the GNU GPL,
>         version 2.  See
>         >> + * the COPYING file in the top-level directory.
>         >> + *
>         >> + * Contributions after 2012-01-13 are licensed under the
>         terms of the
>         >> + * GNU GPL, version 2 or (at your option) any later version.
>         >> + */
>         >> +
>         >> +#include "qemu/osdep.h"
>         >> +#include "qemu-common.h"
>         >> +#include "qemu/error-report.h"
>         >> +#include "sysemu/hvf.h"
>         >> +#include "sysemu/hvf_int.h"
>         >> +#include "sysemu/runstate.h"
>         >> +
>         >> +#include "qemu/main-loop.h"
>         >> +#include "sysemu/accel.h"
>         >> +
>         >> +#include <Hypervisor/Hypervisor.h>
>         >> +
>         >> +bool hvf_allowed;
>         >> +HVFState *hvf_state;
>         >> +
>         >> +void assert_hvf_ok(hv_return_t ret)
>         >> +{
>         >> +    if (ret == HV_SUCCESS) {
>         >> +        return;
>         >> +    }
>         >> +
>         >> +    switch (ret) {
>         >> +    case HV_ERROR:
>         >> +        error_report("Error: HV_ERROR");
>         >> +        break;
>         >> +    case HV_BUSY:
>         >> +        error_report("Error: HV_BUSY");
>         >> +        break;
>         >> +    case HV_BAD_ARGUMENT:
>         >> +        error_report("Error: HV_BAD_ARGUMENT");
>         >> +        break;
>         >> +    case HV_NO_RESOURCES:
>         >> +        error_report("Error: HV_NO_RESOURCES");
>         >> +        break;
>         >> +    case HV_NO_DEVICE:
>         >> +        error_report("Error: HV_NO_DEVICE");
>         >> +        break;
>         >> +    case HV_UNSUPPORTED:
>         >> +        error_report("Error: HV_UNSUPPORTED");
>         >> +        break;
>         >> +    default:
>         >> +        error_report("Unknown Error");
>         >> +    }
>         >> +
>         >> +    abort();
>         >> +}
>         >> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>         >> new file mode 100644
>         >> index 0000000000..f9bb5502b7
>         >> --- /dev/null
>         >> +++ b/accel/hvf/hvf-cpus.c
>         >> @@ -0,0 +1,468 @@
>         >> +/*
>         >> + * Copyright 2008 IBM Corporation
>         >> + *           2008 Red Hat, Inc.
>         >> + * Copyright 2011 Intel Corporation
>         >> + * Copyright 2016 Veertu, Inc.
>         >> + * Copyright 2017 The Android Open Source Project
>         >> + *
>         >> + * QEMU Hypervisor.framework support
>         >> + *
>         >> + * This program is free software; you can redistribute it
>         and/or
>         >> + * modify it under the terms of version 2 of the GNU
>         General Public
>         >> + * License as published by the Free Software Foundation.
>         >> + *
>         >> + * This program is distributed in the hope that it will be
>         useful,
>         >> + * but WITHOUT ANY WARRANTY; without even the implied
>         warranty of
>         >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 
>         See the GNU
>         >> + * General Public License for more details.
>         >> + *
>         >> + * You should have received a copy of the GNU General
>         Public License
>         >> + * along with this program; if not, see
>         <http://www.gnu.org/licenses/ <http://www.gnu.org/licenses/>>.
>         >> + *
>         >> + * This file contain code under public domain from the
>         hvdos project:
>         >> + * https://github.com/mist64/hvdos
>         <https://github.com/mist64/hvdos>
>         >> + *
>         >> + * Parts Copyright (c) 2011 NetApp, Inc.
>         >> + * All rights reserved.
>         >> + *
>         >> + * Redistribution and use in source and binary forms, with
>         or without
>         >> + * modification, are permitted provided that the following
>         conditions
>         >> + * are met:
>         >> + * 1. Redistributions of source code must retain the above
>         copyright
>         >> + *    notice, this list of conditions and the following
>         disclaimer.
>         >> + * 2. Redistributions in binary form must reproduce the
>         above copyright
>         >> + *    notice, this list of conditions and the following
>         disclaimer in the
>         >> + *    documentation and/or other materials provided with
>         the distribution.
>         >> + *
>         >> + * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>         >> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>         LIMITED TO, THE
>         >> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
>         PARTICULAR PURPOSE
>         >> + * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR
>         CONTRIBUTORS BE LIABLE
>         >> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
>         EXEMPLARY, OR CONSEQUENTIAL
>         >> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
>         SUBSTITUTE GOODS
>         >> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
>         INTERRUPTION)
>         >> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
>         IN CONTRACT, STRICT
>         >> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
>         ARISING IN ANY WAY
>         >> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
>         POSSIBILITY OF
>         >> + * SUCH DAMAGE.
>         >> + */
>         >> +
>         >> +#include "qemu/osdep.h"
>         >> +#include "qemu/error-report.h"
>         >> +#include "qemu/main-loop.h"
>         >> +#include "exec/address-spaces.h"
>         >> +#include "exec/exec-all.h"
>         >> +#include "sysemu/cpus.h"
>         >> +#include "sysemu/hvf.h"
>         >> +#include "sysemu/hvf_int.h"
>         >> +#include "sysemu/runstate.h"
>         >> +#include "qemu/guest-random.h"
>         >> +
>         >> +#include <Hypervisor/Hypervisor.h>
>         >> +
>         >> +/* Memory slots */
>         >> +
>         >> +struct mac_slot {
>         >> +    int present;
>         >> +    uint64_t size;
>         >> +    uint64_t gpa_start;
>         >> +    uint64_t gva;
>         >> +};
>         >> +
>         >> +hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>         >> +{
>         >> +    hvf_slot *slot;
>         >> +    int x;
>         >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>         >> +        slot = &hvf_state->slots[x];
>         >> +        if (slot->size && start < (slot->start +
>         slot->size) &&
>         >> +            (start + size) > slot->start) {
>         >> +            return slot;
>         >> +        }
>         >> +    }
>         >> +    return NULL;
>         >> +}
>         >> +
>         >> +struct mac_slot mac_slots[32];
>         >> +
>         >> +static int do_hvf_set_memory(hvf_slot *slot,
>         hv_memory_flags_t flags)
>         >> +{
>         >> +    struct mac_slot *macslot;
>         >> +    hv_return_t ret;
>         >> +
>         >> +    macslot = &mac_slots[slot->slot_id];
>         >> +
>         >> +    if (macslot->present) {
>         >> +        if (macslot->size != slot->size) {
>         >> +            macslot->present = 0;
>         >> +            ret = hv_vm_unmap(macslot->gpa_start,
>         macslot->size);
>         >> +            assert_hvf_ok(ret);
>         >> +        }
>         >> +    }
>         >> +
>         >> +    if (!slot->size) {
>         >> +        return 0;
>         >> +    }
>         >> +
>         >> +    macslot->present = 1;
>         >> +    macslot->gpa_start = slot->start;
>         >> +    macslot->size = slot->size;
>         >> +    ret = hv_vm_map(slot->mem, slot->start, slot->size,
>         flags);
>         >> +    assert_hvf_ok(ret);
>         >> +    return 0;
>         >> +}
>         >> +
>         >> +static void hvf_set_phys_mem(MemoryRegionSection *section,
>         bool add)
>         >> +{
>         >> +    hvf_slot *mem;
>         >> +    MemoryRegion *area = section->mr;
>         >> +    bool writeable = !area->readonly && !area->rom_device;
>         >> +    hv_memory_flags_t flags;
>         >> +
>         >> +    if (!memory_region_is_ram(area)) {
>         >> +        if (writeable) {
>         >> +            return;
>         >> +        } else if (!memory_region_is_romd(area)) {
>         >> +            /*
>         >> +             * If the memory device is not in romd_mode,
>         then we actually want
>         >> +             * to remove the hvf memory slot so all
>         accesses will trap.
>         >> +             */
>         >> +             add = false;
>         >> +        }
>         >> +    }
>         >> +
>         >> +    mem = hvf_find_overlap_slot(
>         >> + section->offset_within_address_space,
>         >> +            int128_get64(section->size));
>         >> +
>         >> +    if (mem && add) {
>         >> +        if (mem->size == int128_get64(section->size) &&
>         >> +            mem->start ==
>         section->offset_within_address_space &&
>         >> +            mem->mem == (memory_region_get_ram_ptr(area) +
>         >> +            section->offset_within_region)) {
>         >> +            return; /* Same region was attempted to
>         register, go away. */
>         >> +        }
>         >> +    }
>         >> +
>         >> +    /* Region needs to be reset. set the size to 0 and
>         remap it. */
>         >> +    if (mem) {
>         >> +        mem->size = 0;
>         >> +        if (do_hvf_set_memory(mem, 0)) {
>         >> +            error_report("Failed to reset overlapping slot");
>         >> +            abort();
>         >> +        }
>         >> +    }
>         >> +
>         >> +    if (!add) {
>         >> +        return;
>         >> +    }
>         >> +
>         >> +    if (area->readonly ||
>         >> +        (!memory_region_is_ram(area) &&
>         memory_region_is_romd(area))) {
>         >> +        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>         >> +    } else {
>         >> +        flags = HV_MEMORY_READ | HV_MEMORY_WRITE |
>         HV_MEMORY_EXEC;
>         >> +    }
>         >> +
>         >> +    /* Now make a new slot. */
>         >> +    int x;
>         >> +
>         >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>         >> +        mem = &hvf_state->slots[x];
>         >> +        if (!mem->size) {
>         >> +            break;
>         >> +        }
>         >> +    }
>         >> +
>         >> +    if (x == hvf_state->num_slots) {
>         >> +        error_report("No free slots");
>         >> +        abort();
>         >> +    }
>         >> +
>         >> +    mem->size = int128_get64(section->size);
>         >> +    mem->mem = memory_region_get_ram_ptr(area) +
>         section->offset_within_region;
>         >> +    mem->start = section->offset_within_address_space;
>         >> +    mem->region = area;
>         >> +
>         >> +    if (do_hvf_set_memory(mem, flags)) {
>         >> +        error_report("Error registering new memory slot");
>         >> +        abort();
>         >> +    }
>         >> +}
>         >> +
>         >> +static void hvf_set_dirty_tracking(MemoryRegionSection
>         *section, bool on)
>         >> +{
>         >> +    hvf_slot *slot;
>         >> +
>         >> +    slot = hvf_find_overlap_slot(
>         >> + section->offset_within_address_space,
>         >> +            int128_get64(section->size));
>         >> +
>         >> +    /* protect region against writes; begin tracking it */
>         >> +    if (on) {
>         >> +        slot->flags |= HVF_SLOT_LOG;
>         >> +        hv_vm_protect((uintptr_t)slot->start,
>         (size_t)slot->size,
>         >> +                      HV_MEMORY_READ);
>         >> +    /* stop tracking region*/
>         >> +    } else {
>         >> +        slot->flags &= ~HVF_SLOT_LOG;
>         >> +        hv_vm_protect((uintptr_t)slot->start,
>         (size_t)slot->size,
>         >> +                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>         >> +    }
>         >> +}
>         >> +
>         >> +static void hvf_log_start(MemoryListener *listener,
>         >> +                          MemoryRegionSection *section,
>         int old, int new)
>         >> +{
>         >> +    if (old != 0) {
>         >> +        return;
>         >> +    }
>         >> +
>         >> +    hvf_set_dirty_tracking(section, 1);
>         >> +}
>         >> +
>         >> +static void hvf_log_stop(MemoryListener *listener,
>         >> +                         MemoryRegionSection *section, int
>         old, int new)
>         >> +{
>         >> +    if (new != 0) {
>         >> +        return;
>         >> +    }
>         >> +
>         >> +    hvf_set_dirty_tracking(section, 0);
>         >> +}
>         >> +
>         >> +static void hvf_log_sync(MemoryListener *listener,
>         >> +                         MemoryRegionSection *section)
>         >> +{
>         >> +    /*
>         >> +     * sync of dirty pages is handled elsewhere; just make
>         sure we keep
>         >> +     * tracking the region.
>         >> +     */
>         >> +    hvf_set_dirty_tracking(section, 1);
>         >> +}
>         >> +
>         >> +static void hvf_region_add(MemoryListener *listener,
>         >> +                           MemoryRegionSection *section)
>         >> +{
>         >> +    hvf_set_phys_mem(section, true);
>         >> +}
>         >> +
>         >> +static void hvf_region_del(MemoryListener *listener,
>         >> +                           MemoryRegionSection *section)
>         >> +{
>         >> +    hvf_set_phys_mem(section, false);
>         >> +}
>         >> +
>         >> +static MemoryListener hvf_memory_listener = {
>         >> +    .priority = 10,
>         >> +    .region_add = hvf_region_add,
>         >> +    .region_del = hvf_region_del,
>         >> +    .log_start = hvf_log_start,
>         >> +    .log_stop = hvf_log_stop,
>         >> +    .log_sync = hvf_log_sync,
>         >> +};
>         >> +
>         >> +static void do_hvf_cpu_synchronize_state(CPUState *cpu,
>         run_on_cpu_data arg)
>         >> +{
>         >> +    if (!cpu->vcpu_dirty) {
>         >> +        hvf_get_registers(cpu);
>         >> +        cpu->vcpu_dirty = true;
>         >> +    }
>         >> +}
>         >> +
>         >> +static void hvf_cpu_synchronize_state(CPUState *cpu)
>         >> +{
>         >> +    if (!cpu->vcpu_dirty) {
>         >> +        run_on_cpu(cpu, do_hvf_cpu_synchronize_state,
>         RUN_ON_CPU_NULL);
>         >> +    }
>         >> +}
>         >> +
>         >> +static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>         >> + run_on_cpu_data arg)
>         >> +{
>         >> +    hvf_put_registers(cpu);
>         >> +    cpu->vcpu_dirty = false;
>         >> +}
>         >> +
>         >> +static void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>         >> +{
>         >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset,
>         RUN_ON_CPU_NULL);
>         >> +}
>         >> +
>         >> +static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>         >> +  run_on_cpu_data arg)
>         >> +{
>         >> +    hvf_put_registers(cpu);
>         >> +    cpu->vcpu_dirty = false;
>         >> +}
>         >> +
>         >> +static void hvf_cpu_synchronize_post_init(CPUState *cpu)
>         >> +{
>         >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init,
>         RUN_ON_CPU_NULL);
>         >> +}
>         >> +
>         >> +static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>         >> + run_on_cpu_data arg)
>         >> +{
>         >> +    cpu->vcpu_dirty = true;
>         >> +}
>         >> +
>         >> +static void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>         >> +{
>         >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm,
>         RUN_ON_CPU_NULL);
>         >> +}
>         >> +
>         >> +static void hvf_vcpu_destroy(CPUState *cpu)
>         >> +{
>         >> +    hv_return_t ret = hv_vcpu_destroy(cpu->hvf_fd);
>         >> +    assert_hvf_ok(ret);
>         >> +
>         >> +    hvf_arch_vcpu_destroy(cpu);
>         >> +}
>         >> +
>         >> +static void dummy_signal(int sig)
>         >> +{
>         >> +}
>         >> +
>         >> +static int hvf_init_vcpu(CPUState *cpu)
>         >> +{
>         >> +    int r;
>         >> +
>         >> +    /* init cpu signals */
>         >> +    sigset_t set;
>         >> +    struct sigaction sigact;
>         >> +
>         >> +    memset(&sigact, 0, sizeof(sigact));
>         >> +    sigact.sa_handler = dummy_signal;
>         >> +    sigaction(SIG_IPI, &sigact, NULL);
>         >> +
>         >> +    pthread_sigmask(SIG_BLOCK, NULL, &set);
>         >> +    sigdelset(&set, SIG_IPI);
>         >> +
>         >> +#ifdef __aarch64__
>         >> +    r = hv_vcpu_create(&cpu->hvf_fd, (hv_vcpu_exit_t
>         **)&cpu->hvf_exit, NULL);
>         >> +#else
>         >> +    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd,
>         HV_VCPU_DEFAULT);
>         >> +#endif
>         > I think the first __aarch64__ bit fits better to arm part of
>         the series.
>
>
>         Oops. Thanks for catching it! Yes, absolutely. It should be
>         part of the
>         ARM enablement.
>
>
>         >
>         >> +    cpu->vcpu_dirty = 1;
>         >> +    assert_hvf_ok(r);
>         >> +
>         >> +    return hvf_arch_init_vcpu(cpu);
>         >> +}
>         >> +
>         >> +/*
>         >> + * The HVF-specific vCPU thread function. This one should
>         only run when the host
>         >> + * CPU supports the VMX "unrestricted guest" feature.
>         >> + */
>         >> +static void *hvf_cpu_thread_fn(void *arg)
>         >> +{
>         >> +    CPUState *cpu = arg;
>         >> +
>         >> +    int r;
>         >> +
>         >> +    assert(hvf_enabled());
>         >> +
>         >> +    rcu_register_thread();
>         >> +
>         >> +    qemu_mutex_lock_iothread();
>         >> +    qemu_thread_get_self(cpu->thread);
>         >> +
>         >> +    cpu->thread_id = qemu_get_thread_id();
>         >> +    cpu->can_do_io = 1;
>         >> +    current_cpu = cpu;
>         >> +
>         >> +    hvf_init_vcpu(cpu);
>         >> +
>         >> +    /* signal CPU creation */
>         >> +    cpu_thread_signal_created(cpu);
>         >> + qemu_guest_random_seed_thread_part2(cpu->random_seed);
>         >> +
>         >> +    do {
>         >> +        if (cpu_can_run(cpu)) {
>         >> +            r = hvf_vcpu_exec(cpu);
>         >> +            if (r == EXCP_DEBUG) {
>         >> +                cpu_handle_guest_debug(cpu);
>         >> +            }
>         >> +        }
>         >> +        qemu_wait_io_event(cpu);
>         >> +    } while (!cpu->unplug || cpu_can_run(cpu));
>         >> +
>         >> +    hvf_vcpu_destroy(cpu);
>         >> +    cpu_thread_signal_destroyed(cpu);
>         >> +    qemu_mutex_unlock_iothread();
>         >> +    rcu_unregister_thread();
>         >> +    return NULL;
>         >> +}
>         >> +
>         >> +static void hvf_start_vcpu_thread(CPUState *cpu)
>         >> +{
>         >> +    char thread_name[VCPU_THREAD_NAME_SIZE];
>         >> +
>         >> +    /*
>         >> +     * HVF currently does not support TCG, and only runs in
>         >> +     * unrestricted-guest mode.
>         >> +     */
>         >> +    assert(hvf_enabled());
>         >> +
>         >> +    cpu->thread = g_malloc0(sizeof(QemuThread));
>         >> +    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>         >> +    qemu_cond_init(cpu->halt_cond);
>         >> +
>         >> +    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>         >> +             cpu->cpu_index);
>         >> +    qemu_thread_create(cpu->thread, thread_name,
>         hvf_cpu_thread_fn,
>         >> +                       cpu, QEMU_THREAD_JOINABLE);
>         >> +}
>         >> +
>         >> +static const CpusAccel hvf_cpus = {
>         >> +    .create_vcpu_thread = hvf_start_vcpu_thread,
>         >> +
>         >> +    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>         >> +    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>         >> +    .synchronize_state = hvf_cpu_synchronize_state,
>         >> +    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>         >> +};
>         >> +
>         >> +static int hvf_accel_init(MachineState *ms)
>         >> +{
>         >> +    int x;
>         >> +    hv_return_t ret;
>         >> +    HVFState *s;
>         >> +
>         >> +    ret = hv_vm_create(HV_VM_DEFAULT);
>         >> +    assert_hvf_ok(ret);
>         >> +
>         >> +    s = g_new0(HVFState, 1);
>         >> +
>         >> +    s->num_slots = 32;
>         >> +    for (x = 0; x < s->num_slots; ++x) {
>         >> +        s->slots[x].size = 0;
>         >> +        s->slots[x].slot_id = x;
>         >> +    }
>         >> +
>         >> +    hvf_state = s;
>         >> + memory_listener_register(&hvf_memory_listener,
>         &address_space_memory);
>         >> +    cpus_register_accel(&hvf_cpus);
>         >> +    return 0;
>         >> +}
>         >> +
>         >> +static void hvf_accel_class_init(ObjectClass *oc, void *data)
>         >> +{
>         >> +    AccelClass *ac = ACCEL_CLASS(oc);
>         >> +    ac->name = "HVF";
>         >> +    ac->init_machine = hvf_accel_init;
>         >> +    ac->allowed = &hvf_allowed;
>         >> +}
>         >> +
>         >> +static const TypeInfo hvf_accel_type = {
>         >> +    .name = TYPE_HVF_ACCEL,
>         >> +    .parent = TYPE_ACCEL,
>         >> +    .class_init = hvf_accel_class_init,
>         >> +};
>         >> +
>         >> +static void hvf_type_init(void)
>         >> +{
>         >> +    type_register_static(&hvf_accel_type);
>         >> +}
>         >> +
>         >> +type_init(hvf_type_init);
>         >> diff --git a/accel/hvf/meson.build b/accel/hvf/meson.build
>         >> new file mode 100644
>         >> index 0000000000..dfd6b68dc7
>         >> --- /dev/null
>         >> +++ b/accel/hvf/meson.build
>         >> @@ -0,0 +1,7 @@
>         >> +hvf_ss = ss.source_set()
>         >> +hvf_ss.add(files(
>         >> +  'hvf-all.c',
>         >> +  'hvf-cpus.c',
>         >> +))
>         >> +
>         >> +specific_ss.add_all(when: 'CONFIG_HVF', if_true: hvf_ss)
>         >> diff --git a/accel/meson.build b/accel/meson.build
>         >> index b26cca227a..6de12ce5d5 100644
>         >> --- a/accel/meson.build
>         >> +++ b/accel/meson.build
>         >> @@ -1,5 +1,6 @@
>         >>   softmmu_ss.add(files('accel.c'))
>         >>
>         >> +subdir('hvf')
>         >>   subdir('qtest')
>         >>   subdir('kvm')
>         >>   subdir('tcg')
>         >> diff --git a/include/sysemu/hvf_int.h
>         b/include/sysemu/hvf_int.h
>         >> new file mode 100644
>         >> index 0000000000..de9bad23a8
>         >> --- /dev/null
>         >> +++ b/include/sysemu/hvf_int.h
>         >> @@ -0,0 +1,69 @@
>         >> +/*
>         >> + * QEMU Hypervisor.framework (HVF) support
>         >> + *
>         >> + * This work is licensed under the terms of the GNU GPL,
>         version 2 or later.
>         >> + * See the COPYING file in the top-level directory.
>         >> + *
>         >> + */
>         >> +
>         >> +/* header to be included in HVF-specific code */
>         >> +
>         >> +#ifndef HVF_INT_H
>         >> +#define HVF_INT_H
>         >> +
>         >> +#include <Hypervisor/Hypervisor.h>
>         >> +
>         >> +#define HVF_MAX_VCPU 0x10
>         >> +
>         >> +extern struct hvf_state hvf_global;
>         >> +
>         >> +struct hvf_vm {
>         >> +    int id;
>         >> +    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>         >> +};
>         >> +
>         >> +struct hvf_state {
>         >> +    uint32_t version;
>         >> +    struct hvf_vm *vm;
>         >> +    uint64_t mem_quota;
>         >> +};
>         >> +
>         >> +/* hvf_slot flags */
>         >> +#define HVF_SLOT_LOG (1 << 0)
>         >> +
>         >> +typedef struct hvf_slot {
>         >> +    uint64_t start;
>         >> +    uint64_t size;
>         >> +    uint8_t *mem;
>         >> +    int slot_id;
>         >> +    uint32_t flags;
>         >> +    MemoryRegion *region;
>         >> +} hvf_slot;
>         >> +
>         >> +typedef struct hvf_vcpu_caps {
>         >> +    uint64_t vmx_cap_pinbased;
>         >> +    uint64_t vmx_cap_procbased;
>         >> +    uint64_t vmx_cap_procbased2;
>         >> +    uint64_t vmx_cap_entry;
>         >> +    uint64_t vmx_cap_exit;
>         >> +    uint64_t vmx_cap_preemption_timer;
>         >> +} hvf_vcpu_caps;
>         >> +
>         >> +struct HVFState {
>         >> +    AccelState parent;
>         >> +    hvf_slot slots[32];
>         >> +    int num_slots;
>         >> +
>         >> +    hvf_vcpu_caps *hvf_caps;
>         >> +};
>         >> +extern HVFState *hvf_state;
>         >> +
>         >> +void assert_hvf_ok(hv_return_t ret);
>         >> +int hvf_get_registers(CPUState *cpu);
>         >> +int hvf_put_registers(CPUState *cpu);
>         >> +int hvf_arch_init_vcpu(CPUState *cpu);
>         >> +void hvf_arch_vcpu_destroy(CPUState *cpu);
>         >> +int hvf_vcpu_exec(CPUState *cpu);
>         >> +hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>         >> +
>         >> +#endif
>         >> diff --git a/target/i386/hvf/hvf-cpus.c
>         b/target/i386/hvf/hvf-cpus.c
>         >> deleted file mode 100644
>         >> index 817b3d7452..0000000000
>         >> --- a/target/i386/hvf/hvf-cpus.c
>         >> +++ /dev/null
>         >> @@ -1,131 +0,0 @@
>         >> -/*
>         >> - * Copyright 2008 IBM Corporation
>         >> - *           2008 Red Hat, Inc.
>         >> - * Copyright 2011 Intel Corporation
>         >> - * Copyright 2016 Veertu, Inc.
>         >> - * Copyright 2017 The Android Open Source Project
>         >> - *
>         >> - * QEMU Hypervisor.framework support
>         >> - *
>         >> - * This program is free software; you can redistribute it
>         and/or
>         >> - * modify it under the terms of version 2 of the GNU
>         General Public
>         >> - * License as published by the Free Software Foundation.
>         >> - *
>         >> - * This program is distributed in the hope that it will be
>         useful,
>         >> - * but WITHOUT ANY WARRANTY; without even the implied
>         warranty of
>         >> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 
>         See the GNU
>         >> - * General Public License for more details.
>         >> - *
>         >> - * You should have received a copy of the GNU General
>         Public License
>         >> - * along with this program; if not, see
>         <http://www.gnu.org/licenses/ <http://www.gnu.org/licenses/>>.
>         >> - *
>         >> - * This file contain code under public domain from the
>         hvdos project:
>         >> - * https://github.com/mist64/hvdos
>         <https://github.com/mist64/hvdos>
>         >> - *
>         >> - * Parts Copyright (c) 2011 NetApp, Inc.
>         >> - * All rights reserved.
>         >> - *
>         >> - * Redistribution and use in source and binary forms, with
>         or without
>         >> - * modification, are permitted provided that the following
>         conditions
>         >> - * are met:
>         >> - * 1. Redistributions of source code must retain the above
>         copyright
>         >> - *    notice, this list of conditions and the following
>         disclaimer.
>         >> - * 2. Redistributions in binary form must reproduce the
>         above copyright
>         >> - *    notice, this list of conditions and the following
>         disclaimer in the
>         >> - *    documentation and/or other materials provided with
>         the distribution.
>         >> - *
>         >> - * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>         >> - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>         LIMITED TO, THE
>         >> - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
>         PARTICULAR PURPOSE
>         >> - * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR
>         CONTRIBUTORS BE LIABLE
>         >> - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
>         EXEMPLARY, OR CONSEQUENTIAL
>         >> - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
>         SUBSTITUTE GOODS
>         >> - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
>         INTERRUPTION)
>         >> - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
>         IN CONTRACT, STRICT
>         >> - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
>         ARISING IN ANY WAY
>         >> - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
>         POSSIBILITY OF
>         >> - * SUCH DAMAGE.
>         >> - */
>         >> -
>         >> -#include "qemu/osdep.h"
>         >> -#include "qemu/error-report.h"
>         >> -#include "qemu/main-loop.h"
>         >> -#include "sysemu/hvf.h"
>         >> -#include "sysemu/runstate.h"
>         >> -#include "target/i386/cpu.h"
>         >> -#include "qemu/guest-random.h"
>         >> -
>         >> -#include "hvf-cpus.h"
>         >> -
>         >> -/*
>         >> - * The HVF-specific vCPU thread function. This one should
>         only run when the host
>         >> - * CPU supports the VMX "unrestricted guest" feature.
>         >> - */
>         >> -static void *hvf_cpu_thread_fn(void *arg)
>         >> -{
>         >> -    CPUState *cpu = arg;
>         >> -
>         >> -    int r;
>         >> -
>         >> -    assert(hvf_enabled());
>         >> -
>         >> -    rcu_register_thread();
>         >> -
>         >> -    qemu_mutex_lock_iothread();
>         >> -    qemu_thread_get_self(cpu->thread);
>         >> -
>         >> -    cpu->thread_id = qemu_get_thread_id();
>         >> -    cpu->can_do_io = 1;
>         >> -    current_cpu = cpu;
>         >> -
>         >> -    hvf_init_vcpu(cpu);
>         >> -
>         >> -    /* signal CPU creation */
>         >> -    cpu_thread_signal_created(cpu);
>         >> - qemu_guest_random_seed_thread_part2(cpu->random_seed);
>         >> -
>         >> -    do {
>         >> -        if (cpu_can_run(cpu)) {
>         >> -            r = hvf_vcpu_exec(cpu);
>         >> -            if (r == EXCP_DEBUG) {
>         >> -                cpu_handle_guest_debug(cpu);
>         >> -            }
>         >> -        }
>         >> -        qemu_wait_io_event(cpu);
>         >> -    } while (!cpu->unplug || cpu_can_run(cpu));
>         >> -
>         >> -    hvf_vcpu_destroy(cpu);
>         >> -    cpu_thread_signal_destroyed(cpu);
>         >> -    qemu_mutex_unlock_iothread();
>         >> -    rcu_unregister_thread();
>         >> -    return NULL;
>         >> -}
>         >> -
>         >> -static void hvf_start_vcpu_thread(CPUState *cpu)
>         >> -{
>         >> -    char thread_name[VCPU_THREAD_NAME_SIZE];
>         >> -
>         >> -    /*
>         >> -     * HVF currently does not support TCG, and only runs in
>         >> -     * unrestricted-guest mode.
>         >> -     */
>         >> -    assert(hvf_enabled());
>         >> -
>         >> -    cpu->thread = g_malloc0(sizeof(QemuThread));
>         >> -    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>         >> -    qemu_cond_init(cpu->halt_cond);
>         >> -
>         >> -    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>         >> -             cpu->cpu_index);
>         >> -    qemu_thread_create(cpu->thread, thread_name,
>         hvf_cpu_thread_fn,
>         >> -                       cpu, QEMU_THREAD_JOINABLE);
>         >> -}
>         >> -
>         >> -const CpusAccel hvf_cpus = {
>         >> -    .create_vcpu_thread = hvf_start_vcpu_thread,
>         >> -
>         >> -    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>         >> -    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>         >> -    .synchronize_state = hvf_cpu_synchronize_state,
>         >> -    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>         >> -};
>         >> diff --git a/target/i386/hvf/hvf-cpus.h
>         b/target/i386/hvf/hvf-cpus.h
>         >> deleted file mode 100644
>         >> index ced31b82c0..0000000000
>         >> --- a/target/i386/hvf/hvf-cpus.h
>         >> +++ /dev/null
>         >> @@ -1,25 +0,0 @@
>         >> -/*
>         >> - * Accelerator CPUS Interface
>         >> - *
>         >> - * Copyright 2020 SUSE LLC
>         >> - *
>         >> - * This work is licensed under the terms of the GNU GPL,
>         version 2 or later.
>         >> - * See the COPYING file in the top-level directory.
>         >> - */
>         >> -
>         >> -#ifndef HVF_CPUS_H
>         >> -#define HVF_CPUS_H
>         >> -
>         >> -#include "sysemu/cpus.h"
>         >> -
>         >> -extern const CpusAccel hvf_cpus;
>         >> -
>         >> -int hvf_init_vcpu(CPUState *);
>         >> -int hvf_vcpu_exec(CPUState *);
>         >> -void hvf_cpu_synchronize_state(CPUState *);
>         >> -void hvf_cpu_synchronize_post_reset(CPUState *);
>         >> -void hvf_cpu_synchronize_post_init(CPUState *);
>         >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *);
>         >> -void hvf_vcpu_destroy(CPUState *);
>         >> -
>         >> -#endif /* HVF_CPUS_H */
>         >> diff --git a/target/i386/hvf/hvf-i386.h
>         b/target/i386/hvf/hvf-i386.h
>         >> index e0edffd077..6d56f8f6bb 100644
>         >> --- a/target/i386/hvf/hvf-i386.h
>         >> +++ b/target/i386/hvf/hvf-i386.h
>         >> @@ -18,57 +18,11 @@
>         >>
>         >>   #include "sysemu/accel.h"
>         >>   #include "sysemu/hvf.h"
>         >> +#include "sysemu/hvf_int.h"
>         >>   #include "cpu.h"
>         >>   #include "x86.h"
>         >>
>         >> -#define HVF_MAX_VCPU 0x10
>         >> -
>         >> -extern struct hvf_state hvf_global;
>         >> -
>         >> -struct hvf_vm {
>         >> -    int id;
>         >> -    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>         >> -};
>         >> -
>         >> -struct hvf_state {
>         >> -    uint32_t version;
>         >> -    struct hvf_vm *vm;
>         >> -    uint64_t mem_quota;
>         >> -};
>         >> -
>         >> -/* hvf_slot flags */
>         >> -#define HVF_SLOT_LOG (1 << 0)
>         >> -
>         >> -typedef struct hvf_slot {
>         >> -    uint64_t start;
>         >> -    uint64_t size;
>         >> -    uint8_t *mem;
>         >> -    int slot_id;
>         >> -    uint32_t flags;
>         >> -    MemoryRegion *region;
>         >> -} hvf_slot;
>         >> -
>         >> -typedef struct hvf_vcpu_caps {
>         >> -    uint64_t vmx_cap_pinbased;
>         >> -    uint64_t vmx_cap_procbased;
>         >> -    uint64_t vmx_cap_procbased2;
>         >> -    uint64_t vmx_cap_entry;
>         >> -    uint64_t vmx_cap_exit;
>         >> -    uint64_t vmx_cap_preemption_timer;
>         >> -} hvf_vcpu_caps;
>         >> -
>         >> -struct HVFState {
>         >> -    AccelState parent;
>         >> -    hvf_slot slots[32];
>         >> -    int num_slots;
>         >> -
>         >> -    hvf_vcpu_caps *hvf_caps;
>         >> -};
>         >> -extern HVFState *hvf_state;
>         >> -
>         >> -void hvf_set_phys_mem(MemoryRegionSection *, bool);
>         >>   void hvf_handle_io(CPUArchState *, uint16_t, void *, int,
>         int, int);
>         >> -hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>         >>
>         >>   #ifdef NEED_CPU_H
>         >>   /* Functions exported to host specific mode */
>         >> diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
>         >> index ed9356565c..8b96ecd619 100644
>         >> --- a/target/i386/hvf/hvf.c
>         >> +++ b/target/i386/hvf/hvf.c
>         >> @@ -51,6 +51,7 @@
>         >>   #include "qemu/error-report.h"
>         >>
>         >>   #include "sysemu/hvf.h"
>         >> +#include "sysemu/hvf_int.h"
>         >>   #include "sysemu/runstate.h"
>         >>   #include "hvf-i386.h"
>         >>   #include "vmcs.h"
>         >> @@ -72,171 +73,6 @@
>         >>   #include "sysemu/accel.h"
>         >>   #include "target/i386/cpu.h"
>         >>
>         >> -#include "hvf-cpus.h"
>         >> -
>         >> -HVFState *hvf_state;
>         >> -
>         >> -static void assert_hvf_ok(hv_return_t ret)
>         >> -{
>         >> -    if (ret == HV_SUCCESS) {
>         >> -        return;
>         >> -    }
>         >> -
>         >> -    switch (ret) {
>         >> -    case HV_ERROR:
>         >> -        error_report("Error: HV_ERROR");
>         >> -        break;
>         >> -    case HV_BUSY:
>         >> -        error_report("Error: HV_BUSY");
>         >> -        break;
>         >> -    case HV_BAD_ARGUMENT:
>         >> -        error_report("Error: HV_BAD_ARGUMENT");
>         >> -        break;
>         >> -    case HV_NO_RESOURCES:
>         >> -        error_report("Error: HV_NO_RESOURCES");
>         >> -        break;
>         >> -    case HV_NO_DEVICE:
>         >> -        error_report("Error: HV_NO_DEVICE");
>         >> -        break;
>         >> -    case HV_UNSUPPORTED:
>         >> -        error_report("Error: HV_UNSUPPORTED");
>         >> -        break;
>         >> -    default:
>         >> -        error_report("Unknown Error");
>         >> -    }
>         >> -
>         >> -    abort();
>         >> -}
>         >> -
>         >> -/* Memory slots */
>         >> -hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>         >> -{
>         >> -    hvf_slot *slot;
>         >> -    int x;
>         >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>         >> -        slot = &hvf_state->slots[x];
>         >> -        if (slot->size && start < (slot->start +
>         slot->size) &&
>         >> -            (start + size) > slot->start) {
>         >> -            return slot;
>         >> -        }
>         >> -    }
>         >> -    return NULL;
>         >> -}
>         >> -
>         >> -struct mac_slot {
>         >> -    int present;
>         >> -    uint64_t size;
>         >> -    uint64_t gpa_start;
>         >> -    uint64_t gva;
>         >> -};
>         >> -
>         >> -struct mac_slot mac_slots[32];
>         >> -
>         >> -static int do_hvf_set_memory(hvf_slot *slot,
>         hv_memory_flags_t flags)
>         >> -{
>         >> -    struct mac_slot *macslot;
>         >> -    hv_return_t ret;
>         >> -
>         >> -    macslot = &mac_slots[slot->slot_id];
>         >> -
>         >> -    if (macslot->present) {
>         >> -        if (macslot->size != slot->size) {
>         >> -            macslot->present = 0;
>         >> -            ret = hv_vm_unmap(macslot->gpa_start,
>         macslot->size);
>         >> -            assert_hvf_ok(ret);
>         >> -        }
>         >> -    }
>         >> -
>         >> -    if (!slot->size) {
>         >> -        return 0;
>         >> -    }
>         >> -
>         >> -    macslot->present = 1;
>         >> -    macslot->gpa_start = slot->start;
>         >> -    macslot->size = slot->size;
>         >> -    ret = hv_vm_map((hv_uvaddr_t)slot->mem, slot->start,
>         slot->size, flags);
>         >> -    assert_hvf_ok(ret);
>         >> -    return 0;
>         >> -}
>         >> -
>         >> -void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>         >> -{
>         >> -    hvf_slot *mem;
>         >> -    MemoryRegion *area = section->mr;
>         >> -    bool writeable = !area->readonly && !area->rom_device;
>         >> -    hv_memory_flags_t flags;
>         >> -
>         >> -    if (!memory_region_is_ram(area)) {
>         >> -        if (writeable) {
>         >> -            return;
>         >> -        } else if (!memory_region_is_romd(area)) {
>         >> -            /*
>         >> -             * If the memory device is not in romd_mode,
>         then we actually want
>         >> -             * to remove the hvf memory slot so all
>         accesses will trap.
>         >> -             */
>         >> -             add = false;
>         >> -        }
>         >> -    }
>         >> -
>         >> -    mem = hvf_find_overlap_slot(
>         >> - section->offset_within_address_space,
>         >> -            int128_get64(section->size));
>         >> -
>         >> -    if (mem && add) {
>         >> -        if (mem->size == int128_get64(section->size) &&
>         >> -            mem->start ==
>         section->offset_within_address_space &&
>         >> -            mem->mem == (memory_region_get_ram_ptr(area) +
>         >> -            section->offset_within_region)) {
>         >> -            return; /* Same region was attempted to
>         register, go away. */
>         >> -        }
>         >> -    }
>         >> -
>         >> -    /* Region needs to be reset. set the size to 0 and
>         remap it. */
>         >> -    if (mem) {
>         >> -        mem->size = 0;
>         >> -        if (do_hvf_set_memory(mem, 0)) {
>         >> -            error_report("Failed to reset overlapping slot");
>         >> -            abort();
>         >> -        }
>         >> -    }
>         >> -
>         >> -    if (!add) {
>         >> -        return;
>         >> -    }
>         >> -
>         >> -    if (area->readonly ||
>         >> -        (!memory_region_is_ram(area) &&
>         memory_region_is_romd(area))) {
>         >> -        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>         >> -    } else {
>         >> -        flags = HV_MEMORY_READ | HV_MEMORY_WRITE |
>         HV_MEMORY_EXEC;
>         >> -    }
>         >> -
>         >> -    /* Now make a new slot. */
>         >> -    int x;
>         >> -
>         >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>         >> -        mem = &hvf_state->slots[x];
>         >> -        if (!mem->size) {
>         >> -            break;
>         >> -        }
>         >> -    }
>         >> -
>         >> -    if (x == hvf_state->num_slots) {
>         >> -        error_report("No free slots");
>         >> -        abort();
>         >> -    }
>         >> -
>         >> -    mem->size = int128_get64(section->size);
>         >> -    mem->mem = memory_region_get_ram_ptr(area) +
>         section->offset_within_region;
>         >> -    mem->start = section->offset_within_address_space;
>         >> -    mem->region = area;
>         >> -
>         >> -    if (do_hvf_set_memory(mem, flags)) {
>         >> -        error_report("Error registering new memory slot");
>         >> -        abort();
>         >> -    }
>         >> -}
>         >> -
>         >>   void vmx_update_tpr(CPUState *cpu)
>         >>   {
>         >>       /* TODO: need integrate APIC handling */
>         >> @@ -276,56 +112,6 @@ void hvf_handle_io(CPUArchState *env,
>         uint16_t port, void *buffer,
>         >>       }
>         >>   }
>         >>
>         >> -static void do_hvf_cpu_synchronize_state(CPUState *cpu,
>         run_on_cpu_data arg)
>         >> -{
>         >> -    if (!cpu->vcpu_dirty) {
>         >> -        hvf_get_registers(cpu);
>         >> -        cpu->vcpu_dirty = true;
>         >> -    }
>         >> -}
>         >> -
>         >> -void hvf_cpu_synchronize_state(CPUState *cpu)
>         >> -{
>         >> -    if (!cpu->vcpu_dirty) {
>         >> -        run_on_cpu(cpu, do_hvf_cpu_synchronize_state,
>         RUN_ON_CPU_NULL);
>         >> -    }
>         >> -}
>         >> -
>         >> -static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>         >> - run_on_cpu_data arg)
>         >> -{
>         >> -    hvf_put_registers(cpu);
>         >> -    cpu->vcpu_dirty = false;
>         >> -}
>         >> -
>         >> -void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>         >> -{
>         >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset,
>         RUN_ON_CPU_NULL);
>         >> -}
>         >> -
>         >> -static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>         >> -  run_on_cpu_data arg)
>         >> -{
>         >> -    hvf_put_registers(cpu);
>         >> -    cpu->vcpu_dirty = false;
>         >> -}
>         >> -
>         >> -void hvf_cpu_synchronize_post_init(CPUState *cpu)
>         >> -{
>         >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init,
>         RUN_ON_CPU_NULL);
>         >> -}
>         >> -
>         >> -static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>         >> - run_on_cpu_data arg)
>         >> -{
>         >> -    cpu->vcpu_dirty = true;
>         >> -}
>         >> -
>         >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>         >> -{
>         >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm,
>         RUN_ON_CPU_NULL);
>         >> -}
>         >> -
>         >>   static bool ept_emulation_fault(hvf_slot *slot, uint64_t
>         gpa, uint64_t ept_qual)
>         >>   {
>         >>       int read, write;
>         >> @@ -370,109 +156,19 @@ static bool
>         ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t
>         ept_qual)
>         >>       return false;
>         >>   }
>         >>
>         >> -static void hvf_set_dirty_tracking(MemoryRegionSection
>         *section, bool on)
>         >> -{
>         >> -    hvf_slot *slot;
>         >> -
>         >> -    slot = hvf_find_overlap_slot(
>         >> - section->offset_within_address_space,
>         >> -            int128_get64(section->size));
>         >> -
>         >> -    /* protect region against writes; begin tracking it */
>         >> -    if (on) {
>         >> -        slot->flags |= HVF_SLOT_LOG;
>         >> - hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>         >> -                      HV_MEMORY_READ);
>         >> -    /* stop tracking region*/
>         >> -    } else {
>         >> -        slot->flags &= ~HVF_SLOT_LOG;
>         >> - hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>         >> -                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>         >> -    }
>         >> -}
>         >> -
>         >> -static void hvf_log_start(MemoryListener *listener,
>         >> -                          MemoryRegionSection *section,
>         int old, int new)
>         >> -{
>         >> -    if (old != 0) {
>         >> -        return;
>         >> -    }
>         >> -
>         >> -    hvf_set_dirty_tracking(section, 1);
>         >> -}
>         >> -
>         >> -static void hvf_log_stop(MemoryListener *listener,
>         >> -                         MemoryRegionSection *section, int
>         old, int new)
>         >> -{
>         >> -    if (new != 0) {
>         >> -        return;
>         >> -    }
>         >> -
>         >> -    hvf_set_dirty_tracking(section, 0);
>         >> -}
>         >> -
>         >> -static void hvf_log_sync(MemoryListener *listener,
>         >> -                         MemoryRegionSection *section)
>         >> -{
>         >> -    /*
>         >> -     * sync of dirty pages is handled elsewhere; just make
>         sure we keep
>         >> -     * tracking the region.
>         >> -     */
>         >> -    hvf_set_dirty_tracking(section, 1);
>         >> -}
>         >> -
>         >> -static void hvf_region_add(MemoryListener *listener,
>         >> -                           MemoryRegionSection *section)
>         >> -{
>         >> -    hvf_set_phys_mem(section, true);
>         >> -}
>         >> -
>         >> -static void hvf_region_del(MemoryListener *listener,
>         >> -                           MemoryRegionSection *section)
>         >> -{
>         >> -    hvf_set_phys_mem(section, false);
>         >> -}
>         >> -
>         >> -static MemoryListener hvf_memory_listener = {
>         >> -    .priority = 10,
>         >> -    .region_add = hvf_region_add,
>         >> -    .region_del = hvf_region_del,
>         >> -    .log_start = hvf_log_start,
>         >> -    .log_stop = hvf_log_stop,
>         >> -    .log_sync = hvf_log_sync,
>         >> -};
>         >> -
>         >> -void hvf_vcpu_destroy(CPUState *cpu)
>         >> +void hvf_arch_vcpu_destroy(CPUState *cpu)
>         >>   {
>         >>       X86CPU *x86_cpu = X86_CPU(cpu);
>         >>       CPUX86State *env = &x86_cpu->env;
>         >>
>         >> -    hv_return_t ret =
>         hv_vcpu_destroy((hv_vcpuid_t)cpu->hvf_fd);
>         >>       g_free(env->hvf_mmio_buf);
>         >> -    assert_hvf_ok(ret);
>         >> -}
>         >> -
>         >> -static void dummy_signal(int sig)
>         >> -{
>         >>   }
>         >>
>         >> -int hvf_init_vcpu(CPUState *cpu)
>         >> +int hvf_arch_init_vcpu(CPUState *cpu)
>         >>   {
>         >>
>         >>       X86CPU *x86cpu = X86_CPU(cpu);
>         >>       CPUX86State *env = &x86cpu->env;
>         >> -    int r;
>         >> -
>         >> -    /* init cpu signals */
>         >> -    sigset_t set;
>         >> -    struct sigaction sigact;
>         >> -
>         >> -    memset(&sigact, 0, sizeof(sigact));
>         >> -    sigact.sa_handler = dummy_signal;
>         >> -    sigaction(SIG_IPI, &sigact, NULL);
>         >> -
>         >> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>         >> -    sigdelset(&set, SIG_IPI);
>         >>
>         >>       init_emu();
>         >>       init_decoder();
>         >> @@ -480,10 +176,6 @@ int hvf_init_vcpu(CPUState *cpu)
>         >>       hvf_state->hvf_caps = g_new0(struct hvf_vcpu_caps, 1);
>         >>       env->hvf_mmio_buf = g_new(char, 4096);
>         >>
>         >> -    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd,
>         HV_VCPU_DEFAULT);
>         >> -    cpu->vcpu_dirty = 1;
>         >> -    assert_hvf_ok(r);
>         >> -
>         >>       if (hv_vmx_read_capability(HV_VMX_CAP_PINBASED,
>         >>  &hvf_state->hvf_caps->vmx_cap_pinbased)) {
>         >>           abort();
>         >> @@ -865,49 +557,3 @@ int hvf_vcpu_exec(CPUState *cpu)
>         >>
>         >>       return ret;
>         >>   }
>         >> -
>         >> -bool hvf_allowed;
>         >> -
>         >> -static int hvf_accel_init(MachineState *ms)
>         >> -{
>         >> -    int x;
>         >> -    hv_return_t ret;
>         >> -    HVFState *s;
>         >> -
>         >> -    ret = hv_vm_create(HV_VM_DEFAULT);
>         >> -    assert_hvf_ok(ret);
>         >> -
>         >> -    s = g_new0(HVFState, 1);
>         >> -
>         >> -    s->num_slots = 32;
>         >> -    for (x = 0; x < s->num_slots; ++x) {
>         >> -        s->slots[x].size = 0;
>         >> -        s->slots[x].slot_id = x;
>         >> -    }
>         >> -
>         >> -    hvf_state = s;
>         >> - memory_listener_register(&hvf_memory_listener,
>         &address_space_memory);
>         >> -    cpus_register_accel(&hvf_cpus);
>         >> -    return 0;
>         >> -}
>         >> -
>         >> -static void hvf_accel_class_init(ObjectClass *oc, void *data)
>         >> -{
>         >> -    AccelClass *ac = ACCEL_CLASS(oc);
>         >> -    ac->name = "HVF";
>         >> -    ac->init_machine = hvf_accel_init;
>         >> -    ac->allowed = &hvf_allowed;
>         >> -}
>         >> -
>         >> -static const TypeInfo hvf_accel_type = {
>         >> -    .name = TYPE_HVF_ACCEL,
>         >> -    .parent = TYPE_ACCEL,
>         >> -    .class_init = hvf_accel_class_init,
>         >> -};
>         >> -
>         >> -static void hvf_type_init(void)
>         >> -{
>         >> -    type_register_static(&hvf_accel_type);
>         >> -}
>         >> -
>         >> -type_init(hvf_type_init);
>         >> diff --git a/target/i386/hvf/meson.build
>         b/target/i386/hvf/meson.build
>         >> index 409c9a3f14..c8a43717ee 100644
>         >> --- a/target/i386/hvf/meson.build
>         >> +++ b/target/i386/hvf/meson.build
>         >> @@ -1,6 +1,5 @@
>         >>   i386_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true:
>         files(
>         >>     'hvf.c',
>         >> -  'hvf-cpus.c',
>         >>     'x86.c',
>         >>     'x86_cpuid.c',
>         >>     'x86_decode.c',
>         >> diff --git a/target/i386/hvf/x86hvf.c
>         b/target/i386/hvf/x86hvf.c
>         >> index bbec412b6c..89b8e9d87a 100644
>         >> --- a/target/i386/hvf/x86hvf.c
>         >> +++ b/target/i386/hvf/x86hvf.c
>         >> @@ -20,6 +20,9 @@
>         >>   #include "qemu/osdep.h"
>         >>
>         >>   #include "qemu-common.h"
>         >> +#include "sysemu/hvf.h"
>         >> +#include "sysemu/hvf_int.h"
>         >> +#include "sysemu/hw_accel.h"
>         >>   #include "x86hvf.h"
>         >>   #include "vmx.h"
>         >>   #include "vmcs.h"
>         >> @@ -32,8 +35,6 @@
>         >>   #include <Hypervisor/hv.h>
>         >>   #include <Hypervisor/hv_vmx.h>
>         >>
>         >> -#include "hvf-cpus.h"
>         >> -
>         >>   void hvf_set_segment(struct CPUState *cpu, struct
>         vmx_segment *vmx_seg,
>         >>                        SegmentCache *qseg, bool is_tr)
>         >>   {
>         >> @@ -437,7 +438,7 @@ int hvf_process_events(CPUState *cpu_state)
>         >>       env->eflags = rreg(cpu_state->hvf_fd, HV_X86_RFLAGS);
>         >>
>         >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_INIT) {
>         >> -        hvf_cpu_synchronize_state(cpu_state);
>         >> +        cpu_synchronize_state(cpu_state);
>         >>           do_cpu_init(cpu);
>         >>       }
>         >>
>         >> @@ -451,12 +452,12 @@ int hvf_process_events(CPUState
>         *cpu_state)
>         >>           cpu_state->halted = 0;
>         >>       }
>         >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_SIPI) {
>         >> -        hvf_cpu_synchronize_state(cpu_state);
>         >> +        cpu_synchronize_state(cpu_state);
>         >>           do_cpu_sipi(cpu);
>         >>       }
>         >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_TPR) {
>         >>           cpu_state->interrupt_request &= ~CPU_INTERRUPT_TPR;
>         >> -        hvf_cpu_synchronize_state(cpu_state);
>         >> +        cpu_synchronize_state(cpu_state);
>         > The changes from hvf_cpu_*() to cpu_*() are cleanup and
>         perhaps should
>         > be a separate patch. It follows cpu/accel cleanups Claudio
>         was doing the
>         > summer.
>
>
>         The only reason they're in here is because we no longer have
>         access to
>         the hvf_ functions from the file. I am perfectly happy to
>         rebase the
>         patch on top of Claudio's if his goes in first. I'm sure it'll be
>         trivial for him to rebase on top of this too if my series goes
>         in first.
>
>
>         >
>         > Phillipe raised the idea that the patch might go ahead of
>         ARM-specific
>         > part (which might involve some discussions) and I agree with
>         that.
>         >
>         > Some sync between Claudio series (CC'd him) and the patch
>         might be need.
>
>
>         I would prefer not to hold back because of the sync. Claudio's
>         cleanup
>         is trivial enough to adjust for if it gets merged ahead of this.
>
>
>         Alex
>
>
>

Re: [PATCH 2/8] hvf: Move common code out

Posted by Frank Yang 4 years, 3 months ago

On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:

> Hi Frank,
>
> Thanks for the update :). Your previous email nudged me into the right
> direction. I previously had implemented WFI through the internal timer
> framework which performed way worse.
>
Cool, glad it's helping. Also, Peter found out that the main thing keeping
us from just using cntpct_el0 on the host directly and compare with cval is
that if we sleep, cval is going to be much < cntpct_el0 by the sleep time.
If we can get either the architecture or macos to read out the sleep time
then we might be able to not have to use a poll interval either!

> Along the way, I stumbled over a few issues though. For starters, the
> signal mask for SIG_IPI was not set correctly, so while pselect() would
> exit, the signal would never get delivered to the thread! For a fix, check
> out
>
>
> https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
>
>
Thanks, we'll take a look :)


> Please also have a look at my latest stab at WFI emulation. It doesn't
> handle WFE (that's only relevant in overcommitted scenarios). But it does
> handle WFI and even does something similar to hlt polling, albeit not with
> an adaptive threshold.
>
> Also, is there a particular reason you're working on this super
> interesting and useful code in a random downstream fork of QEMU? Wouldn't
> it be more helpful to contribute to the upstream code base instead?
>
We'd actually like to contribute upstream too :) We do want to maintain our
own downstream though; Android Emulator codebase needs to work solidly on
macos and windows which has made keeping up with upstream difficult, and
staying on a previous version (2.12) with known quirks easier. (theres also
some android related customization relating to Qt Ui + different set of
virtual devices and snapshot support (incl. snapshots of graphics devices
with OpenGLES state tracking), which we hope to separate into other
libraries/processes, but its not insignificant)

>
> Alex
>
> On 30.11.20 21:15, Frank Yang wrote:
>
> Update: We're not quite sure how to compare the CNTV_CVAL and CNTVCT. But
> the high CPU usage seems to be mitigated by having a poll interval (like
> KVM does) in handling WFI:
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512501
>
> This is loosely inspired by
> https://elixir.bootlin.com/linux/v5.10-rc6/source/virt/kvm/kvm_main.c#L2766
> which does seem to specify a poll interval.
>
> It would be cool if we could have a lightweight way to enter sleep and
> restart the vcpus precisely when CVAL passes, though.
>
> Frank
>
>
> On Fri, Nov 27, 2020 at 3:30 PM Frank Yang <lfy@google.com> wrote:
>
>> Hi all,
>>
>> +Peter Collingbourne <pcc@google.com>
>>
>> I'm a developer on the Android Emulator, which is in a fork of QEMU.
>>
>> Peter and I have been working on an HVF Apple Silicon backend with an eye
>> toward Android guests.
>>
>> We have gotten things to basically switch to Android userspace already
>> (logcat/shell and graphics available at least)
>>
>> Our strategy so far has been to import logic from the KVM implementation
>> and hook into QEMU's software devices that previously assumed to only work
>> with TCG, or have KVM-specific paths.
>>
>> Thanks to Alexander for the tip on the 36-bit address space limitation
>> btw; our way of addressing this is to still allow highmem but not put pci
>> high mmio so high.
>>
>> Also, note we have a sleep/signal based mechanism to deal with WFx, which
>> might be worth looking into in Alexander's implementation as well:
>>
>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551
>>
>> Patches so far, FYI:
>>
>>
>> https://android-review.googlesource.com/c/platform/external/qemu/+/1513429/1
>>
>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512554/3
>>
>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512553/3
>>
>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512552/3
>>
>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551/3
>>
>>
>> https://android.googlesource.com/platform/external/qemu/+/c17eb6a3ffd50047e9646aff6640b710cb8ff48a
>>
>> https://android.googlesource.com/platform/external/qemu/+/74bed16de1afb41b7a7ab8da1d1861226c9db63b
>>
>> https://android.googlesource.com/platform/external/qemu/+/eccd9e47ab2ccb9003455e3bb721f57f9ebc3c01
>>
>> https://android.googlesource.com/platform/external/qemu/+/54fe3d67ed4698e85826537a4f49b2b9074b2228
>>
>> https://android.googlesource.com/platform/external/qemu/+/82ef91a6fede1d1000f36be037ad4d58fbe0d102
>>
>> https://android.googlesource.com/platform/external/qemu/+/c28147aa7c74d98b858e99623d2fe46e74a379f6
>>
>> Peter's also noticed that there are extra steps needed for M1's to allow
>> TCG to work, as it involves JIT:
>>
>>
>> https://android.googlesource.com/platform/external/qemu/+/740e3fe47f88926c6bda9abb22ee6eae1bc254a9
>>
>> We'd appreciate any feedback/comments :)
>>
>> Best,
>>
>> Frank
>>
>> On Fri, Nov 27, 2020 at 1:57 PM Alexander Graf <agraf@csgraf.de> wrote:
>>
>>>
>>> On 27.11.20 21:00, Roman Bolshakov wrote:
>>> > On Thu, Nov 26, 2020 at 10:50:11PM +0100, Alexander Graf wrote:
>>> >> Until now, Hypervisor.framework has only been available on x86_64
>>> systems.
>>> >> With Apple Silicon shipping now, it extends its reach to aarch64. To
>>> >> prepare for support for multiple architectures, let's move common
>>> code out
>>> >> into its own accel directory.
>>> >>
>>> >> Signed-off-by: Alexander Graf <agraf@csgraf.de>
>>> >> ---
>>> >>   MAINTAINERS                 |   9 +-
>>> >>   accel/hvf/hvf-all.c         |  56 +++++
>>> >>   accel/hvf/hvf-cpus.c        | 468
>>> ++++++++++++++++++++++++++++++++++++
>>> >>   accel/hvf/meson.build       |   7 +
>>> >>   accel/meson.build           |   1 +
>>> >>   include/sysemu/hvf_int.h    |  69 ++++++
>>> >>   target/i386/hvf/hvf-cpus.c  | 131 ----------
>>> >>   target/i386/hvf/hvf-cpus.h  |  25 --
>>> >>   target/i386/hvf/hvf-i386.h  |  48 +---
>>> >>   target/i386/hvf/hvf.c       | 360 +--------------------------
>>> >>   target/i386/hvf/meson.build |   1 -
>>> >>   target/i386/hvf/x86hvf.c    |  11 +-
>>> >>   target/i386/hvf/x86hvf.h    |   2 -
>>> >>   13 files changed, 619 insertions(+), 569 deletions(-)
>>> >>   create mode 100644 accel/hvf/hvf-all.c
>>> >>   create mode 100644 accel/hvf/hvf-cpus.c
>>> >>   create mode 100644 accel/hvf/meson.build
>>> >>   create mode 100644 include/sysemu/hvf_int.h
>>> >>   delete mode 100644 target/i386/hvf/hvf-cpus.c
>>> >>   delete mode 100644 target/i386/hvf/hvf-cpus.h
>>> >>
>>> >> diff --git a/MAINTAINERS b/MAINTAINERS
>>> >> index 68bc160f41..ca4b6d9279 100644
>>> >> --- a/MAINTAINERS
>>> >> +++ b/MAINTAINERS
>>> >> @@ -444,9 +444,16 @@ M: Cameron Esfahani <dirty@apple.com>
>>> >>   M: Roman Bolshakov <r.bolshakov@yadro.com>
>>> >>   W: https://wiki.qemu.org/Features/HVF
>>> >>   S: Maintained
>>> >> -F: accel/stubs/hvf-stub.c
>>> > There was a patch for that in the RFC series from Claudio.
>>>
>>>
>>> Yeah, I'm not worried about this hunk :).
>>>
>>>
>>> >
>>> >>   F: target/i386/hvf/
>>> >> +
>>> >> +HVF
>>> >> +M: Cameron Esfahani <dirty@apple.com>
>>> >> +M: Roman Bolshakov <r.bolshakov@yadro.com>
>>> >> +W: https://wiki.qemu.org/Features/HVF
>>> >> +S: Maintained
>>> >> +F: accel/hvf/
>>> >>   F: include/sysemu/hvf.h
>>> >> +F: include/sysemu/hvf_int.h
>>> >>
>>> >>   WHPX CPUs
>>> >>   M: Sunil Muthuswamy <sunilmut@microsoft.com>
>>> >> diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
>>> >> new file mode 100644
>>> >> index 0000000000..47d77a472a
>>> >> --- /dev/null
>>> >> +++ b/accel/hvf/hvf-all.c
>>> >> @@ -0,0 +1,56 @@
>>> >> +/*
>>> >> + * QEMU Hypervisor.framework support
>>> >> + *
>>> >> + * This work is licensed under the terms of the GNU GPL, version 2.
>>> See
>>> >> + * the COPYING file in the top-level directory.
>>> >> + *
>>> >> + * Contributions after 2012-01-13 are licensed under the terms of the
>>> >> + * GNU GPL, version 2 or (at your option) any later version.
>>> >> + */
>>> >> +
>>> >> +#include "qemu/osdep.h"
>>> >> +#include "qemu-common.h"
>>> >> +#include "qemu/error-report.h"
>>> >> +#include "sysemu/hvf.h"
>>> >> +#include "sysemu/hvf_int.h"
>>> >> +#include "sysemu/runstate.h"
>>> >> +
>>> >> +#include "qemu/main-loop.h"
>>> >> +#include "sysemu/accel.h"
>>> >> +
>>> >> +#include <Hypervisor/Hypervisor.h>
>>> >> +
>>> >> +bool hvf_allowed;
>>> >> +HVFState *hvf_state;
>>> >> +
>>> >> +void assert_hvf_ok(hv_return_t ret)
>>> >> +{
>>> >> +    if (ret == HV_SUCCESS) {
>>> >> +        return;
>>> >> +    }
>>> >> +
>>> >> +    switch (ret) {
>>> >> +    case HV_ERROR:
>>> >> +        error_report("Error: HV_ERROR");
>>> >> +        break;
>>> >> +    case HV_BUSY:
>>> >> +        error_report("Error: HV_BUSY");
>>> >> +        break;
>>> >> +    case HV_BAD_ARGUMENT:
>>> >> +        error_report("Error: HV_BAD_ARGUMENT");
>>> >> +        break;
>>> >> +    case HV_NO_RESOURCES:
>>> >> +        error_report("Error: HV_NO_RESOURCES");
>>> >> +        break;
>>> >> +    case HV_NO_DEVICE:
>>> >> +        error_report("Error: HV_NO_DEVICE");
>>> >> +        break;
>>> >> +    case HV_UNSUPPORTED:
>>> >> +        error_report("Error: HV_UNSUPPORTED");
>>> >> +        break;
>>> >> +    default:
>>> >> +        error_report("Unknown Error");
>>> >> +    }
>>> >> +
>>> >> +    abort();
>>> >> +}
>>> >> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>>> >> new file mode 100644
>>> >> index 0000000000..f9bb5502b7
>>> >> --- /dev/null
>>> >> +++ b/accel/hvf/hvf-cpus.c
>>> >> @@ -0,0 +1,468 @@
>>> >> +/*
>>> >> + * Copyright 2008 IBM Corporation
>>> >> + *           2008 Red Hat, Inc.
>>> >> + * Copyright 2011 Intel Corporation
>>> >> + * Copyright 2016 Veertu, Inc.
>>> >> + * Copyright 2017 The Android Open Source Project
>>> >> + *
>>> >> + * QEMU Hypervisor.framework support
>>> >> + *
>>> >> + * This program is free software; you can redistribute it and/or
>>> >> + * modify it under the terms of version 2 of the GNU General Public
>>> >> + * License as published by the Free Software Foundation.
>>> >> + *
>>> >> + * This program is distributed in the hope that it will be useful,
>>> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>> >> + * General Public License for more details.
>>> >> + *
>>> >> + * You should have received a copy of the GNU General Public License
>>> >> + * along with this program; if not, see <
>>> http://www.gnu.org/licenses/>.
>>> >> + *
>>> >> + * This file contain code under public domain from the hvdos project:
>>> >> + * https://github.com/mist64/hvdos
>>> >> + *
>>> >> + * Parts Copyright (c) 2011 NetApp, Inc.
>>> >> + * All rights reserved.
>>> >> + *
>>> >> + * Redistribution and use in source and binary forms, with or without
>>> >> + * modification, are permitted provided that the following conditions
>>> >> + * are met:
>>> >> + * 1. Redistributions of source code must retain the above copyright
>>> >> + *    notice, this list of conditions and the following disclaimer.
>>> >> + * 2. Redistributions in binary form must reproduce the above
>>> copyright
>>> >> + *    notice, this list of conditions and the following disclaimer
>>> in the
>>> >> + *    documentation and/or other materials provided with the
>>> distribution.
>>> >> + *
>>> >> + * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>>> >> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
>>> THE
>>> >> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
>>> PARTICULAR PURPOSE
>>> >> + * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE
>>> LIABLE
>>> >> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
>>> CONSEQUENTIAL
>>> >> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
>>> GOODS
>>> >> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
>>> INTERRUPTION)
>>> >> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
>>> CONTRACT, STRICT
>>> >> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
>>> ANY WAY
>>> >> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
>>> POSSIBILITY OF
>>> >> + * SUCH DAMAGE.
>>> >> + */
>>> >> +
>>> >> +#include "qemu/osdep.h"
>>> >> +#include "qemu/error-report.h"
>>> >> +#include "qemu/main-loop.h"
>>> >> +#include "exec/address-spaces.h"
>>> >> +#include "exec/exec-all.h"
>>> >> +#include "sysemu/cpus.h"
>>> >> +#include "sysemu/hvf.h"
>>> >> +#include "sysemu/hvf_int.h"
>>> >> +#include "sysemu/runstate.h"
>>> >> +#include "qemu/guest-random.h"
>>> >> +
>>> >> +#include <Hypervisor/Hypervisor.h>
>>> >> +
>>> >> +/* Memory slots */
>>> >> +
>>> >> +struct mac_slot {
>>> >> +    int present;
>>> >> +    uint64_t size;
>>> >> +    uint64_t gpa_start;
>>> >> +    uint64_t gva;
>>> >> +};
>>> >> +
>>> >> +hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>>> >> +{
>>> >> +    hvf_slot *slot;
>>> >> +    int x;
>>> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>>> >> +        slot = &hvf_state->slots[x];
>>> >> +        if (slot->size && start < (slot->start + slot->size) &&
>>> >> +            (start + size) > slot->start) {
>>> >> +            return slot;
>>> >> +        }
>>> >> +    }
>>> >> +    return NULL;
>>> >> +}
>>> >> +
>>> >> +struct mac_slot mac_slots[32];
>>> >> +
>>> >> +static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>>> >> +{
>>> >> +    struct mac_slot *macslot;
>>> >> +    hv_return_t ret;
>>> >> +
>>> >> +    macslot = &mac_slots[slot->slot_id];
>>> >> +
>>> >> +    if (macslot->present) {
>>> >> +        if (macslot->size != slot->size) {
>>> >> +            macslot->present = 0;
>>> >> +            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>>> >> +            assert_hvf_ok(ret);
>>> >> +        }
>>> >> +    }
>>> >> +
>>> >> +    if (!slot->size) {
>>> >> +        return 0;
>>> >> +    }
>>> >> +
>>> >> +    macslot->present = 1;
>>> >> +    macslot->gpa_start = slot->start;
>>> >> +    macslot->size = slot->size;
>>> >> +    ret = hv_vm_map(slot->mem, slot->start, slot->size, flags);
>>> >> +    assert_hvf_ok(ret);
>>> >> +    return 0;
>>> >> +}
>>> >> +
>>> >> +static void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>>> >> +{
>>> >> +    hvf_slot *mem;
>>> >> +    MemoryRegion *area = section->mr;
>>> >> +    bool writeable = !area->readonly && !area->rom_device;
>>> >> +    hv_memory_flags_t flags;
>>> >> +
>>> >> +    if (!memory_region_is_ram(area)) {
>>> >> +        if (writeable) {
>>> >> +            return;
>>> >> +        } else if (!memory_region_is_romd(area)) {
>>> >> +            /*
>>> >> +             * If the memory device is not in romd_mode, then we
>>> actually want
>>> >> +             * to remove the hvf memory slot so all accesses will
>>> trap.
>>> >> +             */
>>> >> +             add = false;
>>> >> +        }
>>> >> +    }
>>> >> +
>>> >> +    mem = hvf_find_overlap_slot(
>>> >> +            section->offset_within_address_space,
>>> >> +            int128_get64(section->size));
>>> >> +
>>> >> +    if (mem && add) {
>>> >> +        if (mem->size == int128_get64(section->size) &&
>>> >> +            mem->start == section->offset_within_address_space &&
>>> >> +            mem->mem == (memory_region_get_ram_ptr(area) +
>>> >> +            section->offset_within_region)) {
>>> >> +            return; /* Same region was attempted to register, go
>>> away. */
>>> >> +        }
>>> >> +    }
>>> >> +
>>> >> +    /* Region needs to be reset. set the size to 0 and remap it. */
>>> >> +    if (mem) {
>>> >> +        mem->size = 0;
>>> >> +        if (do_hvf_set_memory(mem, 0)) {
>>> >> +            error_report("Failed to reset overlapping slot");
>>> >> +            abort();
>>> >> +        }
>>> >> +    }
>>> >> +
>>> >> +    if (!add) {
>>> >> +        return;
>>> >> +    }
>>> >> +
>>> >> +    if (area->readonly ||
>>> >> +        (!memory_region_is_ram(area) &&
>>> memory_region_is_romd(area))) {
>>> >> +        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>>> >> +    } else {
>>> >> +        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>>> >> +    }
>>> >> +
>>> >> +    /* Now make a new slot. */
>>> >> +    int x;
>>> >> +
>>> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>>> >> +        mem = &hvf_state->slots[x];
>>> >> +        if (!mem->size) {
>>> >> +            break;
>>> >> +        }
>>> >> +    }
>>> >> +
>>> >> +    if (x == hvf_state->num_slots) {
>>> >> +        error_report("No free slots");
>>> >> +        abort();
>>> >> +    }
>>> >> +
>>> >> +    mem->size = int128_get64(section->size);
>>> >> +    mem->mem = memory_region_get_ram_ptr(area) +
>>> section->offset_within_region;
>>> >> +    mem->start = section->offset_within_address_space;
>>> >> +    mem->region = area;
>>> >> +
>>> >> +    if (do_hvf_set_memory(mem, flags)) {
>>> >> +        error_report("Error registering new memory slot");
>>> >> +        abort();
>>> >> +    }
>>> >> +}
>>> >> +
>>> >> +static void hvf_set_dirty_tracking(MemoryRegionSection *section,
>>> bool on)
>>> >> +{
>>> >> +    hvf_slot *slot;
>>> >> +
>>> >> +    slot = hvf_find_overlap_slot(
>>> >> +            section->offset_within_address_space,
>>> >> +            int128_get64(section->size));
>>> >> +
>>> >> +    /* protect region against writes; begin tracking it */
>>> >> +    if (on) {
>>> >> +        slot->flags |= HVF_SLOT_LOG;
>>> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>>> >> +                      HV_MEMORY_READ);
>>> >> +    /* stop tracking region*/
>>> >> +    } else {
>>> >> +        slot->flags &= ~HVF_SLOT_LOG;
>>> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>>> >> +                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>>> >> +    }
>>> >> +}
>>> >> +
>>> >> +static void hvf_log_start(MemoryListener *listener,
>>> >> +                          MemoryRegionSection *section, int old, int
>>> new)
>>> >> +{
>>> >> +    if (old != 0) {
>>> >> +        return;
>>> >> +    }
>>> >> +
>>> >> +    hvf_set_dirty_tracking(section, 1);
>>> >> +}
>>> >> +
>>> >> +static void hvf_log_stop(MemoryListener *listener,
>>> >> +                         MemoryRegionSection *section, int old, int
>>> new)
>>> >> +{
>>> >> +    if (new != 0) {
>>> >> +        return;
>>> >> +    }
>>> >> +
>>> >> +    hvf_set_dirty_tracking(section, 0);
>>> >> +}
>>> >> +
>>> >> +static void hvf_log_sync(MemoryListener *listener,
>>> >> +                         MemoryRegionSection *section)
>>> >> +{
>>> >> +    /*
>>> >> +     * sync of dirty pages is handled elsewhere; just make sure we
>>> keep
>>> >> +     * tracking the region.
>>> >> +     */
>>> >> +    hvf_set_dirty_tracking(section, 1);
>>> >> +}
>>> >> +
>>> >> +static void hvf_region_add(MemoryListener *listener,
>>> >> +                           MemoryRegionSection *section)
>>> >> +{
>>> >> +    hvf_set_phys_mem(section, true);
>>> >> +}
>>> >> +
>>> >> +static void hvf_region_del(MemoryListener *listener,
>>> >> +                           MemoryRegionSection *section)
>>> >> +{
>>> >> +    hvf_set_phys_mem(section, false);
>>> >> +}
>>> >> +
>>> >> +static MemoryListener hvf_memory_listener = {
>>> >> +    .priority = 10,
>>> >> +    .region_add = hvf_region_add,
>>> >> +    .region_del = hvf_region_del,
>>> >> +    .log_start = hvf_log_start,
>>> >> +    .log_stop = hvf_log_stop,
>>> >> +    .log_sync = hvf_log_sync,
>>> >> +};
>>> >> +
>>> >> +static void do_hvf_cpu_synchronize_state(CPUState *cpu,
>>> run_on_cpu_data arg)
>>> >> +{
>>> >> +    if (!cpu->vcpu_dirty) {
>>> >> +        hvf_get_registers(cpu);
>>> >> +        cpu->vcpu_dirty = true;
>>> >> +    }
>>> >> +}
>>> >> +
>>> >> +static void hvf_cpu_synchronize_state(CPUState *cpu)
>>> >> +{
>>> >> +    if (!cpu->vcpu_dirty) {
>>> >> +        run_on_cpu(cpu, do_hvf_cpu_synchronize_state,
>>> RUN_ON_CPU_NULL);
>>> >> +    }
>>> >> +}
>>> >> +
>>> >> +static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>>> >> +                                              run_on_cpu_data arg)
>>> >> +{
>>> >> +    hvf_put_registers(cpu);
>>> >> +    cpu->vcpu_dirty = false;
>>> >> +}
>>> >> +
>>> >> +static void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>>> >> +{
>>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset,
>>> RUN_ON_CPU_NULL);
>>> >> +}
>>> >> +
>>> >> +static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>>> >> +                                             run_on_cpu_data arg)
>>> >> +{
>>> >> +    hvf_put_registers(cpu);
>>> >> +    cpu->vcpu_dirty = false;
>>> >> +}
>>> >> +
>>> >> +static void hvf_cpu_synchronize_post_init(CPUState *cpu)
>>> >> +{
>>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init,
>>> RUN_ON_CPU_NULL);
>>> >> +}
>>> >> +
>>> >> +static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>>> >> +                                              run_on_cpu_data arg)
>>> >> +{
>>> >> +    cpu->vcpu_dirty = true;
>>> >> +}
>>> >> +
>>> >> +static void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>>> >> +{
>>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm,
>>> RUN_ON_CPU_NULL);
>>> >> +}
>>> >> +
>>> >> +static void hvf_vcpu_destroy(CPUState *cpu)
>>> >> +{
>>> >> +    hv_return_t ret = hv_vcpu_destroy(cpu->hvf_fd);
>>> >> +    assert_hvf_ok(ret);
>>> >> +
>>> >> +    hvf_arch_vcpu_destroy(cpu);
>>> >> +}
>>> >> +
>>> >> +static void dummy_signal(int sig)
>>> >> +{
>>> >> +}
>>> >> +
>>> >> +static int hvf_init_vcpu(CPUState *cpu)
>>> >> +{
>>> >> +    int r;
>>> >> +
>>> >> +    /* init cpu signals */
>>> >> +    sigset_t set;
>>> >> +    struct sigaction sigact;
>>> >> +
>>> >> +    memset(&sigact, 0, sizeof(sigact));
>>> >> +    sigact.sa_handler = dummy_signal;
>>> >> +    sigaction(SIG_IPI, &sigact, NULL);
>>> >> +
>>> >> +    pthread_sigmask(SIG_BLOCK, NULL, &set);
>>> >> +    sigdelset(&set, SIG_IPI);
>>> >> +
>>> >> +#ifdef __aarch64__
>>> >> +    r = hv_vcpu_create(&cpu->hvf_fd, (hv_vcpu_exit_t
>>> **)&cpu->hvf_exit, NULL);
>>> >> +#else
>>> >> +    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>>> >> +#endif
>>> > I think the first __aarch64__ bit fits better to arm part of the
>>> series.
>>>
>>>
>>> Oops. Thanks for catching it! Yes, absolutely. It should be part of the
>>> ARM enablement.
>>>
>>>
>>> >
>>> >> +    cpu->vcpu_dirty = 1;
>>> >> +    assert_hvf_ok(r);
>>> >> +
>>> >> +    return hvf_arch_init_vcpu(cpu);
>>> >> +}
>>> >> +
>>> >> +/*
>>> >> + * The HVF-specific vCPU thread function. This one should only run
>>> when the host
>>> >> + * CPU supports the VMX "unrestricted guest" feature.
>>> >> + */
>>> >> +static void *hvf_cpu_thread_fn(void *arg)
>>> >> +{
>>> >> +    CPUState *cpu = arg;
>>> >> +
>>> >> +    int r;
>>> >> +
>>> >> +    assert(hvf_enabled());
>>> >> +
>>> >> +    rcu_register_thread();
>>> >> +
>>> >> +    qemu_mutex_lock_iothread();
>>> >> +    qemu_thread_get_self(cpu->thread);
>>> >> +
>>> >> +    cpu->thread_id = qemu_get_thread_id();
>>> >> +    cpu->can_do_io = 1;
>>> >> +    current_cpu = cpu;
>>> >> +
>>> >> +    hvf_init_vcpu(cpu);
>>> >> +
>>> >> +    /* signal CPU creation */
>>> >> +    cpu_thread_signal_created(cpu);
>>> >> +    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>>> >> +
>>> >> +    do {
>>> >> +        if (cpu_can_run(cpu)) {
>>> >> +            r = hvf_vcpu_exec(cpu);
>>> >> +            if (r == EXCP_DEBUG) {
>>> >> +                cpu_handle_guest_debug(cpu);
>>> >> +            }
>>> >> +        }
>>> >> +        qemu_wait_io_event(cpu);
>>> >> +    } while (!cpu->unplug || cpu_can_run(cpu));
>>> >> +
>>> >> +    hvf_vcpu_destroy(cpu);
>>> >> +    cpu_thread_signal_destroyed(cpu);
>>> >> +    qemu_mutex_unlock_iothread();
>>> >> +    rcu_unregister_thread();
>>> >> +    return NULL;
>>> >> +}
>>> >> +
>>> >> +static void hvf_start_vcpu_thread(CPUState *cpu)
>>> >> +{
>>> >> +    char thread_name[VCPU_THREAD_NAME_SIZE];
>>> >> +
>>> >> +    /*
>>> >> +     * HVF currently does not support TCG, and only runs in
>>> >> +     * unrestricted-guest mode.
>>> >> +     */
>>> >> +    assert(hvf_enabled());
>>> >> +
>>> >> +    cpu->thread = g_malloc0(sizeof(QemuThread));
>>> >> +    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>>> >> +    qemu_cond_init(cpu->halt_cond);
>>> >> +
>>> >> +    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>>> >> +             cpu->cpu_index);
>>> >> +    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>>> >> +                       cpu, QEMU_THREAD_JOINABLE);
>>> >> +}
>>> >> +
>>> >> +static const CpusAccel hvf_cpus = {
>>> >> +    .create_vcpu_thread = hvf_start_vcpu_thread,
>>> >> +
>>> >> +    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>>> >> +    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>>> >> +    .synchronize_state = hvf_cpu_synchronize_state,
>>> >> +    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>>> >> +};
>>> >> +
>>> >> +static int hvf_accel_init(MachineState *ms)
>>> >> +{
>>> >> +    int x;
>>> >> +    hv_return_t ret;
>>> >> +    HVFState *s;
>>> >> +
>>> >> +    ret = hv_vm_create(HV_VM_DEFAULT);
>>> >> +    assert_hvf_ok(ret);
>>> >> +
>>> >> +    s = g_new0(HVFState, 1);
>>> >> +
>>> >> +    s->num_slots = 32;
>>> >> +    for (x = 0; x < s->num_slots; ++x) {
>>> >> +        s->slots[x].size = 0;
>>> >> +        s->slots[x].slot_id = x;
>>> >> +    }
>>> >> +
>>> >> +    hvf_state = s;
>>> >> +    memory_listener_register(&hvf_memory_listener,
>>> &address_space_memory);
>>> >> +    cpus_register_accel(&hvf_cpus);
>>> >> +    return 0;
>>> >> +}
>>> >> +
>>> >> +static void hvf_accel_class_init(ObjectClass *oc, void *data)
>>> >> +{
>>> >> +    AccelClass *ac = ACCEL_CLASS(oc);
>>> >> +    ac->name = "HVF";
>>> >> +    ac->init_machine = hvf_accel_init;
>>> >> +    ac->allowed = &hvf_allowed;
>>> >> +}
>>> >> +
>>> >> +static const TypeInfo hvf_accel_type = {
>>> >> +    .name = TYPE_HVF_ACCEL,
>>> >> +    .parent = TYPE_ACCEL,
>>> >> +    .class_init = hvf_accel_class_init,
>>> >> +};
>>> >> +
>>> >> +static void hvf_type_init(void)
>>> >> +{
>>> >> +    type_register_static(&hvf_accel_type);
>>> >> +}
>>> >> +
>>> >> +type_init(hvf_type_init);
>>> >> diff --git a/accel/hvf/meson.build b/accel/hvf/meson.build
>>> >> new file mode 100644
>>> >> index 0000000000..dfd6b68dc7
>>> >> --- /dev/null
>>> >> +++ b/accel/hvf/meson.build
>>> >> @@ -0,0 +1,7 @@
>>> >> +hvf_ss = ss.source_set()
>>> >> +hvf_ss.add(files(
>>> >> +  'hvf-all.c',
>>> >> +  'hvf-cpus.c',
>>> >> +))
>>> >> +
>>> >> +specific_ss.add_all(when: 'CONFIG_HVF', if_true: hvf_ss)
>>> >> diff --git a/accel/meson.build b/accel/meson.build
>>> >> index b26cca227a..6de12ce5d5 100644
>>> >> --- a/accel/meson.build
>>> >> +++ b/accel/meson.build
>>> >> @@ -1,5 +1,6 @@
>>> >>   softmmu_ss.add(files('accel.c'))
>>> >>
>>> >> +subdir('hvf')
>>> >>   subdir('qtest')
>>> >>   subdir('kvm')
>>> >>   subdir('tcg')
>>> >> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
>>> >> new file mode 100644
>>> >> index 0000000000..de9bad23a8
>>> >> --- /dev/null
>>> >> +++ b/include/sysemu/hvf_int.h
>>> >> @@ -0,0 +1,69 @@
>>> >> +/*
>>> >> + * QEMU Hypervisor.framework (HVF) support
>>> >> + *
>>> >> + * This work is licensed under the terms of the GNU GPL, version 2
>>> or later.
>>> >> + * See the COPYING file in the top-level directory.
>>> >> + *
>>> >> + */
>>> >> +
>>> >> +/* header to be included in HVF-specific code */
>>> >> +
>>> >> +#ifndef HVF_INT_H
>>> >> +#define HVF_INT_H
>>> >> +
>>> >> +#include <Hypervisor/Hypervisor.h>
>>> >> +
>>> >> +#define HVF_MAX_VCPU 0x10
>>> >> +
>>> >> +extern struct hvf_state hvf_global;
>>> >> +
>>> >> +struct hvf_vm {
>>> >> +    int id;
>>> >> +    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>>> >> +};
>>> >> +
>>> >> +struct hvf_state {
>>> >> +    uint32_t version;
>>> >> +    struct hvf_vm *vm;
>>> >> +    uint64_t mem_quota;
>>> >> +};
>>> >> +
>>> >> +/* hvf_slot flags */
>>> >> +#define HVF_SLOT_LOG (1 << 0)
>>> >> +
>>> >> +typedef struct hvf_slot {
>>> >> +    uint64_t start;
>>> >> +    uint64_t size;
>>> >> +    uint8_t *mem;
>>> >> +    int slot_id;
>>> >> +    uint32_t flags;
>>> >> +    MemoryRegion *region;
>>> >> +} hvf_slot;
>>> >> +
>>> >> +typedef struct hvf_vcpu_caps {
>>> >> +    uint64_t vmx_cap_pinbased;
>>> >> +    uint64_t vmx_cap_procbased;
>>> >> +    uint64_t vmx_cap_procbased2;
>>> >> +    uint64_t vmx_cap_entry;
>>> >> +    uint64_t vmx_cap_exit;
>>> >> +    uint64_t vmx_cap_preemption_timer;
>>> >> +} hvf_vcpu_caps;
>>> >> +
>>> >> +struct HVFState {
>>> >> +    AccelState parent;
>>> >> +    hvf_slot slots[32];
>>> >> +    int num_slots;
>>> >> +
>>> >> +    hvf_vcpu_caps *hvf_caps;
>>> >> +};
>>> >> +extern HVFState *hvf_state;
>>> >> +
>>> >> +void assert_hvf_ok(hv_return_t ret);
>>> >> +int hvf_get_registers(CPUState *cpu);
>>> >> +int hvf_put_registers(CPUState *cpu);
>>> >> +int hvf_arch_init_vcpu(CPUState *cpu);
>>> >> +void hvf_arch_vcpu_destroy(CPUState *cpu);
>>> >> +int hvf_vcpu_exec(CPUState *cpu);
>>> >> +hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>>> >> +
>>> >> +#endif
>>> >> diff --git a/target/i386/hvf/hvf-cpus.c b/target/i386/hvf/hvf-cpus.c
>>> >> deleted file mode 100644
>>> >> index 817b3d7452..0000000000
>>> >> --- a/target/i386/hvf/hvf-cpus.c
>>> >> +++ /dev/null
>>> >> @@ -1,131 +0,0 @@
>>> >> -/*
>>> >> - * Copyright 2008 IBM Corporation
>>> >> - *           2008 Red Hat, Inc.
>>> >> - * Copyright 2011 Intel Corporation
>>> >> - * Copyright 2016 Veertu, Inc.
>>> >> - * Copyright 2017 The Android Open Source Project
>>> >> - *
>>> >> - * QEMU Hypervisor.framework support
>>> >> - *
>>> >> - * This program is free software; you can redistribute it and/or
>>> >> - * modify it under the terms of version 2 of the GNU General Public
>>> >> - * License as published by the Free Software Foundation.
>>> >> - *
>>> >> - * This program is distributed in the hope that it will be useful,
>>> >> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> >> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>> >> - * General Public License for more details.
>>> >> - *
>>> >> - * You should have received a copy of the GNU General Public License
>>> >> - * along with this program; if not, see <
>>> http://www.gnu.org/licenses/>.
>>> >> - *
>>> >> - * This file contain code under public domain from the hvdos project:
>>> >> - * https://github.com/mist64/hvdos
>>> >> - *
>>> >> - * Parts Copyright (c) 2011 NetApp, Inc.
>>> >> - * All rights reserved.
>>> >> - *
>>> >> - * Redistribution and use in source and binary forms, with or without
>>> >> - * modification, are permitted provided that the following conditions
>>> >> - * are met:
>>> >> - * 1. Redistributions of source code must retain the above copyright
>>> >> - *    notice, this list of conditions and the following disclaimer.
>>> >> - * 2. Redistributions in binary form must reproduce the above
>>> copyright
>>> >> - *    notice, this list of conditions and the following disclaimer
>>> in the
>>> >> - *    documentation and/or other materials provided with the
>>> distribution.
>>> >> - *
>>> >> - * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>>> >> - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
>>> THE
>>> >> - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
>>> PARTICULAR PURPOSE
>>> >> - * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE
>>> LIABLE
>>> >> - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
>>> CONSEQUENTIAL
>>> >> - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
>>> GOODS
>>> >> - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
>>> INTERRUPTION)
>>> >> - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
>>> CONTRACT, STRICT
>>> >> - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
>>> ANY WAY
>>> >> - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
>>> POSSIBILITY OF
>>> >> - * SUCH DAMAGE.
>>> >> - */
>>> >> -
>>> >> -#include "qemu/osdep.h"
>>> >> -#include "qemu/error-report.h"
>>> >> -#include "qemu/main-loop.h"
>>> >> -#include "sysemu/hvf.h"
>>> >> -#include "sysemu/runstate.h"
>>> >> -#include "target/i386/cpu.h"
>>> >> -#include "qemu/guest-random.h"
>>> >> -
>>> >> -#include "hvf-cpus.h"
>>> >> -
>>> >> -/*
>>> >> - * The HVF-specific vCPU thread function. This one should only run
>>> when the host
>>> >> - * CPU supports the VMX "unrestricted guest" feature.
>>> >> - */
>>> >> -static void *hvf_cpu_thread_fn(void *arg)
>>> >> -{
>>> >> -    CPUState *cpu = arg;
>>> >> -
>>> >> -    int r;
>>> >> -
>>> >> -    assert(hvf_enabled());
>>> >> -
>>> >> -    rcu_register_thread();
>>> >> -
>>> >> -    qemu_mutex_lock_iothread();
>>> >> -    qemu_thread_get_self(cpu->thread);
>>> >> -
>>> >> -    cpu->thread_id = qemu_get_thread_id();
>>> >> -    cpu->can_do_io = 1;
>>> >> -    current_cpu = cpu;
>>> >> -
>>> >> -    hvf_init_vcpu(cpu);
>>> >> -
>>> >> -    /* signal CPU creation */
>>> >> -    cpu_thread_signal_created(cpu);
>>> >> -    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>>> >> -
>>> >> -    do {
>>> >> -        if (cpu_can_run(cpu)) {
>>> >> -            r = hvf_vcpu_exec(cpu);
>>> >> -            if (r == EXCP_DEBUG) {
>>> >> -                cpu_handle_guest_debug(cpu);
>>> >> -            }
>>> >> -        }
>>> >> -        qemu_wait_io_event(cpu);
>>> >> -    } while (!cpu->unplug || cpu_can_run(cpu));
>>> >> -
>>> >> -    hvf_vcpu_destroy(cpu);
>>> >> -    cpu_thread_signal_destroyed(cpu);
>>> >> -    qemu_mutex_unlock_iothread();
>>> >> -    rcu_unregister_thread();
>>> >> -    return NULL;
>>> >> -}
>>> >> -
>>> >> -static void hvf_start_vcpu_thread(CPUState *cpu)
>>> >> -{
>>> >> -    char thread_name[VCPU_THREAD_NAME_SIZE];
>>> >> -
>>> >> -    /*
>>> >> -     * HVF currently does not support TCG, and only runs in
>>> >> -     * unrestricted-guest mode.
>>> >> -     */
>>> >> -    assert(hvf_enabled());
>>> >> -
>>> >> -    cpu->thread = g_malloc0(sizeof(QemuThread));
>>> >> -    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>>> >> -    qemu_cond_init(cpu->halt_cond);
>>> >> -
>>> >> -    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>>> >> -             cpu->cpu_index);
>>> >> -    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>>> >> -                       cpu, QEMU_THREAD_JOINABLE);
>>> >> -}
>>> >> -
>>> >> -const CpusAccel hvf_cpus = {
>>> >> -    .create_vcpu_thread = hvf_start_vcpu_thread,
>>> >> -
>>> >> -    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>>> >> -    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>>> >> -    .synchronize_state = hvf_cpu_synchronize_state,
>>> >> -    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>>> >> -};
>>> >> diff --git a/target/i386/hvf/hvf-cpus.h b/target/i386/hvf/hvf-cpus.h
>>> >> deleted file mode 100644
>>> >> index ced31b82c0..0000000000
>>> >> --- a/target/i386/hvf/hvf-cpus.h
>>> >> +++ /dev/null
>>> >> @@ -1,25 +0,0 @@
>>> >> -/*
>>> >> - * Accelerator CPUS Interface
>>> >> - *
>>> >> - * Copyright 2020 SUSE LLC
>>> >> - *
>>> >> - * This work is licensed under the terms of the GNU GPL, version 2
>>> or later.
>>> >> - * See the COPYING file in the top-level directory.
>>> >> - */
>>> >> -
>>> >> -#ifndef HVF_CPUS_H
>>> >> -#define HVF_CPUS_H
>>> >> -
>>> >> -#include "sysemu/cpus.h"
>>> >> -
>>> >> -extern const CpusAccel hvf_cpus;
>>> >> -
>>> >> -int hvf_init_vcpu(CPUState *);
>>> >> -int hvf_vcpu_exec(CPUState *);
>>> >> -void hvf_cpu_synchronize_state(CPUState *);
>>> >> -void hvf_cpu_synchronize_post_reset(CPUState *);
>>> >> -void hvf_cpu_synchronize_post_init(CPUState *);
>>> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *);
>>> >> -void hvf_vcpu_destroy(CPUState *);
>>> >> -
>>> >> -#endif /* HVF_CPUS_H */
>>> >> diff --git a/target/i386/hvf/hvf-i386.h b/target/i386/hvf/hvf-i386.h
>>> >> index e0edffd077..6d56f8f6bb 100644
>>> >> --- a/target/i386/hvf/hvf-i386.h
>>> >> +++ b/target/i386/hvf/hvf-i386.h
>>> >> @@ -18,57 +18,11 @@
>>> >>
>>> >>   #include "sysemu/accel.h"
>>> >>   #include "sysemu/hvf.h"
>>> >> +#include "sysemu/hvf_int.h"
>>> >>   #include "cpu.h"
>>> >>   #include "x86.h"
>>> >>
>>> >> -#define HVF_MAX_VCPU 0x10
>>> >> -
>>> >> -extern struct hvf_state hvf_global;
>>> >> -
>>> >> -struct hvf_vm {
>>> >> -    int id;
>>> >> -    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>>> >> -};
>>> >> -
>>> >> -struct hvf_state {
>>> >> -    uint32_t version;
>>> >> -    struct hvf_vm *vm;
>>> >> -    uint64_t mem_quota;
>>> >> -};
>>> >> -
>>> >> -/* hvf_slot flags */
>>> >> -#define HVF_SLOT_LOG (1 << 0)
>>> >> -
>>> >> -typedef struct hvf_slot {
>>> >> -    uint64_t start;
>>> >> -    uint64_t size;
>>> >> -    uint8_t *mem;
>>> >> -    int slot_id;
>>> >> -    uint32_t flags;
>>> >> -    MemoryRegion *region;
>>> >> -} hvf_slot;
>>> >> -
>>> >> -typedef struct hvf_vcpu_caps {
>>> >> -    uint64_t vmx_cap_pinbased;
>>> >> -    uint64_t vmx_cap_procbased;
>>> >> -    uint64_t vmx_cap_procbased2;
>>> >> -    uint64_t vmx_cap_entry;
>>> >> -    uint64_t vmx_cap_exit;
>>> >> -    uint64_t vmx_cap_preemption_timer;
>>> >> -} hvf_vcpu_caps;
>>> >> -
>>> >> -struct HVFState {
>>> >> -    AccelState parent;
>>> >> -    hvf_slot slots[32];
>>> >> -    int num_slots;
>>> >> -
>>> >> -    hvf_vcpu_caps *hvf_caps;
>>> >> -};
>>> >> -extern HVFState *hvf_state;
>>> >> -
>>> >> -void hvf_set_phys_mem(MemoryRegionSection *, bool);
>>> >>   void hvf_handle_io(CPUArchState *, uint16_t, void *, int, int, int);
>>> >> -hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>>> >>
>>> >>   #ifdef NEED_CPU_H
>>> >>   /* Functions exported to host specific mode */
>>> >> diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
>>> >> index ed9356565c..8b96ecd619 100644
>>> >> --- a/target/i386/hvf/hvf.c
>>> >> +++ b/target/i386/hvf/hvf.c
>>> >> @@ -51,6 +51,7 @@
>>> >>   #include "qemu/error-report.h"
>>> >>
>>> >>   #include "sysemu/hvf.h"
>>> >> +#include "sysemu/hvf_int.h"
>>> >>   #include "sysemu/runstate.h"
>>> >>   #include "hvf-i386.h"
>>> >>   #include "vmcs.h"
>>> >> @@ -72,171 +73,6 @@
>>> >>   #include "sysemu/accel.h"
>>> >>   #include "target/i386/cpu.h"
>>> >>
>>> >> -#include "hvf-cpus.h"
>>> >> -
>>> >> -HVFState *hvf_state;
>>> >> -
>>> >> -static void assert_hvf_ok(hv_return_t ret)
>>> >> -{
>>> >> -    if (ret == HV_SUCCESS) {
>>> >> -        return;
>>> >> -    }
>>> >> -
>>> >> -    switch (ret) {
>>> >> -    case HV_ERROR:
>>> >> -        error_report("Error: HV_ERROR");
>>> >> -        break;
>>> >> -    case HV_BUSY:
>>> >> -        error_report("Error: HV_BUSY");
>>> >> -        break;
>>> >> -    case HV_BAD_ARGUMENT:
>>> >> -        error_report("Error: HV_BAD_ARGUMENT");
>>> >> -        break;
>>> >> -    case HV_NO_RESOURCES:
>>> >> -        error_report("Error: HV_NO_RESOURCES");
>>> >> -        break;
>>> >> -    case HV_NO_DEVICE:
>>> >> -        error_report("Error: HV_NO_DEVICE");
>>> >> -        break;
>>> >> -    case HV_UNSUPPORTED:
>>> >> -        error_report("Error: HV_UNSUPPORTED");
>>> >> -        break;
>>> >> -    default:
>>> >> -        error_report("Unknown Error");
>>> >> -    }
>>> >> -
>>> >> -    abort();
>>> >> -}
>>> >> -
>>> >> -/* Memory slots */
>>> >> -hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>>> >> -{
>>> >> -    hvf_slot *slot;
>>> >> -    int x;
>>> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>>> >> -        slot = &hvf_state->slots[x];
>>> >> -        if (slot->size && start < (slot->start + slot->size) &&
>>> >> -            (start + size) > slot->start) {
>>> >> -            return slot;
>>> >> -        }
>>> >> -    }
>>> >> -    return NULL;
>>> >> -}
>>> >> -
>>> >> -struct mac_slot {
>>> >> -    int present;
>>> >> -    uint64_t size;
>>> >> -    uint64_t gpa_start;
>>> >> -    uint64_t gva;
>>> >> -};
>>> >> -
>>> >> -struct mac_slot mac_slots[32];
>>> >> -
>>> >> -static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>>> >> -{
>>> >> -    struct mac_slot *macslot;
>>> >> -    hv_return_t ret;
>>> >> -
>>> >> -    macslot = &mac_slots[slot->slot_id];
>>> >> -
>>> >> -    if (macslot->present) {
>>> >> -        if (macslot->size != slot->size) {
>>> >> -            macslot->present = 0;
>>> >> -            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>>> >> -            assert_hvf_ok(ret);
>>> >> -        }
>>> >> -    }
>>> >> -
>>> >> -    if (!slot->size) {
>>> >> -        return 0;
>>> >> -    }
>>> >> -
>>> >> -    macslot->present = 1;
>>> >> -    macslot->gpa_start = slot->start;
>>> >> -    macslot->size = slot->size;
>>> >> -    ret = hv_vm_map((hv_uvaddr_t)slot->mem, slot->start, slot->size,
>>> flags);
>>> >> -    assert_hvf_ok(ret);
>>> >> -    return 0;
>>> >> -}
>>> >> -
>>> >> -void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>>> >> -{
>>> >> -    hvf_slot *mem;
>>> >> -    MemoryRegion *area = section->mr;
>>> >> -    bool writeable = !area->readonly && !area->rom_device;
>>> >> -    hv_memory_flags_t flags;
>>> >> -
>>> >> -    if (!memory_region_is_ram(area)) {
>>> >> -        if (writeable) {
>>> >> -            return;
>>> >> -        } else if (!memory_region_is_romd(area)) {
>>> >> -            /*
>>> >> -             * If the memory device is not in romd_mode, then we
>>> actually want
>>> >> -             * to remove the hvf memory slot so all accesses will
>>> trap.
>>> >> -             */
>>> >> -             add = false;
>>> >> -        }
>>> >> -    }
>>> >> -
>>> >> -    mem = hvf_find_overlap_slot(
>>> >> -            section->offset_within_address_space,
>>> >> -            int128_get64(section->size));
>>> >> -
>>> >> -    if (mem && add) {
>>> >> -        if (mem->size == int128_get64(section->size) &&
>>> >> -            mem->start == section->offset_within_address_space &&
>>> >> -            mem->mem == (memory_region_get_ram_ptr(area) +
>>> >> -            section->offset_within_region)) {
>>> >> -            return; /* Same region was attempted to register, go
>>> away. */
>>> >> -        }
>>> >> -    }
>>> >> -
>>> >> -    /* Region needs to be reset. set the size to 0 and remap it. */
>>> >> -    if (mem) {
>>> >> -        mem->size = 0;
>>> >> -        if (do_hvf_set_memory(mem, 0)) {
>>> >> -            error_report("Failed to reset overlapping slot");
>>> >> -            abort();
>>> >> -        }
>>> >> -    }
>>> >> -
>>> >> -    if (!add) {
>>> >> -        return;
>>> >> -    }
>>> >> -
>>> >> -    if (area->readonly ||
>>> >> -        (!memory_region_is_ram(area) &&
>>> memory_region_is_romd(area))) {
>>> >> -        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>>> >> -    } else {
>>> >> -        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>>> >> -    }
>>> >> -
>>> >> -    /* Now make a new slot. */
>>> >> -    int x;
>>> >> -
>>> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>>> >> -        mem = &hvf_state->slots[x];
>>> >> -        if (!mem->size) {
>>> >> -            break;
>>> >> -        }
>>> >> -    }
>>> >> -
>>> >> -    if (x == hvf_state->num_slots) {
>>> >> -        error_report("No free slots");
>>> >> -        abort();
>>> >> -    }
>>> >> -
>>> >> -    mem->size = int128_get64(section->size);
>>> >> -    mem->mem = memory_region_get_ram_ptr(area) +
>>> section->offset_within_region;
>>> >> -    mem->start = section->offset_within_address_space;
>>> >> -    mem->region = area;
>>> >> -
>>> >> -    if (do_hvf_set_memory(mem, flags)) {
>>> >> -        error_report("Error registering new memory slot");
>>> >> -        abort();
>>> >> -    }
>>> >> -}
>>> >> -
>>> >>   void vmx_update_tpr(CPUState *cpu)
>>> >>   {
>>> >>       /* TODO: need integrate APIC handling */
>>> >> @@ -276,56 +112,6 @@ void hvf_handle_io(CPUArchState *env, uint16_t
>>> port, void *buffer,
>>> >>       }
>>> >>   }
>>> >>
>>> >> -static void do_hvf_cpu_synchronize_state(CPUState *cpu,
>>> run_on_cpu_data arg)
>>> >> -{
>>> >> -    if (!cpu->vcpu_dirty) {
>>> >> -        hvf_get_registers(cpu);
>>> >> -        cpu->vcpu_dirty = true;
>>> >> -    }
>>> >> -}
>>> >> -
>>> >> -void hvf_cpu_synchronize_state(CPUState *cpu)
>>> >> -{
>>> >> -    if (!cpu->vcpu_dirty) {
>>> >> -        run_on_cpu(cpu, do_hvf_cpu_synchronize_state,
>>> RUN_ON_CPU_NULL);
>>> >> -    }
>>> >> -}
>>> >> -
>>> >> -static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>>> >> -                                              run_on_cpu_data arg)
>>> >> -{
>>> >> -    hvf_put_registers(cpu);
>>> >> -    cpu->vcpu_dirty = false;
>>> >> -}
>>> >> -
>>> >> -void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>>> >> -{
>>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset,
>>> RUN_ON_CPU_NULL);
>>> >> -}
>>> >> -
>>> >> -static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>>> >> -                                             run_on_cpu_data arg)
>>> >> -{
>>> >> -    hvf_put_registers(cpu);
>>> >> -    cpu->vcpu_dirty = false;
>>> >> -}
>>> >> -
>>> >> -void hvf_cpu_synchronize_post_init(CPUState *cpu)
>>> >> -{
>>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init,
>>> RUN_ON_CPU_NULL);
>>> >> -}
>>> >> -
>>> >> -static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>>> >> -                                              run_on_cpu_data arg)
>>> >> -{
>>> >> -    cpu->vcpu_dirty = true;
>>> >> -}
>>> >> -
>>> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>>> >> -{
>>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm,
>>> RUN_ON_CPU_NULL);
>>> >> -}
>>> >> -
>>> >>   static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa,
>>> uint64_t ept_qual)
>>> >>   {
>>> >>       int read, write;
>>> >> @@ -370,109 +156,19 @@ static bool ept_emulation_fault(hvf_slot
>>> *slot, uint64_t gpa, uint64_t ept_qual)
>>> >>       return false;
>>> >>   }
>>> >>
>>> >> -static void hvf_set_dirty_tracking(MemoryRegionSection *section,
>>> bool on)
>>> >> -{
>>> >> -    hvf_slot *slot;
>>> >> -
>>> >> -    slot = hvf_find_overlap_slot(
>>> >> -            section->offset_within_address_space,
>>> >> -            int128_get64(section->size));
>>> >> -
>>> >> -    /* protect region against writes; begin tracking it */
>>> >> -    if (on) {
>>> >> -        slot->flags |= HVF_SLOT_LOG;
>>> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>>> >> -                      HV_MEMORY_READ);
>>> >> -    /* stop tracking region*/
>>> >> -    } else {
>>> >> -        slot->flags &= ~HVF_SLOT_LOG;
>>> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>>> >> -                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>>> >> -    }
>>> >> -}
>>> >> -
>>> >> -static void hvf_log_start(MemoryListener *listener,
>>> >> -                          MemoryRegionSection *section, int old, int
>>> new)
>>> >> -{
>>> >> -    if (old != 0) {
>>> >> -        return;
>>> >> -    }
>>> >> -
>>> >> -    hvf_set_dirty_tracking(section, 1);
>>> >> -}
>>> >> -
>>> >> -static void hvf_log_stop(MemoryListener *listener,
>>> >> -                         MemoryRegionSection *section, int old, int
>>> new)
>>> >> -{
>>> >> -    if (new != 0) {
>>> >> -        return;
>>> >> -    }
>>> >> -
>>> >> -    hvf_set_dirty_tracking(section, 0);
>>> >> -}
>>> >> -
>>> >> -static void hvf_log_sync(MemoryListener *listener,
>>> >> -                         MemoryRegionSection *section)
>>> >> -{
>>> >> -    /*
>>> >> -     * sync of dirty pages is handled elsewhere; just make sure we
>>> keep
>>> >> -     * tracking the region.
>>> >> -     */
>>> >> -    hvf_set_dirty_tracking(section, 1);
>>> >> -}
>>> >> -
>>> >> -static void hvf_region_add(MemoryListener *listener,
>>> >> -                           MemoryRegionSection *section)
>>> >> -{
>>> >> -    hvf_set_phys_mem(section, true);
>>> >> -}
>>> >> -
>>> >> -static void hvf_region_del(MemoryListener *listener,
>>> >> -                           MemoryRegionSection *section)
>>> >> -{
>>> >> -    hvf_set_phys_mem(section, false);
>>> >> -}
>>> >> -
>>> >> -static MemoryListener hvf_memory_listener = {
>>> >> -    .priority = 10,
>>> >> -    .region_add = hvf_region_add,
>>> >> -    .region_del = hvf_region_del,
>>> >> -    .log_start = hvf_log_start,
>>> >> -    .log_stop = hvf_log_stop,
>>> >> -    .log_sync = hvf_log_sync,
>>> >> -};
>>> >> -
>>> >> -void hvf_vcpu_destroy(CPUState *cpu)
>>> >> +void hvf_arch_vcpu_destroy(CPUState *cpu)
>>> >>   {
>>> >>       X86CPU *x86_cpu = X86_CPU(cpu);
>>> >>       CPUX86State *env = &x86_cpu->env;
>>> >>
>>> >> -    hv_return_t ret = hv_vcpu_destroy((hv_vcpuid_t)cpu->hvf_fd);
>>> >>       g_free(env->hvf_mmio_buf);
>>> >> -    assert_hvf_ok(ret);
>>> >> -}
>>> >> -
>>> >> -static void dummy_signal(int sig)
>>> >> -{
>>> >>   }
>>> >>
>>> >> -int hvf_init_vcpu(CPUState *cpu)
>>> >> +int hvf_arch_init_vcpu(CPUState *cpu)
>>> >>   {
>>> >>
>>> >>       X86CPU *x86cpu = X86_CPU(cpu);
>>> >>       CPUX86State *env = &x86cpu->env;
>>> >> -    int r;
>>> >> -
>>> >> -    /* init cpu signals */
>>> >> -    sigset_t set;
>>> >> -    struct sigaction sigact;
>>> >> -
>>> >> -    memset(&sigact, 0, sizeof(sigact));
>>> >> -    sigact.sa_handler = dummy_signal;
>>> >> -    sigaction(SIG_IPI, &sigact, NULL);
>>> >> -
>>> >> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>>> >> -    sigdelset(&set, SIG_IPI);
>>> >>
>>> >>       init_emu();
>>> >>       init_decoder();
>>> >> @@ -480,10 +176,6 @@ int hvf_init_vcpu(CPUState *cpu)
>>> >>       hvf_state->hvf_caps = g_new0(struct hvf_vcpu_caps, 1);
>>> >>       env->hvf_mmio_buf = g_new(char, 4096);
>>> >>
>>> >> -    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>>> >> -    cpu->vcpu_dirty = 1;
>>> >> -    assert_hvf_ok(r);
>>> >> -
>>> >>       if (hv_vmx_read_capability(HV_VMX_CAP_PINBASED,
>>> >>           &hvf_state->hvf_caps->vmx_cap_pinbased)) {
>>> >>           abort();
>>> >> @@ -865,49 +557,3 @@ int hvf_vcpu_exec(CPUState *cpu)
>>> >>
>>> >>       return ret;
>>> >>   }
>>> >> -
>>> >> -bool hvf_allowed;
>>> >> -
>>> >> -static int hvf_accel_init(MachineState *ms)
>>> >> -{
>>> >> -    int x;
>>> >> -    hv_return_t ret;
>>> >> -    HVFState *s;
>>> >> -
>>> >> -    ret = hv_vm_create(HV_VM_DEFAULT);
>>> >> -    assert_hvf_ok(ret);
>>> >> -
>>> >> -    s = g_new0(HVFState, 1);
>>> >> -
>>> >> -    s->num_slots = 32;
>>> >> -    for (x = 0; x < s->num_slots; ++x) {
>>> >> -        s->slots[x].size = 0;
>>> >> -        s->slots[x].slot_id = x;
>>> >> -    }
>>> >> -
>>> >> -    hvf_state = s;
>>> >> -    memory_listener_register(&hvf_memory_listener,
>>> &address_space_memory);
>>> >> -    cpus_register_accel(&hvf_cpus);
>>> >> -    return 0;
>>> >> -}
>>> >> -
>>> >> -static void hvf_accel_class_init(ObjectClass *oc, void *data)
>>> >> -{
>>> >> -    AccelClass *ac = ACCEL_CLASS(oc);
>>> >> -    ac->name = "HVF";
>>> >> -    ac->init_machine = hvf_accel_init;
>>> >> -    ac->allowed = &hvf_allowed;
>>> >> -}
>>> >> -
>>> >> -static const TypeInfo hvf_accel_type = {
>>> >> -    .name = TYPE_HVF_ACCEL,
>>> >> -    .parent = TYPE_ACCEL,
>>> >> -    .class_init = hvf_accel_class_init,
>>> >> -};
>>> >> -
>>> >> -static void hvf_type_init(void)
>>> >> -{
>>> >> -    type_register_static(&hvf_accel_type);
>>> >> -}
>>> >> -
>>> >> -type_init(hvf_type_init);
>>> >> diff --git a/target/i386/hvf/meson.build b/target/i386/hvf/meson.build
>>> >> index 409c9a3f14..c8a43717ee 100644
>>> >> --- a/target/i386/hvf/meson.build
>>> >> +++ b/target/i386/hvf/meson.build
>>> >> @@ -1,6 +1,5 @@
>>> >>   i386_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true: files(
>>> >>     'hvf.c',
>>> >> -  'hvf-cpus.c',
>>> >>     'x86.c',
>>> >>     'x86_cpuid.c',
>>> >>     'x86_decode.c',
>>> >> diff --git a/target/i386/hvf/x86hvf.c b/target/i386/hvf/x86hvf.c
>>> >> index bbec412b6c..89b8e9d87a 100644
>>> >> --- a/target/i386/hvf/x86hvf.c
>>> >> +++ b/target/i386/hvf/x86hvf.c
>>> >> @@ -20,6 +20,9 @@
>>> >>   #include "qemu/osdep.h"
>>> >>
>>> >>   #include "qemu-common.h"
>>> >> +#include "sysemu/hvf.h"
>>> >> +#include "sysemu/hvf_int.h"
>>> >> +#include "sysemu/hw_accel.h"
>>> >>   #include "x86hvf.h"
>>> >>   #include "vmx.h"
>>> >>   #include "vmcs.h"
>>> >> @@ -32,8 +35,6 @@
>>> >>   #include <Hypervisor/hv.h>
>>> >>   #include <Hypervisor/hv_vmx.h>
>>> >>
>>> >> -#include "hvf-cpus.h"
>>> >> -
>>> >>   void hvf_set_segment(struct CPUState *cpu, struct vmx_segment
>>> *vmx_seg,
>>> >>                        SegmentCache *qseg, bool is_tr)
>>> >>   {
>>> >> @@ -437,7 +438,7 @@ int hvf_process_events(CPUState *cpu_state)
>>> >>       env->eflags = rreg(cpu_state->hvf_fd, HV_X86_RFLAGS);
>>> >>
>>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_INIT) {
>>> >> -        hvf_cpu_synchronize_state(cpu_state);
>>> >> +        cpu_synchronize_state(cpu_state);
>>> >>           do_cpu_init(cpu);
>>> >>       }
>>> >>
>>> >> @@ -451,12 +452,12 @@ int hvf_process_events(CPUState *cpu_state)
>>> >>           cpu_state->halted = 0;
>>> >>       }
>>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_SIPI) {
>>> >> -        hvf_cpu_synchronize_state(cpu_state);
>>> >> +        cpu_synchronize_state(cpu_state);
>>> >>           do_cpu_sipi(cpu);
>>> >>       }
>>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_TPR) {
>>> >>           cpu_state->interrupt_request &= ~CPU_INTERRUPT_TPR;
>>> >> -        hvf_cpu_synchronize_state(cpu_state);
>>> >> +        cpu_synchronize_state(cpu_state);
>>> > The changes from hvf_cpu_*() to cpu_*() are cleanup and perhaps should
>>> > be a separate patch. It follows cpu/accel cleanups Claudio was doing
>>> the
>>> > summer.
>>>
>>>
>>> The only reason they're in here is because we no longer have access to
>>> the hvf_ functions from the file. I am perfectly happy to rebase the
>>> patch on top of Claudio's if his goes in first. I'm sure it'll be
>>> trivial for him to rebase on top of this too if my series goes in first.
>>>
>>>
>>> >
>>> > Phillipe raised the idea that the patch might go ahead of ARM-specific
>>> > part (which might involve some discussions) and I agree with that.
>>> >
>>> > Some sync between Claudio series (CC'd him) and the patch might be
>>> need.
>>>
>>>
>>> I would prefer not to hold back because of the sync. Claudio's cleanup
>>> is trivial enough to adjust for if it gets merged ahead of this.
>>>
>>>
>>> Alex
>>>
>>>
>>>
>>>

Re: [PATCH 2/8] hvf: Move common code out

Posted by Peter Collingbourne 4 years, 3 months ago

On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
>
>
>
> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
>>
>> Hi Frank,
>>
>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
>
> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
>>
>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
>>
>>   https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
>>
>
> Thanks, we'll take a look :)
>
>>
>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.

Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
I'll reply to your patch here. You have:

+                    /* Set cpu->hvf->sleeping so that we get a
SIG_IPI signal. */
+                    cpu->hvf->sleeping = true;
+                    smp_mb();
+
+                    /* Bail out if we received an IRQ meanwhile */
+                    if (cpu->thread_kicked || (cpu->interrupt_request &
+                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
+                        cpu->hvf->sleeping = false;
+                        break;
+                    }
+
+                    /* nanosleep returns on signal, so we wake up on kick. */
+                    nanosleep(ts, NULL);

and then send the signal conditional on whether sleeping is true, but
I think this is racy. If the signal is sent after sleeping is set to
true but before entering nanosleep then I think it will be ignored and
we will miss the wakeup. That's why in my implementation I block IPI
on the CPU thread at startup and then use pselect to atomically
unblock and begin sleeping. The signal is sent unconditionally so
there's no need to worry about races between actually sleeping and the
"we think we're sleeping" state. It may lead to an extra wakeup but
that's better than missing it entirely.

Peter

>>
>> Also, is there a particular reason you're working on this super interesting and useful code in a random downstream fork of QEMU? Wouldn't it be more helpful to contribute to the upstream code base instead?
>
> We'd actually like to contribute upstream too :) We do want to maintain our own downstream though; Android Emulator codebase needs to work solidly on macos and windows which has made keeping up with upstream difficult, and staying on a previous version (2.12) with known quirks easier. (theres also some android related customization relating to Qt Ui + different set of virtual devices and snapshot support (incl. snapshots of graphics devices with OpenGLES state tracking), which we hope to separate into other libraries/processes, but its not insignificant)
>>
>>
>> Alex
>>
>>
>> On 30.11.20 21:15, Frank Yang wrote:
>>
>> Update: We're not quite sure how to compare the CNTV_CVAL and CNTVCT. But the high CPU usage seems to be mitigated by having a poll interval (like KVM does) in handling WFI:
>>
>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512501
>>
>> This is loosely inspired by https://elixir.bootlin.com/linux/v5.10-rc6/source/virt/kvm/kvm_main.c#L2766 which does seem to specify a poll interval.
>>
>> It would be cool if we could have a lightweight way to enter sleep and restart the vcpus precisely when CVAL passes, though.
>>
>> Frank
>>
>>
>> On Fri, Nov 27, 2020 at 3:30 PM Frank Yang <lfy@google.com> wrote:
>>>
>>> Hi all,
>>>
>>> +Peter Collingbourne
>>>
>>> I'm a developer on the Android Emulator, which is in a fork of QEMU.
>>>
>>> Peter and I have been working on an HVF Apple Silicon backend with an eye toward Android guests.
>>>
>>> We have gotten things to basically switch to Android userspace already (logcat/shell and graphics available at least)
>>>
>>> Our strategy so far has been to import logic from the KVM implementation and hook into QEMU's software devices that previously assumed to only work with TCG, or have KVM-specific paths.
>>>
>>> Thanks to Alexander for the tip on the 36-bit address space limitation btw; our way of addressing this is to still allow highmem but not put pci high mmio so high.
>>>
>>> Also, note we have a sleep/signal based mechanism to deal with WFx, which might be worth looking into in Alexander's implementation as well:
>>>
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551
>>>
>>> Patches so far, FYI:
>>>
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1513429/1
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512554/3
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512553/3
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512552/3
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551/3
>>>
>>> https://android.googlesource.com/platform/external/qemu/+/c17eb6a3ffd50047e9646aff6640b710cb8ff48a
>>> https://android.googlesource.com/platform/external/qemu/+/74bed16de1afb41b7a7ab8da1d1861226c9db63b
>>> https://android.googlesource.com/platform/external/qemu/+/eccd9e47ab2ccb9003455e3bb721f57f9ebc3c01
>>> https://android.googlesource.com/platform/external/qemu/+/54fe3d67ed4698e85826537a4f49b2b9074b2228
>>> https://android.googlesource.com/platform/external/qemu/+/82ef91a6fede1d1000f36be037ad4d58fbe0d102
>>> https://android.googlesource.com/platform/external/qemu/+/c28147aa7c74d98b858e99623d2fe46e74a379f6
>>>
>>> Peter's also noticed that there are extra steps needed for M1's to allow TCG to work, as it involves JIT:
>>>
>>> https://android.googlesource.com/platform/external/qemu/+/740e3fe47f88926c6bda9abb22ee6eae1bc254a9
>>>
>>> We'd appreciate any feedback/comments :)
>>>
>>> Best,
>>>
>>> Frank
>>>
>>> On Fri, Nov 27, 2020 at 1:57 PM Alexander Graf <agraf@csgraf.de> wrote:
>>>>
>>>>
>>>> On 27.11.20 21:00, Roman Bolshakov wrote:
>>>> > On Thu, Nov 26, 2020 at 10:50:11PM +0100, Alexander Graf wrote:
>>>> >> Until now, Hypervisor.framework has only been available on x86_64 systems.
>>>> >> With Apple Silicon shipping now, it extends its reach to aarch64. To
>>>> >> prepare for support for multiple architectures, let's move common code out
>>>> >> into its own accel directory.
>>>> >>
>>>> >> Signed-off-by: Alexander Graf <agraf@csgraf.de>
>>>> >> ---
>>>> >>   MAINTAINERS                 |   9 +-
>>>> >>   accel/hvf/hvf-all.c         |  56 +++++
>>>> >>   accel/hvf/hvf-cpus.c        | 468 ++++++++++++++++++++++++++++++++++++
>>>> >>   accel/hvf/meson.build       |   7 +
>>>> >>   accel/meson.build           |   1 +
>>>> >>   include/sysemu/hvf_int.h    |  69 ++++++
>>>> >>   target/i386/hvf/hvf-cpus.c  | 131 ----------
>>>> >>   target/i386/hvf/hvf-cpus.h  |  25 --
>>>> >>   target/i386/hvf/hvf-i386.h  |  48 +---
>>>> >>   target/i386/hvf/hvf.c       | 360 +--------------------------
>>>> >>   target/i386/hvf/meson.build |   1 -
>>>> >>   target/i386/hvf/x86hvf.c    |  11 +-
>>>> >>   target/i386/hvf/x86hvf.h    |   2 -
>>>> >>   13 files changed, 619 insertions(+), 569 deletions(-)
>>>> >>   create mode 100644 accel/hvf/hvf-all.c
>>>> >>   create mode 100644 accel/hvf/hvf-cpus.c
>>>> >>   create mode 100644 accel/hvf/meson.build
>>>> >>   create mode 100644 include/sysemu/hvf_int.h
>>>> >>   delete mode 100644 target/i386/hvf/hvf-cpus.c
>>>> >>   delete mode 100644 target/i386/hvf/hvf-cpus.h
>>>> >>
>>>> >> diff --git a/MAINTAINERS b/MAINTAINERS
>>>> >> index 68bc160f41..ca4b6d9279 100644
>>>> >> --- a/MAINTAINERS
>>>> >> +++ b/MAINTAINERS
>>>> >> @@ -444,9 +444,16 @@ M: Cameron Esfahani <dirty@apple.com>
>>>> >>   M: Roman Bolshakov <r.bolshakov@yadro.com>
>>>> >>   W: https://wiki.qemu.org/Features/HVF
>>>> >>   S: Maintained
>>>> >> -F: accel/stubs/hvf-stub.c
>>>> > There was a patch for that in the RFC series from Claudio.
>>>>
>>>>
>>>> Yeah, I'm not worried about this hunk :).
>>>>
>>>>
>>>> >
>>>> >>   F: target/i386/hvf/
>>>> >> +
>>>> >> +HVF
>>>> >> +M: Cameron Esfahani <dirty@apple.com>
>>>> >> +M: Roman Bolshakov <r.bolshakov@yadro.com>
>>>> >> +W: https://wiki.qemu.org/Features/HVF
>>>> >> +S: Maintained
>>>> >> +F: accel/hvf/
>>>> >>   F: include/sysemu/hvf.h
>>>> >> +F: include/sysemu/hvf_int.h
>>>> >>
>>>> >>   WHPX CPUs
>>>> >>   M: Sunil Muthuswamy <sunilmut@microsoft.com>
>>>> >> diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
>>>> >> new file mode 100644
>>>> >> index 0000000000..47d77a472a
>>>> >> --- /dev/null
>>>> >> +++ b/accel/hvf/hvf-all.c
>>>> >> @@ -0,0 +1,56 @@
>>>> >> +/*
>>>> >> + * QEMU Hypervisor.framework support
>>>> >> + *
>>>> >> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>>>> >> + * the COPYING file in the top-level directory.
>>>> >> + *
>>>> >> + * Contributions after 2012-01-13 are licensed under the terms of the
>>>> >> + * GNU GPL, version 2 or (at your option) any later version.
>>>> >> + */
>>>> >> +
>>>> >> +#include "qemu/osdep.h"
>>>> >> +#include "qemu-common.h"
>>>> >> +#include "qemu/error-report.h"
>>>> >> +#include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >> +#include "sysemu/runstate.h"
>>>> >> +
>>>> >> +#include "qemu/main-loop.h"
>>>> >> +#include "sysemu/accel.h"
>>>> >> +
>>>> >> +#include <Hypervisor/Hypervisor.h>
>>>> >> +
>>>> >> +bool hvf_allowed;
>>>> >> +HVFState *hvf_state;
>>>> >> +
>>>> >> +void assert_hvf_ok(hv_return_t ret)
>>>> >> +{
>>>> >> +    if (ret == HV_SUCCESS) {
>>>> >> +        return;
>>>> >> +    }
>>>> >> +
>>>> >> +    switch (ret) {
>>>> >> +    case HV_ERROR:
>>>> >> +        error_report("Error: HV_ERROR");
>>>> >> +        break;
>>>> >> +    case HV_BUSY:
>>>> >> +        error_report("Error: HV_BUSY");
>>>> >> +        break;
>>>> >> +    case HV_BAD_ARGUMENT:
>>>> >> +        error_report("Error: HV_BAD_ARGUMENT");
>>>> >> +        break;
>>>> >> +    case HV_NO_RESOURCES:
>>>> >> +        error_report("Error: HV_NO_RESOURCES");
>>>> >> +        break;
>>>> >> +    case HV_NO_DEVICE:
>>>> >> +        error_report("Error: HV_NO_DEVICE");
>>>> >> +        break;
>>>> >> +    case HV_UNSUPPORTED:
>>>> >> +        error_report("Error: HV_UNSUPPORTED");
>>>> >> +        break;
>>>> >> +    default:
>>>> >> +        error_report("Unknown Error");
>>>> >> +    }
>>>> >> +
>>>> >> +    abort();
>>>> >> +}
>>>> >> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>>>> >> new file mode 100644
>>>> >> index 0000000000..f9bb5502b7
>>>> >> --- /dev/null
>>>> >> +++ b/accel/hvf/hvf-cpus.c
>>>> >> @@ -0,0 +1,468 @@
>>>> >> +/*
>>>> >> + * Copyright 2008 IBM Corporation
>>>> >> + *           2008 Red Hat, Inc.
>>>> >> + * Copyright 2011 Intel Corporation
>>>> >> + * Copyright 2016 Veertu, Inc.
>>>> >> + * Copyright 2017 The Android Open Source Project
>>>> >> + *
>>>> >> + * QEMU Hypervisor.framework support
>>>> >> + *
>>>> >> + * This program is free software; you can redistribute it and/or
>>>> >> + * modify it under the terms of version 2 of the GNU General Public
>>>> >> + * License as published by the Free Software Foundation.
>>>> >> + *
>>>> >> + * This program is distributed in the hope that it will be useful,
>>>> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>>> >> + * General Public License for more details.
>>>> >> + *
>>>> >> + * You should have received a copy of the GNU General Public License
>>>> >> + * along with this program; if not, see <http://www.gnu.org/licenses/>.
>>>> >> + *
>>>> >> + * This file contain code under public domain from the hvdos project:
>>>> >> + * https://github.com/mist64/hvdos
>>>> >> + *
>>>> >> + * Parts Copyright (c) 2011 NetApp, Inc.
>>>> >> + * All rights reserved.
>>>> >> + *
>>>> >> + * Redistribution and use in source and binary forms, with or without
>>>> >> + * modification, are permitted provided that the following conditions
>>>> >> + * are met:
>>>> >> + * 1. Redistributions of source code must retain the above copyright
>>>> >> + *    notice, this list of conditions and the following disclaimer.
>>>> >> + * 2. Redistributions in binary form must reproduce the above copyright
>>>> >> + *    notice, this list of conditions and the following disclaimer in the
>>>> >> + *    documentation and/or other materials provided with the distribution.
>>>> >> + *
>>>> >> + * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>>>> >> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
>>>> >> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
>>>> >> + * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
>>>> >> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>>>> >> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
>>>> >> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>>>> >> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
>>>> >> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
>>>> >> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>>>> >> + * SUCH DAMAGE.
>>>> >> + */
>>>> >> +
>>>> >> +#include "qemu/osdep.h"
>>>> >> +#include "qemu/error-report.h"
>>>> >> +#include "qemu/main-loop.h"
>>>> >> +#include "exec/address-spaces.h"
>>>> >> +#include "exec/exec-all.h"
>>>> >> +#include "sysemu/cpus.h"
>>>> >> +#include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >> +#include "sysemu/runstate.h"
>>>> >> +#include "qemu/guest-random.h"
>>>> >> +
>>>> >> +#include <Hypervisor/Hypervisor.h>
>>>> >> +
>>>> >> +/* Memory slots */
>>>> >> +
>>>> >> +struct mac_slot {
>>>> >> +    int present;
>>>> >> +    uint64_t size;
>>>> >> +    uint64_t gpa_start;
>>>> >> +    uint64_t gva;
>>>> >> +};
>>>> >> +
>>>> >> +hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>>>> >> +{
>>>> >> +    hvf_slot *slot;
>>>> >> +    int x;
>>>> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>>>> >> +        slot = &hvf_state->slots[x];
>>>> >> +        if (slot->size && start < (slot->start + slot->size) &&
>>>> >> +            (start + size) > slot->start) {
>>>> >> +            return slot;
>>>> >> +        }
>>>> >> +    }
>>>> >> +    return NULL;
>>>> >> +}
>>>> >> +
>>>> >> +struct mac_slot mac_slots[32];
>>>> >> +
>>>> >> +static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>>>> >> +{
>>>> >> +    struct mac_slot *macslot;
>>>> >> +    hv_return_t ret;
>>>> >> +
>>>> >> +    macslot = &mac_slots[slot->slot_id];
>>>> >> +
>>>> >> +    if (macslot->present) {
>>>> >> +        if (macslot->size != slot->size) {
>>>> >> +            macslot->present = 0;
>>>> >> +            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>>>> >> +            assert_hvf_ok(ret);
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    if (!slot->size) {
>>>> >> +        return 0;
>>>> >> +    }
>>>> >> +
>>>> >> +    macslot->present = 1;
>>>> >> +    macslot->gpa_start = slot->start;
>>>> >> +    macslot->size = slot->size;
>>>> >> +    ret = hv_vm_map(slot->mem, slot->start, slot->size, flags);
>>>> >> +    assert_hvf_ok(ret);
>>>> >> +    return 0;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>>>> >> +{
>>>> >> +    hvf_slot *mem;
>>>> >> +    MemoryRegion *area = section->mr;
>>>> >> +    bool writeable = !area->readonly && !area->rom_device;
>>>> >> +    hv_memory_flags_t flags;
>>>> >> +
>>>> >> +    if (!memory_region_is_ram(area)) {
>>>> >> +        if (writeable) {
>>>> >> +            return;
>>>> >> +        } else if (!memory_region_is_romd(area)) {
>>>> >> +            /*
>>>> >> +             * If the memory device is not in romd_mode, then we actually want
>>>> >> +             * to remove the hvf memory slot so all accesses will trap.
>>>> >> +             */
>>>> >> +             add = false;
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    mem = hvf_find_overlap_slot(
>>>> >> +            section->offset_within_address_space,
>>>> >> +            int128_get64(section->size));
>>>> >> +
>>>> >> +    if (mem && add) {
>>>> >> +        if (mem->size == int128_get64(section->size) &&
>>>> >> +            mem->start == section->offset_within_address_space &&
>>>> >> +            mem->mem == (memory_region_get_ram_ptr(area) +
>>>> >> +            section->offset_within_region)) {
>>>> >> +            return; /* Same region was attempted to register, go away. */
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    /* Region needs to be reset. set the size to 0 and remap it. */
>>>> >> +    if (mem) {
>>>> >> +        mem->size = 0;
>>>> >> +        if (do_hvf_set_memory(mem, 0)) {
>>>> >> +            error_report("Failed to reset overlapping slot");
>>>> >> +            abort();
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    if (!add) {
>>>> >> +        return;
>>>> >> +    }
>>>> >> +
>>>> >> +    if (area->readonly ||
>>>> >> +        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
>>>> >> +        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>>>> >> +    } else {
>>>> >> +        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>>>> >> +    }
>>>> >> +
>>>> >> +    /* Now make a new slot. */
>>>> >> +    int x;
>>>> >> +
>>>> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>>>> >> +        mem = &hvf_state->slots[x];
>>>> >> +        if (!mem->size) {
>>>> >> +            break;
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    if (x == hvf_state->num_slots) {
>>>> >> +        error_report("No free slots");
>>>> >> +        abort();
>>>> >> +    }
>>>> >> +
>>>> >> +    mem->size = int128_get64(section->size);
>>>> >> +    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
>>>> >> +    mem->start = section->offset_within_address_space;
>>>> >> +    mem->region = area;
>>>> >> +
>>>> >> +    if (do_hvf_set_memory(mem, flags)) {
>>>> >> +        error_report("Error registering new memory slot");
>>>> >> +        abort();
>>>> >> +    }
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
>>>> >> +{
>>>> >> +    hvf_slot *slot;
>>>> >> +
>>>> >> +    slot = hvf_find_overlap_slot(
>>>> >> +            section->offset_within_address_space,
>>>> >> +            int128_get64(section->size));
>>>> >> +
>>>> >> +    /* protect region against writes; begin tracking it */
>>>> >> +    if (on) {
>>>> >> +        slot->flags |= HVF_SLOT_LOG;
>>>> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>>>> >> +                      HV_MEMORY_READ);
>>>> >> +    /* stop tracking region*/
>>>> >> +    } else {
>>>> >> +        slot->flags &= ~HVF_SLOT_LOG;
>>>> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>>>> >> +                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>>>> >> +    }
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_log_start(MemoryListener *listener,
>>>> >> +                          MemoryRegionSection *section, int old, int new)
>>>> >> +{
>>>> >> +    if (old != 0) {
>>>> >> +        return;
>>>> >> +    }
>>>> >> +
>>>> >> +    hvf_set_dirty_tracking(section, 1);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_log_stop(MemoryListener *listener,
>>>> >> +                         MemoryRegionSection *section, int old, int new)
>>>> >> +{
>>>> >> +    if (new != 0) {
>>>> >> +        return;
>>>> >> +    }
>>>> >> +
>>>> >> +    hvf_set_dirty_tracking(section, 0);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_log_sync(MemoryListener *listener,
>>>> >> +                         MemoryRegionSection *section)
>>>> >> +{
>>>> >> +    /*
>>>> >> +     * sync of dirty pages is handled elsewhere; just make sure we keep
>>>> >> +     * tracking the region.
>>>> >> +     */
>>>> >> +    hvf_set_dirty_tracking(section, 1);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_region_add(MemoryListener *listener,
>>>> >> +                           MemoryRegionSection *section)
>>>> >> +{
>>>> >> +    hvf_set_phys_mem(section, true);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_region_del(MemoryListener *listener,
>>>> >> +                           MemoryRegionSection *section)
>>>> >> +{
>>>> >> +    hvf_set_phys_mem(section, false);
>>>> >> +}
>>>> >> +
>>>> >> +static MemoryListener hvf_memory_listener = {
>>>> >> +    .priority = 10,
>>>> >> +    .region_add = hvf_region_add,
>>>> >> +    .region_del = hvf_region_del,
>>>> >> +    .log_start = hvf_log_start,
>>>> >> +    .log_stop = hvf_log_stop,
>>>> >> +    .log_sync = hvf_log_sync,
>>>> >> +};
>>>> >> +
>>>> >> +static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
>>>> >> +{
>>>> >> +    if (!cpu->vcpu_dirty) {
>>>> >> +        hvf_get_registers(cpu);
>>>> >> +        cpu->vcpu_dirty = true;
>>>> >> +    }
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_cpu_synchronize_state(CPUState *cpu)
>>>> >> +{
>>>> >> +    if (!cpu->vcpu_dirty) {
>>>> >> +        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
>>>> >> +    }
>>>> >> +}
>>>> >> +
>>>> >> +static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>>>> >> +                                              run_on_cpu_data arg)
>>>> >> +{
>>>> >> +    hvf_put_registers(cpu);
>>>> >> +    cpu->vcpu_dirty = false;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>>>> >> +{
>>>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
>>>> >> +}
>>>> >> +
>>>> >> +static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>>>> >> +                                             run_on_cpu_data arg)
>>>> >> +{
>>>> >> +    hvf_put_registers(cpu);
>>>> >> +    cpu->vcpu_dirty = false;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_cpu_synchronize_post_init(CPUState *cpu)
>>>> >> +{
>>>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
>>>> >> +}
>>>> >> +
>>>> >> +static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>>>> >> +                                              run_on_cpu_data arg)
>>>> >> +{
>>>> >> +    cpu->vcpu_dirty = true;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>>>> >> +{
>>>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_vcpu_destroy(CPUState *cpu)
>>>> >> +{
>>>> >> +    hv_return_t ret = hv_vcpu_destroy(cpu->hvf_fd);
>>>> >> +    assert_hvf_ok(ret);
>>>> >> +
>>>> >> +    hvf_arch_vcpu_destroy(cpu);
>>>> >> +}
>>>> >> +
>>>> >> +static void dummy_signal(int sig)
>>>> >> +{
>>>> >> +}
>>>> >> +
>>>> >> +static int hvf_init_vcpu(CPUState *cpu)
>>>> >> +{
>>>> >> +    int r;
>>>> >> +
>>>> >> +    /* init cpu signals */
>>>> >> +    sigset_t set;
>>>> >> +    struct sigaction sigact;
>>>> >> +
>>>> >> +    memset(&sigact, 0, sizeof(sigact));
>>>> >> +    sigact.sa_handler = dummy_signal;
>>>> >> +    sigaction(SIG_IPI, &sigact, NULL);
>>>> >> +
>>>> >> +    pthread_sigmask(SIG_BLOCK, NULL, &set);
>>>> >> +    sigdelset(&set, SIG_IPI);
>>>> >> +
>>>> >> +#ifdef __aarch64__
>>>> >> +    r = hv_vcpu_create(&cpu->hvf_fd, (hv_vcpu_exit_t **)&cpu->hvf_exit, NULL);
>>>> >> +#else
>>>> >> +    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>>>> >> +#endif
>>>> > I think the first __aarch64__ bit fits better to arm part of the series.
>>>>
>>>>
>>>> Oops. Thanks for catching it! Yes, absolutely. It should be part of the
>>>> ARM enablement.
>>>>
>>>>
>>>> >
>>>> >> +    cpu->vcpu_dirty = 1;
>>>> >> +    assert_hvf_ok(r);
>>>> >> +
>>>> >> +    return hvf_arch_init_vcpu(cpu);
>>>> >> +}
>>>> >> +
>>>> >> +/*
>>>> >> + * The HVF-specific vCPU thread function. This one should only run when the host
>>>> >> + * CPU supports the VMX "unrestricted guest" feature.
>>>> >> + */
>>>> >> +static void *hvf_cpu_thread_fn(void *arg)
>>>> >> +{
>>>> >> +    CPUState *cpu = arg;
>>>> >> +
>>>> >> +    int r;
>>>> >> +
>>>> >> +    assert(hvf_enabled());
>>>> >> +
>>>> >> +    rcu_register_thread();
>>>> >> +
>>>> >> +    qemu_mutex_lock_iothread();
>>>> >> +    qemu_thread_get_self(cpu->thread);
>>>> >> +
>>>> >> +    cpu->thread_id = qemu_get_thread_id();
>>>> >> +    cpu->can_do_io = 1;
>>>> >> +    current_cpu = cpu;
>>>> >> +
>>>> >> +    hvf_init_vcpu(cpu);
>>>> >> +
>>>> >> +    /* signal CPU creation */
>>>> >> +    cpu_thread_signal_created(cpu);
>>>> >> +    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>>>> >> +
>>>> >> +    do {
>>>> >> +        if (cpu_can_run(cpu)) {
>>>> >> +            r = hvf_vcpu_exec(cpu);
>>>> >> +            if (r == EXCP_DEBUG) {
>>>> >> +                cpu_handle_guest_debug(cpu);
>>>> >> +            }
>>>> >> +        }
>>>> >> +        qemu_wait_io_event(cpu);
>>>> >> +    } while (!cpu->unplug || cpu_can_run(cpu));
>>>> >> +
>>>> >> +    hvf_vcpu_destroy(cpu);
>>>> >> +    cpu_thread_signal_destroyed(cpu);
>>>> >> +    qemu_mutex_unlock_iothread();
>>>> >> +    rcu_unregister_thread();
>>>> >> +    return NULL;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_start_vcpu_thread(CPUState *cpu)
>>>> >> +{
>>>> >> +    char thread_name[VCPU_THREAD_NAME_SIZE];
>>>> >> +
>>>> >> +    /*
>>>> >> +     * HVF currently does not support TCG, and only runs in
>>>> >> +     * unrestricted-guest mode.
>>>> >> +     */
>>>> >> +    assert(hvf_enabled());
>>>> >> +
>>>> >> +    cpu->thread = g_malloc0(sizeof(QemuThread));
>>>> >> +    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>>>> >> +    qemu_cond_init(cpu->halt_cond);
>>>> >> +
>>>> >> +    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>>>> >> +             cpu->cpu_index);
>>>> >> +    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>>>> >> +                       cpu, QEMU_THREAD_JOINABLE);
>>>> >> +}
>>>> >> +
>>>> >> +static const CpusAccel hvf_cpus = {
>>>> >> +    .create_vcpu_thread = hvf_start_vcpu_thread,
>>>> >> +
>>>> >> +    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>>>> >> +    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>>>> >> +    .synchronize_state = hvf_cpu_synchronize_state,
>>>> >> +    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>>>> >> +};
>>>> >> +
>>>> >> +static int hvf_accel_init(MachineState *ms)
>>>> >> +{
>>>> >> +    int x;
>>>> >> +    hv_return_t ret;
>>>> >> +    HVFState *s;
>>>> >> +
>>>> >> +    ret = hv_vm_create(HV_VM_DEFAULT);
>>>> >> +    assert_hvf_ok(ret);
>>>> >> +
>>>> >> +    s = g_new0(HVFState, 1);
>>>> >> +
>>>> >> +    s->num_slots = 32;
>>>> >> +    for (x = 0; x < s->num_slots; ++x) {
>>>> >> +        s->slots[x].size = 0;
>>>> >> +        s->slots[x].slot_id = x;
>>>> >> +    }
>>>> >> +
>>>> >> +    hvf_state = s;
>>>> >> +    memory_listener_register(&hvf_memory_listener, &address_space_memory);
>>>> >> +    cpus_register_accel(&hvf_cpus);
>>>> >> +    return 0;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_accel_class_init(ObjectClass *oc, void *data)
>>>> >> +{
>>>> >> +    AccelClass *ac = ACCEL_CLASS(oc);
>>>> >> +    ac->name = "HVF";
>>>> >> +    ac->init_machine = hvf_accel_init;
>>>> >> +    ac->allowed = &hvf_allowed;
>>>> >> +}
>>>> >> +
>>>> >> +static const TypeInfo hvf_accel_type = {
>>>> >> +    .name = TYPE_HVF_ACCEL,
>>>> >> +    .parent = TYPE_ACCEL,
>>>> >> +    .class_init = hvf_accel_class_init,
>>>> >> +};
>>>> >> +
>>>> >> +static void hvf_type_init(void)
>>>> >> +{
>>>> >> +    type_register_static(&hvf_accel_type);
>>>> >> +}
>>>> >> +
>>>> >> +type_init(hvf_type_init);
>>>> >> diff --git a/accel/hvf/meson.build b/accel/hvf/meson.build
>>>> >> new file mode 100644
>>>> >> index 0000000000..dfd6b68dc7
>>>> >> --- /dev/null
>>>> >> +++ b/accel/hvf/meson.build
>>>> >> @@ -0,0 +1,7 @@
>>>> >> +hvf_ss = ss.source_set()
>>>> >> +hvf_ss.add(files(
>>>> >> +  'hvf-all.c',
>>>> >> +  'hvf-cpus.c',
>>>> >> +))
>>>> >> +
>>>> >> +specific_ss.add_all(when: 'CONFIG_HVF', if_true: hvf_ss)
>>>> >> diff --git a/accel/meson.build b/accel/meson.build
>>>> >> index b26cca227a..6de12ce5d5 100644
>>>> >> --- a/accel/meson.build
>>>> >> +++ b/accel/meson.build
>>>> >> @@ -1,5 +1,6 @@
>>>> >>   softmmu_ss.add(files('accel.c'))
>>>> >>
>>>> >> +subdir('hvf')
>>>> >>   subdir('qtest')
>>>> >>   subdir('kvm')
>>>> >>   subdir('tcg')
>>>> >> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
>>>> >> new file mode 100644
>>>> >> index 0000000000..de9bad23a8
>>>> >> --- /dev/null
>>>> >> +++ b/include/sysemu/hvf_int.h
>>>> >> @@ -0,0 +1,69 @@
>>>> >> +/*
>>>> >> + * QEMU Hypervisor.framework (HVF) support
>>>> >> + *
>>>> >> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>>> >> + * See the COPYING file in the top-level directory.
>>>> >> + *
>>>> >> + */
>>>> >> +
>>>> >> +/* header to be included in HVF-specific code */
>>>> >> +
>>>> >> +#ifndef HVF_INT_H
>>>> >> +#define HVF_INT_H
>>>> >> +
>>>> >> +#include <Hypervisor/Hypervisor.h>
>>>> >> +
>>>> >> +#define HVF_MAX_VCPU 0x10
>>>> >> +
>>>> >> +extern struct hvf_state hvf_global;
>>>> >> +
>>>> >> +struct hvf_vm {
>>>> >> +    int id;
>>>> >> +    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>>>> >> +};
>>>> >> +
>>>> >> +struct hvf_state {
>>>> >> +    uint32_t version;
>>>> >> +    struct hvf_vm *vm;
>>>> >> +    uint64_t mem_quota;
>>>> >> +};
>>>> >> +
>>>> >> +/* hvf_slot flags */
>>>> >> +#define HVF_SLOT_LOG (1 << 0)
>>>> >> +
>>>> >> +typedef struct hvf_slot {
>>>> >> +    uint64_t start;
>>>> >> +    uint64_t size;
>>>> >> +    uint8_t *mem;
>>>> >> +    int slot_id;
>>>> >> +    uint32_t flags;
>>>> >> +    MemoryRegion *region;
>>>> >> +} hvf_slot;
>>>> >> +
>>>> >> +typedef struct hvf_vcpu_caps {
>>>> >> +    uint64_t vmx_cap_pinbased;
>>>> >> +    uint64_t vmx_cap_procbased;
>>>> >> +    uint64_t vmx_cap_procbased2;
>>>> >> +    uint64_t vmx_cap_entry;
>>>> >> +    uint64_t vmx_cap_exit;
>>>> >> +    uint64_t vmx_cap_preemption_timer;
>>>> >> +} hvf_vcpu_caps;
>>>> >> +
>>>> >> +struct HVFState {
>>>> >> +    AccelState parent;
>>>> >> +    hvf_slot slots[32];
>>>> >> +    int num_slots;
>>>> >> +
>>>> >> +    hvf_vcpu_caps *hvf_caps;
>>>> >> +};
>>>> >> +extern HVFState *hvf_state;
>>>> >> +
>>>> >> +void assert_hvf_ok(hv_return_t ret);
>>>> >> +int hvf_get_registers(CPUState *cpu);
>>>> >> +int hvf_put_registers(CPUState *cpu);
>>>> >> +int hvf_arch_init_vcpu(CPUState *cpu);
>>>> >> +void hvf_arch_vcpu_destroy(CPUState *cpu);
>>>> >> +int hvf_vcpu_exec(CPUState *cpu);
>>>> >> +hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>>>> >> +
>>>> >> +#endif
>>>> >> diff --git a/target/i386/hvf/hvf-cpus.c b/target/i386/hvf/hvf-cpus.c
>>>> >> deleted file mode 100644
>>>> >> index 817b3d7452..0000000000
>>>> >> --- a/target/i386/hvf/hvf-cpus.c
>>>> >> +++ /dev/null
>>>> >> @@ -1,131 +0,0 @@
>>>> >> -/*
>>>> >> - * Copyright 2008 IBM Corporation
>>>> >> - *           2008 Red Hat, Inc.
>>>> >> - * Copyright 2011 Intel Corporation
>>>> >> - * Copyright 2016 Veertu, Inc.
>>>> >> - * Copyright 2017 The Android Open Source Project
>>>> >> - *
>>>> >> - * QEMU Hypervisor.framework support
>>>> >> - *
>>>> >> - * This program is free software; you can redistribute it and/or
>>>> >> - * modify it under the terms of version 2 of the GNU General Public
>>>> >> - * License as published by the Free Software Foundation.
>>>> >> - *
>>>> >> - * This program is distributed in the hope that it will be useful,
>>>> >> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> >> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>>> >> - * General Public License for more details.
>>>> >> - *
>>>> >> - * You should have received a copy of the GNU General Public License
>>>> >> - * along with this program; if not, see <http://www.gnu.org/licenses/>.
>>>> >> - *
>>>> >> - * This file contain code under public domain from the hvdos project:
>>>> >> - * https://github.com/mist64/hvdos
>>>> >> - *
>>>> >> - * Parts Copyright (c) 2011 NetApp, Inc.
>>>> >> - * All rights reserved.
>>>> >> - *
>>>> >> - * Redistribution and use in source and binary forms, with or without
>>>> >> - * modification, are permitted provided that the following conditions
>>>> >> - * are met:
>>>> >> - * 1. Redistributions of source code must retain the above copyright
>>>> >> - *    notice, this list of conditions and the following disclaimer.
>>>> >> - * 2. Redistributions in binary form must reproduce the above copyright
>>>> >> - *    notice, this list of conditions and the following disclaimer in the
>>>> >> - *    documentation and/or other materials provided with the distribution.
>>>> >> - *
>>>> >> - * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>>>> >> - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
>>>> >> - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
>>>> >> - * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
>>>> >> - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>>>> >> - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
>>>> >> - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>>>> >> - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
>>>> >> - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
>>>> >> - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>>>> >> - * SUCH DAMAGE.
>>>> >> - */
>>>> >> -
>>>> >> -#include "qemu/osdep.h"
>>>> >> -#include "qemu/error-report.h"
>>>> >> -#include "qemu/main-loop.h"
>>>> >> -#include "sysemu/hvf.h"
>>>> >> -#include "sysemu/runstate.h"
>>>> >> -#include "target/i386/cpu.h"
>>>> >> -#include "qemu/guest-random.h"
>>>> >> -
>>>> >> -#include "hvf-cpus.h"
>>>> >> -
>>>> >> -/*
>>>> >> - * The HVF-specific vCPU thread function. This one should only run when the host
>>>> >> - * CPU supports the VMX "unrestricted guest" feature.
>>>> >> - */
>>>> >> -static void *hvf_cpu_thread_fn(void *arg)
>>>> >> -{
>>>> >> -    CPUState *cpu = arg;
>>>> >> -
>>>> >> -    int r;
>>>> >> -
>>>> >> -    assert(hvf_enabled());
>>>> >> -
>>>> >> -    rcu_register_thread();
>>>> >> -
>>>> >> -    qemu_mutex_lock_iothread();
>>>> >> -    qemu_thread_get_self(cpu->thread);
>>>> >> -
>>>> >> -    cpu->thread_id = qemu_get_thread_id();
>>>> >> -    cpu->can_do_io = 1;
>>>> >> -    current_cpu = cpu;
>>>> >> -
>>>> >> -    hvf_init_vcpu(cpu);
>>>> >> -
>>>> >> -    /* signal CPU creation */
>>>> >> -    cpu_thread_signal_created(cpu);
>>>> >> -    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>>>> >> -
>>>> >> -    do {
>>>> >> -        if (cpu_can_run(cpu)) {
>>>> >> -            r = hvf_vcpu_exec(cpu);
>>>> >> -            if (r == EXCP_DEBUG) {
>>>> >> -                cpu_handle_guest_debug(cpu);
>>>> >> -            }
>>>> >> -        }
>>>> >> -        qemu_wait_io_event(cpu);
>>>> >> -    } while (!cpu->unplug || cpu_can_run(cpu));
>>>> >> -
>>>> >> -    hvf_vcpu_destroy(cpu);
>>>> >> -    cpu_thread_signal_destroyed(cpu);
>>>> >> -    qemu_mutex_unlock_iothread();
>>>> >> -    rcu_unregister_thread();
>>>> >> -    return NULL;
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_start_vcpu_thread(CPUState *cpu)
>>>> >> -{
>>>> >> -    char thread_name[VCPU_THREAD_NAME_SIZE];
>>>> >> -
>>>> >> -    /*
>>>> >> -     * HVF currently does not support TCG, and only runs in
>>>> >> -     * unrestricted-guest mode.
>>>> >> -     */
>>>> >> -    assert(hvf_enabled());
>>>> >> -
>>>> >> -    cpu->thread = g_malloc0(sizeof(QemuThread));
>>>> >> -    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>>>> >> -    qemu_cond_init(cpu->halt_cond);
>>>> >> -
>>>> >> -    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>>>> >> -             cpu->cpu_index);
>>>> >> -    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>>>> >> -                       cpu, QEMU_THREAD_JOINABLE);
>>>> >> -}
>>>> >> -
>>>> >> -const CpusAccel hvf_cpus = {
>>>> >> -    .create_vcpu_thread = hvf_start_vcpu_thread,
>>>> >> -
>>>> >> -    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>>>> >> -    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>>>> >> -    .synchronize_state = hvf_cpu_synchronize_state,
>>>> >> -    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>>>> >> -};
>>>> >> diff --git a/target/i386/hvf/hvf-cpus.h b/target/i386/hvf/hvf-cpus.h
>>>> >> deleted file mode 100644
>>>> >> index ced31b82c0..0000000000
>>>> >> --- a/target/i386/hvf/hvf-cpus.h
>>>> >> +++ /dev/null
>>>> >> @@ -1,25 +0,0 @@
>>>> >> -/*
>>>> >> - * Accelerator CPUS Interface
>>>> >> - *
>>>> >> - * Copyright 2020 SUSE LLC
>>>> >> - *
>>>> >> - * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>>> >> - * See the COPYING file in the top-level directory.
>>>> >> - */
>>>> >> -
>>>> >> -#ifndef HVF_CPUS_H
>>>> >> -#define HVF_CPUS_H
>>>> >> -
>>>> >> -#include "sysemu/cpus.h"
>>>> >> -
>>>> >> -extern const CpusAccel hvf_cpus;
>>>> >> -
>>>> >> -int hvf_init_vcpu(CPUState *);
>>>> >> -int hvf_vcpu_exec(CPUState *);
>>>> >> -void hvf_cpu_synchronize_state(CPUState *);
>>>> >> -void hvf_cpu_synchronize_post_reset(CPUState *);
>>>> >> -void hvf_cpu_synchronize_post_init(CPUState *);
>>>> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *);
>>>> >> -void hvf_vcpu_destroy(CPUState *);
>>>> >> -
>>>> >> -#endif /* HVF_CPUS_H */
>>>> >> diff --git a/target/i386/hvf/hvf-i386.h b/target/i386/hvf/hvf-i386.h
>>>> >> index e0edffd077..6d56f8f6bb 100644
>>>> >> --- a/target/i386/hvf/hvf-i386.h
>>>> >> +++ b/target/i386/hvf/hvf-i386.h
>>>> >> @@ -18,57 +18,11 @@
>>>> >>
>>>> >>   #include "sysemu/accel.h"
>>>> >>   #include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >>   #include "cpu.h"
>>>> >>   #include "x86.h"
>>>> >>
>>>> >> -#define HVF_MAX_VCPU 0x10
>>>> >> -
>>>> >> -extern struct hvf_state hvf_global;
>>>> >> -
>>>> >> -struct hvf_vm {
>>>> >> -    int id;
>>>> >> -    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>>>> >> -};
>>>> >> -
>>>> >> -struct hvf_state {
>>>> >> -    uint32_t version;
>>>> >> -    struct hvf_vm *vm;
>>>> >> -    uint64_t mem_quota;
>>>> >> -};
>>>> >> -
>>>> >> -/* hvf_slot flags */
>>>> >> -#define HVF_SLOT_LOG (1 << 0)
>>>> >> -
>>>> >> -typedef struct hvf_slot {
>>>> >> -    uint64_t start;
>>>> >> -    uint64_t size;
>>>> >> -    uint8_t *mem;
>>>> >> -    int slot_id;
>>>> >> -    uint32_t flags;
>>>> >> -    MemoryRegion *region;
>>>> >> -} hvf_slot;
>>>> >> -
>>>> >> -typedef struct hvf_vcpu_caps {
>>>> >> -    uint64_t vmx_cap_pinbased;
>>>> >> -    uint64_t vmx_cap_procbased;
>>>> >> -    uint64_t vmx_cap_procbased2;
>>>> >> -    uint64_t vmx_cap_entry;
>>>> >> -    uint64_t vmx_cap_exit;
>>>> >> -    uint64_t vmx_cap_preemption_timer;
>>>> >> -} hvf_vcpu_caps;
>>>> >> -
>>>> >> -struct HVFState {
>>>> >> -    AccelState parent;
>>>> >> -    hvf_slot slots[32];
>>>> >> -    int num_slots;
>>>> >> -
>>>> >> -    hvf_vcpu_caps *hvf_caps;
>>>> >> -};
>>>> >> -extern HVFState *hvf_state;
>>>> >> -
>>>> >> -void hvf_set_phys_mem(MemoryRegionSection *, bool);
>>>> >>   void hvf_handle_io(CPUArchState *, uint16_t, void *, int, int, int);
>>>> >> -hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>>>> >>
>>>> >>   #ifdef NEED_CPU_H
>>>> >>   /* Functions exported to host specific mode */
>>>> >> diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
>>>> >> index ed9356565c..8b96ecd619 100644
>>>> >> --- a/target/i386/hvf/hvf.c
>>>> >> +++ b/target/i386/hvf/hvf.c
>>>> >> @@ -51,6 +51,7 @@
>>>> >>   #include "qemu/error-report.h"
>>>> >>
>>>> >>   #include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >>   #include "sysemu/runstate.h"
>>>> >>   #include "hvf-i386.h"
>>>> >>   #include "vmcs.h"
>>>> >> @@ -72,171 +73,6 @@
>>>> >>   #include "sysemu/accel.h"
>>>> >>   #include "target/i386/cpu.h"
>>>> >>
>>>> >> -#include "hvf-cpus.h"
>>>> >> -
>>>> >> -HVFState *hvf_state;
>>>> >> -
>>>> >> -static void assert_hvf_ok(hv_return_t ret)
>>>> >> -{
>>>> >> -    if (ret == HV_SUCCESS) {
>>>> >> -        return;
>>>> >> -    }
>>>> >> -
>>>> >> -    switch (ret) {
>>>> >> -    case HV_ERROR:
>>>> >> -        error_report("Error: HV_ERROR");
>>>> >> -        break;
>>>> >> -    case HV_BUSY:
>>>> >> -        error_report("Error: HV_BUSY");
>>>> >> -        break;
>>>> >> -    case HV_BAD_ARGUMENT:
>>>> >> -        error_report("Error: HV_BAD_ARGUMENT");
>>>> >> -        break;
>>>> >> -    case HV_NO_RESOURCES:
>>>> >> -        error_report("Error: HV_NO_RESOURCES");
>>>> >> -        break;
>>>> >> -    case HV_NO_DEVICE:
>>>> >> -        error_report("Error: HV_NO_DEVICE");
>>>> >> -        break;
>>>> >> -    case HV_UNSUPPORTED:
>>>> >> -        error_report("Error: HV_UNSUPPORTED");
>>>> >> -        break;
>>>> >> -    default:
>>>> >> -        error_report("Unknown Error");
>>>> >> -    }
>>>> >> -
>>>> >> -    abort();
>>>> >> -}
>>>> >> -
>>>> >> -/* Memory slots */
>>>> >> -hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>>>> >> -{
>>>> >> -    hvf_slot *slot;
>>>> >> -    int x;
>>>> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>>>> >> -        slot = &hvf_state->slots[x];
>>>> >> -        if (slot->size && start < (slot->start + slot->size) &&
>>>> >> -            (start + size) > slot->start) {
>>>> >> -            return slot;
>>>> >> -        }
>>>> >> -    }
>>>> >> -    return NULL;
>>>> >> -}
>>>> >> -
>>>> >> -struct mac_slot {
>>>> >> -    int present;
>>>> >> -    uint64_t size;
>>>> >> -    uint64_t gpa_start;
>>>> >> -    uint64_t gva;
>>>> >> -};
>>>> >> -
>>>> >> -struct mac_slot mac_slots[32];
>>>> >> -
>>>> >> -static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>>>> >> -{
>>>> >> -    struct mac_slot *macslot;
>>>> >> -    hv_return_t ret;
>>>> >> -
>>>> >> -    macslot = &mac_slots[slot->slot_id];
>>>> >> -
>>>> >> -    if (macslot->present) {
>>>> >> -        if (macslot->size != slot->size) {
>>>> >> -            macslot->present = 0;
>>>> >> -            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>>>> >> -            assert_hvf_ok(ret);
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    if (!slot->size) {
>>>> >> -        return 0;
>>>> >> -    }
>>>> >> -
>>>> >> -    macslot->present = 1;
>>>> >> -    macslot->gpa_start = slot->start;
>>>> >> -    macslot->size = slot->size;
>>>> >> -    ret = hv_vm_map((hv_uvaddr_t)slot->mem, slot->start, slot->size, flags);
>>>> >> -    assert_hvf_ok(ret);
>>>> >> -    return 0;
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>>>> >> -{
>>>> >> -    hvf_slot *mem;
>>>> >> -    MemoryRegion *area = section->mr;
>>>> >> -    bool writeable = !area->readonly && !area->rom_device;
>>>> >> -    hv_memory_flags_t flags;
>>>> >> -
>>>> >> -    if (!memory_region_is_ram(area)) {
>>>> >> -        if (writeable) {
>>>> >> -            return;
>>>> >> -        } else if (!memory_region_is_romd(area)) {
>>>> >> -            /*
>>>> >> -             * If the memory device is not in romd_mode, then we actually want
>>>> >> -             * to remove the hvf memory slot so all accesses will trap.
>>>> >> -             */
>>>> >> -             add = false;
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    mem = hvf_find_overlap_slot(
>>>> >> -            section->offset_within_address_space,
>>>> >> -            int128_get64(section->size));
>>>> >> -
>>>> >> -    if (mem && add) {
>>>> >> -        if (mem->size == int128_get64(section->size) &&
>>>> >> -            mem->start == section->offset_within_address_space &&
>>>> >> -            mem->mem == (memory_region_get_ram_ptr(area) +
>>>> >> -            section->offset_within_region)) {
>>>> >> -            return; /* Same region was attempted to register, go away. */
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    /* Region needs to be reset. set the size to 0 and remap it. */
>>>> >> -    if (mem) {
>>>> >> -        mem->size = 0;
>>>> >> -        if (do_hvf_set_memory(mem, 0)) {
>>>> >> -            error_report("Failed to reset overlapping slot");
>>>> >> -            abort();
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    if (!add) {
>>>> >> -        return;
>>>> >> -    }
>>>> >> -
>>>> >> -    if (area->readonly ||
>>>> >> -        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
>>>> >> -        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>>>> >> -    } else {
>>>> >> -        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>>>> >> -    }
>>>> >> -
>>>> >> -    /* Now make a new slot. */
>>>> >> -    int x;
>>>> >> -
>>>> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>>>> >> -        mem = &hvf_state->slots[x];
>>>> >> -        if (!mem->size) {
>>>> >> -            break;
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    if (x == hvf_state->num_slots) {
>>>> >> -        error_report("No free slots");
>>>> >> -        abort();
>>>> >> -    }
>>>> >> -
>>>> >> -    mem->size = int128_get64(section->size);
>>>> >> -    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
>>>> >> -    mem->start = section->offset_within_address_space;
>>>> >> -    mem->region = area;
>>>> >> -
>>>> >> -    if (do_hvf_set_memory(mem, flags)) {
>>>> >> -        error_report("Error registering new memory slot");
>>>> >> -        abort();
>>>> >> -    }
>>>> >> -}
>>>> >> -
>>>> >>   void vmx_update_tpr(CPUState *cpu)
>>>> >>   {
>>>> >>       /* TODO: need integrate APIC handling */
>>>> >> @@ -276,56 +112,6 @@ void hvf_handle_io(CPUArchState *env, uint16_t port, void *buffer,
>>>> >>       }
>>>> >>   }
>>>> >>
>>>> >> -static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
>>>> >> -{
>>>> >> -    if (!cpu->vcpu_dirty) {
>>>> >> -        hvf_get_registers(cpu);
>>>> >> -        cpu->vcpu_dirty = true;
>>>> >> -    }
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_cpu_synchronize_state(CPUState *cpu)
>>>> >> -{
>>>> >> -    if (!cpu->vcpu_dirty) {
>>>> >> -        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
>>>> >> -    }
>>>> >> -}
>>>> >> -
>>>> >> -static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>>>> >> -                                              run_on_cpu_data arg)
>>>> >> -{
>>>> >> -    hvf_put_registers(cpu);
>>>> >> -    cpu->vcpu_dirty = false;
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>>>> >> -{
>>>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
>>>> >> -}
>>>> >> -
>>>> >> -static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>>>> >> -                                             run_on_cpu_data arg)
>>>> >> -{
>>>> >> -    hvf_put_registers(cpu);
>>>> >> -    cpu->vcpu_dirty = false;
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_cpu_synchronize_post_init(CPUState *cpu)
>>>> >> -{
>>>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
>>>> >> -}
>>>> >> -
>>>> >> -static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>>>> >> -                                              run_on_cpu_data arg)
>>>> >> -{
>>>> >> -    cpu->vcpu_dirty = true;
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>>>> >> -{
>>>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
>>>> >> -}
>>>> >> -
>>>> >>   static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
>>>> >>   {
>>>> >>       int read, write;
>>>> >> @@ -370,109 +156,19 @@ static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
>>>> >>       return false;
>>>> >>   }
>>>> >>
>>>> >> -static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
>>>> >> -{
>>>> >> -    hvf_slot *slot;
>>>> >> -
>>>> >> -    slot = hvf_find_overlap_slot(
>>>> >> -            section->offset_within_address_space,
>>>> >> -            int128_get64(section->size));
>>>> >> -
>>>> >> -    /* protect region against writes; begin tracking it */
>>>> >> -    if (on) {
>>>> >> -        slot->flags |= HVF_SLOT_LOG;
>>>> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>>>> >> -                      HV_MEMORY_READ);
>>>> >> -    /* stop tracking region*/
>>>> >> -    } else {
>>>> >> -        slot->flags &= ~HVF_SLOT_LOG;
>>>> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>>>> >> -                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>>>> >> -    }
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_log_start(MemoryListener *listener,
>>>> >> -                          MemoryRegionSection *section, int old, int new)
>>>> >> -{
>>>> >> -    if (old != 0) {
>>>> >> -        return;
>>>> >> -    }
>>>> >> -
>>>> >> -    hvf_set_dirty_tracking(section, 1);
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_log_stop(MemoryListener *listener,
>>>> >> -                         MemoryRegionSection *section, int old, int new)
>>>> >> -{
>>>> >> -    if (new != 0) {
>>>> >> -        return;
>>>> >> -    }
>>>> >> -
>>>> >> -    hvf_set_dirty_tracking(section, 0);
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_log_sync(MemoryListener *listener,
>>>> >> -                         MemoryRegionSection *section)
>>>> >> -{
>>>> >> -    /*
>>>> >> -     * sync of dirty pages is handled elsewhere; just make sure we keep
>>>> >> -     * tracking the region.
>>>> >> -     */
>>>> >> -    hvf_set_dirty_tracking(section, 1);
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_region_add(MemoryListener *listener,
>>>> >> -                           MemoryRegionSection *section)
>>>> >> -{
>>>> >> -    hvf_set_phys_mem(section, true);
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_region_del(MemoryListener *listener,
>>>> >> -                           MemoryRegionSection *section)
>>>> >> -{
>>>> >> -    hvf_set_phys_mem(section, false);
>>>> >> -}
>>>> >> -
>>>> >> -static MemoryListener hvf_memory_listener = {
>>>> >> -    .priority = 10,
>>>> >> -    .region_add = hvf_region_add,
>>>> >> -    .region_del = hvf_region_del,
>>>> >> -    .log_start = hvf_log_start,
>>>> >> -    .log_stop = hvf_log_stop,
>>>> >> -    .log_sync = hvf_log_sync,
>>>> >> -};
>>>> >> -
>>>> >> -void hvf_vcpu_destroy(CPUState *cpu)
>>>> >> +void hvf_arch_vcpu_destroy(CPUState *cpu)
>>>> >>   {
>>>> >>       X86CPU *x86_cpu = X86_CPU(cpu);
>>>> >>       CPUX86State *env = &x86_cpu->env;
>>>> >>
>>>> >> -    hv_return_t ret = hv_vcpu_destroy((hv_vcpuid_t)cpu->hvf_fd);
>>>> >>       g_free(env->hvf_mmio_buf);
>>>> >> -    assert_hvf_ok(ret);
>>>> >> -}
>>>> >> -
>>>> >> -static void dummy_signal(int sig)
>>>> >> -{
>>>> >>   }
>>>> >>
>>>> >> -int hvf_init_vcpu(CPUState *cpu)
>>>> >> +int hvf_arch_init_vcpu(CPUState *cpu)
>>>> >>   {
>>>> >>
>>>> >>       X86CPU *x86cpu = X86_CPU(cpu);
>>>> >>       CPUX86State *env = &x86cpu->env;
>>>> >> -    int r;
>>>> >> -
>>>> >> -    /* init cpu signals */
>>>> >> -    sigset_t set;
>>>> >> -    struct sigaction sigact;
>>>> >> -
>>>> >> -    memset(&sigact, 0, sizeof(sigact));
>>>> >> -    sigact.sa_handler = dummy_signal;
>>>> >> -    sigaction(SIG_IPI, &sigact, NULL);
>>>> >> -
>>>> >> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>>>> >> -    sigdelset(&set, SIG_IPI);
>>>> >>
>>>> >>       init_emu();
>>>> >>       init_decoder();
>>>> >> @@ -480,10 +176,6 @@ int hvf_init_vcpu(CPUState *cpu)
>>>> >>       hvf_state->hvf_caps = g_new0(struct hvf_vcpu_caps, 1);
>>>> >>       env->hvf_mmio_buf = g_new(char, 4096);
>>>> >>
>>>> >> -    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>>>> >> -    cpu->vcpu_dirty = 1;
>>>> >> -    assert_hvf_ok(r);
>>>> >> -
>>>> >>       if (hv_vmx_read_capability(HV_VMX_CAP_PINBASED,
>>>> >>           &hvf_state->hvf_caps->vmx_cap_pinbased)) {
>>>> >>           abort();
>>>> >> @@ -865,49 +557,3 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>> >>
>>>> >>       return ret;
>>>> >>   }
>>>> >> -
>>>> >> -bool hvf_allowed;
>>>> >> -
>>>> >> -static int hvf_accel_init(MachineState *ms)
>>>> >> -{
>>>> >> -    int x;
>>>> >> -    hv_return_t ret;
>>>> >> -    HVFState *s;
>>>> >> -
>>>> >> -    ret = hv_vm_create(HV_VM_DEFAULT);
>>>> >> -    assert_hvf_ok(ret);
>>>> >> -
>>>> >> -    s = g_new0(HVFState, 1);
>>>> >> -
>>>> >> -    s->num_slots = 32;
>>>> >> -    for (x = 0; x < s->num_slots; ++x) {
>>>> >> -        s->slots[x].size = 0;
>>>> >> -        s->slots[x].slot_id = x;
>>>> >> -    }
>>>> >> -
>>>> >> -    hvf_state = s;
>>>> >> -    memory_listener_register(&hvf_memory_listener, &address_space_memory);
>>>> >> -    cpus_register_accel(&hvf_cpus);
>>>> >> -    return 0;
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_accel_class_init(ObjectClass *oc, void *data)
>>>> >> -{
>>>> >> -    AccelClass *ac = ACCEL_CLASS(oc);
>>>> >> -    ac->name = "HVF";
>>>> >> -    ac->init_machine = hvf_accel_init;
>>>> >> -    ac->allowed = &hvf_allowed;
>>>> >> -}
>>>> >> -
>>>> >> -static const TypeInfo hvf_accel_type = {
>>>> >> -    .name = TYPE_HVF_ACCEL,
>>>> >> -    .parent = TYPE_ACCEL,
>>>> >> -    .class_init = hvf_accel_class_init,
>>>> >> -};
>>>> >> -
>>>> >> -static void hvf_type_init(void)
>>>> >> -{
>>>> >> -    type_register_static(&hvf_accel_type);
>>>> >> -}
>>>> >> -
>>>> >> -type_init(hvf_type_init);
>>>> >> diff --git a/target/i386/hvf/meson.build b/target/i386/hvf/meson.build
>>>> >> index 409c9a3f14..c8a43717ee 100644
>>>> >> --- a/target/i386/hvf/meson.build
>>>> >> +++ b/target/i386/hvf/meson.build
>>>> >> @@ -1,6 +1,5 @@
>>>> >>   i386_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true: files(
>>>> >>     'hvf.c',
>>>> >> -  'hvf-cpus.c',
>>>> >>     'x86.c',
>>>> >>     'x86_cpuid.c',
>>>> >>     'x86_decode.c',
>>>> >> diff --git a/target/i386/hvf/x86hvf.c b/target/i386/hvf/x86hvf.c
>>>> >> index bbec412b6c..89b8e9d87a 100644
>>>> >> --- a/target/i386/hvf/x86hvf.c
>>>> >> +++ b/target/i386/hvf/x86hvf.c
>>>> >> @@ -20,6 +20,9 @@
>>>> >>   #include "qemu/osdep.h"
>>>> >>
>>>> >>   #include "qemu-common.h"
>>>> >> +#include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >> +#include "sysemu/hw_accel.h"
>>>> >>   #include "x86hvf.h"
>>>> >>   #include "vmx.h"
>>>> >>   #include "vmcs.h"
>>>> >> @@ -32,8 +35,6 @@
>>>> >>   #include <Hypervisor/hv.h>
>>>> >>   #include <Hypervisor/hv_vmx.h>
>>>> >>
>>>> >> -#include "hvf-cpus.h"
>>>> >> -
>>>> >>   void hvf_set_segment(struct CPUState *cpu, struct vmx_segment *vmx_seg,
>>>> >>                        SegmentCache *qseg, bool is_tr)
>>>> >>   {
>>>> >> @@ -437,7 +438,7 @@ int hvf_process_events(CPUState *cpu_state)
>>>> >>       env->eflags = rreg(cpu_state->hvf_fd, HV_X86_RFLAGS);
>>>> >>
>>>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_INIT) {
>>>> >> -        hvf_cpu_synchronize_state(cpu_state);
>>>> >> +        cpu_synchronize_state(cpu_state);
>>>> >>           do_cpu_init(cpu);
>>>> >>       }
>>>> >>
>>>> >> @@ -451,12 +452,12 @@ int hvf_process_events(CPUState *cpu_state)
>>>> >>           cpu_state->halted = 0;
>>>> >>       }
>>>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_SIPI) {
>>>> >> -        hvf_cpu_synchronize_state(cpu_state);
>>>> >> +        cpu_synchronize_state(cpu_state);
>>>> >>           do_cpu_sipi(cpu);
>>>> >>       }
>>>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_TPR) {
>>>> >>           cpu_state->interrupt_request &= ~CPU_INTERRUPT_TPR;
>>>> >> -        hvf_cpu_synchronize_state(cpu_state);
>>>> >> +        cpu_synchronize_state(cpu_state);
>>>> > The changes from hvf_cpu_*() to cpu_*() are cleanup and perhaps should
>>>> > be a separate patch. It follows cpu/accel cleanups Claudio was doing the
>>>> > summer.
>>>>
>>>>
>>>> The only reason they're in here is because we no longer have access to
>>>> the hvf_ functions from the file. I am perfectly happy to rebase the
>>>> patch on top of Claudio's if his goes in first. I'm sure it'll be
>>>> trivial for him to rebase on top of this too if my series goes in first.
>>>>
>>>>
>>>> >
>>>> > Phillipe raised the idea that the patch might go ahead of ARM-specific
>>>> > part (which might involve some discussions) and I agree with that.
>>>> >
>>>> > Some sync between Claudio series (CC'd him) and the patch might be need.
>>>>
>>>>
>>>> I would prefer not to hold back because of the sync. Claudio's cleanup
>>>> is trivial enough to adjust for if it gets merged ahead of this.
>>>>
>>>>
>>>> Alex
>>>>
>>>>
>>>>

Re: [PATCH 2/8] hvf: Move common code out

Posted by Alexander Graf 4 years, 3 months ago

Hi Peter,

On 30.11.20 22:08, Peter Collingbourne wrote:
> On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
>>
>>
>> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
>>> Hi Frank,
>>>
>>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
>> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
>>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
>>>
>>>    https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
>>>
>> Thanks, we'll take a look :)
>>
>>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
> Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
> I'll reply to your patch here. You have:
>
> +                    /* Set cpu->hvf->sleeping so that we get a
> SIG_IPI signal. */
> +                    cpu->hvf->sleeping = true;
> +                    smp_mb();
> +
> +                    /* Bail out if we received an IRQ meanwhile */
> +                    if (cpu->thread_kicked || (cpu->interrupt_request &
> +                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> +                        cpu->hvf->sleeping = false;
> +                        break;
> +                    }
> +
> +                    /* nanosleep returns on signal, so we wake up on kick. */
> +                    nanosleep(ts, NULL);
>
> and then send the signal conditional on whether sleeping is true, but
> I think this is racy. If the signal is sent after sleeping is set to
> true but before entering nanosleep then I think it will be ignored and
> we will miss the wakeup. That's why in my implementation I block IPI
> on the CPU thread at startup and then use pselect to atomically
> unblock and begin sleeping. The signal is sent unconditionally so
> there's no need to worry about races between actually sleeping and the
> "we think we're sleeping" state. It may lead to an extra wakeup but
> that's better than missing it entirely.


Thanks a bunch for the comment! So the trick I was using here is to 
modify the timespec from the kick function before sending the IPI 
signal. That way, we know that either we are inside the sleep (where the 
signal wakes it up) or we are outside the sleep (where timespec={} will 
make it return immediately).

The only race I can think of is if nanosleep does calculations based on 
the timespec and we happen to send the signal right there and then.

The problem with blocking IPIs is basically what Frank was describing 
earlier: How do you unset the IPI signal pending status? If the signal 
is never delivered, how can pselect differentiate "signal from last time 
is still pending" from "new signal because I got an IPI"?


Alex

Re: [PATCH 2/8] hvf: Move common code out

Posted by Peter Collingbourne 4 years, 3 months ago

On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf <agraf@csgraf.de> wrote:
>
> Hi Peter,
>
> On 30.11.20 22:08, Peter Collingbourne wrote:
> > On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
> >>
> >>
> >> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
> >>> Hi Frank,
> >>>
> >>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
> >> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
> >>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
> >>>
> >>>    https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
> >>>
> >> Thanks, we'll take a look :)
> >>
> >>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
> > Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
> > I'll reply to your patch here. You have:
> >
> > +                    /* Set cpu->hvf->sleeping so that we get a
> > SIG_IPI signal. */
> > +                    cpu->hvf->sleeping = true;
> > +                    smp_mb();
> > +
> > +                    /* Bail out if we received an IRQ meanwhile */
> > +                    if (cpu->thread_kicked || (cpu->interrupt_request &
> > +                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > +                        cpu->hvf->sleeping = false;
> > +                        break;
> > +                    }
> > +
> > +                    /* nanosleep returns on signal, so we wake up on kick. */
> > +                    nanosleep(ts, NULL);
> >
> > and then send the signal conditional on whether sleeping is true, but
> > I think this is racy. If the signal is sent after sleeping is set to
> > true but before entering nanosleep then I think it will be ignored and
> > we will miss the wakeup. That's why in my implementation I block IPI
> > on the CPU thread at startup and then use pselect to atomically
> > unblock and begin sleeping. The signal is sent unconditionally so
> > there's no need to worry about races between actually sleeping and the
> > "we think we're sleeping" state. It may lead to an extra wakeup but
> > that's better than missing it entirely.
>
>
> Thanks a bunch for the comment! So the trick I was using here is to
> modify the timespec from the kick function before sending the IPI
> signal. That way, we know that either we are inside the sleep (where the
> signal wakes it up) or we are outside the sleep (where timespec={} will
> make it return immediately).
>
> The only race I can think of is if nanosleep does calculations based on
> the timespec and we happen to send the signal right there and then.

Yes that's the race I was thinking of. Admittedly it's a small window
but it's theoretically possible and part of the reason why pselect was
created.

> The problem with blocking IPIs is basically what Frank was describing
> earlier: How do you unset the IPI signal pending status? If the signal
> is never delivered, how can pselect differentiate "signal from last time
> is still pending" from "new signal because I got an IPI"?

In this case we would take the additional wakeup which should be
harmless since we will take the WFx exit again and put us in the
correct state. But that's a lot better than busy looping.

I reckon that you could improve things a little by unblocking the
signal and then reblocking it before unlocking iothread (e.g. with a
pselect with zero time interval), which would flush any pending
signals. Since any such signal would correspond to a signal from last
time (because we still have the iothread lock) we know that any future
signals should correspond to new IPIs.

Peter

Re: [PATCH 2/8] hvf: Move common code out

Posted by Alexander Graf 4 years, 3 months ago

On 01.12.20 00:01, Peter Collingbourne wrote:
> On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf <agraf@csgraf.de> wrote:
>> Hi Peter,
>>
>> On 30.11.20 22:08, Peter Collingbourne wrote:
>>> On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
>>>>
>>>> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
>>>>> Hi Frank,
>>>>>
>>>>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
>>>> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
>>>>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
>>>>>
>>>>>     https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
>>>>>
>>>> Thanks, we'll take a look :)
>>>>
>>>>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
>>> Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
>>> I'll reply to your patch here. You have:
>>>
>>> +                    /* Set cpu->hvf->sleeping so that we get a
>>> SIG_IPI signal. */
>>> +                    cpu->hvf->sleeping = true;
>>> +                    smp_mb();
>>> +
>>> +                    /* Bail out if we received an IRQ meanwhile */
>>> +                    if (cpu->thread_kicked || (cpu->interrupt_request &
>>> +                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
>>> +                        cpu->hvf->sleeping = false;
>>> +                        break;
>>> +                    }
>>> +
>>> +                    /* nanosleep returns on signal, so we wake up on kick. */
>>> +                    nanosleep(ts, NULL);
>>>
>>> and then send the signal conditional on whether sleeping is true, but
>>> I think this is racy. If the signal is sent after sleeping is set to
>>> true but before entering nanosleep then I think it will be ignored and
>>> we will miss the wakeup. That's why in my implementation I block IPI
>>> on the CPU thread at startup and then use pselect to atomically
>>> unblock and begin sleeping. The signal is sent unconditionally so
>>> there's no need to worry about races between actually sleeping and the
>>> "we think we're sleeping" state. It may lead to an extra wakeup but
>>> that's better than missing it entirely.
>>
>> Thanks a bunch for the comment! So the trick I was using here is to
>> modify the timespec from the kick function before sending the IPI
>> signal. That way, we know that either we are inside the sleep (where the
>> signal wakes it up) or we are outside the sleep (where timespec={} will
>> make it return immediately).
>>
>> The only race I can think of is if nanosleep does calculations based on
>> the timespec and we happen to send the signal right there and then.
> Yes that's the race I was thinking of. Admittedly it's a small window
> but it's theoretically possible and part of the reason why pselect was
> created.
>
>> The problem with blocking IPIs is basically what Frank was describing
>> earlier: How do you unset the IPI signal pending status? If the signal
>> is never delivered, how can pselect differentiate "signal from last time
>> is still pending" from "new signal because I got an IPI"?
> In this case we would take the additional wakeup which should be
> harmless since we will take the WFx exit again and put us in the
> correct state. But that's a lot better than busy looping.


I'm not sure I follow. I'm thinking of the following scenario:

   - trap into WFI handler
   - go to sleep with blocked SIG_IPI
   - SIG_IPI arrives, pselect() exits
   - signal is still pending because it's blocked
   - enter guest
   - trap into WFI handler
   - run pselect(), but it immediate exits because SIG_IPI is still pending

This was the loop I was seeing when running with SIG_IPI blocked. That's 
part of the reason why I switched to a different model.


> I reckon that you could improve things a little by unblocking the
> signal and then reblocking it before unlocking iothread (e.g. with a
> pselect with zero time interval), which would flush any pending
> signals. Since any such signal would correspond to a signal from last
> time (because we still have the iothread lock) we know that any future
> signals should correspond to new IPIs.


Yeah, I think you actually *have* to do exactly that, because otherwise 
pselect() will always return after 0ns because the signal is still pending.

And yes, I agree that that starts to sound a bit less racy now. But it 
means we can probably also just do

   - WFI handler
   - block SIG_IPI
   - set hvf->sleeping = true
   - check for pending interrupts
   - pselect()
   - unblock SIG_IPI

which means we run with SIG_IPI unmasked by default. I don't think the 
number of signal mask changes is any different with that compared to 
running with SIG_IPI always masked, right?


Alex

Re: [PATCH 2/8] hvf: Move common code out

Posted by Peter Collingbourne 4 years, 3 months ago

On Mon, Nov 30, 2020 at 3:18 PM Alexander Graf <agraf@csgraf.de> wrote:
>
>
> On 01.12.20 00:01, Peter Collingbourne wrote:
> > On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf <agraf@csgraf.de> wrote:
> >> Hi Peter,
> >>
> >> On 30.11.20 22:08, Peter Collingbourne wrote:
> >>> On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
> >>>>
> >>>> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
> >>>>> Hi Frank,
> >>>>>
> >>>>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
> >>>> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
> >>>>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
> >>>>>
> >>>>>     https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
> >>>>>
> >>>> Thanks, we'll take a look :)
> >>>>
> >>>>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
> >>> Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
> >>> I'll reply to your patch here. You have:
> >>>
> >>> +                    /* Set cpu->hvf->sleeping so that we get a
> >>> SIG_IPI signal. */
> >>> +                    cpu->hvf->sleeping = true;
> >>> +                    smp_mb();
> >>> +
> >>> +                    /* Bail out if we received an IRQ meanwhile */
> >>> +                    if (cpu->thread_kicked || (cpu->interrupt_request &
> >>> +                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> >>> +                        cpu->hvf->sleeping = false;
> >>> +                        break;
> >>> +                    }
> >>> +
> >>> +                    /* nanosleep returns on signal, so we wake up on kick. */
> >>> +                    nanosleep(ts, NULL);
> >>>
> >>> and then send the signal conditional on whether sleeping is true, but
> >>> I think this is racy. If the signal is sent after sleeping is set to
> >>> true but before entering nanosleep then I think it will be ignored and
> >>> we will miss the wakeup. That's why in my implementation I block IPI
> >>> on the CPU thread at startup and then use pselect to atomically
> >>> unblock and begin sleeping. The signal is sent unconditionally so
> >>> there's no need to worry about races between actually sleeping and the
> >>> "we think we're sleeping" state. It may lead to an extra wakeup but
> >>> that's better than missing it entirely.
> >>
> >> Thanks a bunch for the comment! So the trick I was using here is to
> >> modify the timespec from the kick function before sending the IPI
> >> signal. That way, we know that either we are inside the sleep (where the
> >> signal wakes it up) or we are outside the sleep (where timespec={} will
> >> make it return immediately).
> >>
> >> The only race I can think of is if nanosleep does calculations based on
> >> the timespec and we happen to send the signal right there and then.
> > Yes that's the race I was thinking of. Admittedly it's a small window
> > but it's theoretically possible and part of the reason why pselect was
> > created.
> >
> >> The problem with blocking IPIs is basically what Frank was describing
> >> earlier: How do you unset the IPI signal pending status? If the signal
> >> is never delivered, how can pselect differentiate "signal from last time
> >> is still pending" from "new signal because I got an IPI"?
> > In this case we would take the additional wakeup which should be
> > harmless since we will take the WFx exit again and put us in the
> > correct state. But that's a lot better than busy looping.
>
>
> I'm not sure I follow. I'm thinking of the following scenario:
>
>    - trap into WFI handler
>    - go to sleep with blocked SIG_IPI
>    - SIG_IPI arrives, pselect() exits
>    - signal is still pending because it's blocked
>    - enter guest
>    - trap into WFI handler
>    - run pselect(), but it immediate exits because SIG_IPI is still pending
>
> This was the loop I was seeing when running with SIG_IPI blocked. That's
> part of the reason why I switched to a different model.

What I observe is that when returning from a pending signal pselect
consumes the signal (which is also consistent with my understanding of
what pselect does). That means that it doesn't matter if we take a
second WFx exit because once we reach the pselect in the second WFx
exit the signal will have been consumed by the pselect in the first
exit and we will just wait for the next one.

I don't know why things may have been going wrong in your
implementation but it may be related to the issue with
mach_absolute_time() which I posted about separately and was also
causing busy loops for us in some cases. Once that issue was fixed in
our implementation we started seeing sleep until VTIMER due work
properly.

>
>
> > I reckon that you could improve things a little by unblocking the
> > signal and then reblocking it before unlocking iothread (e.g. with a
> > pselect with zero time interval), which would flush any pending
> > signals. Since any such signal would correspond to a signal from last
> > time (because we still have the iothread lock) we know that any future
> > signals should correspond to new IPIs.
>
>
> Yeah, I think you actually *have* to do exactly that, because otherwise
> pselect() will always return after 0ns because the signal is still pending.
>
> And yes, I agree that that starts to sound a bit less racy now. But it
> means we can probably also just do
>
>    - WFI handler
>    - block SIG_IPI
>    - set hvf->sleeping = true
>    - check for pending interrupts
>    - pselect()
>    - unblock SIG_IPI
>
> which means we run with SIG_IPI unmasked by default. I don't think the
> number of signal mask changes is any different with that compared to
> running with SIG_IPI always masked, right?

And unlock/lock iothread around the pselect? I suppose that could work
but as I mentioned it would just be an optimization.

Maybe I can try to make my approach work on top of your series, or if
you already have a patch I can try to debug it. Let me know.

Peter

Re: [PATCH 2/8] hvf: Move common code out

Posted by Roman Bolshakov 4 years, 3 months ago

On Mon, Nov 30, 2020 at 04:00:11PM -0800, Peter Collingbourne wrote:
> On Mon, Nov 30, 2020 at 3:18 PM Alexander Graf <agraf@csgraf.de> wrote:
> >
> >
> > On 01.12.20 00:01, Peter Collingbourne wrote:
> > > On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf <agraf@csgraf.de> wrote:
> > >> Hi Peter,
> > >>
> > >> On 30.11.20 22:08, Peter Collingbourne wrote:
> > >>> On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
> > >>>>
> > >>>> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
> > >>>>> Hi Frank,
> > >>>>>
> > >>>>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
> > >>>> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
> > >>>>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
> > >>>>>
> > >>>>>     https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
> > >>>>>
> > >>>> Thanks, we'll take a look :)
> > >>>>
> > >>>>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
> > >>> Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
> > >>> I'll reply to your patch here. You have:
> > >>>
> > >>> +                    /* Set cpu->hvf->sleeping so that we get a
> > >>> SIG_IPI signal. */
> > >>> +                    cpu->hvf->sleeping = true;
> > >>> +                    smp_mb();
> > >>> +
> > >>> +                    /* Bail out if we received an IRQ meanwhile */
> > >>> +                    if (cpu->thread_kicked || (cpu->interrupt_request &
> > >>> +                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > >>> +                        cpu->hvf->sleeping = false;
> > >>> +                        break;
> > >>> +                    }
> > >>> +
> > >>> +                    /* nanosleep returns on signal, so we wake up on kick. */
> > >>> +                    nanosleep(ts, NULL);
> > >>>
> > >>> and then send the signal conditional on whether sleeping is true, but
> > >>> I think this is racy. If the signal is sent after sleeping is set to
> > >>> true but before entering nanosleep then I think it will be ignored and
> > >>> we will miss the wakeup. That's why in my implementation I block IPI
> > >>> on the CPU thread at startup and then use pselect to atomically
> > >>> unblock and begin sleeping. The signal is sent unconditionally so
> > >>> there's no need to worry about races between actually sleeping and the
> > >>> "we think we're sleeping" state. It may lead to an extra wakeup but
> > >>> that's better than missing it entirely.
> > >>
> > >> Thanks a bunch for the comment! So the trick I was using here is to > > >> modify the timespec from the kick function before sending the IPI
> > >> signal. That way, we know that either we are inside the sleep (where the
> > >> signal wakes it up) or we are outside the sleep (where timespec={} will
> > >> make it return immediately).
> > >>
> > >> The only race I can think of is if nanosleep does calculations based on
> > >> the timespec and we happen to send the signal right there and then.
> > > Yes that's the race I was thinking of. Admittedly it's a small window
> > > but it's theoretically possible and part of the reason why pselect was
> > > created.
> > >
> > >> The problem with blocking IPIs is basically what Frank was describing
> > >> earlier: How do you unset the IPI signal pending status? If the signal
> > >> is never delivered, how can pselect differentiate "signal from last time
> > >> is still pending" from "new signal because I got an IPI"?
> > > In this case we would take the additional wakeup which should be
> > > harmless since we will take the WFx exit again and put us in the
> > > correct state. But that's a lot better than busy looping.
> >
> >
> > I'm not sure I follow. I'm thinking of the following scenario:
> >
> >    - trap into WFI handler
> >    - go to sleep with blocked SIG_IPI
> >    - SIG_IPI arrives, pselect() exits
> >    - signal is still pending because it's blocked
> >    - enter guest
> >    - trap into WFI handler
> >    - run pselect(), but it immediate exits because SIG_IPI is still pending
> >
> > This was the loop I was seeing when running with SIG_IPI blocked. That's
> > part of the reason why I switched to a different model.
> 
> What I observe is that when returning from a pending signal pselect
> consumes the signal (which is also consistent with my understanding of
> what pselect does). That means that it doesn't matter if we take a
> second WFx exit because once we reach the pselect in the second WFx
> exit the signal will have been consumed by the pselect in the first
> exit and we will just wait for the next one.
> 

Aha! Thanks for the explanation. So, the first WFI in the series of
guest WFIs will likely wake up immediately? After a period without WFIs
there must be a pending SIG_IPI...

It shouldn't be a critical issue though because (as defined in D1.16.2)
"the architecture permits a PE to leave the low-power state for any
reason, it is permissible for a PE to treat WFI as a NOP, but this is
not recommended for lowest power operation."

BTW. I think a bit from the thread should go into the description of
patch 8, because it's not trivial and it would really be helpful to keep
in repo history. At least something like this (taken from an earlier
reply in the thread):

  In this implementation IPI is blocked on the CPU thread at startup and
  pselect() is used to atomically unblock the signal and begin sleeping.
  The signal is sent unconditionally so there's no need to worry about
  races between actually sleeping and the "we think we're sleeping"
  state. It may lead to an extra wakeup but that's better than missing
  it entirely.


Thanks,
Roman

> I don't know why things may have been going wrong in your
> implementation but it may be related to the issue with
> mach_absolute_time() which I posted about separately and was also
> causing busy loops for us in some cases. Once that issue was fixed in
> our implementation we started seeing sleep until VTIMER due work
> properly.
> 
> >
> >
> > > I reckon that you could improve things a little by unblocking the
> > > signal and then reblocking it before unlocking iothread (e.g. with a
> > > pselect with zero time interval), which would flush any pending
> > > signals. Since any such signal would correspond to a signal from last
> > > time (because we still have the iothread lock) we know that any future
> > > signals should correspond to new IPIs.
> >
> >
> > Yeah, I think you actually *have* to do exactly that, because otherwise
> > pselect() will always return after 0ns because the signal is still pending.
> >
> > And yes, I agree that that starts to sound a bit less racy now. But it
> > means we can probably also just do
> >
> >    - WFI handler
> >    - block SIG_IPI
> >    - set hvf->sleeping = true
> >    - check for pending interrupts
> >    - pselect()
> >    - unblock SIG_IPI
> >
> > which means we run with SIG_IPI unmasked by default. I don't think the
> > number of signal mask changes is any different with that compared to
> > running with SIG_IPI always masked, right?
> 

P.S. Just found that Alex already raised my concern. Pending signals
have to be consumed or there should be no pending signals to start
sleeping on the very first WFI.

> And unlock/lock iothread around the pselect? I suppose that could work
> but as I mentioned it would just be an optimization.
> 
> Maybe I can try to make my approach work on top of your series, or if
> you already have a patch I can try to debug it. Let me know.
> 
> Peter

Re: [PATCH 2/8] hvf: Move common code out

Posted by Peter Collingbourne 4 years, 3 months ago

On Thu, Dec 3, 2020 at 1:41 AM Roman Bolshakov <r.bolshakov@yadro.com> wrote:
>
> On Mon, Nov 30, 2020 at 04:00:11PM -0800, Peter Collingbourne wrote:
> > On Mon, Nov 30, 2020 at 3:18 PM Alexander Graf <agraf@csgraf.de> wrote:
> > >
> > >
> > > On 01.12.20 00:01, Peter Collingbourne wrote:
> > > > On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf <agraf@csgraf.de> wrote:
> > > >> Hi Peter,
> > > >>
> > > >> On 30.11.20 22:08, Peter Collingbourne wrote:
> > > >>> On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
> > > >>>>
> > > >>>> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
> > > >>>>> Hi Frank,
> > > >>>>>
> > > >>>>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
> > > >>>> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
> > > >>>>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
> > > >>>>>
> > > >>>>>     https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
> > > >>>>>
> > > >>>> Thanks, we'll take a look :)
> > > >>>>
> > > >>>>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
> > > >>> Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
> > > >>> I'll reply to your patch here. You have:
> > > >>>
> > > >>> +                    /* Set cpu->hvf->sleeping so that we get a
> > > >>> SIG_IPI signal. */
> > > >>> +                    cpu->hvf->sleeping = true;
> > > >>> +                    smp_mb();
> > > >>> +
> > > >>> +                    /* Bail out if we received an IRQ meanwhile */
> > > >>> +                    if (cpu->thread_kicked || (cpu->interrupt_request &
> > > >>> +                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > > >>> +                        cpu->hvf->sleeping = false;
> > > >>> +                        break;
> > > >>> +                    }
> > > >>> +
> > > >>> +                    /* nanosleep returns on signal, so we wake up on kick. */
> > > >>> +                    nanosleep(ts, NULL);
> > > >>>
> > > >>> and then send the signal conditional on whether sleeping is true, but
> > > >>> I think this is racy. If the signal is sent after sleeping is set to
> > > >>> true but before entering nanosleep then I think it will be ignored and
> > > >>> we will miss the wakeup. That's why in my implementation I block IPI
> > > >>> on the CPU thread at startup and then use pselect to atomically
> > > >>> unblock and begin sleeping. The signal is sent unconditionally so
> > > >>> there's no need to worry about races between actually sleeping and the
> > > >>> "we think we're sleeping" state. It may lead to an extra wakeup but
> > > >>> that's better than missing it entirely.
> > > >>
> > > >> Thanks a bunch for the comment! So the trick I was using here is to > > >> modify the timespec from the kick function before sending the IPI
> > > >> signal. That way, we know that either we are inside the sleep (where the
> > > >> signal wakes it up) or we are outside the sleep (where timespec={} will
> > > >> make it return immediately).
> > > >>
> > > >> The only race I can think of is if nanosleep does calculations based on
> > > >> the timespec and we happen to send the signal right there and then.
> > > > Yes that's the race I was thinking of. Admittedly it's a small window
> > > > but it's theoretically possible and part of the reason why pselect was
> > > > created.
> > > >
> > > >> The problem with blocking IPIs is basically what Frank was describing
> > > >> earlier: How do you unset the IPI signal pending status? If the signal
> > > >> is never delivered, how can pselect differentiate "signal from last time
> > > >> is still pending" from "new signal because I got an IPI"?
> > > > In this case we would take the additional wakeup which should be
> > > > harmless since we will take the WFx exit again and put us in the
> > > > correct state. But that's a lot better than busy looping.
> > >
> > >
> > > I'm not sure I follow. I'm thinking of the following scenario:
> > >
> > >    - trap into WFI handler
> > >    - go to sleep with blocked SIG_IPI
> > >    - SIG_IPI arrives, pselect() exits
> > >    - signal is still pending because it's blocked
> > >    - enter guest
> > >    - trap into WFI handler
> > >    - run pselect(), but it immediate exits because SIG_IPI is still pending
> > >
> > > This was the loop I was seeing when running with SIG_IPI blocked. That's
> > > part of the reason why I switched to a different model.
> >
> > What I observe is that when returning from a pending signal pselect
> > consumes the signal (which is also consistent with my understanding of
> > what pselect does). That means that it doesn't matter if we take a
> > second WFx exit because once we reach the pselect in the second WFx
> > exit the signal will have been consumed by the pselect in the first
> > exit and we will just wait for the next one.
> >
>
> Aha! Thanks for the explanation. So, the first WFI in the series of
> guest WFIs will likely wake up immediately? After a period without WFIs
> there must be a pending SIG_IPI...
>
> It shouldn't be a critical issue though because (as defined in D1.16.2)
> "the architecture permits a PE to leave the low-power state for any
> reason, it is permissible for a PE to treat WFI as a NOP, but this is
> not recommended for lowest power operation."
>
> BTW. I think a bit from the thread should go into the description of
> patch 8, because it's not trivial and it would really be helpful to keep
> in repo history. At least something like this (taken from an earlier
> reply in the thread):
>
>   In this implementation IPI is blocked on the CPU thread at startup and
>   pselect() is used to atomically unblock the signal and begin sleeping.
>   The signal is sent unconditionally so there's no need to worry about
>   races between actually sleeping and the "we think we're sleeping"
>   state. It may lead to an extra wakeup but that's better than missing
>   it entirely.

Okay, I'll add something like that to the next version of the patch I send out.

Peter

>
>
> Thanks,
> Roman
>
> > I don't know why things may have been going wrong in your
> > implementation but it may be related to the issue with
> > mach_absolute_time() which I posted about separately and was also
> > causing busy loops for us in some cases. Once that issue was fixed in
> > our implementation we started seeing sleep until VTIMER due work
> > properly.
> >
> > >
> > >
> > > > I reckon that you could improve things a little by unblocking the
> > > > signal and then reblocking it before unlocking iothread (e.g. with a
> > > > pselect with zero time interval), which would flush any pending
> > > > signals. Since any such signal would correspond to a signal from last
> > > > time (because we still have the iothread lock) we know that any future
> > > > signals should correspond to new IPIs.
> > >
> > >
> > > Yeah, I think you actually *have* to do exactly that, because otherwise
> > > pselect() will always return after 0ns because the signal is still pending.
> > >
> > > And yes, I agree that that starts to sound a bit less racy now. But it
> > > means we can probably also just do
> > >
> > >    - WFI handler
> > >    - block SIG_IPI
> > >    - set hvf->sleeping = true
> > >    - check for pending interrupts
> > >    - pselect()
> > >    - unblock SIG_IPI
> > >
> > > which means we run with SIG_IPI unmasked by default. I don't think the
> > > number of signal mask changes is any different with that compared to
> > > running with SIG_IPI always masked, right?
> >
>
> P.S. Just found that Alex already raised my concern. Pending signals
> have to be consumed or there should be no pending signals to start
> sleeping on the very first WFI.
>
> > And unlock/lock iothread around the pselect? I suppose that could work
> > but as I mentioned it would just be an optimization.
> >
> > Maybe I can try to make my approach work on top of your series, or if
> > you already have a patch I can try to debug it. Let me know.
> >
> > Peter

Re: [PATCH 2/8] hvf: Move common code out

Posted by Alexander Graf 4 years, 3 months ago

On 03.12.20 19:42, Peter Collingbourne wrote:
> On Thu, Dec 3, 2020 at 1:41 AM Roman Bolshakov <r.bolshakov@yadro.com> wrote:
>> On Mon, Nov 30, 2020 at 04:00:11PM -0800, Peter Collingbourne wrote:
>>> On Mon, Nov 30, 2020 at 3:18 PM Alexander Graf <agraf@csgraf.de> wrote:
>>>>
>>>> On 01.12.20 00:01, Peter Collingbourne wrote:
>>>>> On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf <agraf@csgraf.de> wrote:
>>>>>> Hi Peter,
>>>>>>
>>>>>> On 30.11.20 22:08, Peter Collingbourne wrote:
>>>>>>> On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
>>>>>>>> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
>>>>>>>>> Hi Frank,
>>>>>>>>>
>>>>>>>>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
>>>>>>>> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
>>>>>>>>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
>>>>>>>>>
>>>>>>>>>      https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
>>>>>>>>>
>>>>>>>> Thanks, we'll take a look :)
>>>>>>>>
>>>>>>>>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
>>>>>>> Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
>>>>>>> I'll reply to your patch here. You have:
>>>>>>>
>>>>>>> +                    /* Set cpu->hvf->sleeping so that we get a
>>>>>>> SIG_IPI signal. */
>>>>>>> +                    cpu->hvf->sleeping = true;
>>>>>>> +                    smp_mb();
>>>>>>> +
>>>>>>> +                    /* Bail out if we received an IRQ meanwhile */
>>>>>>> +                    if (cpu->thread_kicked || (cpu->interrupt_request &
>>>>>>> +                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
>>>>>>> +                        cpu->hvf->sleeping = false;
>>>>>>> +                        break;
>>>>>>> +                    }
>>>>>>> +
>>>>>>> +                    /* nanosleep returns on signal, so we wake up on kick. */
>>>>>>> +                    nanosleep(ts, NULL);
>>>>>>>
>>>>>>> and then send the signal conditional on whether sleeping is true, but
>>>>>>> I think this is racy. If the signal is sent after sleeping is set to
>>>>>>> true but before entering nanosleep then I think it will be ignored and
>>>>>>> we will miss the wakeup. That's why in my implementation I block IPI
>>>>>>> on the CPU thread at startup and then use pselect to atomically
>>>>>>> unblock and begin sleeping. The signal is sent unconditionally so
>>>>>>> there's no need to worry about races between actually sleeping and the
>>>>>>> "we think we're sleeping" state. It may lead to an extra wakeup but
>>>>>>> that's better than missing it entirely.
>>>>>> Thanks a bunch for the comment! So the trick I was using here is to > > >> modify the timespec from the kick function before sending the IPI
>>>>>> signal. That way, we know that either we are inside the sleep (where the
>>>>>> signal wakes it up) or we are outside the sleep (where timespec={} will
>>>>>> make it return immediately).
>>>>>>
>>>>>> The only race I can think of is if nanosleep does calculations based on
>>>>>> the timespec and we happen to send the signal right there and then.
>>>>> Yes that's the race I was thinking of. Admittedly it's a small window
>>>>> but it's theoretically possible and part of the reason why pselect was
>>>>> created.
>>>>>
>>>>>> The problem with blocking IPIs is basically what Frank was describing
>>>>>> earlier: How do you unset the IPI signal pending status? If the signal
>>>>>> is never delivered, how can pselect differentiate "signal from last time
>>>>>> is still pending" from "new signal because I got an IPI"?
>>>>> In this case we would take the additional wakeup which should be
>>>>> harmless since we will take the WFx exit again and put us in the
>>>>> correct state. But that's a lot better than busy looping.
>>>>
>>>> I'm not sure I follow. I'm thinking of the following scenario:
>>>>
>>>>     - trap into WFI handler
>>>>     - go to sleep with blocked SIG_IPI
>>>>     - SIG_IPI arrives, pselect() exits
>>>>     - signal is still pending because it's blocked
>>>>     - enter guest
>>>>     - trap into WFI handler
>>>>     - run pselect(), but it immediate exits because SIG_IPI is still pending
>>>>
>>>> This was the loop I was seeing when running with SIG_IPI blocked. That's
>>>> part of the reason why I switched to a different model.
>>> What I observe is that when returning from a pending signal pselect
>>> consumes the signal (which is also consistent with my understanding of
>>> what pselect does). That means that it doesn't matter if we take a
>>> second WFx exit because once we reach the pselect in the second WFx
>>> exit the signal will have been consumed by the pselect in the first
>>> exit and we will just wait for the next one.
>>>
>> Aha! Thanks for the explanation. So, the first WFI in the series of
>> guest WFIs will likely wake up immediately? After a period without WFIs
>> there must be a pending SIG_IPI...
>>
>> It shouldn't be a critical issue though because (as defined in D1.16.2)
>> "the architecture permits a PE to leave the low-power state for any
>> reason, it is permissible for a PE to treat WFI as a NOP, but this is
>> not recommended for lowest power operation."
>>
>> BTW. I think a bit from the thread should go into the description of
>> patch 8, because it's not trivial and it would really be helpful to keep
>> in repo history. At least something like this (taken from an earlier
>> reply in the thread):
>>
>>    In this implementation IPI is blocked on the CPU thread at startup and
>>    pselect() is used to atomically unblock the signal and begin sleeping.
>>    The signal is sent unconditionally so there's no need to worry about
>>    races between actually sleeping and the "we think we're sleeping"
>>    state. It may lead to an extra wakeup but that's better than missing
>>    it entirely.
> Okay, I'll add something like that to the next version of the patch I send out.


If this is the only change, I've already added it for v4. If you want me 
to change it further, just let me know what to replace the patch 
description with.


Alex

Re: [PATCH 2/8] hvf: Move common code out

Posted by Roman Bolshakov 4 years, 3 months ago

On Thu, Dec 03, 2020 at 11:13:35PM +0100, Alexander Graf wrote:
> 
> On 03.12.20 19:42, Peter Collingbourne wrote:
> > On Thu, Dec 3, 2020 at 1:41 AM Roman Bolshakov <r.bolshakov@yadro.com> wrote:
> > > On Mon, Nov 30, 2020 at 04:00:11PM -0800, Peter Collingbourne wrote:
> > > > What I observe is that when returning from a pending signal pselect
> > > > consumes the signal (which is also consistent with my understanding of
> > > > what pselect does). That means that it doesn't matter if we take a
> > > > second WFx exit because once we reach the pselect in the second WFx
> > > > exit the signal will have been consumed by the pselect in the first
> > > > exit and we will just wait for the next one.
> > > > 
> > > Aha! Thanks for the explanation. So, the first WFI in the series of
> > > guest WFIs will likely wake up immediately? After a period without WFIs
> > > there must be a pending SIG_IPI...
> > > 
> > > It shouldn't be a critical issue though because (as defined in D1.16.2)
> > > "the architecture permits a PE to leave the low-power state for any
> > > reason, it is permissible for a PE to treat WFI as a NOP, but this is
> > > not recommended for lowest power operation."
> > > 
> > > BTW. I think a bit from the thread should go into the description of
> > > patch 8, because it's not trivial and it would really be helpful to keep
> > > in repo history. At least something like this (taken from an earlier
> > > reply in the thread):
> > > 
> > >    In this implementation IPI is blocked on the CPU thread at startup and
> > >    pselect() is used to atomically unblock the signal and begin sleeping.
> > >    The signal is sent unconditionally so there's no need to worry about
> > >    races between actually sleeping and the "we think we're sleeping"
> > >    state. It may lead to an extra wakeup but that's better than missing
> > >    it entirely.
> > Okay, I'll add something like that to the next version of the patch I send out.
> 
> 
> If this is the only change, I've already added it for v4. If you want me to
> change it further, just let me know what to replace the patch description
> with.
> 
> 

Thanks, Alex.

I'm fine with the description and all set.

-Roman

Re: [PATCH 2/8] hvf: Move common code out

Posted by Alexander Graf 4 years, 3 months ago

On 01.12.20 01:00, Peter Collingbourne wrote:
> On Mon, Nov 30, 2020 at 3:18 PM Alexander Graf <agraf@csgraf.de> wrote:
>>
>> On 01.12.20 00:01, Peter Collingbourne wrote:
>>> On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf <agraf@csgraf.de> wrote:
>>>> Hi Peter,
>>>>
>>>> On 30.11.20 22:08, Peter Collingbourne wrote:
>>>>> On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
>>>>>> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
>>>>>>> Hi Frank,
>>>>>>>
>>>>>>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
>>>>>> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
>>>>>>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
>>>>>>>
>>>>>>>      https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
>>>>>>>
>>>>>> Thanks, we'll take a look :)
>>>>>>
>>>>>>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
>>>>> Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
>>>>> I'll reply to your patch here. You have:
>>>>>
>>>>> +                    /* Set cpu->hvf->sleeping so that we get a
>>>>> SIG_IPI signal. */
>>>>> +                    cpu->hvf->sleeping = true;
>>>>> +                    smp_mb();
>>>>> +
>>>>> +                    /* Bail out if we received an IRQ meanwhile */
>>>>> +                    if (cpu->thread_kicked || (cpu->interrupt_request &
>>>>> +                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
>>>>> +                        cpu->hvf->sleeping = false;
>>>>> +                        break;
>>>>> +                    }
>>>>> +
>>>>> +                    /* nanosleep returns on signal, so we wake up on kick. */
>>>>> +                    nanosleep(ts, NULL);
>>>>>
>>>>> and then send the signal conditional on whether sleeping is true, but
>>>>> I think this is racy. If the signal is sent after sleeping is set to
>>>>> true but before entering nanosleep then I think it will be ignored and
>>>>> we will miss the wakeup. That's why in my implementation I block IPI
>>>>> on the CPU thread at startup and then use pselect to atomically
>>>>> unblock and begin sleeping. The signal is sent unconditionally so
>>>>> there's no need to worry about races between actually sleeping and the
>>>>> "we think we're sleeping" state. It may lead to an extra wakeup but
>>>>> that's better than missing it entirely.
>>>> Thanks a bunch for the comment! So the trick I was using here is to
>>>> modify the timespec from the kick function before sending the IPI
>>>> signal. That way, we know that either we are inside the sleep (where the
>>>> signal wakes it up) or we are outside the sleep (where timespec={} will
>>>> make it return immediately).
>>>>
>>>> The only race I can think of is if nanosleep does calculations based on
>>>> the timespec and we happen to send the signal right there and then.
>>> Yes that's the race I was thinking of. Admittedly it's a small window
>>> but it's theoretically possible and part of the reason why pselect was
>>> created.
>>>
>>>> The problem with blocking IPIs is basically what Frank was describing
>>>> earlier: How do you unset the IPI signal pending status? If the signal
>>>> is never delivered, how can pselect differentiate "signal from last time
>>>> is still pending" from "new signal because I got an IPI"?
>>> In this case we would take the additional wakeup which should be
>>> harmless since we will take the WFx exit again and put us in the
>>> correct state. But that's a lot better than busy looping.
>>
>> I'm not sure I follow. I'm thinking of the following scenario:
>>
>>     - trap into WFI handler
>>     - go to sleep with blocked SIG_IPI
>>     - SIG_IPI arrives, pselect() exits
>>     - signal is still pending because it's blocked
>>     - enter guest
>>     - trap into WFI handler
>>     - run pselect(), but it immediate exits because SIG_IPI is still pending
>>
>> This was the loop I was seeing when running with SIG_IPI blocked. That's
>> part of the reason why I switched to a different model.
> What I observe is that when returning from a pending signal pselect
> consumes the signal (which is also consistent with my understanding of
> what pselect does). That means that it doesn't matter if we take a
> second WFx exit because once we reach the pselect in the second WFx
> exit the signal will have been consumed by the pselect in the first
> exit and we will just wait for the next one.
>
> I don't know why things may have been going wrong in your
> implementation but it may be related to the issue with
> mach_absolute_time() which I posted about separately and was also
> causing busy loops for us in some cases. Once that issue was fixed in
> our implementation we started seeing sleep until VTIMER due work
> properly.
>
>>
>>> I reckon that you could improve things a little by unblocking the
>>> signal and then reblocking it before unlocking iothread (e.g. with a
>>> pselect with zero time interval), which would flush any pending
>>> signals. Since any such signal would correspond to a signal from last
>>> time (because we still have the iothread lock) we know that any future
>>> signals should correspond to new IPIs.
>>
>> Yeah, I think you actually *have* to do exactly that, because otherwise
>> pselect() will always return after 0ns because the signal is still pending.
>>
>> And yes, I agree that that starts to sound a bit less racy now. But it
>> means we can probably also just do
>>
>>     - WFI handler
>>     - block SIG_IPI
>>     - set hvf->sleeping = true
>>     - check for pending interrupts
>>     - pselect()
>>     - unblock SIG_IPI
>>
>> which means we run with SIG_IPI unmasked by default. I don't think the
>> number of signal mask changes is any different with that compared to
>> running with SIG_IPI always masked, right?
> And unlock/lock iothread around the pselect? I suppose that could work
> but as I mentioned it would just be an optimization.
>
> Maybe I can try to make my approach work on top of your series, or if
> you already have a patch I can try to debug it. Let me know.


I would love to take a patch from you here :). I'll still be stuck for a 
while with the sysreg sync rework that Peter asked for before I can look 
at WFI again.


Alex

[PATCH] arm/hvf: Optimize and simplify WFI handling

Posted by Peter Collingbourne via 4 years, 3 months ago

Sleep on WFx until the VTIMER is due but allow ourselves to be woken
up on IPI.

Signed-off-by: Peter Collingbourne <pcc@google.com>
---
Alexander Graf wrote:
> I would love to take a patch from you here :). I'll still be stuck for a
> while with the sysreg sync rework that Peter asked for before I can look
> at WFI again.

Okay, here's a patch :) It's a relatively straightforward adaptation
of what we have in our fork, which can now boot Android to GUI while
remaining at around 4% CPU when idle.

I'm not set up to boot a full Linux distribution at the moment so I
tested it on upstream QEMU by running a recent mainline Linux kernel
with a rootfs containing an init program that just does sleep(5)
and verified that the qemu process remains at low CPU usage during
the sleep. This was on top of your v2 plus the last patch of your v1
since it doesn't look like you have a replacement for that logic yet.

 accel/hvf/hvf-cpus.c     |  5 +--
 include/sysemu/hvf_int.h |  3 +-
 target/arm/hvf/hvf.c     | 94 +++++++++++-----------------------------
 3 files changed, 28 insertions(+), 74 deletions(-)

diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
index 4360f64671..b2c8fb57f6 100644
--- a/accel/hvf/hvf-cpus.c
+++ b/accel/hvf/hvf-cpus.c
@@ -344,9 +344,8 @@ static int hvf_init_vcpu(CPUState *cpu)
     sigact.sa_handler = dummy_signal;
     sigaction(SIG_IPI, &sigact, NULL);
 
-    pthread_sigmask(SIG_BLOCK, NULL, &set);
-    sigdelset(&set, SIG_IPI);
-    pthread_sigmask(SIG_SETMASK, &set, NULL);
+    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->unblock_ipi_mask);
+    sigdelset(&cpu->hvf->unblock_ipi_mask, SIG_IPI);
 
 #ifdef __aarch64__
     r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t **)&cpu->hvf->exit, NULL);
diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
index c56baa3ae8..13adf6ea77 100644
--- a/include/sysemu/hvf_int.h
+++ b/include/sysemu/hvf_int.h
@@ -62,8 +62,7 @@ extern HVFState *hvf_state;
 struct hvf_vcpu_state {
     uint64_t fd;
     void *exit;
-    struct timespec ts;
-    bool sleeping;
+    sigset_t unblock_ipi_mask;
 };
 
 void assert_hvf_ok(hv_return_t ret);
diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
index 8fe10966d2..60a361ff38 100644
--- a/target/arm/hvf/hvf.c
+++ b/target/arm/hvf/hvf.c
@@ -2,6 +2,7 @@
  * QEMU Hypervisor.framework support for Apple Silicon
 
  * Copyright 2020 Alexander Graf <agraf@csgraf.de>
+ * Copyright 2020 Google LLC
  *
  * This work is licensed under the terms of the GNU GPL, version 2 or later.
  * See the COPYING file in the top-level directory.
@@ -18,6 +19,7 @@
 #include "sysemu/hw_accel.h"
 
 #include <Hypervisor/Hypervisor.h>
+#include <mach/mach_time.h>
 
 #include "exec/address-spaces.h"
 #include "hw/irq.h"
@@ -320,18 +322,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)
 
 void hvf_kick_vcpu_thread(CPUState *cpu)
 {
-    if (cpu->hvf->sleeping) {
-        /*
-         * When sleeping, make sure we always send signals. Also, clear the
-         * timespec, so that an IPI that arrives between setting hvf->sleeping
-         * and the nanosleep syscall still aborts the sleep.
-         */
-        cpu->thread_kicked = false;
-        cpu->hvf->ts = (struct timespec){ };
-        cpus_kick_thread(cpu);
-    } else {
-        hv_vcpus_exit(&cpu->hvf->fd, 1);
-    }
+    cpus_kick_thread(cpu);
+    hv_vcpus_exit(&cpu->hvf->fd, 1);
 }
 
 static int hvf_inject_interrupts(CPUState *cpu)
@@ -385,18 +377,19 @@ int hvf_vcpu_exec(CPUState *cpu)
         uint64_t syndrome = hvf_exit->exception.syndrome;
         uint32_t ec = syn_get_ec(syndrome);
 
+        qemu_mutex_lock_iothread();
         switch (exit_reason) {
         case HV_EXIT_REASON_EXCEPTION:
             /* This is the main one, handle below. */
             break;
         case HV_EXIT_REASON_VTIMER_ACTIVATED:
-            qemu_mutex_lock_iothread();
             current_cpu = cpu;
             qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
             qemu_mutex_unlock_iothread();
             continue;
         case HV_EXIT_REASON_CANCELED:
             /* we got kicked, no exit to process */
+            qemu_mutex_unlock_iothread();
             continue;
         default:
             assert(0);
@@ -413,7 +406,6 @@ int hvf_vcpu_exec(CPUState *cpu)
             uint32_t srt = (syndrome >> 16) & 0x1f;
             uint64_t val = 0;
 
-            qemu_mutex_lock_iothread();
             current_cpu = cpu;
 
             DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
@@ -446,8 +438,6 @@ int hvf_vcpu_exec(CPUState *cpu)
                 hvf_set_reg(cpu, srt, val);
             }
 
-            qemu_mutex_unlock_iothread();
-
             advance_pc = true;
             break;
         }
@@ -493,68 +483,36 @@ int hvf_vcpu_exec(CPUState *cpu)
         case EC_WFX_TRAP:
             if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
                 (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
-                uint64_t cval, ctl, val, diff, now;
+                uint64_t cval;
 
-                /* Set up a local timer for vtimer if necessary ... */
-                r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CTL_EL0, &ctl);
-                assert_hvf_ok(r);
                 r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CVAL_EL0, &cval);
                 assert_hvf_ok(r);
 
-                asm volatile("mrs %0, cntvct_el0" : "=r"(val));
-                diff = cval - val;
-
-                now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
-                      gt_cntfrq_period_ns(arm_cpu);
-
-                /* Timer disabled or masked, just wait for long */
-                if (!(ctl & 1) || (ctl & 2)) {
-                    diff = (120 * NANOSECONDS_PER_SECOND) /
-                           gt_cntfrq_period_ns(arm_cpu);
+                int64_t ticks_to_sleep = cval - mach_absolute_time();
+                if (ticks_to_sleep < 0) {
+                    break;
                 }
 
-                if (diff < INT64_MAX) {
-                    uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
-                    struct timespec *ts = &cpu->hvf->ts;
-
-                    *ts = (struct timespec){
-                        .tv_sec = ns / NANOSECONDS_PER_SECOND,
-                        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
-                    };
-
-                    /*
-                     * Waking up easily takes 1ms, don't go to sleep for smaller
-                     * time periods than 2ms.
-                     */
-                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
-                        advance_pc = true;
-                        break;
-                    }
-
-                    /* Set cpu->hvf->sleeping so that we get a SIG_IPI signal. */
-                    cpu->hvf->sleeping = true;
-                    smp_mb();
-
-                    /* Bail out if we received an IRQ meanwhile */
-                    if (cpu->thread_kicked || (cpu->interrupt_request &
-                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
-                        cpu->hvf->sleeping = false;
-                        break;
-                    }
-
-                    /* nanosleep returns on signal, so we wake up on kick. */
-                    nanosleep(ts, NULL);
-
-                    /* Out of sleep - either naturally or because of a kick */
-                    cpu->hvf->sleeping = false;
-                }
+                uint64_t seconds = ticks_to_sleep / arm_cpu->gt_cntfrq_hz;
+                uint64_t nanos =
+                    (ticks_to_sleep - arm_cpu->gt_cntfrq_hz * seconds) *
+                    1000000000 / arm_cpu->gt_cntfrq_hz;
+                struct timespec ts = { seconds, nanos };
+
+                /*
+                 * Use pselect to sleep so that other threads can IPI us while
+                 * we're sleeping.
+                 */
+                qatomic_mb_set(&cpu->thread_kicked, false);
+                qemu_mutex_unlock_iothread();
+                pselect(0, 0, 0, 0, &ts, &cpu->hvf->unblock_ipi_mask);
+                qemu_mutex_lock_iothread();
 
                 advance_pc = true;
             }
             break;
         case EC_AA64_HVC:
             cpu_synchronize_state(cpu);
-            qemu_mutex_lock_iothread();
             current_cpu = cpu;
             if (arm_is_psci_call(arm_cpu, EXCP_HVC)) {
                 arm_handle_psci_call(arm_cpu);
@@ -562,11 +520,9 @@ int hvf_vcpu_exec(CPUState *cpu)
                 DPRINTF("unknown HVC! %016llx", env->xregs[0]);
                 env->xregs[0] = -1;
             }
-            qemu_mutex_unlock_iothread();
             break;
         case EC_AA64_SMC:
             cpu_synchronize_state(cpu);
-            qemu_mutex_lock_iothread();
             current_cpu = cpu;
             if (arm_is_psci_call(arm_cpu, EXCP_SMC)) {
                 arm_handle_psci_call(arm_cpu);
@@ -575,7 +531,6 @@ int hvf_vcpu_exec(CPUState *cpu)
                 env->xregs[0] = -1;
                 env->pc += 4;
             }
-            qemu_mutex_unlock_iothread();
             break;
         default:
             cpu_synchronize_state(cpu);
@@ -594,6 +549,7 @@ int hvf_vcpu_exec(CPUState *cpu)
             r = hv_vcpu_set_reg(cpu->hvf->fd, HV_REG_PC, pc);
             assert_hvf_ok(r);
         }
+        qemu_mutex_unlock_iothread();
     } while (ret == 0);
 
     qemu_mutex_lock_iothread();
-- 
2.29.2.454.gaff20da3a2-goog

Re: [PATCH] arm/hvf: Optimize and simplify WFI handling

Posted by Alexander Graf 4 years, 3 months ago

Hi Peter,

On 01.12.20 09:21, Peter Collingbourne wrote:
> Sleep on WFx until the VTIMER is due but allow ourselves to be woken
> up on IPI.
>
> Signed-off-by: Peter Collingbourne <pcc@google.com>


Thanks a bunch!


> ---
> Alexander Graf wrote:
>> I would love to take a patch from you here :). I'll still be stuck for a
>> while with the sysreg sync rework that Peter asked for before I can look
>> at WFI again.
> Okay, here's a patch :) It's a relatively straightforward adaptation
> of what we have in our fork, which can now boot Android to GUI while
> remaining at around 4% CPU when idle.
>
> I'm not set up to boot a full Linux distribution at the moment so I
> tested it on upstream QEMU by running a recent mainline Linux kernel
> with a rootfs containing an init program that just does sleep(5)
> and verified that the qemu process remains at low CPU usage during
> the sleep. This was on top of your v2 plus the last patch of your v1
> since it doesn't look like you have a replacement for that logic yet.
>
>   accel/hvf/hvf-cpus.c     |  5 +--
>   include/sysemu/hvf_int.h |  3 +-
>   target/arm/hvf/hvf.c     | 94 +++++++++++-----------------------------
>   3 files changed, 28 insertions(+), 74 deletions(-)
>
> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> index 4360f64671..b2c8fb57f6 100644
> --- a/accel/hvf/hvf-cpus.c
> +++ b/accel/hvf/hvf-cpus.c
> @@ -344,9 +344,8 @@ static int hvf_init_vcpu(CPUState *cpu)
>       sigact.sa_handler = dummy_signal;
>       sigaction(SIG_IPI, &sigact, NULL);
>   
> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> -    sigdelset(&set, SIG_IPI);
> -    pthread_sigmask(SIG_SETMASK, &set, NULL);
> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->unblock_ipi_mask);
> +    sigdelset(&cpu->hvf->unblock_ipi_mask, SIG_IPI);


What will this do to the x86 hvf implementation? We're now not 
unblocking SIG_IPI again for that, right?


>   
>   #ifdef __aarch64__
>       r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t **)&cpu->hvf->exit, NULL);
> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> index c56baa3ae8..13adf6ea77 100644
> --- a/include/sysemu/hvf_int.h
> +++ b/include/sysemu/hvf_int.h
> @@ -62,8 +62,7 @@ extern HVFState *hvf_state;
>   struct hvf_vcpu_state {
>       uint64_t fd;
>       void *exit;
> -    struct timespec ts;
> -    bool sleeping;
> +    sigset_t unblock_ipi_mask;
>   };
>   
>   void assert_hvf_ok(hv_return_t ret);
> diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
> index 8fe10966d2..60a361ff38 100644
> --- a/target/arm/hvf/hvf.c
> +++ b/target/arm/hvf/hvf.c
> @@ -2,6 +2,7 @@
>    * QEMU Hypervisor.framework support for Apple Silicon
>   
>    * Copyright 2020 Alexander Graf <agraf@csgraf.de>
> + * Copyright 2020 Google LLC
>    *
>    * This work is licensed under the terms of the GNU GPL, version 2 or later.
>    * See the COPYING file in the top-level directory.
> @@ -18,6 +19,7 @@
>   #include "sysemu/hw_accel.h"
>   
>   #include <Hypervisor/Hypervisor.h>
> +#include <mach/mach_time.h>
>   
>   #include "exec/address-spaces.h"
>   #include "hw/irq.h"
> @@ -320,18 +322,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)
>   
>   void hvf_kick_vcpu_thread(CPUState *cpu)
>   {
> -    if (cpu->hvf->sleeping) {
> -        /*
> -         * When sleeping, make sure we always send signals. Also, clear the
> -         * timespec, so that an IPI that arrives between setting hvf->sleeping
> -         * and the nanosleep syscall still aborts the sleep.
> -         */
> -        cpu->thread_kicked = false;
> -        cpu->hvf->ts = (struct timespec){ };
> -        cpus_kick_thread(cpu);
> -    } else {
> -        hv_vcpus_exit(&cpu->hvf->fd, 1);
> -    }
> +    cpus_kick_thread(cpu);
> +    hv_vcpus_exit(&cpu->hvf->fd, 1);


This means your first WFI will almost always return immediately due to a 
pending signal, because there probably was an IRQ pending before on the 
same CPU, no?


>   }
>   
>   static int hvf_inject_interrupts(CPUState *cpu)
> @@ -385,18 +377,19 @@ int hvf_vcpu_exec(CPUState *cpu)
>           uint64_t syndrome = hvf_exit->exception.syndrome;
>           uint32_t ec = syn_get_ec(syndrome);
>   
> +        qemu_mutex_lock_iothread();


Is there a particular reason you're moving the iothread lock out again 
from the individual bits? I would really like to keep a notion of fast 
path exits.


>           switch (exit_reason) {
>           case HV_EXIT_REASON_EXCEPTION:
>               /* This is the main one, handle below. */
>               break;
>           case HV_EXIT_REASON_VTIMER_ACTIVATED:
> -            qemu_mutex_lock_iothread();
>               current_cpu = cpu;
>               qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
>               qemu_mutex_unlock_iothread();
>               continue;
>           case HV_EXIT_REASON_CANCELED:
>               /* we got kicked, no exit to process */
> +            qemu_mutex_unlock_iothread();
>               continue;
>           default:
>               assert(0);
> @@ -413,7 +406,6 @@ int hvf_vcpu_exec(CPUState *cpu)
>               uint32_t srt = (syndrome >> 16) & 0x1f;
>               uint64_t val = 0;
>   
> -            qemu_mutex_lock_iothread();
>               current_cpu = cpu;
>   
>               DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
> @@ -446,8 +438,6 @@ int hvf_vcpu_exec(CPUState *cpu)
>                   hvf_set_reg(cpu, srt, val);
>               }
>   
> -            qemu_mutex_unlock_iothread();
> -
>               advance_pc = true;
>               break;
>           }
> @@ -493,68 +483,36 @@ int hvf_vcpu_exec(CPUState *cpu)
>           case EC_WFX_TRAP:
>               if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
>                   (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> -                uint64_t cval, ctl, val, diff, now;
> +                uint64_t cval;
>   
> -                /* Set up a local timer for vtimer if necessary ... */
> -                r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CTL_EL0, &ctl);
> -                assert_hvf_ok(r);
>                   r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CVAL_EL0, &cval);
>                   assert_hvf_ok(r);
>   
> -                asm volatile("mrs %0, cntvct_el0" : "=r"(val));
> -                diff = cval - val;
> -
> -                now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
> -                      gt_cntfrq_period_ns(arm_cpu);
> -
> -                /* Timer disabled or masked, just wait for long */
> -                if (!(ctl & 1) || (ctl & 2)) {
> -                    diff = (120 * NANOSECONDS_PER_SECOND) /
> -                           gt_cntfrq_period_ns(arm_cpu);
> +                int64_t ticks_to_sleep = cval - mach_absolute_time();
> +                if (ticks_to_sleep < 0) {
> +                    break;


This will loop at 100% for Windows, which configures the vtimer as 
cval=0 ctl=7, so with IRQ mask bit set.


Alex


>                   }
>   
> -                if (diff < INT64_MAX) {
> -                    uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
> -                    struct timespec *ts = &cpu->hvf->ts;
> -
> -                    *ts = (struct timespec){
> -                        .tv_sec = ns / NANOSECONDS_PER_SECOND,
> -                        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
> -                    };
> -
> -                    /*
> -                     * Waking up easily takes 1ms, don't go to sleep for smaller
> -                     * time periods than 2ms.
> -                     */
> -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {


I put this logic here on purpose. A pselect(1 ns) easily takes 1-2ms to 
return. Without logic like this, super short WFIs will hurt performance 
quite badly.


Alex

> -                        advance_pc = true;
> -                        break;
> -                    }
> -
> -                    /* Set cpu->hvf->sleeping so that we get a SIG_IPI signal. */
> -                    cpu->hvf->sleeping = true;
> -                    smp_mb();
> -
> -                    /* Bail out if we received an IRQ meanwhile */
> -                    if (cpu->thread_kicked || (cpu->interrupt_request &
> -                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> -                        cpu->hvf->sleeping = false;
> -                        break;
> -                    }
> -
> -                    /* nanosleep returns on signal, so we wake up on kick. */
> -                    nanosleep(ts, NULL);
> -
> -                    /* Out of sleep - either naturally or because of a kick */
> -                    cpu->hvf->sleeping = false;
> -                }
> +                uint64_t seconds = ticks_to_sleep / arm_cpu->gt_cntfrq_hz;
> +                uint64_t nanos =
> +                    (ticks_to_sleep - arm_cpu->gt_cntfrq_hz * seconds) *
> +                    1000000000 / arm_cpu->gt_cntfrq_hz;
> +                struct timespec ts = { seconds, nanos };
> +
> +                /*
> +                 * Use pselect to sleep so that other threads can IPI us while
> +                 * we're sleeping.
> +                 */
> +                qatomic_mb_set(&cpu->thread_kicked, false);
> +                qemu_mutex_unlock_iothread();
> +                pselect(0, 0, 0, 0, &ts, &cpu->hvf->unblock_ipi_mask);
> +                qemu_mutex_lock_iothread();
>   
>                   advance_pc = true;
>               }
>               break;
>           case EC_AA64_HVC:
>               cpu_synchronize_state(cpu);
> -            qemu_mutex_lock_iothread();
>               current_cpu = cpu;
>               if (arm_is_psci_call(arm_cpu, EXCP_HVC)) {
>                   arm_handle_psci_call(arm_cpu);
> @@ -562,11 +520,9 @@ int hvf_vcpu_exec(CPUState *cpu)
>                   DPRINTF("unknown HVC! %016llx", env->xregs[0]);
>                   env->xregs[0] = -1;
>               }
> -            qemu_mutex_unlock_iothread();
>               break;
>           case EC_AA64_SMC:
>               cpu_synchronize_state(cpu);
> -            qemu_mutex_lock_iothread();
>               current_cpu = cpu;
>               if (arm_is_psci_call(arm_cpu, EXCP_SMC)) {
>                   arm_handle_psci_call(arm_cpu);
> @@ -575,7 +531,6 @@ int hvf_vcpu_exec(CPUState *cpu)
>                   env->xregs[0] = -1;
>                   env->pc += 4;
>               }
> -            qemu_mutex_unlock_iothread();
>               break;
>           default:
>               cpu_synchronize_state(cpu);
> @@ -594,6 +549,7 @@ int hvf_vcpu_exec(CPUState *cpu)
>               r = hv_vcpu_set_reg(cpu->hvf->fd, HV_REG_PC, pc);
>               assert_hvf_ok(r);
>           }
> +        qemu_mutex_unlock_iothread();
>       } while (ret == 0);
>   
>       qemu_mutex_lock_iothread();

Re: [PATCH] arm/hvf: Optimize and simplify WFI handling

Posted by Peter Collingbourne 4 years, 3 months ago

On Tue, Dec 1, 2020 at 3:16 AM Alexander Graf <agraf@csgraf.de> wrote:
>
> Hi Peter,
>
> On 01.12.20 09:21, Peter Collingbourne wrote:
> > Sleep on WFx until the VTIMER is due but allow ourselves to be woken
> > up on IPI.
> >
> > Signed-off-by: Peter Collingbourne <pcc@google.com>
>
>
> Thanks a bunch!
>
>
> > ---
> > Alexander Graf wrote:
> >> I would love to take a patch from you here :). I'll still be stuck for a
> >> while with the sysreg sync rework that Peter asked for before I can look
> >> at WFI again.
> > Okay, here's a patch :) It's a relatively straightforward adaptation
> > of what we have in our fork, which can now boot Android to GUI while
> > remaining at around 4% CPU when idle.
> >
> > I'm not set up to boot a full Linux distribution at the moment so I
> > tested it on upstream QEMU by running a recent mainline Linux kernel
> > with a rootfs containing an init program that just does sleep(5)
> > and verified that the qemu process remains at low CPU usage during
> > the sleep. This was on top of your v2 plus the last patch of your v1
> > since it doesn't look like you have a replacement for that logic yet.
> >
> >   accel/hvf/hvf-cpus.c     |  5 +--
> >   include/sysemu/hvf_int.h |  3 +-
> >   target/arm/hvf/hvf.c     | 94 +++++++++++-----------------------------
> >   3 files changed, 28 insertions(+), 74 deletions(-)
> >
> > diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> > index 4360f64671..b2c8fb57f6 100644
> > --- a/accel/hvf/hvf-cpus.c
> > +++ b/accel/hvf/hvf-cpus.c
> > @@ -344,9 +344,8 @@ static int hvf_init_vcpu(CPUState *cpu)
> >       sigact.sa_handler = dummy_signal;
> >       sigaction(SIG_IPI, &sigact, NULL);
> >
> > -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> > -    sigdelset(&set, SIG_IPI);
> > -    pthread_sigmask(SIG_SETMASK, &set, NULL);
> > +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->unblock_ipi_mask);
> > +    sigdelset(&cpu->hvf->unblock_ipi_mask, SIG_IPI);
>
>
> What will this do to the x86 hvf implementation? We're now not
> unblocking SIG_IPI again for that, right?

Yes and that was the case before your patch series.

> >
> >   #ifdef __aarch64__
> >       r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t **)&cpu->hvf->exit, NULL);
> > diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> > index c56baa3ae8..13adf6ea77 100644
> > --- a/include/sysemu/hvf_int.h
> > +++ b/include/sysemu/hvf_int.h
> > @@ -62,8 +62,7 @@ extern HVFState *hvf_state;
> >   struct hvf_vcpu_state {
> >       uint64_t fd;
> >       void *exit;
> > -    struct timespec ts;
> > -    bool sleeping;
> > +    sigset_t unblock_ipi_mask;
> >   };
> >
> >   void assert_hvf_ok(hv_return_t ret);
> > diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
> > index 8fe10966d2..60a361ff38 100644
> > --- a/target/arm/hvf/hvf.c
> > +++ b/target/arm/hvf/hvf.c
> > @@ -2,6 +2,7 @@
> >    * QEMU Hypervisor.framework support for Apple Silicon
> >
> >    * Copyright 2020 Alexander Graf <agraf@csgraf.de>
> > + * Copyright 2020 Google LLC
> >    *
> >    * This work is licensed under the terms of the GNU GPL, version 2 or later.
> >    * See the COPYING file in the top-level directory.
> > @@ -18,6 +19,7 @@
> >   #include "sysemu/hw_accel.h"
> >
> >   #include <Hypervisor/Hypervisor.h>
> > +#include <mach/mach_time.h>
> >
> >   #include "exec/address-spaces.h"
> >   #include "hw/irq.h"
> > @@ -320,18 +322,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)
> >
> >   void hvf_kick_vcpu_thread(CPUState *cpu)
> >   {
> > -    if (cpu->hvf->sleeping) {
> > -        /*
> > -         * When sleeping, make sure we always send signals. Also, clear the
> > -         * timespec, so that an IPI that arrives between setting hvf->sleeping
> > -         * and the nanosleep syscall still aborts the sleep.
> > -         */
> > -        cpu->thread_kicked = false;
> > -        cpu->hvf->ts = (struct timespec){ };
> > -        cpus_kick_thread(cpu);
> > -    } else {
> > -        hv_vcpus_exit(&cpu->hvf->fd, 1);
> > -    }
> > +    cpus_kick_thread(cpu);
> > +    hv_vcpus_exit(&cpu->hvf->fd, 1);
>
>
> This means your first WFI will almost always return immediately due to a
> pending signal, because there probably was an IRQ pending before on the
> same CPU, no?

That's right. Any approach involving the "sleeping" field would need
to be implemented carefully to avoid races that may result in missed
wakeups so for simplicity I just decided to send both kinds of
wakeups. In particular the approach in the updated patch you sent is
racy and I'll elaborate more in the reply to that patch.

> >   }
> >
> >   static int hvf_inject_interrupts(CPUState *cpu)
> > @@ -385,18 +377,19 @@ int hvf_vcpu_exec(CPUState *cpu)
> >           uint64_t syndrome = hvf_exit->exception.syndrome;
> >           uint32_t ec = syn_get_ec(syndrome);
> >
> > +        qemu_mutex_lock_iothread();
>
>
> Is there a particular reason you're moving the iothread lock out again
> from the individual bits? I would really like to keep a notion of fast
> path exits.

We still need to lock at least once no matter the exit reason to check
the interrupts so I don't think it's worth it to try and avoid locking
like this. It also makes the implementation easier to reason about and
therefore more likely to be correct. In our implementation we just
stay locked the whole time unless we're in hv_vcpu_run() or pselect().

> >           switch (exit_reason) {
> >           case HV_EXIT_REASON_EXCEPTION:
> >               /* This is the main one, handle below. */
> >               break;
> >           case HV_EXIT_REASON_VTIMER_ACTIVATED:
> > -            qemu_mutex_lock_iothread();
> >               current_cpu = cpu;
> >               qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
> >               qemu_mutex_unlock_iothread();
> >               continue;
> >           case HV_EXIT_REASON_CANCELED:
> >               /* we got kicked, no exit to process */
> > +            qemu_mutex_unlock_iothread();
> >               continue;
> >           default:
> >               assert(0);
> > @@ -413,7 +406,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> >               uint32_t srt = (syndrome >> 16) & 0x1f;
> >               uint64_t val = 0;
> >
> > -            qemu_mutex_lock_iothread();
> >               current_cpu = cpu;
> >
> >               DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
> > @@ -446,8 +438,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> >                   hvf_set_reg(cpu, srt, val);
> >               }
> >
> > -            qemu_mutex_unlock_iothread();
> > -
> >               advance_pc = true;
> >               break;
> >           }
> > @@ -493,68 +483,36 @@ int hvf_vcpu_exec(CPUState *cpu)
> >           case EC_WFX_TRAP:
> >               if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
> >                   (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > -                uint64_t cval, ctl, val, diff, now;
> > +                uint64_t cval;
> >
> > -                /* Set up a local timer for vtimer if necessary ... */
> > -                r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CTL_EL0, &ctl);
> > -                assert_hvf_ok(r);
> >                   r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CVAL_EL0, &cval);
> >                   assert_hvf_ok(r);
> >
> > -                asm volatile("mrs %0, cntvct_el0" : "=r"(val));
> > -                diff = cval - val;
> > -
> > -                now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
> > -                      gt_cntfrq_period_ns(arm_cpu);
> > -
> > -                /* Timer disabled or masked, just wait for long */
> > -                if (!(ctl & 1) || (ctl & 2)) {
> > -                    diff = (120 * NANOSECONDS_PER_SECOND) /
> > -                           gt_cntfrq_period_ns(arm_cpu);
> > +                int64_t ticks_to_sleep = cval - mach_absolute_time();
> > +                if (ticks_to_sleep < 0) {
> > +                    break;
>
>
> This will loop at 100% for Windows, which configures the vtimer as
> cval=0 ctl=7, so with IRQ mask bit set.

Okay, but the 120s is kind of arbitrary so we should just sleep until
we get a signal. That can be done by passing null as the timespec
argument to pselect().

>
>
> Alex
>
>
> >                   }
> >
> > -                if (diff < INT64_MAX) {
> > -                    uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
> > -                    struct timespec *ts = &cpu->hvf->ts;
> > -
> > -                    *ts = (struct timespec){
> > -                        .tv_sec = ns / NANOSECONDS_PER_SECOND,
> > -                        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
> > -                    };
> > -
> > -                    /*
> > -                     * Waking up easily takes 1ms, don't go to sleep for smaller
> > -                     * time periods than 2ms.
> > -                     */
> > -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
>
>
> I put this logic here on purpose. A pselect(1 ns) easily takes 1-2ms to
> return. Without logic like this, super short WFIs will hurt performance
> quite badly.

I don't think that's accurate. According to this benchmark it's a few
hundred nanoseconds at most.

pcc@pac-mini /tmp> cat pselect.c
#include <signal.h>
#include <sys/select.h>

int main() {
  sigset_t mask, orig_mask;
  pthread_sigmask(SIG_SETMASK, 0, &mask);
  sigaddset(&mask, SIGUSR1);
  pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);

  for (int i = 0; i != 1000000; ++i) {
    struct timespec ts = { 0, 1 };
    pselect(0, 0, 0, 0, &ts, &orig_mask);
  }
}
pcc@pac-mini /tmp> time ./pselect

________________________________________________________
Executed in  179.87 millis    fish           external
   usr time   77.68 millis   57.00 micros   77.62 millis
   sys time  101.37 millis  852.00 micros  100.52 millis

Besides, all that you're really saving here is the single pselect
call. There are no doubt more expensive syscalls involved in exiting
and entering the VCPU that would dominate here.

Peter

>
>
> Alex
>
> > -                        advance_pc = true;
> > -                        break;
> > -                    }
> > -
> > -                    /* Set cpu->hvf->sleeping so that we get a SIG_IPI signal. */
> > -                    cpu->hvf->sleeping = true;
> > -                    smp_mb();
> > -
> > -                    /* Bail out if we received an IRQ meanwhile */
> > -                    if (cpu->thread_kicked || (cpu->interrupt_request &
> > -                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > -                        cpu->hvf->sleeping = false;
> > -                        break;
> > -                    }
> > -
> > -                    /* nanosleep returns on signal, so we wake up on kick. */
> > -                    nanosleep(ts, NULL);
> > -
> > -                    /* Out of sleep - either naturally or because of a kick */
> > -                    cpu->hvf->sleeping = false;
> > -                }
> > +                uint64_t seconds = ticks_to_sleep / arm_cpu->gt_cntfrq_hz;
> > +                uint64_t nanos =
> > +                    (ticks_to_sleep - arm_cpu->gt_cntfrq_hz * seconds) *
> > +                    1000000000 / arm_cpu->gt_cntfrq_hz;
> > +                struct timespec ts = { seconds, nanos };
> > +
> > +                /*
> > +                 * Use pselect to sleep so that other threads can IPI us while
> > +                 * we're sleeping.
> > +                 */
> > +                qatomic_mb_set(&cpu->thread_kicked, false);
> > +                qemu_mutex_unlock_iothread();
> > +                pselect(0, 0, 0, 0, &ts, &cpu->hvf->unblock_ipi_mask);
> > +                qemu_mutex_lock_iothread();
> >
> >                   advance_pc = true;
> >               }
> >               break;
> >           case EC_AA64_HVC:
> >               cpu_synchronize_state(cpu);
> > -            qemu_mutex_lock_iothread();
> >               current_cpu = cpu;
> >               if (arm_is_psci_call(arm_cpu, EXCP_HVC)) {
> >                   arm_handle_psci_call(arm_cpu);
> > @@ -562,11 +520,9 @@ int hvf_vcpu_exec(CPUState *cpu)
> >                   DPRINTF("unknown HVC! %016llx", env->xregs[0]);
> >                   env->xregs[0] = -1;
> >               }
> > -            qemu_mutex_unlock_iothread();
> >               break;
> >           case EC_AA64_SMC:
> >               cpu_synchronize_state(cpu);
> > -            qemu_mutex_lock_iothread();
> >               current_cpu = cpu;
> >               if (arm_is_psci_call(arm_cpu, EXCP_SMC)) {
> >                   arm_handle_psci_call(arm_cpu);
> > @@ -575,7 +531,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> >                   env->xregs[0] = -1;
> >                   env->pc += 4;
> >               }
> > -            qemu_mutex_unlock_iothread();
> >               break;
> >           default:
> >               cpu_synchronize_state(cpu);
> > @@ -594,6 +549,7 @@ int hvf_vcpu_exec(CPUState *cpu)
> >               r = hv_vcpu_set_reg(cpu->hvf->fd, HV_REG_PC, pc);
> >               assert_hvf_ok(r);
> >           }
> > +        qemu_mutex_unlock_iothread();
> >       } while (ret == 0);
> >
> >       qemu_mutex_lock_iothread();

Re: [PATCH] arm/hvf: Optimize and simplify WFI handling

Posted by Alexander Graf 4 years, 3 months ago

On 01.12.20 19:59, Peter Collingbourne wrote:
> On Tue, Dec 1, 2020 at 3:16 AM Alexander Graf <agraf@csgraf.de> wrote:
>> Hi Peter,
>>
>> On 01.12.20 09:21, Peter Collingbourne wrote:
>>> Sleep on WFx until the VTIMER is due but allow ourselves to be woken
>>> up on IPI.
>>>
>>> Signed-off-by: Peter Collingbourne <pcc@google.com>
>>
>> Thanks a bunch!
>>
>>
>>> ---
>>> Alexander Graf wrote:
>>>> I would love to take a patch from you here :). I'll still be stuck for a
>>>> while with the sysreg sync rework that Peter asked for before I can look
>>>> at WFI again.
>>> Okay, here's a patch :) It's a relatively straightforward adaptation
>>> of what we have in our fork, which can now boot Android to GUI while
>>> remaining at around 4% CPU when idle.
>>>
>>> I'm not set up to boot a full Linux distribution at the moment so I
>>> tested it on upstream QEMU by running a recent mainline Linux kernel
>>> with a rootfs containing an init program that just does sleep(5)
>>> and verified that the qemu process remains at low CPU usage during
>>> the sleep. This was on top of your v2 plus the last patch of your v1
>>> since it doesn't look like you have a replacement for that logic yet.
>>>
>>>    accel/hvf/hvf-cpus.c     |  5 +--
>>>    include/sysemu/hvf_int.h |  3 +-
>>>    target/arm/hvf/hvf.c     | 94 +++++++++++-----------------------------
>>>    3 files changed, 28 insertions(+), 74 deletions(-)
>>>
>>> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>>> index 4360f64671..b2c8fb57f6 100644
>>> --- a/accel/hvf/hvf-cpus.c
>>> +++ b/accel/hvf/hvf-cpus.c
>>> @@ -344,9 +344,8 @@ static int hvf_init_vcpu(CPUState *cpu)
>>>        sigact.sa_handler = dummy_signal;
>>>        sigaction(SIG_IPI, &sigact, NULL);
>>>
>>> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>>> -    sigdelset(&set, SIG_IPI);
>>> -    pthread_sigmask(SIG_SETMASK, &set, NULL);
>>> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->unblock_ipi_mask);
>>> +    sigdelset(&cpu->hvf->unblock_ipi_mask, SIG_IPI);
>>
>> What will this do to the x86 hvf implementation? We're now not
>> unblocking SIG_IPI again for that, right?
> Yes and that was the case before your patch series.


The way I understand Roman, he wanted to unblock the IPI signal on x86:

https://patchwork.kernel.org/project/qemu-devel/patch/20201126215017.41156-3-agraf@csgraf.de/#23807021

I agree that at this point it's not a problem though to break it again. 
I'm not quite sure how to merge your patches within my patch set though, 
given they basically revert half of my previously introduced code...


>
>>>    #ifdef __aarch64__
>>>        r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t **)&cpu->hvf->exit, NULL);
>>> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
>>> index c56baa3ae8..13adf6ea77 100644
>>> --- a/include/sysemu/hvf_int.h
>>> +++ b/include/sysemu/hvf_int.h
>>> @@ -62,8 +62,7 @@ extern HVFState *hvf_state;
>>>    struct hvf_vcpu_state {
>>>        uint64_t fd;
>>>        void *exit;
>>> -    struct timespec ts;
>>> -    bool sleeping;
>>> +    sigset_t unblock_ipi_mask;
>>>    };
>>>
>>>    void assert_hvf_ok(hv_return_t ret);
>>> diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
>>> index 8fe10966d2..60a361ff38 100644
>>> --- a/target/arm/hvf/hvf.c
>>> +++ b/target/arm/hvf/hvf.c
>>> @@ -2,6 +2,7 @@
>>>     * QEMU Hypervisor.framework support for Apple Silicon
>>>
>>>     * Copyright 2020 Alexander Graf <agraf@csgraf.de>
>>> + * Copyright 2020 Google LLC
>>>     *
>>>     * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>>     * See the COPYING file in the top-level directory.
>>> @@ -18,6 +19,7 @@
>>>    #include "sysemu/hw_accel.h"
>>>
>>>    #include <Hypervisor/Hypervisor.h>
>>> +#include <mach/mach_time.h>
>>>
>>>    #include "exec/address-spaces.h"
>>>    #include "hw/irq.h"
>>> @@ -320,18 +322,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)
>>>
>>>    void hvf_kick_vcpu_thread(CPUState *cpu)
>>>    {
>>> -    if (cpu->hvf->sleeping) {
>>> -        /*
>>> -         * When sleeping, make sure we always send signals. Also, clear the
>>> -         * timespec, so that an IPI that arrives between setting hvf->sleeping
>>> -         * and the nanosleep syscall still aborts the sleep.
>>> -         */
>>> -        cpu->thread_kicked = false;
>>> -        cpu->hvf->ts = (struct timespec){ };
>>> -        cpus_kick_thread(cpu);
>>> -    } else {
>>> -        hv_vcpus_exit(&cpu->hvf->fd, 1);
>>> -    }
>>> +    cpus_kick_thread(cpu);
>>> +    hv_vcpus_exit(&cpu->hvf->fd, 1);
>>
>> This means your first WFI will almost always return immediately due to a
>> pending signal, because there probably was an IRQ pending before on the
>> same CPU, no?
> That's right. Any approach involving the "sleeping" field would need
> to be implemented carefully to avoid races that may result in missed
> wakeups so for simplicity I just decided to send both kinds of
> wakeups. In particular the approach in the updated patch you sent is
> racy and I'll elaborate more in the reply to that patch.
>
>>>    }
>>>
>>>    static int hvf_inject_interrupts(CPUState *cpu)
>>> @@ -385,18 +377,19 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>            uint64_t syndrome = hvf_exit->exception.syndrome;
>>>            uint32_t ec = syn_get_ec(syndrome);
>>>
>>> +        qemu_mutex_lock_iothread();
>>
>> Is there a particular reason you're moving the iothread lock out again
>> from the individual bits? I would really like to keep a notion of fast
>> path exits.
> We still need to lock at least once no matter the exit reason to check
> the interrupts so I don't think it's worth it to try and avoid locking
> like this. It also makes the implementation easier to reason about and
> therefore more likely to be correct. In our implementation we just
> stay locked the whole time unless we're in hv_vcpu_run() or pselect().
>
>>>            switch (exit_reason) {
>>>            case HV_EXIT_REASON_EXCEPTION:
>>>                /* This is the main one, handle below. */
>>>                break;
>>>            case HV_EXIT_REASON_VTIMER_ACTIVATED:
>>> -            qemu_mutex_lock_iothread();
>>>                current_cpu = cpu;
>>>                qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
>>>                qemu_mutex_unlock_iothread();
>>>                continue;
>>>            case HV_EXIT_REASON_CANCELED:
>>>                /* we got kicked, no exit to process */
>>> +            qemu_mutex_unlock_iothread();
>>>                continue;
>>>            default:
>>>                assert(0);
>>> @@ -413,7 +406,6 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>                uint32_t srt = (syndrome >> 16) & 0x1f;
>>>                uint64_t val = 0;
>>>
>>> -            qemu_mutex_lock_iothread();
>>>                current_cpu = cpu;
>>>
>>>                DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
>>> @@ -446,8 +438,6 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>                    hvf_set_reg(cpu, srt, val);
>>>                }
>>>
>>> -            qemu_mutex_unlock_iothread();
>>> -
>>>                advance_pc = true;
>>>                break;
>>>            }
>>> @@ -493,68 +483,36 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>            case EC_WFX_TRAP:
>>>                if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
>>>                    (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
>>> -                uint64_t cval, ctl, val, diff, now;
>>> +                uint64_t cval;
>>>
>>> -                /* Set up a local timer for vtimer if necessary ... */
>>> -                r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CTL_EL0, &ctl);
>>> -                assert_hvf_ok(r);
>>>                    r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CVAL_EL0, &cval);
>>>                    assert_hvf_ok(r);
>>>
>>> -                asm volatile("mrs %0, cntvct_el0" : "=r"(val));
>>> -                diff = cval - val;
>>> -
>>> -                now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
>>> -                      gt_cntfrq_period_ns(arm_cpu);
>>> -
>>> -                /* Timer disabled or masked, just wait for long */
>>> -                if (!(ctl & 1) || (ctl & 2)) {
>>> -                    diff = (120 * NANOSECONDS_PER_SECOND) /
>>> -                           gt_cntfrq_period_ns(arm_cpu);
>>> +                int64_t ticks_to_sleep = cval - mach_absolute_time();
>>> +                if (ticks_to_sleep < 0) {
>>> +                    break;
>>
>> This will loop at 100% for Windows, which configures the vtimer as
>> cval=0 ctl=7, so with IRQ mask bit set.
> Okay, but the 120s is kind of arbitrary so we should just sleep until
> we get a signal. That can be done by passing null as the timespec
> argument to pselect().


The reason I capped it at 120s was so that if I do hit a race, you don't 
break everything forever. Only for 2 minutes :).


>
>>
>> Alex
>>
>>
>>>                    }
>>>
>>> -                if (diff < INT64_MAX) {
>>> -                    uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
>>> -                    struct timespec *ts = &cpu->hvf->ts;
>>> -
>>> -                    *ts = (struct timespec){
>>> -                        .tv_sec = ns / NANOSECONDS_PER_SECOND,
>>> -                        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
>>> -                    };
>>> -
>>> -                    /*
>>> -                     * Waking up easily takes 1ms, don't go to sleep for smaller
>>> -                     * time periods than 2ms.
>>> -                     */
>>> -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
>>
>> I put this logic here on purpose. A pselect(1 ns) easily takes 1-2ms to
>> return. Without logic like this, super short WFIs will hurt performance
>> quite badly.
> I don't think that's accurate. According to this benchmark it's a few
> hundred nanoseconds at most.
>
> pcc@pac-mini /tmp> cat pselect.c
> #include <signal.h>
> #include <sys/select.h>
>
> int main() {
>    sigset_t mask, orig_mask;
>    pthread_sigmask(SIG_SETMASK, 0, &mask);
>    sigaddset(&mask, SIGUSR1);
>    pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);
>
>    for (int i = 0; i != 1000000; ++i) {
>      struct timespec ts = { 0, 1 };
>      pselect(0, 0, 0, 0, &ts, &orig_mask);
>    }
> }
> pcc@pac-mini /tmp> time ./pselect
>
> ________________________________________________________
> Executed in  179.87 millis    fish           external
>     usr time   77.68 millis   57.00 micros   77.62 millis
>     sys time  101.37 millis  852.00 micros  100.52 millis
>
> Besides, all that you're really saving here is the single pselect
> call. There are no doubt more expensive syscalls involved in exiting
> and entering the VCPU that would dominate here.


I would expect that such a super low ts value has a short-circuit path 
in the kernel as well. Where things start to fall apart is when you're 
at a threshold where rescheduling might be ok, but then you need to take 
all of the additional task switch overhead into account. Try to adapt 
your test code a bit:

#include <signal.h>
#include <sys/select.h>

int main() {
   sigset_t mask, orig_mask;
   pthread_sigmask(SIG_SETMASK, 0, &mask);
   sigaddset(&mask, SIGUSR1);
   pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);

   for (int i = 0; i != 10000; ++i) {
#define SCALE_MS 1000000
     struct timespec ts = { 0, SCALE_MS / 10 };
     pselect(0, 0, 0, 0, &ts, &orig_mask);
   }
}


% time ./pselect
./pselect  0.00s user 0.01s system 1% cpu 1.282 total

You're suddenly seeing 300µs overhead per pselect call then. When I 
measured actual enter/exit times in QEMU, I saw much bigger differences 
between "time I want to sleep for" and "time I did sleep" even when just 
capturing the virtual time before and after the nanosleep/pselect call.


Alex

Re: [PATCH] arm/hvf: Optimize and simplify WFI handling

Posted by Peter Collingbourne 4 years, 3 months ago

On Tue, Dec 1, 2020 at 2:04 PM Alexander Graf <agraf@csgraf.de> wrote:
>
>
> On 01.12.20 19:59, Peter Collingbourne wrote:
> > On Tue, Dec 1, 2020 at 3:16 AM Alexander Graf <agraf@csgraf.de> wrote:
> >> Hi Peter,
> >>
> >> On 01.12.20 09:21, Peter Collingbourne wrote:
> >>> Sleep on WFx until the VTIMER is due but allow ourselves to be woken
> >>> up on IPI.
> >>>
> >>> Signed-off-by: Peter Collingbourne <pcc@google.com>
> >>
> >> Thanks a bunch!
> >>
> >>
> >>> ---
> >>> Alexander Graf wrote:
> >>>> I would love to take a patch from you here :). I'll still be stuck for a
> >>>> while with the sysreg sync rework that Peter asked for before I can look
> >>>> at WFI again.
> >>> Okay, here's a patch :) It's a relatively straightforward adaptation
> >>> of what we have in our fork, which can now boot Android to GUI while
> >>> remaining at around 4% CPU when idle.
> >>>
> >>> I'm not set up to boot a full Linux distribution at the moment so I
> >>> tested it on upstream QEMU by running a recent mainline Linux kernel
> >>> with a rootfs containing an init program that just does sleep(5)
> >>> and verified that the qemu process remains at low CPU usage during
> >>> the sleep. This was on top of your v2 plus the last patch of your v1
> >>> since it doesn't look like you have a replacement for that logic yet.
> >>>
> >>>    accel/hvf/hvf-cpus.c     |  5 +--
> >>>    include/sysemu/hvf_int.h |  3 +-
> >>>    target/arm/hvf/hvf.c     | 94 +++++++++++-----------------------------
> >>>    3 files changed, 28 insertions(+), 74 deletions(-)
> >>>
> >>> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> >>> index 4360f64671..b2c8fb57f6 100644
> >>> --- a/accel/hvf/hvf-cpus.c
> >>> +++ b/accel/hvf/hvf-cpus.c
> >>> @@ -344,9 +344,8 @@ static int hvf_init_vcpu(CPUState *cpu)
> >>>        sigact.sa_handler = dummy_signal;
> >>>        sigaction(SIG_IPI, &sigact, NULL);
> >>>
> >>> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> >>> -    sigdelset(&set, SIG_IPI);
> >>> -    pthread_sigmask(SIG_SETMASK, &set, NULL);
> >>> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->unblock_ipi_mask);
> >>> +    sigdelset(&cpu->hvf->unblock_ipi_mask, SIG_IPI);
> >>
> >> What will this do to the x86 hvf implementation? We're now not
> >> unblocking SIG_IPI again for that, right?
> > Yes and that was the case before your patch series.
>
>
> The way I understand Roman, he wanted to unblock the IPI signal on x86:
>
> https://patchwork.kernel.org/project/qemu-devel/patch/20201126215017.41156-3-agraf@csgraf.de/#23807021
>
> I agree that at this point it's not a problem though to break it again.
> I'm not quite sure how to merge your patches within my patch set though,
> given they basically revert half of my previously introduced code...
>
>
> >
> >>>    #ifdef __aarch64__
> >>>        r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t **)&cpu->hvf->exit, NULL);
> >>> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> >>> index c56baa3ae8..13adf6ea77 100644
> >>> --- a/include/sysemu/hvf_int.h
> >>> +++ b/include/sysemu/hvf_int.h
> >>> @@ -62,8 +62,7 @@ extern HVFState *hvf_state;
> >>>    struct hvf_vcpu_state {
> >>>        uint64_t fd;
> >>>        void *exit;
> >>> -    struct timespec ts;
> >>> -    bool sleeping;
> >>> +    sigset_t unblock_ipi_mask;
> >>>    };
> >>>
> >>>    void assert_hvf_ok(hv_return_t ret);
> >>> diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
> >>> index 8fe10966d2..60a361ff38 100644
> >>> --- a/target/arm/hvf/hvf.c
> >>> +++ b/target/arm/hvf/hvf.c
> >>> @@ -2,6 +2,7 @@
> >>>     * QEMU Hypervisor.framework support for Apple Silicon
> >>>
> >>>     * Copyright 2020 Alexander Graf <agraf@csgraf.de>
> >>> + * Copyright 2020 Google LLC
> >>>     *
> >>>     * This work is licensed under the terms of the GNU GPL, version 2 or later.
> >>>     * See the COPYING file in the top-level directory.
> >>> @@ -18,6 +19,7 @@
> >>>    #include "sysemu/hw_accel.h"
> >>>
> >>>    #include <Hypervisor/Hypervisor.h>
> >>> +#include <mach/mach_time.h>
> >>>
> >>>    #include "exec/address-spaces.h"
> >>>    #include "hw/irq.h"
> >>> @@ -320,18 +322,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)
> >>>
> >>>    void hvf_kick_vcpu_thread(CPUState *cpu)
> >>>    {
> >>> -    if (cpu->hvf->sleeping) {
> >>> -        /*
> >>> -         * When sleeping, make sure we always send signals. Also, clear the
> >>> -         * timespec, so that an IPI that arrives between setting hvf->sleeping
> >>> -         * and the nanosleep syscall still aborts the sleep.
> >>> -         */
> >>> -        cpu->thread_kicked = false;
> >>> -        cpu->hvf->ts = (struct timespec){ };
> >>> -        cpus_kick_thread(cpu);
> >>> -    } else {
> >>> -        hv_vcpus_exit(&cpu->hvf->fd, 1);
> >>> -    }
> >>> +    cpus_kick_thread(cpu);
> >>> +    hv_vcpus_exit(&cpu->hvf->fd, 1);
> >>
> >> This means your first WFI will almost always return immediately due to a
> >> pending signal, because there probably was an IRQ pending before on the
> >> same CPU, no?
> > That's right. Any approach involving the "sleeping" field would need
> > to be implemented carefully to avoid races that may result in missed
> > wakeups so for simplicity I just decided to send both kinds of
> > wakeups. In particular the approach in the updated patch you sent is
> > racy and I'll elaborate more in the reply to that patch.
> >
> >>>    }
> >>>
> >>>    static int hvf_inject_interrupts(CPUState *cpu)
> >>> @@ -385,18 +377,19 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>>            uint64_t syndrome = hvf_exit->exception.syndrome;
> >>>            uint32_t ec = syn_get_ec(syndrome);
> >>>
> >>> +        qemu_mutex_lock_iothread();
> >>
> >> Is there a particular reason you're moving the iothread lock out again
> >> from the individual bits? I would really like to keep a notion of fast
> >> path exits.
> > We still need to lock at least once no matter the exit reason to check
> > the interrupts so I don't think it's worth it to try and avoid locking
> > like this. It also makes the implementation easier to reason about and
> > therefore more likely to be correct. In our implementation we just
> > stay locked the whole time unless we're in hv_vcpu_run() or pselect().
> >
> >>>            switch (exit_reason) {
> >>>            case HV_EXIT_REASON_EXCEPTION:
> >>>                /* This is the main one, handle below. */
> >>>                break;
> >>>            case HV_EXIT_REASON_VTIMER_ACTIVATED:
> >>> -            qemu_mutex_lock_iothread();
> >>>                current_cpu = cpu;
> >>>                qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
> >>>                qemu_mutex_unlock_iothread();
> >>>                continue;
> >>>            case HV_EXIT_REASON_CANCELED:
> >>>                /* we got kicked, no exit to process */
> >>> +            qemu_mutex_unlock_iothread();
> >>>                continue;
> >>>            default:
> >>>                assert(0);
> >>> @@ -413,7 +406,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>>                uint32_t srt = (syndrome >> 16) & 0x1f;
> >>>                uint64_t val = 0;
> >>>
> >>> -            qemu_mutex_lock_iothread();
> >>>                current_cpu = cpu;
> >>>
> >>>                DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
> >>> @@ -446,8 +438,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>>                    hvf_set_reg(cpu, srt, val);
> >>>                }
> >>>
> >>> -            qemu_mutex_unlock_iothread();
> >>> -
> >>>                advance_pc = true;
> >>>                break;
> >>>            }
> >>> @@ -493,68 +483,36 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>>            case EC_WFX_TRAP:
> >>>                if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
> >>>                    (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> >>> -                uint64_t cval, ctl, val, diff, now;
> >>> +                uint64_t cval;
> >>>
> >>> -                /* Set up a local timer for vtimer if necessary ... */
> >>> -                r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CTL_EL0, &ctl);
> >>> -                assert_hvf_ok(r);
> >>>                    r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CVAL_EL0, &cval);
> >>>                    assert_hvf_ok(r);
> >>>
> >>> -                asm volatile("mrs %0, cntvct_el0" : "=r"(val));
> >>> -                diff = cval - val;
> >>> -
> >>> -                now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
> >>> -                      gt_cntfrq_period_ns(arm_cpu);
> >>> -
> >>> -                /* Timer disabled or masked, just wait for long */
> >>> -                if (!(ctl & 1) || (ctl & 2)) {
> >>> -                    diff = (120 * NANOSECONDS_PER_SECOND) /
> >>> -                           gt_cntfrq_period_ns(arm_cpu);
> >>> +                int64_t ticks_to_sleep = cval - mach_absolute_time();
> >>> +                if (ticks_to_sleep < 0) {
> >>> +                    break;
> >>
> >> This will loop at 100% for Windows, which configures the vtimer as
> >> cval=0 ctl=7, so with IRQ mask bit set.
> > Okay, but the 120s is kind of arbitrary so we should just sleep until
> > we get a signal. That can be done by passing null as the timespec
> > argument to pselect().
>
>
> The reason I capped it at 120s was so that if I do hit a race, you don't
> break everything forever. Only for 2 minutes :).

I see. I think at this point we want to notice these types of bugs if
they exist instead of hiding them, so I would mildly be in favor of
not capping at 120s.

> >
> >>
> >> Alex
> >>
> >>
> >>>                    }
> >>>
> >>> -                if (diff < INT64_MAX) {
> >>> -                    uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
> >>> -                    struct timespec *ts = &cpu->hvf->ts;
> >>> -
> >>> -                    *ts = (struct timespec){
> >>> -                        .tv_sec = ns / NANOSECONDS_PER_SECOND,
> >>> -                        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
> >>> -                    };
> >>> -
> >>> -                    /*
> >>> -                     * Waking up easily takes 1ms, don't go to sleep for smaller
> >>> -                     * time periods than 2ms.
> >>> -                     */
> >>> -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
> >>
> >> I put this logic here on purpose. A pselect(1 ns) easily takes 1-2ms to
> >> return. Without logic like this, super short WFIs will hurt performance
> >> quite badly.
> > I don't think that's accurate. According to this benchmark it's a few
> > hundred nanoseconds at most.
> >
> > pcc@pac-mini /tmp> cat pselect.c
> > #include <signal.h>
> > #include <sys/select.h>
> >
> > int main() {
> >    sigset_t mask, orig_mask;
> >    pthread_sigmask(SIG_SETMASK, 0, &mask);
> >    sigaddset(&mask, SIGUSR1);
> >    pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);
> >
> >    for (int i = 0; i != 1000000; ++i) {
> >      struct timespec ts = { 0, 1 };
> >      pselect(0, 0, 0, 0, &ts, &orig_mask);
> >    }
> > }
> > pcc@pac-mini /tmp> time ./pselect
> >
> > ________________________________________________________
> > Executed in  179.87 millis    fish           external
> >     usr time   77.68 millis   57.00 micros   77.62 millis
> >     sys time  101.37 millis  852.00 micros  100.52 millis
> >
> > Besides, all that you're really saving here is the single pselect
> > call. There are no doubt more expensive syscalls involved in exiting
> > and entering the VCPU that would dominate here.
>
>
> I would expect that such a super low ts value has a short-circuit path
> in the kernel as well. Where things start to fall apart is when you're
> at a threshold where rescheduling might be ok, but then you need to take
> all of the additional task switch overhead into account. Try to adapt
> your test code a bit:
>
> #include <signal.h>
> #include <sys/select.h>
>
> int main() {
>    sigset_t mask, orig_mask;
>    pthread_sigmask(SIG_SETMASK, 0, &mask);
>    sigaddset(&mask, SIGUSR1);
>    pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);
>
>    for (int i = 0; i != 10000; ++i) {
> #define SCALE_MS 1000000
>      struct timespec ts = { 0, SCALE_MS / 10 };
>      pselect(0, 0, 0, 0, &ts, &orig_mask);
>    }
> }
>
>
> % time ./pselect
> ./pselect  0.00s user 0.01s system 1% cpu 1.282 total
>
> You're suddenly seeing 300µs overhead per pselect call then. When I
> measured actual enter/exit times in QEMU, I saw much bigger differences
> between "time I want to sleep for" and "time I did sleep" even when just
> capturing the virtual time before and after the nanosleep/pselect call.

Okay. So the alternative is that we spin on the CPU, either doing
no-op VCPU entries/exits or something like:

while (mach_absolute_time() < cval);

My intuition is we shouldn't try to subvert the OS scheduler like this
unless it's proven to help with some real world metric since otherwise
we're not being fair to the other processes on the CPU. With CPU
intensive workloads I wouldn't expect these kinds of sleeps to happen
very often if at all so if it's only microbenchmarks and so on that
are affected then my inclination is not to do this for now.

Peter

Re: [PATCH] arm/hvf: Optimize and simplify WFI handling

Posted by Alexander Graf 4 years, 3 months ago

On 02.12.20 02:19, Peter Collingbourne wrote:
> On Tue, Dec 1, 2020 at 2:04 PM Alexander Graf <agraf@csgraf.de> wrote:
>>
>> On 01.12.20 19:59, Peter Collingbourne wrote:
>>> On Tue, Dec 1, 2020 at 3:16 AM Alexander Graf <agraf@csgraf.de> wrote:
>>>> Hi Peter,
>>>>
>>>> On 01.12.20 09:21, Peter Collingbourne wrote:
>>>>> Sleep on WFx until the VTIMER is due but allow ourselves to be woken
>>>>> up on IPI.
>>>>>
>>>>> Signed-off-by: Peter Collingbourne <pcc@google.com>
>>>> Thanks a bunch!
>>>>
>>>>
>>>>> ---
>>>>> Alexander Graf wrote:
>>>>>> I would love to take a patch from you here :). I'll still be stuck for a
>>>>>> while with the sysreg sync rework that Peter asked for before I can look
>>>>>> at WFI again.
>>>>> Okay, here's a patch :) It's a relatively straightforward adaptation
>>>>> of what we have in our fork, which can now boot Android to GUI while
>>>>> remaining at around 4% CPU when idle.
>>>>>
>>>>> I'm not set up to boot a full Linux distribution at the moment so I
>>>>> tested it on upstream QEMU by running a recent mainline Linux kernel
>>>>> with a rootfs containing an init program that just does sleep(5)
>>>>> and verified that the qemu process remains at low CPU usage during
>>>>> the sleep. This was on top of your v2 plus the last patch of your v1
>>>>> since it doesn't look like you have a replacement for that logic yet.
>>>>>
>>>>>     accel/hvf/hvf-cpus.c     |  5 +--
>>>>>     include/sysemu/hvf_int.h |  3 +-
>>>>>     target/arm/hvf/hvf.c     | 94 +++++++++++-----------------------------
>>>>>     3 files changed, 28 insertions(+), 74 deletions(-)
>>>>>
>>>>> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>>>>> index 4360f64671..b2c8fb57f6 100644
>>>>> --- a/accel/hvf/hvf-cpus.c
>>>>> +++ b/accel/hvf/hvf-cpus.c
>>>>> @@ -344,9 +344,8 @@ static int hvf_init_vcpu(CPUState *cpu)
>>>>>         sigact.sa_handler = dummy_signal;
>>>>>         sigaction(SIG_IPI, &sigact, NULL);
>>>>>
>>>>> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>>>>> -    sigdelset(&set, SIG_IPI);
>>>>> -    pthread_sigmask(SIG_SETMASK, &set, NULL);
>>>>> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->unblock_ipi_mask);
>>>>> +    sigdelset(&cpu->hvf->unblock_ipi_mask, SIG_IPI);
>>>> What will this do to the x86 hvf implementation? We're now not
>>>> unblocking SIG_IPI again for that, right?
>>> Yes and that was the case before your patch series.
>>
>> The way I understand Roman, he wanted to unblock the IPI signal on x86:
>>
>> https://patchwork.kernel.org/project/qemu-devel/patch/20201126215017.41156-3-agraf@csgraf.de/#23807021
>>
>> I agree that at this point it's not a problem though to break it again.
>> I'm not quite sure how to merge your patches within my patch set though,
>> given they basically revert half of my previously introduced code...
>>
>>
>>>>>     #ifdef __aarch64__
>>>>>         r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t **)&cpu->hvf->exit, NULL);
>>>>> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
>>>>> index c56baa3ae8..13adf6ea77 100644
>>>>> --- a/include/sysemu/hvf_int.h
>>>>> +++ b/include/sysemu/hvf_int.h
>>>>> @@ -62,8 +62,7 @@ extern HVFState *hvf_state;
>>>>>     struct hvf_vcpu_state {
>>>>>         uint64_t fd;
>>>>>         void *exit;
>>>>> -    struct timespec ts;
>>>>> -    bool sleeping;
>>>>> +    sigset_t unblock_ipi_mask;
>>>>>     };
>>>>>
>>>>>     void assert_hvf_ok(hv_return_t ret);
>>>>> diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
>>>>> index 8fe10966d2..60a361ff38 100644
>>>>> --- a/target/arm/hvf/hvf.c
>>>>> +++ b/target/arm/hvf/hvf.c
>>>>> @@ -2,6 +2,7 @@
>>>>>      * QEMU Hypervisor.framework support for Apple Silicon
>>>>>
>>>>>      * Copyright 2020 Alexander Graf <agraf@csgraf.de>
>>>>> + * Copyright 2020 Google LLC
>>>>>      *
>>>>>      * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>>>>      * See the COPYING file in the top-level directory.
>>>>> @@ -18,6 +19,7 @@
>>>>>     #include "sysemu/hw_accel.h"
>>>>>
>>>>>     #include <Hypervisor/Hypervisor.h>
>>>>> +#include <mach/mach_time.h>
>>>>>
>>>>>     #include "exec/address-spaces.h"
>>>>>     #include "hw/irq.h"
>>>>> @@ -320,18 +322,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)
>>>>>
>>>>>     void hvf_kick_vcpu_thread(CPUState *cpu)
>>>>>     {
>>>>> -    if (cpu->hvf->sleeping) {
>>>>> -        /*
>>>>> -         * When sleeping, make sure we always send signals. Also, clear the
>>>>> -         * timespec, so that an IPI that arrives between setting hvf->sleeping
>>>>> -         * and the nanosleep syscall still aborts the sleep.
>>>>> -         */
>>>>> -        cpu->thread_kicked = false;
>>>>> -        cpu->hvf->ts = (struct timespec){ };
>>>>> -        cpus_kick_thread(cpu);
>>>>> -    } else {
>>>>> -        hv_vcpus_exit(&cpu->hvf->fd, 1);
>>>>> -    }
>>>>> +    cpus_kick_thread(cpu);
>>>>> +    hv_vcpus_exit(&cpu->hvf->fd, 1);
>>>> This means your first WFI will almost always return immediately due to a
>>>> pending signal, because there probably was an IRQ pending before on the
>>>> same CPU, no?
>>> That's right. Any approach involving the "sleeping" field would need
>>> to be implemented carefully to avoid races that may result in missed
>>> wakeups so for simplicity I just decided to send both kinds of
>>> wakeups. In particular the approach in the updated patch you sent is
>>> racy and I'll elaborate more in the reply to that patch.
>>>
>>>>>     }
>>>>>
>>>>>     static int hvf_inject_interrupts(CPUState *cpu)
>>>>> @@ -385,18 +377,19 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>>>             uint64_t syndrome = hvf_exit->exception.syndrome;
>>>>>             uint32_t ec = syn_get_ec(syndrome);
>>>>>
>>>>> +        qemu_mutex_lock_iothread();
>>>> Is there a particular reason you're moving the iothread lock out again
>>>> from the individual bits? I would really like to keep a notion of fast
>>>> path exits.
>>> We still need to lock at least once no matter the exit reason to check
>>> the interrupts so I don't think it's worth it to try and avoid locking
>>> like this. It also makes the implementation easier to reason about and
>>> therefore more likely to be correct. In our implementation we just
>>> stay locked the whole time unless we're in hv_vcpu_run() or pselect().
>>>
>>>>>             switch (exit_reason) {
>>>>>             case HV_EXIT_REASON_EXCEPTION:
>>>>>                 /* This is the main one, handle below. */
>>>>>                 break;
>>>>>             case HV_EXIT_REASON_VTIMER_ACTIVATED:
>>>>> -            qemu_mutex_lock_iothread();
>>>>>                 current_cpu = cpu;
>>>>>                 qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
>>>>>                 qemu_mutex_unlock_iothread();
>>>>>                 continue;
>>>>>             case HV_EXIT_REASON_CANCELED:
>>>>>                 /* we got kicked, no exit to process */
>>>>> +            qemu_mutex_unlock_iothread();
>>>>>                 continue;
>>>>>             default:
>>>>>                 assert(0);
>>>>> @@ -413,7 +406,6 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>>>                 uint32_t srt = (syndrome >> 16) & 0x1f;
>>>>>                 uint64_t val = 0;
>>>>>
>>>>> -            qemu_mutex_lock_iothread();
>>>>>                 current_cpu = cpu;
>>>>>
>>>>>                 DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
>>>>> @@ -446,8 +438,6 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>>>                     hvf_set_reg(cpu, srt, val);
>>>>>                 }
>>>>>
>>>>> -            qemu_mutex_unlock_iothread();
>>>>> -
>>>>>                 advance_pc = true;
>>>>>                 break;
>>>>>             }
>>>>> @@ -493,68 +483,36 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>>>             case EC_WFX_TRAP:
>>>>>                 if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
>>>>>                     (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
>>>>> -                uint64_t cval, ctl, val, diff, now;
>>>>> +                uint64_t cval;
>>>>>
>>>>> -                /* Set up a local timer for vtimer if necessary ... */
>>>>> -                r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CTL_EL0, &ctl);
>>>>> -                assert_hvf_ok(r);
>>>>>                     r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CVAL_EL0, &cval);
>>>>>                     assert_hvf_ok(r);
>>>>>
>>>>> -                asm volatile("mrs %0, cntvct_el0" : "=r"(val));
>>>>> -                diff = cval - val;
>>>>> -
>>>>> -                now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
>>>>> -                      gt_cntfrq_period_ns(arm_cpu);
>>>>> -
>>>>> -                /* Timer disabled or masked, just wait for long */
>>>>> -                if (!(ctl & 1) || (ctl & 2)) {
>>>>> -                    diff = (120 * NANOSECONDS_PER_SECOND) /
>>>>> -                           gt_cntfrq_period_ns(arm_cpu);
>>>>> +                int64_t ticks_to_sleep = cval - mach_absolute_time();
>>>>> +                if (ticks_to_sleep < 0) {
>>>>> +                    break;
>>>> This will loop at 100% for Windows, which configures the vtimer as
>>>> cval=0 ctl=7, so with IRQ mask bit set.
>>> Okay, but the 120s is kind of arbitrary so we should just sleep until
>>> we get a signal. That can be done by passing null as the timespec
>>> argument to pselect().
>>
>> The reason I capped it at 120s was so that if I do hit a race, you don't
>> break everything forever. Only for 2 minutes :).
> I see. I think at this point we want to notice these types of bugs if
> they exist instead of hiding them, so I would mildly be in favor of
> not capping at 120s.


Crossing my fingers that we are at that point already :).


>
>>>> Alex
>>>>
>>>>
>>>>>                     }
>>>>>
>>>>> -                if (diff < INT64_MAX) {
>>>>> -                    uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
>>>>> -                    struct timespec *ts = &cpu->hvf->ts;
>>>>> -
>>>>> -                    *ts = (struct timespec){
>>>>> -                        .tv_sec = ns / NANOSECONDS_PER_SECOND,
>>>>> -                        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
>>>>> -                    };
>>>>> -
>>>>> -                    /*
>>>>> -                     * Waking up easily takes 1ms, don't go to sleep for smaller
>>>>> -                     * time periods than 2ms.
>>>>> -                     */
>>>>> -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
>>>> I put this logic here on purpose. A pselect(1 ns) easily takes 1-2ms to
>>>> return. Without logic like this, super short WFIs will hurt performance
>>>> quite badly.
>>> I don't think that's accurate. According to this benchmark it's a few
>>> hundred nanoseconds at most.
>>>
>>> pcc@pac-mini /tmp> cat pselect.c
>>> #include <signal.h>
>>> #include <sys/select.h>
>>>
>>> int main() {
>>>     sigset_t mask, orig_mask;
>>>     pthread_sigmask(SIG_SETMASK, 0, &mask);
>>>     sigaddset(&mask, SIGUSR1);
>>>     pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);
>>>
>>>     for (int i = 0; i != 1000000; ++i) {
>>>       struct timespec ts = { 0, 1 };
>>>       pselect(0, 0, 0, 0, &ts, &orig_mask);
>>>     }
>>> }
>>> pcc@pac-mini /tmp> time ./pselect
>>>
>>> ________________________________________________________
>>> Executed in  179.87 millis    fish           external
>>>      usr time   77.68 millis   57.00 micros   77.62 millis
>>>      sys time  101.37 millis  852.00 micros  100.52 millis
>>>
>>> Besides, all that you're really saving here is the single pselect
>>> call. There are no doubt more expensive syscalls involved in exiting
>>> and entering the VCPU that would dominate here.
>>
>> I would expect that such a super low ts value has a short-circuit path
>> in the kernel as well. Where things start to fall apart is when you're
>> at a threshold where rescheduling might be ok, but then you need to take
>> all of the additional task switch overhead into account. Try to adapt
>> your test code a bit:
>>
>> #include <signal.h>
>> #include <sys/select.h>
>>
>> int main() {
>>     sigset_t mask, orig_mask;
>>     pthread_sigmask(SIG_SETMASK, 0, &mask);
>>     sigaddset(&mask, SIGUSR1);
>>     pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);
>>
>>     for (int i = 0; i != 10000; ++i) {
>> #define SCALE_MS 1000000
>>       struct timespec ts = { 0, SCALE_MS / 10 };
>>       pselect(0, 0, 0, 0, &ts, &orig_mask);
>>     }
>> }
>>
>>
>> % time ./pselect
>> ./pselect  0.00s user 0.01s system 1% cpu 1.282 total
>>
>> You're suddenly seeing 300µs overhead per pselect call then. When I
>> measured actual enter/exit times in QEMU, I saw much bigger differences
>> between "time I want to sleep for" and "time I did sleep" even when just
>> capturing the virtual time before and after the nanosleep/pselect call.
> Okay. So the alternative is that we spin on the CPU, either doing
> no-op VCPU entries/exits or something like:
>
> while (mach_absolute_time() < cval);


This won't catch events that arrive during that time, such as 
interrupts, right? I'd just declare the WFI as done and keep looping in 
and out of the guest for now.


> My intuition is we shouldn't try to subvert the OS scheduler like this
> unless it's proven to help with some real world metric since otherwise
> we're not being fair to the other processes on the CPU. With CPU
> intensive workloads I wouldn't expect these kinds of sleeps to happen
> very often if at all so if it's only microbenchmarks and so on that
> are affected then my inclination is not to do this for now.


The problem is that the VM's OS is expecting bare metal timer behavior 
usually. And that gives you much better granularities than what we can 
achieve with a virtualization layer on top. So I do feel strongly about 
leaving this bit in. In the workloads you describe above, you won't ever 
hit that branch anyway.

The workloads that benefit from logic like this are message passing 
ones. Check out this presentation from a KVM colleague of yours for details:

https://www.linux-kvm.org/images/a/ac/02x03-Davit_Matalack-KVM_Message_passing_Performance.pdf
   https://www.youtube.com/watch?v=p85FFrloLFg


Alex

Re: [PATCH] arm/hvf: Optimize and simplify WFI handling

Posted by Peter Collingbourne 4 years, 3 months ago

On Tue, Dec 1, 2020 at 5:53 PM Alexander Graf <agraf@csgraf.de> wrote:
>
>
> On 02.12.20 02:19, Peter Collingbourne wrote:
> > On Tue, Dec 1, 2020 at 2:04 PM Alexander Graf <agraf@csgraf.de> wrote:
> >>
> >> On 01.12.20 19:59, Peter Collingbourne wrote:
> >>> On Tue, Dec 1, 2020 at 3:16 AM Alexander Graf <agraf@csgraf.de> wrote:
> >>>> Hi Peter,
> >>>>
> >>>> On 01.12.20 09:21, Peter Collingbourne wrote:
> >>>>> Sleep on WFx until the VTIMER is due but allow ourselves to be woken
> >>>>> up on IPI.
> >>>>>
> >>>>> Signed-off-by: Peter Collingbourne <pcc@google.com>
> >>>> Thanks a bunch!
> >>>>
> >>>>
> >>>>> ---
> >>>>> Alexander Graf wrote:
> >>>>>> I would love to take a patch from you here :). I'll still be stuck for a
> >>>>>> while with the sysreg sync rework that Peter asked for before I can look
> >>>>>> at WFI again.
> >>>>> Okay, here's a patch :) It's a relatively straightforward adaptation
> >>>>> of what we have in our fork, which can now boot Android to GUI while
> >>>>> remaining at around 4% CPU when idle.
> >>>>>
> >>>>> I'm not set up to boot a full Linux distribution at the moment so I
> >>>>> tested it on upstream QEMU by running a recent mainline Linux kernel
> >>>>> with a rootfs containing an init program that just does sleep(5)
> >>>>> and verified that the qemu process remains at low CPU usage during
> >>>>> the sleep. This was on top of your v2 plus the last patch of your v1
> >>>>> since it doesn't look like you have a replacement for that logic yet.
> >>>>>
> >>>>>     accel/hvf/hvf-cpus.c     |  5 +--
> >>>>>     include/sysemu/hvf_int.h |  3 +-
> >>>>>     target/arm/hvf/hvf.c     | 94 +++++++++++-----------------------------
> >>>>>     3 files changed, 28 insertions(+), 74 deletions(-)
> >>>>>
> >>>>> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> >>>>> index 4360f64671..b2c8fb57f6 100644
> >>>>> --- a/accel/hvf/hvf-cpus.c
> >>>>> +++ b/accel/hvf/hvf-cpus.c
> >>>>> @@ -344,9 +344,8 @@ static int hvf_init_vcpu(CPUState *cpu)
> >>>>>         sigact.sa_handler = dummy_signal;
> >>>>>         sigaction(SIG_IPI, &sigact, NULL);
> >>>>>
> >>>>> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> >>>>> -    sigdelset(&set, SIG_IPI);
> >>>>> -    pthread_sigmask(SIG_SETMASK, &set, NULL);
> >>>>> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->unblock_ipi_mask);
> >>>>> +    sigdelset(&cpu->hvf->unblock_ipi_mask, SIG_IPI);
> >>>> What will this do to the x86 hvf implementation? We're now not
> >>>> unblocking SIG_IPI again for that, right?
> >>> Yes and that was the case before your patch series.
> >>
> >> The way I understand Roman, he wanted to unblock the IPI signal on x86:
> >>
> >> https://patchwork.kernel.org/project/qemu-devel/patch/20201126215017.41156-3-agraf@csgraf.de/#23807021
> >>
> >> I agree that at this point it's not a problem though to break it again.
> >> I'm not quite sure how to merge your patches within my patch set though,
> >> given they basically revert half of my previously introduced code...
> >>
> >>
> >>>>>     #ifdef __aarch64__
> >>>>>         r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t **)&cpu->hvf->exit, NULL);
> >>>>> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> >>>>> index c56baa3ae8..13adf6ea77 100644
> >>>>> --- a/include/sysemu/hvf_int.h
> >>>>> +++ b/include/sysemu/hvf_int.h
> >>>>> @@ -62,8 +62,7 @@ extern HVFState *hvf_state;
> >>>>>     struct hvf_vcpu_state {
> >>>>>         uint64_t fd;
> >>>>>         void *exit;
> >>>>> -    struct timespec ts;
> >>>>> -    bool sleeping;
> >>>>> +    sigset_t unblock_ipi_mask;
> >>>>>     };
> >>>>>
> >>>>>     void assert_hvf_ok(hv_return_t ret);
> >>>>> diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
> >>>>> index 8fe10966d2..60a361ff38 100644
> >>>>> --- a/target/arm/hvf/hvf.c
> >>>>> +++ b/target/arm/hvf/hvf.c
> >>>>> @@ -2,6 +2,7 @@
> >>>>>      * QEMU Hypervisor.framework support for Apple Silicon
> >>>>>
> >>>>>      * Copyright 2020 Alexander Graf <agraf@csgraf.de>
> >>>>> + * Copyright 2020 Google LLC
> >>>>>      *
> >>>>>      * This work is licensed under the terms of the GNU GPL, version 2 or later.
> >>>>>      * See the COPYING file in the top-level directory.
> >>>>> @@ -18,6 +19,7 @@
> >>>>>     #include "sysemu/hw_accel.h"
> >>>>>
> >>>>>     #include <Hypervisor/Hypervisor.h>
> >>>>> +#include <mach/mach_time.h>
> >>>>>
> >>>>>     #include "exec/address-spaces.h"
> >>>>>     #include "hw/irq.h"
> >>>>> @@ -320,18 +322,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)
> >>>>>
> >>>>>     void hvf_kick_vcpu_thread(CPUState *cpu)
> >>>>>     {
> >>>>> -    if (cpu->hvf->sleeping) {
> >>>>> -        /*
> >>>>> -         * When sleeping, make sure we always send signals. Also, clear the
> >>>>> -         * timespec, so that an IPI that arrives between setting hvf->sleeping
> >>>>> -         * and the nanosleep syscall still aborts the sleep.
> >>>>> -         */
> >>>>> -        cpu->thread_kicked = false;
> >>>>> -        cpu->hvf->ts = (struct timespec){ };
> >>>>> -        cpus_kick_thread(cpu);
> >>>>> -    } else {
> >>>>> -        hv_vcpus_exit(&cpu->hvf->fd, 1);
> >>>>> -    }
> >>>>> +    cpus_kick_thread(cpu);
> >>>>> +    hv_vcpus_exit(&cpu->hvf->fd, 1);
> >>>> This means your first WFI will almost always return immediately due to a
> >>>> pending signal, because there probably was an IRQ pending before on the
> >>>> same CPU, no?
> >>> That's right. Any approach involving the "sleeping" field would need
> >>> to be implemented carefully to avoid races that may result in missed
> >>> wakeups so for simplicity I just decided to send both kinds of
> >>> wakeups. In particular the approach in the updated patch you sent is
> >>> racy and I'll elaborate more in the reply to that patch.
> >>>
> >>>>>     }
> >>>>>
> >>>>>     static int hvf_inject_interrupts(CPUState *cpu)
> >>>>> @@ -385,18 +377,19 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>>>>             uint64_t syndrome = hvf_exit->exception.syndrome;
> >>>>>             uint32_t ec = syn_get_ec(syndrome);
> >>>>>
> >>>>> +        qemu_mutex_lock_iothread();
> >>>> Is there a particular reason you're moving the iothread lock out again
> >>>> from the individual bits? I would really like to keep a notion of fast
> >>>> path exits.
> >>> We still need to lock at least once no matter the exit reason to check
> >>> the interrupts so I don't think it's worth it to try and avoid locking
> >>> like this. It also makes the implementation easier to reason about and
> >>> therefore more likely to be correct. In our implementation we just
> >>> stay locked the whole time unless we're in hv_vcpu_run() or pselect().
> >>>
> >>>>>             switch (exit_reason) {
> >>>>>             case HV_EXIT_REASON_EXCEPTION:
> >>>>>                 /* This is the main one, handle below. */
> >>>>>                 break;
> >>>>>             case HV_EXIT_REASON_VTIMER_ACTIVATED:
> >>>>> -            qemu_mutex_lock_iothread();
> >>>>>                 current_cpu = cpu;
> >>>>>                 qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
> >>>>>                 qemu_mutex_unlock_iothread();
> >>>>>                 continue;
> >>>>>             case HV_EXIT_REASON_CANCELED:
> >>>>>                 /* we got kicked, no exit to process */
> >>>>> +            qemu_mutex_unlock_iothread();
> >>>>>                 continue;
> >>>>>             default:
> >>>>>                 assert(0);
> >>>>> @@ -413,7 +406,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>>>>                 uint32_t srt = (syndrome >> 16) & 0x1f;
> >>>>>                 uint64_t val = 0;
> >>>>>
> >>>>> -            qemu_mutex_lock_iothread();
> >>>>>                 current_cpu = cpu;
> >>>>>
> >>>>>                 DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
> >>>>> @@ -446,8 +438,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>>>>                     hvf_set_reg(cpu, srt, val);
> >>>>>                 }
> >>>>>
> >>>>> -            qemu_mutex_unlock_iothread();
> >>>>> -
> >>>>>                 advance_pc = true;
> >>>>>                 break;
> >>>>>             }
> >>>>> @@ -493,68 +483,36 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>>>>             case EC_WFX_TRAP:
> >>>>>                 if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
> >>>>>                     (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> >>>>> -                uint64_t cval, ctl, val, diff, now;
> >>>>> +                uint64_t cval;
> >>>>>
> >>>>> -                /* Set up a local timer for vtimer if necessary ... */
> >>>>> -                r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CTL_EL0, &ctl);
> >>>>> -                assert_hvf_ok(r);
> >>>>>                     r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CVAL_EL0, &cval);
> >>>>>                     assert_hvf_ok(r);
> >>>>>
> >>>>> -                asm volatile("mrs %0, cntvct_el0" : "=r"(val));
> >>>>> -                diff = cval - val;
> >>>>> -
> >>>>> -                now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
> >>>>> -                      gt_cntfrq_period_ns(arm_cpu);
> >>>>> -
> >>>>> -                /* Timer disabled or masked, just wait for long */
> >>>>> -                if (!(ctl & 1) || (ctl & 2)) {
> >>>>> -                    diff = (120 * NANOSECONDS_PER_SECOND) /
> >>>>> -                           gt_cntfrq_period_ns(arm_cpu);
> >>>>> +                int64_t ticks_to_sleep = cval - mach_absolute_time();
> >>>>> +                if (ticks_to_sleep < 0) {
> >>>>> +                    break;
> >>>> This will loop at 100% for Windows, which configures the vtimer as
> >>>> cval=0 ctl=7, so with IRQ mask bit set.
> >>> Okay, but the 120s is kind of arbitrary so we should just sleep until
> >>> we get a signal. That can be done by passing null as the timespec
> >>> argument to pselect().
> >>
> >> The reason I capped it at 120s was so that if I do hit a race, you don't
> >> break everything forever. Only for 2 minutes :).
> > I see. I think at this point we want to notice these types of bugs if
> > they exist instead of hiding them, so I would mildly be in favor of
> > not capping at 120s.
>
>
> Crossing my fingers that we are at that point already :).
>
>
> >
> >>>> Alex
> >>>>
> >>>>
> >>>>>                     }
> >>>>>
> >>>>> -                if (diff < INT64_MAX) {
> >>>>> -                    uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
> >>>>> -                    struct timespec *ts = &cpu->hvf->ts;
> >>>>> -
> >>>>> -                    *ts = (struct timespec){
> >>>>> -                        .tv_sec = ns / NANOSECONDS_PER_SECOND,
> >>>>> -                        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
> >>>>> -                    };
> >>>>> -
> >>>>> -                    /*
> >>>>> -                     * Waking up easily takes 1ms, don't go to sleep for smaller
> >>>>> -                     * time periods than 2ms.
> >>>>> -                     */
> >>>>> -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
> >>>> I put this logic here on purpose. A pselect(1 ns) easily takes 1-2ms to
> >>>> return. Without logic like this, super short WFIs will hurt performance
> >>>> quite badly.
> >>> I don't think that's accurate. According to this benchmark it's a few
> >>> hundred nanoseconds at most.
> >>>
> >>> pcc@pac-mini /tmp> cat pselect.c
> >>> #include <signal.h>
> >>> #include <sys/select.h>
> >>>
> >>> int main() {
> >>>     sigset_t mask, orig_mask;
> >>>     pthread_sigmask(SIG_SETMASK, 0, &mask);
> >>>     sigaddset(&mask, SIGUSR1);
> >>>     pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);
> >>>
> >>>     for (int i = 0; i != 1000000; ++i) {
> >>>       struct timespec ts = { 0, 1 };
> >>>       pselect(0, 0, 0, 0, &ts, &orig_mask);
> >>>     }
> >>> }
> >>> pcc@pac-mini /tmp> time ./pselect
> >>>
> >>> ________________________________________________________
> >>> Executed in  179.87 millis    fish           external
> >>>      usr time   77.68 millis   57.00 micros   77.62 millis
> >>>      sys time  101.37 millis  852.00 micros  100.52 millis
> >>>
> >>> Besides, all that you're really saving here is the single pselect
> >>> call. There are no doubt more expensive syscalls involved in exiting
> >>> and entering the VCPU that would dominate here.
> >>
> >> I would expect that such a super low ts value has a short-circuit path
> >> in the kernel as well. Where things start to fall apart is when you're
> >> at a threshold where rescheduling might be ok, but then you need to take
> >> all of the additional task switch overhead into account. Try to adapt
> >> your test code a bit:
> >>
> >> #include <signal.h>
> >> #include <sys/select.h>
> >>
> >> int main() {
> >>     sigset_t mask, orig_mask;
> >>     pthread_sigmask(SIG_SETMASK, 0, &mask);
> >>     sigaddset(&mask, SIGUSR1);
> >>     pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);
> >>
> >>     for (int i = 0; i != 10000; ++i) {
> >> #define SCALE_MS 1000000
> >>       struct timespec ts = { 0, SCALE_MS / 10 };
> >>       pselect(0, 0, 0, 0, &ts, &orig_mask);
> >>     }
> >> }
> >>
> >>
> >> % time ./pselect
> >> ./pselect  0.00s user 0.01s system 1% cpu 1.282 total
> >>
> >> You're suddenly seeing 300µs overhead per pselect call then. When I
> >> measured actual enter/exit times in QEMU, I saw much bigger differences
> >> between "time I want to sleep for" and "time I did sleep" even when just
> >> capturing the virtual time before and after the nanosleep/pselect call.
> > Okay. So the alternative is that we spin on the CPU, either doing
> > no-op VCPU entries/exits or something like:
> >
> > while (mach_absolute_time() < cval);
>
>
> This won't catch events that arrive during that time, such as
> interrupts, right? I'd just declare the WFI as done and keep looping in
> and out of the guest for now.

Oh, that's a good point.

> > My intuition is we shouldn't try to subvert the OS scheduler like this
> > unless it's proven to help with some real world metric since otherwise
> > we're not being fair to the other processes on the CPU. With CPU
> > intensive workloads I wouldn't expect these kinds of sleeps to happen
> > very often if at all so if it's only microbenchmarks and so on that
> > are affected then my inclination is not to do this for now.
>
>
> The problem is that the VM's OS is expecting bare metal timer behavior
> usually. And that gives you much better granularities than what we can
> achieve with a virtualization layer on top. So I do feel strongly about
> leaving this bit in. In the workloads you describe above, you won't ever
> hit that branch anyway.
>
> The workloads that benefit from logic like this are message passing
> ones. Check out this presentation from a KVM colleague of yours for details:
>
> https://www.linux-kvm.org/images/a/ac/02x03-Davit_Matalack-KVM_Message_passing_Performance.pdf
>    https://www.youtube.com/watch?v=p85FFrloLFg

Mm, okay. I personally would not add anything like that at this point
without real-world data but I don't feel too strongly and I suppose
the implementation can always be adjusted later.

Peter

Re: [PATCH] arm/hvf: Optimize and simplify WFI handling

Posted by Roman Bolshakov 4 years, 3 months ago

On Tue, Dec 01, 2020 at 10:59:50AM -0800, Peter Collingbourne wrote:
> On Tue, Dec 1, 2020 at 3:16 AM Alexander Graf <agraf@csgraf.de> wrote:
> >
> > Hi Peter,
> >
> > On 01.12.20 09:21, Peter Collingbourne wrote:
> > > Sleep on WFx until the VTIMER is due but allow ourselves to be woken
> > > up on IPI.
> > >
> > > Signed-off-by: Peter Collingbourne <pcc@google.com>
> >
> >
> > Thanks a bunch!
> >
> >
> > > ---
> > > Alexander Graf wrote:
> > >> I would love to take a patch from you here :). I'll still be stuck for a
> > >> while with the sysreg sync rework that Peter asked for before I can look
> > >> at WFI again.
> > > Okay, here's a patch :) It's a relatively straightforward adaptation
> > > of what we have in our fork, which can now boot Android to GUI while
> > > remaining at around 4% CPU when idle.
> > >
> > > I'm not set up to boot a full Linux distribution at the moment so I
> > > tested it on upstream QEMU by running a recent mainline Linux kernel
> > > with a rootfs containing an init program that just does sleep(5)
> > > and verified that the qemu process remains at low CPU usage during
> > > the sleep. This was on top of your v2 plus the last patch of your v1
> > > since it doesn't look like you have a replacement for that logic yet.
> > >
> > >   accel/hvf/hvf-cpus.c     |  5 +--
> > >   include/sysemu/hvf_int.h |  3 +-
> > >   target/arm/hvf/hvf.c     | 94 +++++++++++-----------------------------
> > >   3 files changed, 28 insertions(+), 74 deletions(-)
> > >
> > > diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> > > index 4360f64671..b2c8fb57f6 100644
> > > --- a/accel/hvf/hvf-cpus.c
> > > +++ b/accel/hvf/hvf-cpus.c
> > > @@ -344,9 +344,8 @@ static int hvf_init_vcpu(CPUState *cpu)
> > >       sigact.sa_handler = dummy_signal;
> > >       sigaction(SIG_IPI, &sigact, NULL);
> > >
> > > -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> > > -    sigdelset(&set, SIG_IPI);
> > > -    pthread_sigmask(SIG_SETMASK, &set, NULL);
> > > +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->unblock_ipi_mask);
> > > +    sigdelset(&cpu->hvf->unblock_ipi_mask, SIG_IPI);
> >
> >
> > What will this do to the x86 hvf implementation? We're now not
> > unblocking SIG_IPI again for that, right?
> 
> Yes and that was the case before your patch series.
> 
> > >
> > >   #ifdef __aarch64__
> > >       r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t **)&cpu->hvf->exit, NULL);
> > > diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> > > index c56baa3ae8..13adf6ea77 100644
> > > --- a/include/sysemu/hvf_int.h
> > > +++ b/include/sysemu/hvf_int.h
> > > @@ -62,8 +62,7 @@ extern HVFState *hvf_state;
> > >   struct hvf_vcpu_state {
> > >       uint64_t fd;
> > >       void *exit;
> > > -    struct timespec ts;
> > > -    bool sleeping;
> > > +    sigset_t unblock_ipi_mask;
> > >   };
> > >
> > >   void assert_hvf_ok(hv_return_t ret);
> > > diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
> > > index 8fe10966d2..60a361ff38 100644
> > > --- a/target/arm/hvf/hvf.c
> > > +++ b/target/arm/hvf/hvf.c
> > > @@ -2,6 +2,7 @@
> > >    * QEMU Hypervisor.framework support for Apple Silicon
> > >
> > >    * Copyright 2020 Alexander Graf <agraf@csgraf.de>
> > > + * Copyright 2020 Google LLC
> > >    *
> > >    * This work is licensed under the terms of the GNU GPL, version 2 or later.
> > >    * See the COPYING file in the top-level directory.
> > > @@ -18,6 +19,7 @@
> > >   #include "sysemu/hw_accel.h"
> > >
> > >   #include <Hypervisor/Hypervisor.h>
> > > +#include <mach/mach_time.h>
> > >
> > >   #include "exec/address-spaces.h"
> > >   #include "hw/irq.h"
> > > @@ -320,18 +322,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)
> > >
> > >   void hvf_kick_vcpu_thread(CPUState *cpu)
> > >   {
> > > -    if (cpu->hvf->sleeping) {
> > > -        /*
> > > -         * When sleeping, make sure we always send signals. Also, clear the
> > > -         * timespec, so that an IPI that arrives between setting hvf->sleeping
> > > -         * and the nanosleep syscall still aborts the sleep.
> > > -         */
> > > -        cpu->thread_kicked = false;
> > > -        cpu->hvf->ts = (struct timespec){ };
> > > -        cpus_kick_thread(cpu);
> > > -    } else {
> > > -        hv_vcpus_exit(&cpu->hvf->fd, 1);
> > > -    }
> > > +    cpus_kick_thread(cpu);
> > > +    hv_vcpus_exit(&cpu->hvf->fd, 1);
> >
> >
> > This means your first WFI will almost always return immediately due to a
> > pending signal, because there probably was an IRQ pending before on the
> > same CPU, no?
> 
> That's right. Any approach involving the "sleeping" field would need
> to be implemented carefully to avoid races that may result in missed
> wakeups so for simplicity I just decided to send both kinds of
> wakeups. In particular the approach in the updated patch you sent is
> racy and I'll elaborate more in the reply to that patch.
> 
> > >   }
> > >
> > >   static int hvf_inject_interrupts(CPUState *cpu)
> > > @@ -385,18 +377,19 @@ int hvf_vcpu_exec(CPUState *cpu)
> > >           uint64_t syndrome = hvf_exit->exception.syndrome;
> > >           uint32_t ec = syn_get_ec(syndrome);
> > >
> > > +        qemu_mutex_lock_iothread();
> >
> >
> > Is there a particular reason you're moving the iothread lock out again
> > from the individual bits? I would really like to keep a notion of fast
> > path exits.
> 
> We still need to lock at least once no matter the exit reason to check
> the interrupts so I don't think it's worth it to try and avoid locking
> like this. It also makes the implementation easier to reason about and
> therefore more likely to be correct. In our implementation we just
> stay locked the whole time unless we're in hv_vcpu_run() or pselect().
> 

But does it leaves a small window for a kick loss between
qemu_mutex_unlock_iothread() and hv_vcpu_run()/pselect()?

For x86 it could lose a kick between them. That was a reason for the
sophisticated approach to catch the kick [1] (and related discussions in
v1/v2/v3).  Unfortunately I can't read ARM assembly yet so I don't if
hv_vcpus_exit() suffers from the same issue as x86 hv_vcpu_interrupt().

1. https://patchwork.kernel.org/project/qemu-devel/patch/20200729124832.79375-1-r.bolshakov@yadro.com/

Thanks,
Roman

> > >           switch (exit_reason) {
> > >           case HV_EXIT_REASON_EXCEPTION:
> > >               /* This is the main one, handle below. */
> > >               break;
> > >           case HV_EXIT_REASON_VTIMER_ACTIVATED:
> > > -            qemu_mutex_lock_iothread();
> > >               current_cpu = cpu;
> > >               qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
> > >               qemu_mutex_unlock_iothread();
> > >               continue;
> > >           case HV_EXIT_REASON_CANCELED:
> > >               /* we got kicked, no exit to process */
> > > +            qemu_mutex_unlock_iothread();
> > >               continue;
> > >           default:
> > >               assert(0);
> > > @@ -413,7 +406,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> > >               uint32_t srt = (syndrome >> 16) & 0x1f;
> > >               uint64_t val = 0;
> > >
> > > -            qemu_mutex_lock_iothread();
> > >               current_cpu = cpu;
> > >
> > >               DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
> > > @@ -446,8 +438,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> > >                   hvf_set_reg(cpu, srt, val);
> > >               }
> > >
> > > -            qemu_mutex_unlock_iothread();
> > > -
> > >               advance_pc = true;
> > >               break;
> > >           }
> > > @@ -493,68 +483,36 @@ int hvf_vcpu_exec(CPUState *cpu)
> > >           case EC_WFX_TRAP:
> > >               if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
> > >                   (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > > -                uint64_t cval, ctl, val, diff, now;
> > > +                uint64_t cval;
> > >
> > > -                /* Set up a local timer for vtimer if necessary ... */
> > > -                r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CTL_EL0, &ctl);
> > > -                assert_hvf_ok(r);
> > >                   r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CVAL_EL0, &cval);
> > >                   assert_hvf_ok(r);
> > >
> > > -                asm volatile("mrs %0, cntvct_el0" : "=r"(val));
> > > -                diff = cval - val;
> > > -
> > > -                now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
> > > -                      gt_cntfrq_period_ns(arm_cpu);
> > > -
> > > -                /* Timer disabled or masked, just wait for long */
> > > -                if (!(ctl & 1) || (ctl & 2)) {
> > > -                    diff = (120 * NANOSECONDS_PER_SECOND) /
> > > -                           gt_cntfrq_period_ns(arm_cpu);
> > > +                int64_t ticks_to_sleep = cval - mach_absolute_time();
> > > +                if (ticks_to_sleep < 0) {
> > > +                    break;
> >
> >
> > This will loop at 100% for Windows, which configures the vtimer as
> > cval=0 ctl=7, so with IRQ mask bit set.
> 
> Okay, but the 120s is kind of arbitrary so we should just sleep until
> we get a signal. That can be done by passing null as the timespec
> argument to pselect().
> 
> >
> >
> > Alex
> >
> >
> > >                   }
> > >
> > > -                if (diff < INT64_MAX) {
> > > -                    uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
> > > -                    struct timespec *ts = &cpu->hvf->ts;
> > > -
> > > -                    *ts = (struct timespec){
> > > -                        .tv_sec = ns / NANOSECONDS_PER_SECOND,
> > > -                        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
> > > -                    };
> > > -
> > > -                    /*
> > > -                     * Waking up easily takes 1ms, don't go to sleep for smaller
> > > -                     * time periods than 2ms.
> > > -                     */
> > > -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
> >
> >
> > I put this logic here on purpose. A pselect(1 ns) easily takes 1-2ms to
> > return. Without logic like this, super short WFIs will hurt performance
> > quite badly.
> 
> I don't think that's accurate. According to this benchmark it's a few
> hundred nanoseconds at most.
> 
> pcc@pac-mini /tmp> cat pselect.c
> #include <signal.h>
> #include <sys/select.h>
> 
> int main() {
>   sigset_t mask, orig_mask;
>   pthread_sigmask(SIG_SETMASK, 0, &mask);
>   sigaddset(&mask, SIGUSR1);
>   pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);
> 
>   for (int i = 0; i != 1000000; ++i) {
>     struct timespec ts = { 0, 1 };
>     pselect(0, 0, 0, 0, &ts, &orig_mask);
>   }
> }
> pcc@pac-mini /tmp> time ./pselect
> 
> ________________________________________________________
> Executed in  179.87 millis    fish           external
>    usr time   77.68 millis   57.00 micros   77.62 millis
>    sys time  101.37 millis  852.00 micros  100.52 millis
> 
> Besides, all that you're really saving here is the single pselect
> call. There are no doubt more expensive syscalls involved in exiting
> and entering the VCPU that would dominate here.
> 
> Peter
> 
> >
> >
> > Alex
> >
> > > -                        advance_pc = true;
> > > -                        break;
> > > -                    }
> > > -
> > > -                    /* Set cpu->hvf->sleeping so that we get a SIG_IPI signal. */
> > > -                    cpu->hvf->sleeping = true;
> > > -                    smp_mb();
> > > -
> > > -                    /* Bail out if we received an IRQ meanwhile */
> > > -                    if (cpu->thread_kicked || (cpu->interrupt_request &
> > > -                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > > -                        cpu->hvf->sleeping = false;
> > > -                        break;
> > > -                    }
> > > -
> > > -                    /* nanosleep returns on signal, so we wake up on kick. */
> > > -                    nanosleep(ts, NULL);
> > > -
> > > -                    /* Out of sleep - either naturally or because of a kick */
> > > -                    cpu->hvf->sleeping = false;
> > > -                }
> > > +                uint64_t seconds = ticks_to_sleep / arm_cpu->gt_cntfrq_hz;
> > > +                uint64_t nanos =
> > > +                    (ticks_to_sleep - arm_cpu->gt_cntfrq_hz * seconds) *
> > > +                    1000000000 / arm_cpu->gt_cntfrq_hz;
> > > +                struct timespec ts = { seconds, nanos };
> > > +
> > > +                /*
> > > +                 * Use pselect to sleep so that other threads can IPI us while
> > > +                 * we're sleeping.
> > > +                 */
> > > +                qatomic_mb_set(&cpu->thread_kicked, false);
> > > +                qemu_mutex_unlock_iothread();
> > > +                pselect(0, 0, 0, 0, &ts, &cpu->hvf->unblock_ipi_mask);
> > > +                qemu_mutex_lock_iothread();
> > >
> > >                   advance_pc = true;
> > >               }
> > >               break;
> > >           case EC_AA64_HVC:
> > >               cpu_synchronize_state(cpu);
> > > -            qemu_mutex_lock_iothread();
> > >               current_cpu = cpu;
> > >               if (arm_is_psci_call(arm_cpu, EXCP_HVC)) {
> > >                   arm_handle_psci_call(arm_cpu);
> > > @@ -562,11 +520,9 @@ int hvf_vcpu_exec(CPUState *cpu)
> > >                   DPRINTF("unknown HVC! %016llx", env->xregs[0]);
> > >                   env->xregs[0] = -1;
> > >               }
> > > -            qemu_mutex_unlock_iothread();
> > >               break;
> > >           case EC_AA64_SMC:
> > >               cpu_synchronize_state(cpu);
> > > -            qemu_mutex_lock_iothread();
> > >               current_cpu = cpu;
> > >               if (arm_is_psci_call(arm_cpu, EXCP_SMC)) {
> > >                   arm_handle_psci_call(arm_cpu);
> > > @@ -575,7 +531,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> > >                   env->xregs[0] = -1;
> > >                   env->pc += 4;
> > >               }
> > > -            qemu_mutex_unlock_iothread();
> > >               break;
> > >           default:
> > >               cpu_synchronize_state(cpu);
> > > @@ -594,6 +549,7 @@ int hvf_vcpu_exec(CPUState *cpu)
> > >               r = hv_vcpu_set_reg(cpu->hvf->fd, HV_REG_PC, pc);
> > >               assert_hvf_ok(r);
> > >           }
> > > +        qemu_mutex_unlock_iothread();
> > >       } while (ret == 0);
> > >
> > >       qemu_mutex_lock_iothread();

Re: [PATCH] arm/hvf: Optimize and simplify WFI handling

Posted by Peter Collingbourne 4 years, 3 months ago

On Thu, Dec 3, 2020 at 2:12 AM Roman Bolshakov <r.bolshakov@yadro.com> wrote:
>
> On Tue, Dec 01, 2020 at 10:59:50AM -0800, Peter Collingbourne wrote:
> > On Tue, Dec 1, 2020 at 3:16 AM Alexander Graf <agraf@csgraf.de> wrote:
> > >
> > > Hi Peter,
> > >
> > > On 01.12.20 09:21, Peter Collingbourne wrote:
> > > > Sleep on WFx until the VTIMER is due but allow ourselves to be woken
> > > > up on IPI.
> > > >
> > > > Signed-off-by: Peter Collingbourne <pcc@google.com>
> > >
> > >
> > > Thanks a bunch!
> > >
> > >
> > > > ---
> > > > Alexander Graf wrote:
> > > >> I would love to take a patch from you here :). I'll still be stuck for a
> > > >> while with the sysreg sync rework that Peter asked for before I can look
> > > >> at WFI again.
> > > > Okay, here's a patch :) It's a relatively straightforward adaptation
> > > > of what we have in our fork, which can now boot Android to GUI while
> > > > remaining at around 4% CPU when idle.
> > > >
> > > > I'm not set up to boot a full Linux distribution at the moment so I
> > > > tested it on upstream QEMU by running a recent mainline Linux kernel
> > > > with a rootfs containing an init program that just does sleep(5)
> > > > and verified that the qemu process remains at low CPU usage during
> > > > the sleep. This was on top of your v2 plus the last patch of your v1
> > > > since it doesn't look like you have a replacement for that logic yet.
> > > >
> > > >   accel/hvf/hvf-cpus.c     |  5 +--
> > > >   include/sysemu/hvf_int.h |  3 +-
> > > >   target/arm/hvf/hvf.c     | 94 +++++++++++-----------------------------
> > > >   3 files changed, 28 insertions(+), 74 deletions(-)
> > > >
> > > > diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> > > > index 4360f64671..b2c8fb57f6 100644
> > > > --- a/accel/hvf/hvf-cpus.c
> > > > +++ b/accel/hvf/hvf-cpus.c
> > > > @@ -344,9 +344,8 @@ static int hvf_init_vcpu(CPUState *cpu)
> > > >       sigact.sa_handler = dummy_signal;
> > > >       sigaction(SIG_IPI, &sigact, NULL);
> > > >
> > > > -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> > > > -    sigdelset(&set, SIG_IPI);
> > > > -    pthread_sigmask(SIG_SETMASK, &set, NULL);
> > > > +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->unblock_ipi_mask);
> > > > +    sigdelset(&cpu->hvf->unblock_ipi_mask, SIG_IPI);
> > >
> > >
> > > What will this do to the x86 hvf implementation? We're now not
> > > unblocking SIG_IPI again for that, right?
> >
> > Yes and that was the case before your patch series.
> >
> > > >
> > > >   #ifdef __aarch64__
> > > >       r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t **)&cpu->hvf->exit, NULL);
> > > > diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> > > > index c56baa3ae8..13adf6ea77 100644
> > > > --- a/include/sysemu/hvf_int.h
> > > > +++ b/include/sysemu/hvf_int.h
> > > > @@ -62,8 +62,7 @@ extern HVFState *hvf_state;
> > > >   struct hvf_vcpu_state {
> > > >       uint64_t fd;
> > > >       void *exit;
> > > > -    struct timespec ts;
> > > > -    bool sleeping;
> > > > +    sigset_t unblock_ipi_mask;
> > > >   };
> > > >
> > > >   void assert_hvf_ok(hv_return_t ret);
> > > > diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
> > > > index 8fe10966d2..60a361ff38 100644
> > > > --- a/target/arm/hvf/hvf.c
> > > > +++ b/target/arm/hvf/hvf.c
> > > > @@ -2,6 +2,7 @@
> > > >    * QEMU Hypervisor.framework support for Apple Silicon
> > > >
> > > >    * Copyright 2020 Alexander Graf <agraf@csgraf.de>
> > > > + * Copyright 2020 Google LLC
> > > >    *
> > > >    * This work is licensed under the terms of the GNU GPL, version 2 or later.
> > > >    * See the COPYING file in the top-level directory.
> > > > @@ -18,6 +19,7 @@
> > > >   #include "sysemu/hw_accel.h"
> > > >
> > > >   #include <Hypervisor/Hypervisor.h>
> > > > +#include <mach/mach_time.h>
> > > >
> > > >   #include "exec/address-spaces.h"
> > > >   #include "hw/irq.h"
> > > > @@ -320,18 +322,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)
> > > >
> > > >   void hvf_kick_vcpu_thread(CPUState *cpu)
> > > >   {
> > > > -    if (cpu->hvf->sleeping) {
> > > > -        /*
> > > > -         * When sleeping, make sure we always send signals. Also, clear the
> > > > -         * timespec, so that an IPI that arrives between setting hvf->sleeping
> > > > -         * and the nanosleep syscall still aborts the sleep.
> > > > -         */
> > > > -        cpu->thread_kicked = false;
> > > > -        cpu->hvf->ts = (struct timespec){ };
> > > > -        cpus_kick_thread(cpu);
> > > > -    } else {
> > > > -        hv_vcpus_exit(&cpu->hvf->fd, 1);
> > > > -    }
> > > > +    cpus_kick_thread(cpu);
> > > > +    hv_vcpus_exit(&cpu->hvf->fd, 1);
> > >
> > >
> > > This means your first WFI will almost always return immediately due to a
> > > pending signal, because there probably was an IRQ pending before on the
> > > same CPU, no?
> >
> > That's right. Any approach involving the "sleeping" field would need
> > to be implemented carefully to avoid races that may result in missed
> > wakeups so for simplicity I just decided to send both kinds of
> > wakeups. In particular the approach in the updated patch you sent is
> > racy and I'll elaborate more in the reply to that patch.
> >
> > > >   }
> > > >
> > > >   static int hvf_inject_interrupts(CPUState *cpu)
> > > > @@ -385,18 +377,19 @@ int hvf_vcpu_exec(CPUState *cpu)
> > > >           uint64_t syndrome = hvf_exit->exception.syndrome;
> > > >           uint32_t ec = syn_get_ec(syndrome);
> > > >
> > > > +        qemu_mutex_lock_iothread();
> > >
> > >
> > > Is there a particular reason you're moving the iothread lock out again
> > > from the individual bits? I would really like to keep a notion of fast
> > > path exits.
> >
> > We still need to lock at least once no matter the exit reason to check
> > the interrupts so I don't think it's worth it to try and avoid locking
> > like this. It also makes the implementation easier to reason about and
> > therefore more likely to be correct. In our implementation we just
> > stay locked the whole time unless we're in hv_vcpu_run() or pselect().
> >
>
> But does it leaves a small window for a kick loss between
> qemu_mutex_unlock_iothread() and hv_vcpu_run()/pselect()?
>
> For x86 it could lose a kick between them. That was a reason for the
> sophisticated approach to catch the kick [1] (and related discussions in
> v1/v2/v3).  Unfortunately I can't read ARM assembly yet so I don't if
> hv_vcpus_exit() suffers from the same issue as x86 hv_vcpu_interrupt().
>
> 1. https://patchwork.kernel.org/project/qemu-devel/patch/20200729124832.79375-1-r.bolshakov@yadro.com/

I addressed pselect() in my other reply.

It isn't on the website but the hv_vcpu.h header says this about
hv_vcpus_exit():

 * @discussion
 *             If a vcpu is not running, the next time hv_vcpu_run is
called for the corresponding
 *             vcpu, it will return immediately without entering the guest.

So at least as documented I think we are okay.

Peter

>
> Thanks,
> Roman
>
> > > >           switch (exit_reason) {
> > > >           case HV_EXIT_REASON_EXCEPTION:
> > > >               /* This is the main one, handle below. */
> > > >               break;
> > > >           case HV_EXIT_REASON_VTIMER_ACTIVATED:
> > > > -            qemu_mutex_lock_iothread();
> > > >               current_cpu = cpu;
> > > >               qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
> > > >               qemu_mutex_unlock_iothread();
> > > >               continue;
> > > >           case HV_EXIT_REASON_CANCELED:
> > > >               /* we got kicked, no exit to process */
> > > > +            qemu_mutex_unlock_iothread();
> > > >               continue;
> > > >           default:
> > > >               assert(0);
> > > > @@ -413,7 +406,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> > > >               uint32_t srt = (syndrome >> 16) & 0x1f;
> > > >               uint64_t val = 0;
> > > >
> > > > -            qemu_mutex_lock_iothread();
> > > >               current_cpu = cpu;
> > > >
> > > >               DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
> > > > @@ -446,8 +438,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> > > >                   hvf_set_reg(cpu, srt, val);
> > > >               }
> > > >
> > > > -            qemu_mutex_unlock_iothread();
> > > > -
> > > >               advance_pc = true;
> > > >               break;
> > > >           }
> > > > @@ -493,68 +483,36 @@ int hvf_vcpu_exec(CPUState *cpu)
> > > >           case EC_WFX_TRAP:
> > > >               if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
> > > >                   (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > > > -                uint64_t cval, ctl, val, diff, now;
> > > > +                uint64_t cval;
> > > >
> > > > -                /* Set up a local timer for vtimer if necessary ... */
> > > > -                r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CTL_EL0, &ctl);
> > > > -                assert_hvf_ok(r);
> > > >                   r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CVAL_EL0, &cval);
> > > >                   assert_hvf_ok(r);
> > > >
> > > > -                asm volatile("mrs %0, cntvct_el0" : "=r"(val));
> > > > -                diff = cval - val;
> > > > -
> > > > -                now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
> > > > -                      gt_cntfrq_period_ns(arm_cpu);
> > > > -
> > > > -                /* Timer disabled or masked, just wait for long */
> > > > -                if (!(ctl & 1) || (ctl & 2)) {
> > > > -                    diff = (120 * NANOSECONDS_PER_SECOND) /
> > > > -                           gt_cntfrq_period_ns(arm_cpu);
> > > > +                int64_t ticks_to_sleep = cval - mach_absolute_time();
> > > > +                if (ticks_to_sleep < 0) {
> > > > +                    break;
> > >
> > >
> > > This will loop at 100% for Windows, which configures the vtimer as
> > > cval=0 ctl=7, so with IRQ mask bit set.
> >
> > Okay, but the 120s is kind of arbitrary so we should just sleep until
> > we get a signal. That can be done by passing null as the timespec
> > argument to pselect().
> >
> > >
> > >
> > > Alex
> > >
> > >
> > > >                   }
> > > >
> > > > -                if (diff < INT64_MAX) {
> > > > -                    uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
> > > > -                    struct timespec *ts = &cpu->hvf->ts;
> > > > -
> > > > -                    *ts = (struct timespec){
> > > > -                        .tv_sec = ns / NANOSECONDS_PER_SECOND,
> > > > -                        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
> > > > -                    };
> > > > -
> > > > -                    /*
> > > > -                     * Waking up easily takes 1ms, don't go to sleep for smaller
> > > > -                     * time periods than 2ms.
> > > > -                     */
> > > > -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
> > >
> > >
> > > I put this logic here on purpose. A pselect(1 ns) easily takes 1-2ms to
> > > return. Without logic like this, super short WFIs will hurt performance
> > > quite badly.
> >
> > I don't think that's accurate. According to this benchmark it's a few
> > hundred nanoseconds at most.
> >
> > pcc@pac-mini /tmp> cat pselect.c
> > #include <signal.h>
> > #include <sys/select.h>
> >
> > int main() {
> >   sigset_t mask, orig_mask;
> >   pthread_sigmask(SIG_SETMASK, 0, &mask);
> >   sigaddset(&mask, SIGUSR1);
> >   pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);
> >
> >   for (int i = 0; i != 1000000; ++i) {
> >     struct timespec ts = { 0, 1 };
> >     pselect(0, 0, 0, 0, &ts, &orig_mask);
> >   }
> > }
> > pcc@pac-mini /tmp> time ./pselect
> >
> > ________________________________________________________
> > Executed in  179.87 millis    fish           external
> >    usr time   77.68 millis   57.00 micros   77.62 millis
> >    sys time  101.37 millis  852.00 micros  100.52 millis
> >
> > Besides, all that you're really saving here is the single pselect
> > call. There are no doubt more expensive syscalls involved in exiting
> > and entering the VCPU that would dominate here.
> >
> > Peter
> >
> > >
> > >
> > > Alex
> > >
> > > > -                        advance_pc = true;
> > > > -                        break;
> > > > -                    }
> > > > -
> > > > -                    /* Set cpu->hvf->sleeping so that we get a SIG_IPI signal. */
> > > > -                    cpu->hvf->sleeping = true;
> > > > -                    smp_mb();
> > > > -
> > > > -                    /* Bail out if we received an IRQ meanwhile */
> > > > -                    if (cpu->thread_kicked || (cpu->interrupt_request &
> > > > -                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > > > -                        cpu->hvf->sleeping = false;
> > > > -                        break;
> > > > -                    }
> > > > -
> > > > -                    /* nanosleep returns on signal, so we wake up on kick. */
> > > > -                    nanosleep(ts, NULL);
> > > > -
> > > > -                    /* Out of sleep - either naturally or because of a kick */
> > > > -                    cpu->hvf->sleeping = false;
> > > > -                }
> > > > +                uint64_t seconds = ticks_to_sleep / arm_cpu->gt_cntfrq_hz;
> > > > +                uint64_t nanos =
> > > > +                    (ticks_to_sleep - arm_cpu->gt_cntfrq_hz * seconds) *
> > > > +                    1000000000 / arm_cpu->gt_cntfrq_hz;
> > > > +                struct timespec ts = { seconds, nanos };
> > > > +
> > > > +                /*
> > > > +                 * Use pselect to sleep so that other threads can IPI us while
> > > > +                 * we're sleeping.
> > > > +                 */
> > > > +                qatomic_mb_set(&cpu->thread_kicked, false);
> > > > +                qemu_mutex_unlock_iothread();
> > > > +                pselect(0, 0, 0, 0, &ts, &cpu->hvf->unblock_ipi_mask);
> > > > +                qemu_mutex_lock_iothread();
> > > >
> > > >                   advance_pc = true;
> > > >               }
> > > >               break;
> > > >           case EC_AA64_HVC:
> > > >               cpu_synchronize_state(cpu);
> > > > -            qemu_mutex_lock_iothread();
> > > >               current_cpu = cpu;
> > > >               if (arm_is_psci_call(arm_cpu, EXCP_HVC)) {
> > > >                   arm_handle_psci_call(arm_cpu);
> > > > @@ -562,11 +520,9 @@ int hvf_vcpu_exec(CPUState *cpu)
> > > >                   DPRINTF("unknown HVC! %016llx", env->xregs[0]);
> > > >                   env->xregs[0] = -1;
> > > >               }
> > > > -            qemu_mutex_unlock_iothread();
> > > >               break;
> > > >           case EC_AA64_SMC:
> > > >               cpu_synchronize_state(cpu);
> > > > -            qemu_mutex_lock_iothread();
> > > >               current_cpu = cpu;
> > > >               if (arm_is_psci_call(arm_cpu, EXCP_SMC)) {
> > > >                   arm_handle_psci_call(arm_cpu);
> > > > @@ -575,7 +531,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> > > >                   env->xregs[0] = -1;
> > > >                   env->pc += 4;
> > > >               }
> > > > -            qemu_mutex_unlock_iothread();
> > > >               break;
> > > >           default:
> > > >               cpu_synchronize_state(cpu);
> > > > @@ -594,6 +549,7 @@ int hvf_vcpu_exec(CPUState *cpu)
> > > >               r = hv_vcpu_set_reg(cpu->hvf->fd, HV_REG_PC, pc);
> > > >               assert_hvf_ok(r);
> > > >           }
> > > > +        qemu_mutex_unlock_iothread();
> > > >       } while (ret == 0);
> > > >
> > > >       qemu_mutex_lock_iothread();

Re: [PATCH] arm/hvf: Optimize and simplify WFI handling

Posted by Alexander Graf 4 years, 3 months ago

On 01.12.20 09:21, Peter Collingbourne wrote:
> Sleep on WFx until the VTIMER is due but allow ourselves to be woken
> up on IPI.
>
> Signed-off-by: Peter Collingbourne <pcc@google.com>
> ---
> Alexander Graf wrote:
>> I would love to take a patch from you here :). I'll still be stuck for a
>> while with the sysreg sync rework that Peter asked for before I can look
>> at WFI again.
> Okay, here's a patch :) It's a relatively straightforward adaptation
> of what we have in our fork, which can now boot Android to GUI while
> remaining at around 4% CPU when idle.
>
> I'm not set up to boot a full Linux distribution at the moment so I
> tested it on upstream QEMU by running a recent mainline Linux kernel
> with a rootfs containing an init program that just does sleep(5)
> and verified that the qemu process remains at low CPU usage during
> the sleep. This was on top of your v2 plus the last patch of your v1
> since it doesn't look like you have a replacement for that logic yet.


How about something like this instead?


Alex


diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
index 4360f64671..50384013ea 100644
--- a/accel/hvf/hvf-cpus.c
+++ b/accel/hvf/hvf-cpus.c
@@ -337,16 +337,18 @@ static int hvf_init_vcpu(CPUState *cpu)
      cpu->hvf = g_malloc0(sizeof(*cpu->hvf));

      /* init cpu signals */
-    sigset_t set;
      struct sigaction sigact;

      memset(&sigact, 0, sizeof(sigact));
      sigact.sa_handler = dummy_signal;
      sigaction(SIG_IPI, &sigact, NULL);

-    pthread_sigmask(SIG_BLOCK, NULL, &set);
-    sigdelset(&set, SIG_IPI);
-    pthread_sigmask(SIG_SETMASK, &set, NULL);
+    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask);
+    sigdelset(&cpu->hvf->sigmask, SIG_IPI);
+    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask, NULL);
+
+    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask_ipi);
+    sigaddset(&cpu->hvf->sigmask_ipi, SIG_IPI);

  #ifdef __aarch64__
      r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t 
**)&cpu->hvf->exit, NULL);
diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
index c56baa3ae8..6e237f2db0 100644
--- a/include/sysemu/hvf_int.h
+++ b/include/sysemu/hvf_int.h
@@ -62,8 +62,9 @@ extern HVFState *hvf_state;
  struct hvf_vcpu_state {
      uint64_t fd;
      void *exit;
-    struct timespec ts;
      bool sleeping;
+    sigset_t sigmask;
+    sigset_t sigmask_ipi;
  };

  void assert_hvf_ok(hv_return_t ret);
diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
index 0c01a03725..350b845e6e 100644
--- a/target/arm/hvf/hvf.c
+++ b/target/arm/hvf/hvf.c
@@ -320,20 +320,24 @@ int hvf_arch_init_vcpu(CPUState *cpu)

  void hvf_kick_vcpu_thread(CPUState *cpu)
  {
-    if (cpu->hvf->sleeping) {
-        /*
-         * When sleeping, make sure we always send signals. Also, clear the
-         * timespec, so that an IPI that arrives between setting 
hvf->sleeping
-         * and the nanosleep syscall still aborts the sleep.
-         */
-        cpu->thread_kicked = false;
-        cpu->hvf->ts = (struct timespec){ };
+    if (qatomic_read(&cpu->hvf->sleeping)) {
+        /* When sleeping, send a signal to get out of pselect */
          cpus_kick_thread(cpu);
      } else {
          hv_vcpus_exit(&cpu->hvf->fd, 1);
      }
  }

+static void hvf_block_sig_ipi(CPUState *cpu)
+{
+    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask_ipi, NULL);
+}
+
+static void hvf_unblock_sig_ipi(CPUState *cpu)
+{
+    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask, NULL);
+}
+
  static int hvf_inject_interrupts(CPUState *cpu)
  {
      if (cpu->interrupt_request & CPU_INTERRUPT_FIQ) {
@@ -354,6 +358,7 @@ int hvf_vcpu_exec(CPUState *cpu)
      ARMCPU *arm_cpu = ARM_CPU(cpu);
      CPUARMState *env = &arm_cpu->env;
      hv_vcpu_exit_t *hvf_exit = cpu->hvf->exit;
+    const uint32_t irq_mask = CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ;
      hv_return_t r;
      int ret = 0;

@@ -491,8 +496,8 @@ int hvf_vcpu_exec(CPUState *cpu)
              break;
          }
          case EC_WFX_TRAP:
-            if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
-                (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
+            if (!(syndrome & WFX_IS_WFE) &&
+                !(cpu->interrupt_request & irq_mask)) {
                  uint64_t cval, ctl, val, diff, now;

                  /* Set up a local timer for vtimer if necessary ... */
@@ -515,9 +520,7 @@ int hvf_vcpu_exec(CPUState *cpu)

                  if (diff < INT64_MAX) {
                      uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
-                    struct timespec *ts = &cpu->hvf->ts;
-
-                    *ts = (struct timespec){
+                    struct timespec ts = {
                          .tv_sec = ns / NANOSECONDS_PER_SECOND,
                          .tv_nsec = ns % NANOSECONDS_PER_SECOND,
                      };
@@ -526,27 +529,31 @@ int hvf_vcpu_exec(CPUState *cpu)
                       * Waking up easily takes 1ms, don't go to sleep 
for smaller
                       * time periods than 2ms.
                       */
-                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
+                    if (!ts.tv_sec && (ts.tv_nsec < (SCALE_MS * 2))) {
                          advance_pc = true;
                          break;
                      }

+                    /* block SIG_IPI for the sleep */
+                    hvf_block_sig_ipi(cpu);
+                    cpu->thread_kicked = false;
+
                      /* Set cpu->hvf->sleeping so that we get a SIG_IPI 
signal. */
-                    cpu->hvf->sleeping = true;
-                    smp_mb();
+                    qatomic_set(&cpu->hvf->sleeping, true);

-                    /* Bail out if we received an IRQ meanwhile */
-                    if (cpu->thread_kicked || (cpu->interrupt_request &
-                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
-                        cpu->hvf->sleeping = false;
+                    /* Bail out if we received a kick meanwhile */
+                    if (qatomic_read(&cpu->interrupt_request) & irq_mask) {
+ qatomic_set(&cpu->hvf->sleeping, false);
+                        hvf_unblock_sig_ipi(cpu);
                          break;
                      }

-                    /* nanosleep returns on signal, so we wake up on 
kick. */
-                    nanosleep(ts, NULL);
+                    /* pselect returns on kick signal and consumes it */
+                    pselect(0, 0, 0, 0, &ts, &cpu->hvf->sigmask);

                      /* Out of sleep - either naturally or because of a 
kick */
-                    cpu->hvf->sleeping = false;
+                    qatomic_set(&cpu->hvf->sleeping, false);
+                    hvf_unblock_sig_ipi(cpu);
                  }

                  advance_pc = true;

Re: [PATCH] arm/hvf: Optimize and simplify WFI handling

Posted by Peter Collingbourne 4 years, 3 months ago

On Tue, Dec 1, 2020 at 8:26 AM Alexander Graf <agraf@csgraf.de> wrote:
>
>
> On 01.12.20 09:21, Peter Collingbourne wrote:
> > Sleep on WFx until the VTIMER is due but allow ourselves to be woken
> > up on IPI.
> >
> > Signed-off-by: Peter Collingbourne <pcc@google.com>
> > ---
> > Alexander Graf wrote:
> >> I would love to take a patch from you here :). I'll still be stuck for a
> >> while with the sysreg sync rework that Peter asked for before I can look
> >> at WFI again.
> > Okay, here's a patch :) It's a relatively straightforward adaptation
> > of what we have in our fork, which can now boot Android to GUI while
> > remaining at around 4% CPU when idle.
> >
> > I'm not set up to boot a full Linux distribution at the moment so I
> > tested it on upstream QEMU by running a recent mainline Linux kernel
> > with a rootfs containing an init program that just does sleep(5)
> > and verified that the qemu process remains at low CPU usage during
> > the sleep. This was on top of your v2 plus the last patch of your v1
> > since it doesn't look like you have a replacement for that logic yet.
>
>
> How about something like this instead?
>
>
> Alex
>
>
> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> index 4360f64671..50384013ea 100644
> --- a/accel/hvf/hvf-cpus.c
> +++ b/accel/hvf/hvf-cpus.c
> @@ -337,16 +337,18 @@ static int hvf_init_vcpu(CPUState *cpu)
>       cpu->hvf = g_malloc0(sizeof(*cpu->hvf));
>
>       /* init cpu signals */
> -    sigset_t set;
>       struct sigaction sigact;
>
>       memset(&sigact, 0, sizeof(sigact));
>       sigact.sa_handler = dummy_signal;
>       sigaction(SIG_IPI, &sigact, NULL);
>
> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> -    sigdelset(&set, SIG_IPI);
> -    pthread_sigmask(SIG_SETMASK, &set, NULL);
> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask);
> +    sigdelset(&cpu->hvf->sigmask, SIG_IPI);
> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask, NULL);
> +
> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask_ipi);
> +    sigaddset(&cpu->hvf->sigmask_ipi, SIG_IPI);

There's no reason to unblock SIG_IPI while not in pselect and it can
easily lead to missed wakeups. The whole point of pselect is so that
you can guarantee that only one part of your program sees signals
without a possibility of them being missed.

>
>   #ifdef __aarch64__
>       r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t
> **)&cpu->hvf->exit, NULL);
> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> index c56baa3ae8..6e237f2db0 100644
> --- a/include/sysemu/hvf_int.h
> +++ b/include/sysemu/hvf_int.h
> @@ -62,8 +62,9 @@ extern HVFState *hvf_state;
>   struct hvf_vcpu_state {
>       uint64_t fd;
>       void *exit;
> -    struct timespec ts;
>       bool sleeping;
> +    sigset_t sigmask;
> +    sigset_t sigmask_ipi;
>   };
>
>   void assert_hvf_ok(hv_return_t ret);
> diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
> index 0c01a03725..350b845e6e 100644
> --- a/target/arm/hvf/hvf.c
> +++ b/target/arm/hvf/hvf.c
> @@ -320,20 +320,24 @@ int hvf_arch_init_vcpu(CPUState *cpu)
>
>   void hvf_kick_vcpu_thread(CPUState *cpu)
>   {
> -    if (cpu->hvf->sleeping) {
> -        /*
> -         * When sleeping, make sure we always send signals. Also, clear the
> -         * timespec, so that an IPI that arrives between setting
> hvf->sleeping
> -         * and the nanosleep syscall still aborts the sleep.
> -         */
> -        cpu->thread_kicked = false;
> -        cpu->hvf->ts = (struct timespec){ };
> +    if (qatomic_read(&cpu->hvf->sleeping)) {
> +        /* When sleeping, send a signal to get out of pselect */
>           cpus_kick_thread(cpu);
>       } else {
>           hv_vcpus_exit(&cpu->hvf->fd, 1);
>       }
>   }
>
> +static void hvf_block_sig_ipi(CPUState *cpu)
> +{
> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask_ipi, NULL);
> +}
> +
> +static void hvf_unblock_sig_ipi(CPUState *cpu)
> +{
> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask, NULL);
> +}
> +
>   static int hvf_inject_interrupts(CPUState *cpu)
>   {
>       if (cpu->interrupt_request & CPU_INTERRUPT_FIQ) {
> @@ -354,6 +358,7 @@ int hvf_vcpu_exec(CPUState *cpu)
>       ARMCPU *arm_cpu = ARM_CPU(cpu);
>       CPUARMState *env = &arm_cpu->env;
>       hv_vcpu_exit_t *hvf_exit = cpu->hvf->exit;
> +    const uint32_t irq_mask = CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ;
>       hv_return_t r;
>       int ret = 0;
>
> @@ -491,8 +496,8 @@ int hvf_vcpu_exec(CPUState *cpu)
>               break;
>           }
>           case EC_WFX_TRAP:
> -            if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
> -                (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> +            if (!(syndrome & WFX_IS_WFE) &&
> +                !(cpu->interrupt_request & irq_mask)) {
>                   uint64_t cval, ctl, val, diff, now;

I don't think the access to cpu->interrupt_request is safe because it
is done while not under the iothread lock. That's why to avoid these
types of issues I would prefer to hold the lock almost all of the
time.

>                   /* Set up a local timer for vtimer if necessary ... */
> @@ -515,9 +520,7 @@ int hvf_vcpu_exec(CPUState *cpu)
>
>                   if (diff < INT64_MAX) {
>                       uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
> -                    struct timespec *ts = &cpu->hvf->ts;
> -
> -                    *ts = (struct timespec){
> +                    struct timespec ts = {
>                           .tv_sec = ns / NANOSECONDS_PER_SECOND,
>                           .tv_nsec = ns % NANOSECONDS_PER_SECOND,
>                       };
> @@ -526,27 +529,31 @@ int hvf_vcpu_exec(CPUState *cpu)
>                        * Waking up easily takes 1ms, don't go to sleep
> for smaller
>                        * time periods than 2ms.
>                        */
> -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
> +                    if (!ts.tv_sec && (ts.tv_nsec < (SCALE_MS * 2))) {
>                           advance_pc = true;
>                           break;
>                       }
>
> +                    /* block SIG_IPI for the sleep */
> +                    hvf_block_sig_ipi(cpu);
> +                    cpu->thread_kicked = false;
> +
>                       /* Set cpu->hvf->sleeping so that we get a SIG_IPI
> signal. */
> -                    cpu->hvf->sleeping = true;
> -                    smp_mb();
> +                    qatomic_set(&cpu->hvf->sleeping, true);

This doesn't protect against races because another thread could call
kvf_vcpu_kick_thread() at any time between when we return from
hv_vcpu_run() and when we set sleeping = true and we would miss the
wakeup (due to kvf_vcpu_kick_thread() seeing sleeping = false and
calling hv_vcpus_exit() instead of pthread_kill()). I don't think it
can be fixed by setting sleeping to true earlier either because no
matter how early you move it, there will always be a window where we
are going to pselect() but sleeping is false, resulting in a missed
wakeup.

Peter

>
> -                    /* Bail out if we received an IRQ meanwhile */
> -                    if (cpu->thread_kicked || (cpu->interrupt_request &
> -                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> -                        cpu->hvf->sleeping = false;
> +                    /* Bail out if we received a kick meanwhile */
> +                    if (qatomic_read(&cpu->interrupt_request) & irq_mask) {
> + qatomic_set(&cpu->hvf->sleeping, false);
> +                        hvf_unblock_sig_ipi(cpu);
>                           break;
>                       }
>
> -                    /* nanosleep returns on signal, so we wake up on
> kick. */
> -                    nanosleep(ts, NULL);
> +                    /* pselect returns on kick signal and consumes it */
> +                    pselect(0, 0, 0, 0, &ts, &cpu->hvf->sigmask);
>
>                       /* Out of sleep - either naturally or because of a
> kick */
> -                    cpu->hvf->sleeping = false;
> +                    qatomic_set(&cpu->hvf->sleeping, false);
> +                    hvf_unblock_sig_ipi(cpu);
>                   }
>
>                   advance_pc = true;
>

Re: [PATCH] arm/hvf: Optimize and simplify WFI handling

Posted by Alexander Graf 4 years, 3 months ago

On 01.12.20 21:03, Peter Collingbourne wrote:
> On Tue, Dec 1, 2020 at 8:26 AM Alexander Graf <agraf@csgraf.de> wrote:
>>
>> On 01.12.20 09:21, Peter Collingbourne wrote:
>>> Sleep on WFx until the VTIMER is due but allow ourselves to be woken
>>> up on IPI.
>>>
>>> Signed-off-by: Peter Collingbourne <pcc@google.com>
>>> ---
>>> Alexander Graf wrote:
>>>> I would love to take a patch from you here :). I'll still be stuck for a
>>>> while with the sysreg sync rework that Peter asked for before I can look
>>>> at WFI again.
>>> Okay, here's a patch :) It's a relatively straightforward adaptation
>>> of what we have in our fork, which can now boot Android to GUI while
>>> remaining at around 4% CPU when idle.
>>>
>>> I'm not set up to boot a full Linux distribution at the moment so I
>>> tested it on upstream QEMU by running a recent mainline Linux kernel
>>> with a rootfs containing an init program that just does sleep(5)
>>> and verified that the qemu process remains at low CPU usage during
>>> the sleep. This was on top of your v2 plus the last patch of your v1
>>> since it doesn't look like you have a replacement for that logic yet.
>>
>> How about something like this instead?
>>
>>
>> Alex
>>
>>
>> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>> index 4360f64671..50384013ea 100644
>> --- a/accel/hvf/hvf-cpus.c
>> +++ b/accel/hvf/hvf-cpus.c
>> @@ -337,16 +337,18 @@ static int hvf_init_vcpu(CPUState *cpu)
>>        cpu->hvf = g_malloc0(sizeof(*cpu->hvf));
>>
>>        /* init cpu signals */
>> -    sigset_t set;
>>        struct sigaction sigact;
>>
>>        memset(&sigact, 0, sizeof(sigact));
>>        sigact.sa_handler = dummy_signal;
>>        sigaction(SIG_IPI, &sigact, NULL);
>>
>> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>> -    sigdelset(&set, SIG_IPI);
>> -    pthread_sigmask(SIG_SETMASK, &set, NULL);
>> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask);
>> +    sigdelset(&cpu->hvf->sigmask, SIG_IPI);
>> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask, NULL);
>> +
>> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask_ipi);
>> +    sigaddset(&cpu->hvf->sigmask_ipi, SIG_IPI);
> There's no reason to unblock SIG_IPI while not in pselect and it can
> easily lead to missed wakeups. The whole point of pselect is so that
> you can guarantee that only one part of your program sees signals
> without a possibility of them being missed.


Hm, I think I start to agree with you here :). We can probably just 
leave SIG_IPI masked at all times and only unmask on pselect. The worst 
thing that will happen is a premature wakeup if we did get an IPI 
incoming while hvf->sleeping is set, but were either not running 
pselect() yet and bailed out or already finished pselect() execution.


>
>>    #ifdef __aarch64__
>>        r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t
>> **)&cpu->hvf->exit, NULL);
>> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
>> index c56baa3ae8..6e237f2db0 100644
>> --- a/include/sysemu/hvf_int.h
>> +++ b/include/sysemu/hvf_int.h
>> @@ -62,8 +62,9 @@ extern HVFState *hvf_state;
>>    struct hvf_vcpu_state {
>>        uint64_t fd;
>>        void *exit;
>> -    struct timespec ts;
>>        bool sleeping;
>> +    sigset_t sigmask;
>> +    sigset_t sigmask_ipi;
>>    };
>>
>>    void assert_hvf_ok(hv_return_t ret);
>> diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
>> index 0c01a03725..350b845e6e 100644
>> --- a/target/arm/hvf/hvf.c
>> +++ b/target/arm/hvf/hvf.c
>> @@ -320,20 +320,24 @@ int hvf_arch_init_vcpu(CPUState *cpu)
>>
>>    void hvf_kick_vcpu_thread(CPUState *cpu)
>>    {
>> -    if (cpu->hvf->sleeping) {
>> -        /*
>> -         * When sleeping, make sure we always send signals. Also, clear the
>> -         * timespec, so that an IPI that arrives between setting
>> hvf->sleeping
>> -         * and the nanosleep syscall still aborts the sleep.
>> -         */
>> -        cpu->thread_kicked = false;
>> -        cpu->hvf->ts = (struct timespec){ };
>> +    if (qatomic_read(&cpu->hvf->sleeping)) {
>> +        /* When sleeping, send a signal to get out of pselect */
>>            cpus_kick_thread(cpu);
>>        } else {
>>            hv_vcpus_exit(&cpu->hvf->fd, 1);
>>        }
>>    }
>>
>> +static void hvf_block_sig_ipi(CPUState *cpu)
>> +{
>> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask_ipi, NULL);
>> +}
>> +
>> +static void hvf_unblock_sig_ipi(CPUState *cpu)
>> +{
>> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask, NULL);
>> +}
>> +
>>    static int hvf_inject_interrupts(CPUState *cpu)
>>    {
>>        if (cpu->interrupt_request & CPU_INTERRUPT_FIQ) {
>> @@ -354,6 +358,7 @@ int hvf_vcpu_exec(CPUState *cpu)
>>        ARMCPU *arm_cpu = ARM_CPU(cpu);
>>        CPUARMState *env = &arm_cpu->env;
>>        hv_vcpu_exit_t *hvf_exit = cpu->hvf->exit;
>> +    const uint32_t irq_mask = CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ;
>>        hv_return_t r;
>>        int ret = 0;
>>
>> @@ -491,8 +496,8 @@ int hvf_vcpu_exec(CPUState *cpu)
>>                break;
>>            }
>>            case EC_WFX_TRAP:
>> -            if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
>> -                (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
>> +            if (!(syndrome & WFX_IS_WFE) &&
>> +                !(cpu->interrupt_request & irq_mask)) {
>>                    uint64_t cval, ctl, val, diff, now;
> I don't think the access to cpu->interrupt_request is safe because it
> is done while not under the iothread lock. That's why to avoid these
> types of issues I would prefer to hold the lock almost all of the
> time.


In this branch, that's not a problem yet. On stale values, we either 
don't sleep (which is ok), or we go into the sleep path, and reevaluate 
cpu->interrupt_request atomically again after setting hvf->sleeping.


>
>>                    /* Set up a local timer for vtimer if necessary ... */
>> @@ -515,9 +520,7 @@ int hvf_vcpu_exec(CPUState *cpu)
>>
>>                    if (diff < INT64_MAX) {
>>                        uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
>> -                    struct timespec *ts = &cpu->hvf->ts;
>> -
>> -                    *ts = (struct timespec){
>> +                    struct timespec ts = {
>>                            .tv_sec = ns / NANOSECONDS_PER_SECOND,
>>                            .tv_nsec = ns % NANOSECONDS_PER_SECOND,
>>                        };
>> @@ -526,27 +529,31 @@ int hvf_vcpu_exec(CPUState *cpu)
>>                         * Waking up easily takes 1ms, don't go to sleep
>> for smaller
>>                         * time periods than 2ms.
>>                         */
>> -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
>> +                    if (!ts.tv_sec && (ts.tv_nsec < (SCALE_MS * 2))) {
>>                            advance_pc = true;
>>                            break;
>>                        }
>>
>> +                    /* block SIG_IPI for the sleep */
>> +                    hvf_block_sig_ipi(cpu);
>> +                    cpu->thread_kicked = false;
>> +
>>                        /* Set cpu->hvf->sleeping so that we get a SIG_IPI
>> signal. */
>> -                    cpu->hvf->sleeping = true;
>> -                    smp_mb();
>> +                    qatomic_set(&cpu->hvf->sleeping, true);
> This doesn't protect against races because another thread could call
> kvf_vcpu_kick_thread() at any time between when we return from
> hv_vcpu_run() and when we set sleeping = true and we would miss the
> wakeup (due to kvf_vcpu_kick_thread() seeing sleeping = false and
> calling hv_vcpus_exit() instead of pthread_kill()). I don't think it
> can be fixed by setting sleeping to true earlier either because no
> matter how early you move it, there will always be a window where we
> are going to pselect() but sleeping is false, resulting in a missed
> wakeup.


I don't follow. If anyone was sending us an IPI, it's because they want 
to notify us about an update to cpu->interrupt_request, right? In that 
case, the atomic read of that field below will catch it and bail out of 
the sleep sequence.


>
> Peter
>
>> -                    /* Bail out if we received an IRQ meanwhile */
>> -                    if (cpu->thread_kicked || (cpu->interrupt_request &
>> -                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
>> -                        cpu->hvf->sleeping = false;
>> +                    /* Bail out if we received a kick meanwhile */
>> +                    if (qatomic_read(&cpu->interrupt_request) & irq_mask) {
>> + qatomic_set(&cpu->hvf->sleeping, false);


^^^


Alex


>> +                        hvf_unblock_sig_ipi(cpu);
>>                            break;
>>                        }
>>
>> -                    /* nanosleep returns on signal, so we wake up on
>> kick. */
>> -                    nanosleep(ts, NULL);
>> +                    /* pselect returns on kick signal and consumes it */
>> +                    pselect(0, 0, 0, 0, &ts, &cpu->hvf->sigmask);
>>
>>                        /* Out of sleep - either naturally or because of a
>> kick */
>> -                    cpu->hvf->sleeping = false;
>> +                    qatomic_set(&cpu->hvf->sleeping, false);
>> +                    hvf_unblock_sig_ipi(cpu);
>>                    }
>>
>>                    advance_pc = true;
>>

Re: [PATCH] arm/hvf: Optimize and simplify WFI handling

Posted by Alexander Graf 4 years, 3 months ago

On 01.12.20 23:09, Alexander Graf wrote:
>
> On 01.12.20 21:03, Peter Collingbourne wrote:
>> On Tue, Dec 1, 2020 at 8:26 AM Alexander Graf <agraf@csgraf.de> wrote:
>>>
>>> On 01.12.20 09:21, Peter Collingbourne wrote:
>>>> Sleep on WFx until the VTIMER is due but allow ourselves to be woken
>>>> up on IPI.
>>>>
>>>> Signed-off-by: Peter Collingbourne <pcc@google.com>
>>>> ---
>>>> Alexander Graf wrote:
>>>>> I would love to take a patch from you here :). I'll still be stuck 
>>>>> for a
>>>>> while with the sysreg sync rework that Peter asked for before I 
>>>>> can look
>>>>> at WFI again.
>>>> Okay, here's a patch :) It's a relatively straightforward adaptation
>>>> of what we have in our fork, which can now boot Android to GUI while
>>>> remaining at around 4% CPU when idle.
>>>>
>>>> I'm not set up to boot a full Linux distribution at the moment so I
>>>> tested it on upstream QEMU by running a recent mainline Linux kernel
>>>> with a rootfs containing an init program that just does sleep(5)
>>>> and verified that the qemu process remains at low CPU usage during
>>>> the sleep. This was on top of your v2 plus the last patch of your v1
>>>> since it doesn't look like you have a replacement for that logic yet.
>>>
>>> How about something like this instead?
>>>
>>>
>>> Alex
>>>
>>>
>>> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>>> index 4360f64671..50384013ea 100644
>>> --- a/accel/hvf/hvf-cpus.c
>>> +++ b/accel/hvf/hvf-cpus.c
>>> @@ -337,16 +337,18 @@ static int hvf_init_vcpu(CPUState *cpu)
>>>        cpu->hvf = g_malloc0(sizeof(*cpu->hvf));
>>>
>>>        /* init cpu signals */
>>> -    sigset_t set;
>>>        struct sigaction sigact;
>>>
>>>        memset(&sigact, 0, sizeof(sigact));
>>>        sigact.sa_handler = dummy_signal;
>>>        sigaction(SIG_IPI, &sigact, NULL);
>>>
>>> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>>> -    sigdelset(&set, SIG_IPI);
>>> -    pthread_sigmask(SIG_SETMASK, &set, NULL);
>>> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask);
>>> +    sigdelset(&cpu->hvf->sigmask, SIG_IPI);
>>> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask, NULL);
>>> +
>>> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask_ipi);
>>> +    sigaddset(&cpu->hvf->sigmask_ipi, SIG_IPI);
>> There's no reason to unblock SIG_IPI while not in pselect and it can
>> easily lead to missed wakeups. The whole point of pselect is so that
>> you can guarantee that only one part of your program sees signals
>> without a possibility of them being missed.
>
>
> Hm, I think I start to agree with you here :). We can probably just 
> leave SIG_IPI masked at all times and only unmask on pselect. The 
> worst thing that will happen is a premature wakeup if we did get an 
> IPI incoming while hvf->sleeping is set, but were either not running 
> pselect() yet and bailed out or already finished pselect() execution.


How about this one? Do you really think it's still racy?


Alex


diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
index 4360f64671..e10fca622d 100644
--- a/accel/hvf/hvf-cpus.c
+++ b/accel/hvf/hvf-cpus.c
@@ -337,16 +337,17 @@ static int hvf_init_vcpu(CPUState *cpu)
      cpu->hvf = g_malloc0(sizeof(*cpu->hvf));

      /* init cpu signals */
-    sigset_t set;
      struct sigaction sigact;

      memset(&sigact, 0, sizeof(sigact));
      sigact.sa_handler = dummy_signal;
      sigaction(SIG_IPI, &sigact, NULL);

-    pthread_sigmask(SIG_BLOCK, NULL, &set);
-    sigdelset(&set, SIG_IPI);
-    pthread_sigmask(SIG_SETMASK, &set, NULL);
+    /* Remember unmasked IPI mask for pselect(), leave masked normally */
+    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask_ipi);
+    sigaddset(&cpu->hvf->sigmask_ipi, SIG_IPI);
+    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask_ipi, NULL);
+    sigdelset(&cpu->hvf->sigmask_ipi, SIG_IPI);

  #ifdef __aarch64__
      r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t 
**)&cpu->hvf->exit, NULL);
diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
index c56baa3ae8..8d7d4a6226 100644
--- a/include/sysemu/hvf_int.h
+++ b/include/sysemu/hvf_int.h
@@ -62,8 +62,8 @@ extern HVFState *hvf_state;
  struct hvf_vcpu_state {
      uint64_t fd;
      void *exit;
-    struct timespec ts;
      bool sleeping;
+    sigset_t sigmask_ipi;
  };

  void assert_hvf_ok(hv_return_t ret);
diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
index 0c01a03725..a255a1a7d3 100644
--- a/target/arm/hvf/hvf.c
+++ b/target/arm/hvf/hvf.c
@@ -320,14 +320,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)

  void hvf_kick_vcpu_thread(CPUState *cpu)
  {
-    if (cpu->hvf->sleeping) {
-        /*
-         * When sleeping, make sure we always send signals. Also, clear the
-         * timespec, so that an IPI that arrives between setting 
hvf->sleeping
-         * and the nanosleep syscall still aborts the sleep.
-         */
-        cpu->thread_kicked = false;
-        cpu->hvf->ts = (struct timespec){ };
+    if (qatomic_read(&cpu->hvf->sleeping)) {
+        /* When sleeping, send a signal to get out of pselect */
          cpus_kick_thread(cpu);
      } else {
          hv_vcpus_exit(&cpu->hvf->fd, 1);
@@ -354,6 +348,7 @@ int hvf_vcpu_exec(CPUState *cpu)
      ARMCPU *arm_cpu = ARM_CPU(cpu);
      CPUARMState *env = &arm_cpu->env;
      hv_vcpu_exit_t *hvf_exit = cpu->hvf->exit;
+    const uint32_t irq_mask = CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ;
      hv_return_t r;
      int ret = 0;

@@ -491,8 +486,8 @@ int hvf_vcpu_exec(CPUState *cpu)
              break;
          }
          case EC_WFX_TRAP:
-            if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
-                (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
+            if (!(syndrome & WFX_IS_WFE) &&
+                !(cpu->interrupt_request & irq_mask)) {
                  uint64_t cval, ctl, val, diff, now;

                  /* Set up a local timer for vtimer if necessary ... */
@@ -515,9 +510,7 @@ int hvf_vcpu_exec(CPUState *cpu)

                  if (diff < INT64_MAX) {
                      uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
-                    struct timespec *ts = &cpu->hvf->ts;
-
-                    *ts = (struct timespec){
+                    struct timespec ts = {
                          .tv_sec = ns / NANOSECONDS_PER_SECOND,
                          .tv_nsec = ns % NANOSECONDS_PER_SECOND,
                      };
@@ -526,27 +519,27 @@ int hvf_vcpu_exec(CPUState *cpu)
                       * Waking up easily takes 1ms, don't go to sleep 
for smaller
                       * time periods than 2ms.
                       */
-                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
+                    if (!ts.tv_sec && (ts.tv_nsec < (SCALE_MS * 2))) {
                          advance_pc = true;
                          break;
                      }

+                    cpu->thread_kicked = false;
+
                      /* Set cpu->hvf->sleeping so that we get a SIG_IPI 
signal. */
-                    cpu->hvf->sleeping = true;
-                    smp_mb();
+                    qatomic_set(&cpu->hvf->sleeping, true);

-                    /* Bail out if we received an IRQ meanwhile */
-                    if (cpu->thread_kicked || (cpu->interrupt_request &
-                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
-                        cpu->hvf->sleeping = false;
+                    /* Bail out if we received a kick meanwhile */
+                    if (qatomic_read(&cpu->interrupt_request) & irq_mask) {
+ qatomic_set(&cpu->hvf->sleeping, false);
                          break;
                      }

-                    /* nanosleep returns on signal, so we wake up on 
kick. */
-                    nanosleep(ts, NULL);
+                    /* pselect returns on kick signal and consumes it */
+                    pselect(0, 0, 0, 0, &ts, &cpu->hvf->sigmask_ipi);

                      /* Out of sleep - either naturally or because of a 
kick */
-                    cpu->hvf->sleeping = false;
+                    qatomic_set(&cpu->hvf->sleeping, false);
                  }

                  advance_pc = true;

Re: [PATCH] arm/hvf: Optimize and simplify WFI handling

Posted by Peter Collingbourne 4 years, 3 months ago

On Tue, Dec 1, 2020 at 2:09 PM Alexander Graf <agraf@csgraf.de> wrote:
>
>
> On 01.12.20 21:03, Peter Collingbourne wrote:
> > On Tue, Dec 1, 2020 at 8:26 AM Alexander Graf <agraf@csgraf.de> wrote:
> >>
> >> On 01.12.20 09:21, Peter Collingbourne wrote:
> >>> Sleep on WFx until the VTIMER is due but allow ourselves to be woken
> >>> up on IPI.
> >>>
> >>> Signed-off-by: Peter Collingbourne <pcc@google.com>
> >>> ---
> >>> Alexander Graf wrote:
> >>>> I would love to take a patch from you here :). I'll still be stuck for a
> >>>> while with the sysreg sync rework that Peter asked for before I can look
> >>>> at WFI again.
> >>> Okay, here's a patch :) It's a relatively straightforward adaptation
> >>> of what we have in our fork, which can now boot Android to GUI while
> >>> remaining at around 4% CPU when idle.
> >>>
> >>> I'm not set up to boot a full Linux distribution at the moment so I
> >>> tested it on upstream QEMU by running a recent mainline Linux kernel
> >>> with a rootfs containing an init program that just does sleep(5)
> >>> and verified that the qemu process remains at low CPU usage during
> >>> the sleep. This was on top of your v2 plus the last patch of your v1
> >>> since it doesn't look like you have a replacement for that logic yet.
> >>
> >> How about something like this instead?
> >>
> >>
> >> Alex
> >>
> >>
> >> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> >> index 4360f64671..50384013ea 100644
> >> --- a/accel/hvf/hvf-cpus.c
> >> +++ b/accel/hvf/hvf-cpus.c
> >> @@ -337,16 +337,18 @@ static int hvf_init_vcpu(CPUState *cpu)
> >>        cpu->hvf = g_malloc0(sizeof(*cpu->hvf));
> >>
> >>        /* init cpu signals */
> >> -    sigset_t set;
> >>        struct sigaction sigact;
> >>
> >>        memset(&sigact, 0, sizeof(sigact));
> >>        sigact.sa_handler = dummy_signal;
> >>        sigaction(SIG_IPI, &sigact, NULL);
> >>
> >> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> >> -    sigdelset(&set, SIG_IPI);
> >> -    pthread_sigmask(SIG_SETMASK, &set, NULL);
> >> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask);
> >> +    sigdelset(&cpu->hvf->sigmask, SIG_IPI);
> >> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask, NULL);
> >> +
> >> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask_ipi);
> >> +    sigaddset(&cpu->hvf->sigmask_ipi, SIG_IPI);
> > There's no reason to unblock SIG_IPI while not in pselect and it can
> > easily lead to missed wakeups. The whole point of pselect is so that
> > you can guarantee that only one part of your program sees signals
> > without a possibility of them being missed.
>
>
> Hm, I think I start to agree with you here :). We can probably just
> leave SIG_IPI masked at all times and only unmask on pselect. The worst
> thing that will happen is a premature wakeup if we did get an IPI
> incoming while hvf->sleeping is set, but were either not running
> pselect() yet and bailed out or already finished pselect() execution.

Ack.

> >
> >>    #ifdef __aarch64__
> >>        r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t
> >> **)&cpu->hvf->exit, NULL);
> >> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> >> index c56baa3ae8..6e237f2db0 100644
> >> --- a/include/sysemu/hvf_int.h
> >> +++ b/include/sysemu/hvf_int.h
> >> @@ -62,8 +62,9 @@ extern HVFState *hvf_state;
> >>    struct hvf_vcpu_state {
> >>        uint64_t fd;
> >>        void *exit;
> >> -    struct timespec ts;
> >>        bool sleeping;
> >> +    sigset_t sigmask;
> >> +    sigset_t sigmask_ipi;
> >>    };
> >>
> >>    void assert_hvf_ok(hv_return_t ret);
> >> diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
> >> index 0c01a03725..350b845e6e 100644
> >> --- a/target/arm/hvf/hvf.c
> >> +++ b/target/arm/hvf/hvf.c
> >> @@ -320,20 +320,24 @@ int hvf_arch_init_vcpu(CPUState *cpu)
> >>
> >>    void hvf_kick_vcpu_thread(CPUState *cpu)
> >>    {
> >> -    if (cpu->hvf->sleeping) {
> >> -        /*
> >> -         * When sleeping, make sure we always send signals. Also, clear the
> >> -         * timespec, so that an IPI that arrives between setting
> >> hvf->sleeping
> >> -         * and the nanosleep syscall still aborts the sleep.
> >> -         */
> >> -        cpu->thread_kicked = false;
> >> -        cpu->hvf->ts = (struct timespec){ };
> >> +    if (qatomic_read(&cpu->hvf->sleeping)) {
> >> +        /* When sleeping, send a signal to get out of pselect */
> >>            cpus_kick_thread(cpu);
> >>        } else {
> >>            hv_vcpus_exit(&cpu->hvf->fd, 1);
> >>        }
> >>    }
> >>
> >> +static void hvf_block_sig_ipi(CPUState *cpu)
> >> +{
> >> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask_ipi, NULL);
> >> +}
> >> +
> >> +static void hvf_unblock_sig_ipi(CPUState *cpu)
> >> +{
> >> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask, NULL);
> >> +}
> >> +
> >>    static int hvf_inject_interrupts(CPUState *cpu)
> >>    {
> >>        if (cpu->interrupt_request & CPU_INTERRUPT_FIQ) {
> >> @@ -354,6 +358,7 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>        ARMCPU *arm_cpu = ARM_CPU(cpu);
> >>        CPUARMState *env = &arm_cpu->env;
> >>        hv_vcpu_exit_t *hvf_exit = cpu->hvf->exit;
> >> +    const uint32_t irq_mask = CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ;
> >>        hv_return_t r;
> >>        int ret = 0;
> >>
> >> @@ -491,8 +496,8 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>                break;
> >>            }
> >>            case EC_WFX_TRAP:
> >> -            if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
> >> -                (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> >> +            if (!(syndrome & WFX_IS_WFE) &&
> >> +                !(cpu->interrupt_request & irq_mask)) {
> >>                    uint64_t cval, ctl, val, diff, now;
> > I don't think the access to cpu->interrupt_request is safe because it
> > is done while not under the iothread lock. That's why to avoid these
> > types of issues I would prefer to hold the lock almost all of the
> > time.
>
>
> In this branch, that's not a problem yet. On stale values, we either
> don't sleep (which is ok), or we go into the sleep path, and reevaluate
> cpu->interrupt_request atomically again after setting hvf->sleeping.

Okay, this may be a "benign race" (and it may be helped a little by
the M1's sequential consistency extension) but this is the sort of
thing that I'd prefer not to rely on. At least it should be an atomic
read.

> >
> >>                    /* Set up a local timer for vtimer if necessary ... */
> >> @@ -515,9 +520,7 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>
> >>                    if (diff < INT64_MAX) {
> >>                        uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
> >> -                    struct timespec *ts = &cpu->hvf->ts;
> >> -
> >> -                    *ts = (struct timespec){
> >> +                    struct timespec ts = {
> >>                            .tv_sec = ns / NANOSECONDS_PER_SECOND,
> >>                            .tv_nsec = ns % NANOSECONDS_PER_SECOND,
> >>                        };
> >> @@ -526,27 +529,31 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>                         * Waking up easily takes 1ms, don't go to sleep
> >> for smaller
> >>                         * time periods than 2ms.
> >>                         */
> >> -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
> >> +                    if (!ts.tv_sec && (ts.tv_nsec < (SCALE_MS * 2))) {
> >>                            advance_pc = true;
> >>                            break;
> >>                        }
> >>
> >> +                    /* block SIG_IPI for the sleep */
> >> +                    hvf_block_sig_ipi(cpu);
> >> +                    cpu->thread_kicked = false;
> >> +
> >>                        /* Set cpu->hvf->sleeping so that we get a SIG_IPI
> >> signal. */
> >> -                    cpu->hvf->sleeping = true;
> >> -                    smp_mb();
> >> +                    qatomic_set(&cpu->hvf->sleeping, true);
> > This doesn't protect against races because another thread could call
> > kvf_vcpu_kick_thread() at any time between when we return from
> > hv_vcpu_run() and when we set sleeping = true and we would miss the
> > wakeup (due to kvf_vcpu_kick_thread() seeing sleeping = false and
> > calling hv_vcpus_exit() instead of pthread_kill()). I don't think it
> > can be fixed by setting sleeping to true earlier either because no
> > matter how early you move it, there will always be a window where we
> > are going to pselect() but sleeping is false, resulting in a missed
> > wakeup.
>
>
> I don't follow. If anyone was sending us an IPI, it's because they want
> to notify us about an update to cpu->interrupt_request, right? In that
> case, the atomic read of that field below will catch it and bail out of
> the sleep sequence.

I think there are other possible IPI reasons, e.g. set halted to 1,
I/O events. Now we could check for halted below and maybe some of the
others but the code will be subtle and it seems like a game of
whack-a-mole to get them all. This is an example of what I was talking
about when I said that an approach that relies on the sleeping field
will be difficult to get right. I would strongly prefer to start with
a simple approach and maybe we can consider a more complicated one
later.

Peter

>
>
> >
> > Peter
> >
> >> -                    /* Bail out if we received an IRQ meanwhile */
> >> -                    if (cpu->thread_kicked || (cpu->interrupt_request &
> >> -                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> >> -                        cpu->hvf->sleeping = false;
> >> +                    /* Bail out if we received a kick meanwhile */
> >> +                    if (qatomic_read(&cpu->interrupt_request) & irq_mask) {
> >> + qatomic_set(&cpu->hvf->sleeping, false);
>
>
> ^^^
>
>
> Alex
>
>
> >> +                        hvf_unblock_sig_ipi(cpu);
> >>                            break;
> >>                        }
> >>
> >> -                    /* nanosleep returns on signal, so we wake up on
> >> kick. */
> >> -                    nanosleep(ts, NULL);
> >> +                    /* pselect returns on kick signal and consumes it */
> >> +                    pselect(0, 0, 0, 0, &ts, &cpu->hvf->sigmask);
> >>
> >>                        /* Out of sleep - either naturally or because of a
> >> kick */
> >> -                    cpu->hvf->sleeping = false;
> >> +                    qatomic_set(&cpu->hvf->sleeping, false);
> >> +                    hvf_unblock_sig_ipi(cpu);
> >>                    }
> >>
> >>                    advance_pc = true;
> >>

Re: [PATCH 2/8] hvf: Move common code out

Posted by Roman Bolshakov 4 years, 3 months ago

On Mon, Nov 30, 2020 at 10:40:49PM +0100, Alexander Graf wrote:
> Hi Peter,
> 
> On 30.11.20 22:08, Peter Collingbourne wrote:
> > On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
> > > 
> > > 
> > > On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
> > > > Hi Frank,
> > > > 
> > > > Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
> > > Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
> > > > Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
> > > > 
> > > >    https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
> > > > 
> > > Thanks, we'll take a look :)
> > > 
> > > > Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
> > Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
> > I'll reply to your patch here. You have:
> > 
> > +                    /* Set cpu->hvf->sleeping so that we get a
> > SIG_IPI signal. */
> > +                    cpu->hvf->sleeping = true;
> > +                    smp_mb();
> > +
> > +                    /* Bail out if we received an IRQ meanwhile */
> > +                    if (cpu->thread_kicked || (cpu->interrupt_request &
> > +                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > +                        cpu->hvf->sleeping = false;
> > +                        break;
> > +                    }
> > +
> > +                    /* nanosleep returns on signal, so we wake up on kick. */
> > +                    nanosleep(ts, NULL);
> > 
> > and then send the signal conditional on whether sleeping is true, but
> > I think this is racy. If the signal is sent after sleeping is set to
> > true but before entering nanosleep then I think it will be ignored and
> > we will miss the wakeup. That's why in my implementation I block IPI
> > on the CPU thread at startup and then use pselect to atomically
> > unblock and begin sleeping. The signal is sent unconditionally so
> > there's no need to worry about races between actually sleeping and the
> > "we think we're sleeping" state. It may lead to an extra wakeup but
> > that's better than missing it entirely.
> 
> 
> Thanks a bunch for the comment! So the trick I was using here is to modify
> the timespec from the kick function before sending the IPI signal. That way,
> we know that either we are inside the sleep (where the signal wakes it up)
> or we are outside the sleep (where timespec={} will make it return
> immediately).
> 
> The only race I can think of is if nanosleep does calculations based on the
> timespec and we happen to send the signal right there and then.
> 
> The problem with blocking IPIs is basically what Frank was describing
> earlier: How do you unset the IPI signal pending status? If the signal is
> never delivered, how can pselect differentiate "signal from last time is
> still pending" from "new signal because I got an IPI"?
> 
> 

Hi Alex,

There was a patch for x86 HVF that implements CPU kick and it wasn't
merged (mostly because of my lazyness). It has some changes like you
introduced in the series and VMX-specific handling of preemption timer
to gurantee interrupt delivery without kick loss:

https://patchwork.kernel.org/project/qemu-devel/patch/20200729124832.79375-1-r.bolshakov@yadro.com/

I wonder if it'd possible to have common handling of kicks for both x86
and arm (given that arch-specific bits are wrapped)?

Thanks,
Roman

Re: [PATCH 2/8] hvf: Move common code out

Posted by Peter Maydell 4 years, 3 months ago

On Mon, 30 Nov 2020 at 20:56, Frank Yang <lfy@google.com> wrote:
> We'd actually like to contribute upstream too :) We do want to maintain
> our own downstream though; Android Emulator codebase needs to work
> solidly on macos and windows which has made keeping up with upstream difficult

One of the main reasons why OSX and Windows support upstream is
not so great is because very few people are helping to develop,
test and support it upstream. The way to fix that IMHO is for more
people who do care about those platforms to actively engage
with us upstream to help in making those platforms move closer to
being first class citizens. If you stay on a downstream fork
forever then I don't think you'll ever see things improve.

thanks
-- PMM

Re: [PATCH 2/8] hvf: Move common code out

Posted by Frank Yang 4 years, 3 months ago

On Mon, Nov 30, 2020 at 2:10 PM Peter Maydell <peter.maydell@linaro.org>
wrote:

> On Mon, 30 Nov 2020 at 20:56, Frank Yang <lfy@google.com> wrote:
> > We'd actually like to contribute upstream too :) We do want to maintain
> > our own downstream though; Android Emulator codebase needs to work
> > solidly on macos and windows which has made keeping up with upstream
> difficult
>
> One of the main reasons why OSX and Windows support upstream is
> not so great is because very few people are helping to develop,
> test and support it upstream. The way to fix that IMHO is for more
> people who do care about those platforms to actively engage
> with us upstream to help in making those platforms move closer to
> being first class citizens. If you stay on a downstream fork
> forever then I don't think you'll ever see things improve.
>
> thanks
> -- PMM
>

That's a really good point. I'll definitely be more active about sending
comments upstream in the future :)

Frank

Re: [PATCH 2/8] hvf: Move common code out

Posted by Peter Collingbourne 4 years, 3 months ago

On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
>
>
>
> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
>>
>> Hi Frank,
>>
>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
>
> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!

We tracked down the discrepancies between CNTPCT_EL0 on the guest vs
on the host to the fact that CNTPCT_EL0 on the guest does not
increment while the system is asleep and as such corresponds to
mach_absolute_time() on the host (if you read the XNU sources you will
see that mach_absolute_time() is implemented as CNTPCT_EL0 plus a
constant representing the time spent asleep) while CNTPCT_EL0 on the
host does increment while asleep. This patch switches the
implementation over to using mach_absolute_time() instead of reading
CNTPCT_EL0 directly:

https://android-review.googlesource.com/c/platform/external/qemu/+/1514870

Peter

>>
>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
>>
>>   https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
>>
>
> Thanks, we'll take a look :)
>
>>
>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
>>
>> Also, is there a particular reason you're working on this super interesting and useful code in a random downstream fork of QEMU? Wouldn't it be more helpful to contribute to the upstream code base instead?
>
> We'd actually like to contribute upstream too :) We do want to maintain our own downstream though; Android Emulator codebase needs to work solidly on macos and windows which has made keeping up with upstream difficult, and staying on a previous version (2.12) with known quirks easier. (theres also some android related customization relating to Qt Ui + different set of virtual devices and snapshot support (incl. snapshots of graphics devices with OpenGLES state tracking), which we hope to separate into other libraries/processes, but its not insignificant)
>>
>>
>> Alex
>>
>>
>> On 30.11.20 21:15, Frank Yang wrote:
>>
>> Update: We're not quite sure how to compare the CNTV_CVAL and CNTVCT. But the high CPU usage seems to be mitigated by having a poll interval (like KVM does) in handling WFI:
>>
>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512501
>>
>> This is loosely inspired by https://elixir.bootlin.com/linux/v5.10-rc6/source/virt/kvm/kvm_main.c#L2766 which does seem to specify a poll interval.
>>
>> It would be cool if we could have a lightweight way to enter sleep and restart the vcpus precisely when CVAL passes, though.
>>
>> Frank
>>
>>
>> On Fri, Nov 27, 2020 at 3:30 PM Frank Yang <lfy@google.com> wrote:
>>>
>>> Hi all,
>>>
>>> +Peter Collingbourne
>>>
>>> I'm a developer on the Android Emulator, which is in a fork of QEMU.
>>>
>>> Peter and I have been working on an HVF Apple Silicon backend with an eye toward Android guests.
>>>
>>> We have gotten things to basically switch to Android userspace already (logcat/shell and graphics available at least)
>>>
>>> Our strategy so far has been to import logic from the KVM implementation and hook into QEMU's software devices that previously assumed to only work with TCG, or have KVM-specific paths.
>>>
>>> Thanks to Alexander for the tip on the 36-bit address space limitation btw; our way of addressing this is to still allow highmem but not put pci high mmio so high.
>>>
>>> Also, note we have a sleep/signal based mechanism to deal with WFx, which might be worth looking into in Alexander's implementation as well:
>>>
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551
>>>
>>> Patches so far, FYI:
>>>
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1513429/1
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512554/3
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512553/3
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512552/3
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551/3
>>>
>>> https://android.googlesource.com/platform/external/qemu/+/c17eb6a3ffd50047e9646aff6640b710cb8ff48a
>>> https://android.googlesource.com/platform/external/qemu/+/74bed16de1afb41b7a7ab8da1d1861226c9db63b
>>> https://android.googlesource.com/platform/external/qemu/+/eccd9e47ab2ccb9003455e3bb721f57f9ebc3c01
>>> https://android.googlesource.com/platform/external/qemu/+/54fe3d67ed4698e85826537a4f49b2b9074b2228
>>> https://android.googlesource.com/platform/external/qemu/+/82ef91a6fede1d1000f36be037ad4d58fbe0d102
>>> https://android.googlesource.com/platform/external/qemu/+/c28147aa7c74d98b858e99623d2fe46e74a379f6
>>>
>>> Peter's also noticed that there are extra steps needed for M1's to allow TCG to work, as it involves JIT:
>>>
>>> https://android.googlesource.com/platform/external/qemu/+/740e3fe47f88926c6bda9abb22ee6eae1bc254a9
>>>
>>> We'd appreciate any feedback/comments :)
>>>
>>> Best,
>>>
>>> Frank
>>>
>>> On Fri, Nov 27, 2020 at 1:57 PM Alexander Graf <agraf@csgraf.de> wrote:
>>>>
>>>>
>>>> On 27.11.20 21:00, Roman Bolshakov wrote:
>>>> > On Thu, Nov 26, 2020 at 10:50:11PM +0100, Alexander Graf wrote:
>>>> >> Until now, Hypervisor.framework has only been available on x86_64 systems.
>>>> >> With Apple Silicon shipping now, it extends its reach to aarch64. To
>>>> >> prepare for support for multiple architectures, let's move common code out
>>>> >> into its own accel directory.
>>>> >>
>>>> >> Signed-off-by: Alexander Graf <agraf@csgraf.de>
>>>> >> ---
>>>> >>   MAINTAINERS                 |   9 +-
>>>> >>   accel/hvf/hvf-all.c         |  56 +++++
>>>> >>   accel/hvf/hvf-cpus.c        | 468 ++++++++++++++++++++++++++++++++++++
>>>> >>   accel/hvf/meson.build       |   7 +
>>>> >>   accel/meson.build           |   1 +
>>>> >>   include/sysemu/hvf_int.h    |  69 ++++++
>>>> >>   target/i386/hvf/hvf-cpus.c  | 131 ----------
>>>> >>   target/i386/hvf/hvf-cpus.h  |  25 --
>>>> >>   target/i386/hvf/hvf-i386.h  |  48 +---
>>>> >>   target/i386/hvf/hvf.c       | 360 +--------------------------
>>>> >>   target/i386/hvf/meson.build |   1 -
>>>> >>   target/i386/hvf/x86hvf.c    |  11 +-
>>>> >>   target/i386/hvf/x86hvf.h    |   2 -
>>>> >>   13 files changed, 619 insertions(+), 569 deletions(-)
>>>> >>   create mode 100644 accel/hvf/hvf-all.c
>>>> >>   create mode 100644 accel/hvf/hvf-cpus.c
>>>> >>   create mode 100644 accel/hvf/meson.build
>>>> >>   create mode 100644 include/sysemu/hvf_int.h
>>>> >>   delete mode 100644 target/i386/hvf/hvf-cpus.c
>>>> >>   delete mode 100644 target/i386/hvf/hvf-cpus.h
>>>> >>
>>>> >> diff --git a/MAINTAINERS b/MAINTAINERS
>>>> >> index 68bc160f41..ca4b6d9279 100644
>>>> >> --- a/MAINTAINERS
>>>> >> +++ b/MAINTAINERS
>>>> >> @@ -444,9 +444,16 @@ M: Cameron Esfahani <dirty@apple.com>
>>>> >>   M: Roman Bolshakov <r.bolshakov@yadro.com>
>>>> >>   W: https://wiki.qemu.org/Features/HVF
>>>> >>   S: Maintained
>>>> >> -F: accel/stubs/hvf-stub.c
>>>> > There was a patch for that in the RFC series from Claudio.
>>>>
>>>>
>>>> Yeah, I'm not worried about this hunk :).
>>>>
>>>>
>>>> >
>>>> >>   F: target/i386/hvf/
>>>> >> +
>>>> >> +HVF
>>>> >> +M: Cameron Esfahani <dirty@apple.com>
>>>> >> +M: Roman Bolshakov <r.bolshakov@yadro.com>
>>>> >> +W: https://wiki.qemu.org/Features/HVF
>>>> >> +S: Maintained
>>>> >> +F: accel/hvf/
>>>> >>   F: include/sysemu/hvf.h
>>>> >> +F: include/sysemu/hvf_int.h
>>>> >>
>>>> >>   WHPX CPUs
>>>> >>   M: Sunil Muthuswamy <sunilmut@microsoft.com>
>>>> >> diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
>>>> >> new file mode 100644
>>>> >> index 0000000000..47d77a472a
>>>> >> --- /dev/null
>>>> >> +++ b/accel/hvf/hvf-all.c
>>>> >> @@ -0,0 +1,56 @@
>>>> >> +/*
>>>> >> + * QEMU Hypervisor.framework support
>>>> >> + *
>>>> >> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>>>> >> + * the COPYING file in the top-level directory.
>>>> >> + *
>>>> >> + * Contributions after 2012-01-13 are licensed under the terms of the
>>>> >> + * GNU GPL, version 2 or (at your option) any later version.
>>>> >> + */
>>>> >> +
>>>> >> +#include "qemu/osdep.h"
>>>> >> +#include "qemu-common.h"
>>>> >> +#include "qemu/error-report.h"
>>>> >> +#include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >> +#include "sysemu/runstate.h"
>>>> >> +
>>>> >> +#include "qemu/main-loop.h"
>>>> >> +#include "sysemu/accel.h"
>>>> >> +
>>>> >> +#include <Hypervisor/Hypervisor.h>
>>>> >> +
>>>> >> +bool hvf_allowed;
>>>> >> +HVFState *hvf_state;
>>>> >> +
>>>> >> +void assert_hvf_ok(hv_return_t ret)
>>>> >> +{
>>>> >> +    if (ret == HV_SUCCESS) {
>>>> >> +        return;
>>>> >> +    }
>>>> >> +
>>>> >> +    switch (ret) {
>>>> >> +    case HV_ERROR:
>>>> >> +        error_report("Error: HV_ERROR");
>>>> >> +        break;
>>>> >> +    case HV_BUSY:
>>>> >> +        error_report("Error: HV_BUSY");
>>>> >> +        break;
>>>> >> +    case HV_BAD_ARGUMENT:
>>>> >> +        error_report("Error: HV_BAD_ARGUMENT");
>>>> >> +        break;
>>>> >> +    case HV_NO_RESOURCES:
>>>> >> +        error_report("Error: HV_NO_RESOURCES");
>>>> >> +        break;
>>>> >> +    case HV_NO_DEVICE:
>>>> >> +        error_report("Error: HV_NO_DEVICE");
>>>> >> +        break;
>>>> >> +    case HV_UNSUPPORTED:
>>>> >> +        error_report("Error: HV_UNSUPPORTED");
>>>> >> +        break;
>>>> >> +    default:
>>>> >> +        error_report("Unknown Error");
>>>> >> +    }
>>>> >> +
>>>> >> +    abort();
>>>> >> +}
>>>> >> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>>>> >> new file mode 100644
>>>> >> index 0000000000..f9bb5502b7
>>>> >> --- /dev/null
>>>> >> +++ b/accel/hvf/hvf-cpus.c
>>>> >> @@ -0,0 +1,468 @@
>>>> >> +/*
>>>> >> + * Copyright 2008 IBM Corporation
>>>> >> + *           2008 Red Hat, Inc.
>>>> >> + * Copyright 2011 Intel Corporation
>>>> >> + * Copyright 2016 Veertu, Inc.
>>>> >> + * Copyright 2017 The Android Open Source Project
>>>> >> + *
>>>> >> + * QEMU Hypervisor.framework support
>>>> >> + *
>>>> >> + * This program is free software; you can redistribute it and/or
>>>> >> + * modify it under the terms of version 2 of the GNU General Public
>>>> >> + * License as published by the Free Software Foundation.
>>>> >> + *
>>>> >> + * This program is distributed in the hope that it will be useful,
>>>> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>>> >> + * General Public License for more details.
>>>> >> + *
>>>> >> + * You should have received a copy of the GNU General Public License
>>>> >> + * along with this program; if not, see <http://www.gnu.org/licenses/>.
>>>> >> + *
>>>> >> + * This file contain code under public domain from the hvdos project:
>>>> >> + * https://github.com/mist64/hvdos
>>>> >> + *
>>>> >> + * Parts Copyright (c) 2011 NetApp, Inc.
>>>> >> + * All rights reserved.
>>>> >> + *
>>>> >> + * Redistribution and use in source and binary forms, with or without
>>>> >> + * modification, are permitted provided that the following conditions
>>>> >> + * are met:
>>>> >> + * 1. Redistributions of source code must retain the above copyright
>>>> >> + *    notice, this list of conditions and the following disclaimer.
>>>> >> + * 2. Redistributions in binary form must reproduce the above copyright
>>>> >> + *    notice, this list of conditions and the following disclaimer in the
>>>> >> + *    documentation and/or other materials provided with the distribution.
>>>> >> + *
>>>> >> + * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>>>> >> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
>>>> >> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
>>>> >> + * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
>>>> >> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>>>> >> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
>>>> >> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>>>> >> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
>>>> >> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
>>>> >> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>>>> >> + * SUCH DAMAGE.
>>>> >> + */
>>>> >> +
>>>> >> +#include "qemu/osdep.h"
>>>> >> +#include "qemu/error-report.h"
>>>> >> +#include "qemu/main-loop.h"
>>>> >> +#include "exec/address-spaces.h"
>>>> >> +#include "exec/exec-all.h"
>>>> >> +#include "sysemu/cpus.h"
>>>> >> +#include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >> +#include "sysemu/runstate.h"
>>>> >> +#include "qemu/guest-random.h"
>>>> >> +
>>>> >> +#include <Hypervisor/Hypervisor.h>
>>>> >> +
>>>> >> +/* Memory slots */
>>>> >> +
>>>> >> +struct mac_slot {
>>>> >> +    int present;
>>>> >> +    uint64_t size;
>>>> >> +    uint64_t gpa_start;
>>>> >> +    uint64_t gva;
>>>> >> +};
>>>> >> +
>>>> >> +hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>>>> >> +{
>>>> >> +    hvf_slot *slot;
>>>> >> +    int x;
>>>> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>>>> >> +        slot = &hvf_state->slots[x];
>>>> >> +        if (slot->size && start < (slot->start + slot->size) &&
>>>> >> +            (start + size) > slot->start) {
>>>> >> +            return slot;
>>>> >> +        }
>>>> >> +    }
>>>> >> +    return NULL;
>>>> >> +}
>>>> >> +
>>>> >> +struct mac_slot mac_slots[32];
>>>> >> +
>>>> >> +static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>>>> >> +{
>>>> >> +    struct mac_slot *macslot;
>>>> >> +    hv_return_t ret;
>>>> >> +
>>>> >> +    macslot = &mac_slots[slot->slot_id];
>>>> >> +
>>>> >> +    if (macslot->present) {
>>>> >> +        if (macslot->size != slot->size) {
>>>> >> +            macslot->present = 0;
>>>> >> +            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>>>> >> +            assert_hvf_ok(ret);
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    if (!slot->size) {
>>>> >> +        return 0;
>>>> >> +    }
>>>> >> +
>>>> >> +    macslot->present = 1;
>>>> >> +    macslot->gpa_start = slot->start;
>>>> >> +    macslot->size = slot->size;
>>>> >> +    ret = hv_vm_map(slot->mem, slot->start, slot->size, flags);
>>>> >> +    assert_hvf_ok(ret);
>>>> >> +    return 0;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>>>> >> +{
>>>> >> +    hvf_slot *mem;
>>>> >> +    MemoryRegion *area = section->mr;
>>>> >> +    bool writeable = !area->readonly && !area->rom_device;
>>>> >> +    hv_memory_flags_t flags;
>>>> >> +
>>>> >> +    if (!memory_region_is_ram(area)) {
>>>> >> +        if (writeable) {
>>>> >> +            return;
>>>> >> +        } else if (!memory_region_is_romd(area)) {
>>>> >> +            /*
>>>> >> +             * If the memory device is not in romd_mode, then we actually want
>>>> >> +             * to remove the hvf memory slot so all accesses will trap.
>>>> >> +             */
>>>> >> +             add = false;
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    mem = hvf_find_overlap_slot(
>>>> >> +            section->offset_within_address_space,
>>>> >> +            int128_get64(section->size));
>>>> >> +
>>>> >> +    if (mem && add) {
>>>> >> +        if (mem->size == int128_get64(section->size) &&
>>>> >> +            mem->start == section->offset_within_address_space &&
>>>> >> +            mem->mem == (memory_region_get_ram_ptr(area) +
>>>> >> +            section->offset_within_region)) {
>>>> >> +            return; /* Same region was attempted to register, go away. */
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    /* Region needs to be reset. set the size to 0 and remap it. */
>>>> >> +    if (mem) {
>>>> >> +        mem->size = 0;
>>>> >> +        if (do_hvf_set_memory(mem, 0)) {
>>>> >> +            error_report("Failed to reset overlapping slot");
>>>> >> +            abort();
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    if (!add) {
>>>> >> +        return;
>>>> >> +    }
>>>> >> +
>>>> >> +    if (area->readonly ||
>>>> >> +        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
>>>> >> +        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>>>> >> +    } else {
>>>> >> +        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>>>> >> +    }
>>>> >> +
>>>> >> +    /* Now make a new slot. */
>>>> >> +    int x;
>>>> >> +
>>>> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>>>> >> +        mem = &hvf_state->slots[x];
>>>> >> +        if (!mem->size) {
>>>> >> +            break;
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    if (x == hvf_state->num_slots) {
>>>> >> +        error_report("No free slots");
>>>> >> +        abort();
>>>> >> +    }
>>>> >> +
>>>> >> +    mem->size = int128_get64(section->size);
>>>> >> +    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
>>>> >> +    mem->start = section->offset_within_address_space;
>>>> >> +    mem->region = area;
>>>> >> +
>>>> >> +    if (do_hvf_set_memory(mem, flags)) {
>>>> >> +        error_report("Error registering new memory slot");
>>>> >> +        abort();
>>>> >> +    }
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
>>>> >> +{
>>>> >> +    hvf_slot *slot;
>>>> >> +
>>>> >> +    slot = hvf_find_overlap_slot(
>>>> >> +            section->offset_within_address_space,
>>>> >> +            int128_get64(section->size));
>>>> >> +
>>>> >> +    /* protect region against writes; begin tracking it */
>>>> >> +    if (on) {
>>>> >> +        slot->flags |= HVF_SLOT_LOG;
>>>> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>>>> >> +                      HV_MEMORY_READ);
>>>> >> +    /* stop tracking region*/
>>>> >> +    } else {
>>>> >> +        slot->flags &= ~HVF_SLOT_LOG;
>>>> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>>>> >> +                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>>>> >> +    }
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_log_start(MemoryListener *listener,
>>>> >> +                          MemoryRegionSection *section, int old, int new)
>>>> >> +{
>>>> >> +    if (old != 0) {
>>>> >> +        return;
>>>> >> +    }
>>>> >> +
>>>> >> +    hvf_set_dirty_tracking(section, 1);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_log_stop(MemoryListener *listener,
>>>> >> +                         MemoryRegionSection *section, int old, int new)
>>>> >> +{
>>>> >> +    if (new != 0) {
>>>> >> +        return;
>>>> >> +    }
>>>> >> +
>>>> >> +    hvf_set_dirty_tracking(section, 0);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_log_sync(MemoryListener *listener,
>>>> >> +                         MemoryRegionSection *section)
>>>> >> +{
>>>> >> +    /*
>>>> >> +     * sync of dirty pages is handled elsewhere; just make sure we keep
>>>> >> +     * tracking the region.
>>>> >> +     */
>>>> >> +    hvf_set_dirty_tracking(section, 1);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_region_add(MemoryListener *listener,
>>>> >> +                           MemoryRegionSection *section)
>>>> >> +{
>>>> >> +    hvf_set_phys_mem(section, true);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_region_del(MemoryListener *listener,
>>>> >> +                           MemoryRegionSection *section)
>>>> >> +{
>>>> >> +    hvf_set_phys_mem(section, false);
>>>> >> +}
>>>> >> +
>>>> >> +static MemoryListener hvf_memory_listener = {
>>>> >> +    .priority = 10,
>>>> >> +    .region_add = hvf_region_add,
>>>> >> +    .region_del = hvf_region_del,
>>>> >> +    .log_start = hvf_log_start,
>>>> >> +    .log_stop = hvf_log_stop,
>>>> >> +    .log_sync = hvf_log_sync,
>>>> >> +};
>>>> >> +
>>>> >> +static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
>>>> >> +{
>>>> >> +    if (!cpu->vcpu_dirty) {
>>>> >> +        hvf_get_registers(cpu);
>>>> >> +        cpu->vcpu_dirty = true;
>>>> >> +    }
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_cpu_synchronize_state(CPUState *cpu)
>>>> >> +{
>>>> >> +    if (!cpu->vcpu_dirty) {
>>>> >> +        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
>>>> >> +    }
>>>> >> +}
>>>> >> +
>>>> >> +static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>>>> >> +                                              run_on_cpu_data arg)
>>>> >> +{
>>>> >> +    hvf_put_registers(cpu);
>>>> >> +    cpu->vcpu_dirty = false;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>>>> >> +{
>>>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
>>>> >> +}
>>>> >> +
>>>> >> +static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>>>> >> +                                             run_on_cpu_data arg)
>>>> >> +{
>>>> >> +    hvf_put_registers(cpu);
>>>> >> +    cpu->vcpu_dirty = false;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_cpu_synchronize_post_init(CPUState *cpu)
>>>> >> +{
>>>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
>>>> >> +}
>>>> >> +
>>>> >> +static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>>>> >> +                                              run_on_cpu_data arg)
>>>> >> +{
>>>> >> +    cpu->vcpu_dirty = true;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>>>> >> +{
>>>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_vcpu_destroy(CPUState *cpu)
>>>> >> +{
>>>> >> +    hv_return_t ret = hv_vcpu_destroy(cpu->hvf_fd);
>>>> >> +    assert_hvf_ok(ret);
>>>> >> +
>>>> >> +    hvf_arch_vcpu_destroy(cpu);
>>>> >> +}
>>>> >> +
>>>> >> +static void dummy_signal(int sig)
>>>> >> +{
>>>> >> +}
>>>> >> +
>>>> >> +static int hvf_init_vcpu(CPUState *cpu)
>>>> >> +{
>>>> >> +    int r;
>>>> >> +
>>>> >> +    /* init cpu signals */
>>>> >> +    sigset_t set;
>>>> >> +    struct sigaction sigact;
>>>> >> +
>>>> >> +    memset(&sigact, 0, sizeof(sigact));
>>>> >> +    sigact.sa_handler = dummy_signal;
>>>> >> +    sigaction(SIG_IPI, &sigact, NULL);
>>>> >> +
>>>> >> +    pthread_sigmask(SIG_BLOCK, NULL, &set);
>>>> >> +    sigdelset(&set, SIG_IPI);
>>>> >> +
>>>> >> +#ifdef __aarch64__
>>>> >> +    r = hv_vcpu_create(&cpu->hvf_fd, (hv_vcpu_exit_t **)&cpu->hvf_exit, NULL);
>>>> >> +#else
>>>> >> +    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>>>> >> +#endif
>>>> > I think the first __aarch64__ bit fits better to arm part of the series.
>>>>
>>>>
>>>> Oops. Thanks for catching it! Yes, absolutely. It should be part of the
>>>> ARM enablement.
>>>>
>>>>
>>>> >
>>>> >> +    cpu->vcpu_dirty = 1;
>>>> >> +    assert_hvf_ok(r);
>>>> >> +
>>>> >> +    return hvf_arch_init_vcpu(cpu);
>>>> >> +}
>>>> >> +
>>>> >> +/*
>>>> >> + * The HVF-specific vCPU thread function. This one should only run when the host
>>>> >> + * CPU supports the VMX "unrestricted guest" feature.
>>>> >> + */
>>>> >> +static void *hvf_cpu_thread_fn(void *arg)
>>>> >> +{
>>>> >> +    CPUState *cpu = arg;
>>>> >> +
>>>> >> +    int r;
>>>> >> +
>>>> >> +    assert(hvf_enabled());
>>>> >> +
>>>> >> +    rcu_register_thread();
>>>> >> +
>>>> >> +    qemu_mutex_lock_iothread();
>>>> >> +    qemu_thread_get_self(cpu->thread);
>>>> >> +
>>>> >> +    cpu->thread_id = qemu_get_thread_id();
>>>> >> +    cpu->can_do_io = 1;
>>>> >> +    current_cpu = cpu;
>>>> >> +
>>>> >> +    hvf_init_vcpu(cpu);
>>>> >> +
>>>> >> +    /* signal CPU creation */
>>>> >> +    cpu_thread_signal_created(cpu);
>>>> >> +    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>>>> >> +
>>>> >> +    do {
>>>> >> +        if (cpu_can_run(cpu)) {
>>>> >> +            r = hvf_vcpu_exec(cpu);
>>>> >> +            if (r == EXCP_DEBUG) {
>>>> >> +                cpu_handle_guest_debug(cpu);
>>>> >> +            }
>>>> >> +        }
>>>> >> +        qemu_wait_io_event(cpu);
>>>> >> +    } while (!cpu->unplug || cpu_can_run(cpu));
>>>> >> +
>>>> >> +    hvf_vcpu_destroy(cpu);
>>>> >> +    cpu_thread_signal_destroyed(cpu);
>>>> >> +    qemu_mutex_unlock_iothread();
>>>> >> +    rcu_unregister_thread();
>>>> >> +    return NULL;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_start_vcpu_thread(CPUState *cpu)
>>>> >> +{
>>>> >> +    char thread_name[VCPU_THREAD_NAME_SIZE];
>>>> >> +
>>>> >> +    /*
>>>> >> +     * HVF currently does not support TCG, and only runs in
>>>> >> +     * unrestricted-guest mode.
>>>> >> +     */
>>>> >> +    assert(hvf_enabled());
>>>> >> +
>>>> >> +    cpu->thread = g_malloc0(sizeof(QemuThread));
>>>> >> +    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>>>> >> +    qemu_cond_init(cpu->halt_cond);
>>>> >> +
>>>> >> +    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>>>> >> +             cpu->cpu_index);
>>>> >> +    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>>>> >> +                       cpu, QEMU_THREAD_JOINABLE);
>>>> >> +}
>>>> >> +
>>>> >> +static const CpusAccel hvf_cpus = {
>>>> >> +    .create_vcpu_thread = hvf_start_vcpu_thread,
>>>> >> +
>>>> >> +    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>>>> >> +    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>>>> >> +    .synchronize_state = hvf_cpu_synchronize_state,
>>>> >> +    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>>>> >> +};
>>>> >> +
>>>> >> +static int hvf_accel_init(MachineState *ms)
>>>> >> +{
>>>> >> +    int x;
>>>> >> +    hv_return_t ret;
>>>> >> +    HVFState *s;
>>>> >> +
>>>> >> +    ret = hv_vm_create(HV_VM_DEFAULT);
>>>> >> +    assert_hvf_ok(ret);
>>>> >> +
>>>> >> +    s = g_new0(HVFState, 1);
>>>> >> +
>>>> >> +    s->num_slots = 32;
>>>> >> +    for (x = 0; x < s->num_slots; ++x) {
>>>> >> +        s->slots[x].size = 0;
>>>> >> +        s->slots[x].slot_id = x;
>>>> >> +    }
>>>> >> +
>>>> >> +    hvf_state = s;
>>>> >> +    memory_listener_register(&hvf_memory_listener, &address_space_memory);
>>>> >> +    cpus_register_accel(&hvf_cpus);
>>>> >> +    return 0;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_accel_class_init(ObjectClass *oc, void *data)
>>>> >> +{
>>>> >> +    AccelClass *ac = ACCEL_CLASS(oc);
>>>> >> +    ac->name = "HVF";
>>>> >> +    ac->init_machine = hvf_accel_init;
>>>> >> +    ac->allowed = &hvf_allowed;
>>>> >> +}
>>>> >> +
>>>> >> +static const TypeInfo hvf_accel_type = {
>>>> >> +    .name = TYPE_HVF_ACCEL,
>>>> >> +    .parent = TYPE_ACCEL,
>>>> >> +    .class_init = hvf_accel_class_init,
>>>> >> +};
>>>> >> +
>>>> >> +static void hvf_type_init(void)
>>>> >> +{
>>>> >> +    type_register_static(&hvf_accel_type);
>>>> >> +}
>>>> >> +
>>>> >> +type_init(hvf_type_init);
>>>> >> diff --git a/accel/hvf/meson.build b/accel/hvf/meson.build
>>>> >> new file mode 100644
>>>> >> index 0000000000..dfd6b68dc7
>>>> >> --- /dev/null
>>>> >> +++ b/accel/hvf/meson.build
>>>> >> @@ -0,0 +1,7 @@
>>>> >> +hvf_ss = ss.source_set()
>>>> >> +hvf_ss.add(files(
>>>> >> +  'hvf-all.c',
>>>> >> +  'hvf-cpus.c',
>>>> >> +))
>>>> >> +
>>>> >> +specific_ss.add_all(when: 'CONFIG_HVF', if_true: hvf_ss)
>>>> >> diff --git a/accel/meson.build b/accel/meson.build
>>>> >> index b26cca227a..6de12ce5d5 100644
>>>> >> --- a/accel/meson.build
>>>> >> +++ b/accel/meson.build
>>>> >> @@ -1,5 +1,6 @@
>>>> >>   softmmu_ss.add(files('accel.c'))
>>>> >>
>>>> >> +subdir('hvf')
>>>> >>   subdir('qtest')
>>>> >>   subdir('kvm')
>>>> >>   subdir('tcg')
>>>> >> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
>>>> >> new file mode 100644
>>>> >> index 0000000000..de9bad23a8
>>>> >> --- /dev/null
>>>> >> +++ b/include/sysemu/hvf_int.h
>>>> >> @@ -0,0 +1,69 @@
>>>> >> +/*
>>>> >> + * QEMU Hypervisor.framework (HVF) support
>>>> >> + *
>>>> >> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>>> >> + * See the COPYING file in the top-level directory.
>>>> >> + *
>>>> >> + */
>>>> >> +
>>>> >> +/* header to be included in HVF-specific code */
>>>> >> +
>>>> >> +#ifndef HVF_INT_H
>>>> >> +#define HVF_INT_H
>>>> >> +
>>>> >> +#include <Hypervisor/Hypervisor.h>
>>>> >> +
>>>> >> +#define HVF_MAX_VCPU 0x10
>>>> >> +
>>>> >> +extern struct hvf_state hvf_global;
>>>> >> +
>>>> >> +struct hvf_vm {
>>>> >> +    int id;
>>>> >> +    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>>>> >> +};
>>>> >> +
>>>> >> +struct hvf_state {
>>>> >> +    uint32_t version;
>>>> >> +    struct hvf_vm *vm;
>>>> >> +    uint64_t mem_quota;
>>>> >> +};
>>>> >> +
>>>> >> +/* hvf_slot flags */
>>>> >> +#define HVF_SLOT_LOG (1 << 0)
>>>> >> +
>>>> >> +typedef struct hvf_slot {
>>>> >> +    uint64_t start;
>>>> >> +    uint64_t size;
>>>> >> +    uint8_t *mem;
>>>> >> +    int slot_id;
>>>> >> +    uint32_t flags;
>>>> >> +    MemoryRegion *region;
>>>> >> +} hvf_slot;
>>>> >> +
>>>> >> +typedef struct hvf_vcpu_caps {
>>>> >> +    uint64_t vmx_cap_pinbased;
>>>> >> +    uint64_t vmx_cap_procbased;
>>>> >> +    uint64_t vmx_cap_procbased2;
>>>> >> +    uint64_t vmx_cap_entry;
>>>> >> +    uint64_t vmx_cap_exit;
>>>> >> +    uint64_t vmx_cap_preemption_timer;
>>>> >> +} hvf_vcpu_caps;
>>>> >> +
>>>> >> +struct HVFState {
>>>> >> +    AccelState parent;
>>>> >> +    hvf_slot slots[32];
>>>> >> +    int num_slots;
>>>> >> +
>>>> >> +    hvf_vcpu_caps *hvf_caps;
>>>> >> +};
>>>> >> +extern HVFState *hvf_state;
>>>> >> +
>>>> >> +void assert_hvf_ok(hv_return_t ret);
>>>> >> +int hvf_get_registers(CPUState *cpu);
>>>> >> +int hvf_put_registers(CPUState *cpu);
>>>> >> +int hvf_arch_init_vcpu(CPUState *cpu);
>>>> >> +void hvf_arch_vcpu_destroy(CPUState *cpu);
>>>> >> +int hvf_vcpu_exec(CPUState *cpu);
>>>> >> +hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>>>> >> +
>>>> >> +#endif
>>>> >> diff --git a/target/i386/hvf/hvf-cpus.c b/target/i386/hvf/hvf-cpus.c
>>>> >> deleted file mode 100644
>>>> >> index 817b3d7452..0000000000
>>>> >> --- a/target/i386/hvf/hvf-cpus.c
>>>> >> +++ /dev/null
>>>> >> @@ -1,131 +0,0 @@
>>>> >> -/*
>>>> >> - * Copyright 2008 IBM Corporation
>>>> >> - *           2008 Red Hat, Inc.
>>>> >> - * Copyright 2011 Intel Corporation
>>>> >> - * Copyright 2016 Veertu, Inc.
>>>> >> - * Copyright 2017 The Android Open Source Project
>>>> >> - *
>>>> >> - * QEMU Hypervisor.framework support
>>>> >> - *
>>>> >> - * This program is free software; you can redistribute it and/or
>>>> >> - * modify it under the terms of version 2 of the GNU General Public
>>>> >> - * License as published by the Free Software Foundation.
>>>> >> - *
>>>> >> - * This program is distributed in the hope that it will be useful,
>>>> >> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> >> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>>> >> - * General Public License for more details.
>>>> >> - *
>>>> >> - * You should have received a copy of the GNU General Public License
>>>> >> - * along with this program; if not, see <http://www.gnu.org/licenses/>.
>>>> >> - *
>>>> >> - * This file contain code under public domain from the hvdos project:
>>>> >> - * https://github.com/mist64/hvdos
>>>> >> - *
>>>> >> - * Parts Copyright (c) 2011 NetApp, Inc.
>>>> >> - * All rights reserved.
>>>> >> - *
>>>> >> - * Redistribution and use in source and binary forms, with or without
>>>> >> - * modification, are permitted provided that the following conditions
>>>> >> - * are met:
>>>> >> - * 1. Redistributions of source code must retain the above copyright
>>>> >> - *    notice, this list of conditions and the following disclaimer.
>>>> >> - * 2. Redistributions in binary form must reproduce the above copyright
>>>> >> - *    notice, this list of conditions and the following disclaimer in the
>>>> >> - *    documentation and/or other materials provided with the distribution.
>>>> >> - *
>>>> >> - * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>>>> >> - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
>>>> >> - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
>>>> >> - * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
>>>> >> - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>>>> >> - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
>>>> >> - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>>>> >> - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
>>>> >> - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
>>>> >> - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>>>> >> - * SUCH DAMAGE.
>>>> >> - */
>>>> >> -
>>>> >> -#include "qemu/osdep.h"
>>>> >> -#include "qemu/error-report.h"
>>>> >> -#include "qemu/main-loop.h"
>>>> >> -#include "sysemu/hvf.h"
>>>> >> -#include "sysemu/runstate.h"
>>>> >> -#include "target/i386/cpu.h"
>>>> >> -#include "qemu/guest-random.h"
>>>> >> -
>>>> >> -#include "hvf-cpus.h"
>>>> >> -
>>>> >> -/*
>>>> >> - * The HVF-specific vCPU thread function. This one should only run when the host
>>>> >> - * CPU supports the VMX "unrestricted guest" feature.
>>>> >> - */
>>>> >> -static void *hvf_cpu_thread_fn(void *arg)
>>>> >> -{
>>>> >> -    CPUState *cpu = arg;
>>>> >> -
>>>> >> -    int r;
>>>> >> -
>>>> >> -    assert(hvf_enabled());
>>>> >> -
>>>> >> -    rcu_register_thread();
>>>> >> -
>>>> >> -    qemu_mutex_lock_iothread();
>>>> >> -    qemu_thread_get_self(cpu->thread);
>>>> >> -
>>>> >> -    cpu->thread_id = qemu_get_thread_id();
>>>> >> -    cpu->can_do_io = 1;
>>>> >> -    current_cpu = cpu;
>>>> >> -
>>>> >> -    hvf_init_vcpu(cpu);
>>>> >> -
>>>> >> -    /* signal CPU creation */
>>>> >> -    cpu_thread_signal_created(cpu);
>>>> >> -    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>>>> >> -
>>>> >> -    do {
>>>> >> -        if (cpu_can_run(cpu)) {
>>>> >> -            r = hvf_vcpu_exec(cpu);
>>>> >> -            if (r == EXCP_DEBUG) {
>>>> >> -                cpu_handle_guest_debug(cpu);
>>>> >> -            }
>>>> >> -        }
>>>> >> -        qemu_wait_io_event(cpu);
>>>> >> -    } while (!cpu->unplug || cpu_can_run(cpu));
>>>> >> -
>>>> >> -    hvf_vcpu_destroy(cpu);
>>>> >> -    cpu_thread_signal_destroyed(cpu);
>>>> >> -    qemu_mutex_unlock_iothread();
>>>> >> -    rcu_unregister_thread();
>>>> >> -    return NULL;
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_start_vcpu_thread(CPUState *cpu)
>>>> >> -{
>>>> >> -    char thread_name[VCPU_THREAD_NAME_SIZE];
>>>> >> -
>>>> >> -    /*
>>>> >> -     * HVF currently does not support TCG, and only runs in
>>>> >> -     * unrestricted-guest mode.
>>>> >> -     */
>>>> >> -    assert(hvf_enabled());
>>>> >> -
>>>> >> -    cpu->thread = g_malloc0(sizeof(QemuThread));
>>>> >> -    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>>>> >> -    qemu_cond_init(cpu->halt_cond);
>>>> >> -
>>>> >> -    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>>>> >> -             cpu->cpu_index);
>>>> >> -    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>>>> >> -                       cpu, QEMU_THREAD_JOINABLE);
>>>> >> -}
>>>> >> -
>>>> >> -const CpusAccel hvf_cpus = {
>>>> >> -    .create_vcpu_thread = hvf_start_vcpu_thread,
>>>> >> -
>>>> >> -    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>>>> >> -    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>>>> >> -    .synchronize_state = hvf_cpu_synchronize_state,
>>>> >> -    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>>>> >> -};
>>>> >> diff --git a/target/i386/hvf/hvf-cpus.h b/target/i386/hvf/hvf-cpus.h
>>>> >> deleted file mode 100644
>>>> >> index ced31b82c0..0000000000
>>>> >> --- a/target/i386/hvf/hvf-cpus.h
>>>> >> +++ /dev/null
>>>> >> @@ -1,25 +0,0 @@
>>>> >> -/*
>>>> >> - * Accelerator CPUS Interface
>>>> >> - *
>>>> >> - * Copyright 2020 SUSE LLC
>>>> >> - *
>>>> >> - * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>>> >> - * See the COPYING file in the top-level directory.
>>>> >> - */
>>>> >> -
>>>> >> -#ifndef HVF_CPUS_H
>>>> >> -#define HVF_CPUS_H
>>>> >> -
>>>> >> -#include "sysemu/cpus.h"
>>>> >> -
>>>> >> -extern const CpusAccel hvf_cpus;
>>>> >> -
>>>> >> -int hvf_init_vcpu(CPUState *);
>>>> >> -int hvf_vcpu_exec(CPUState *);
>>>> >> -void hvf_cpu_synchronize_state(CPUState *);
>>>> >> -void hvf_cpu_synchronize_post_reset(CPUState *);
>>>> >> -void hvf_cpu_synchronize_post_init(CPUState *);
>>>> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *);
>>>> >> -void hvf_vcpu_destroy(CPUState *);
>>>> >> -
>>>> >> -#endif /* HVF_CPUS_H */
>>>> >> diff --git a/target/i386/hvf/hvf-i386.h b/target/i386/hvf/hvf-i386.h
>>>> >> index e0edffd077..6d56f8f6bb 100644
>>>> >> --- a/target/i386/hvf/hvf-i386.h
>>>> >> +++ b/target/i386/hvf/hvf-i386.h
>>>> >> @@ -18,57 +18,11 @@
>>>> >>
>>>> >>   #include "sysemu/accel.h"
>>>> >>   #include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >>   #include "cpu.h"
>>>> >>   #include "x86.h"
>>>> >>
>>>> >> -#define HVF_MAX_VCPU 0x10
>>>> >> -
>>>> >> -extern struct hvf_state hvf_global;
>>>> >> -
>>>> >> -struct hvf_vm {
>>>> >> -    int id;
>>>> >> -    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>>>> >> -};
>>>> >> -
>>>> >> -struct hvf_state {
>>>> >> -    uint32_t version;
>>>> >> -    struct hvf_vm *vm;
>>>> >> -    uint64_t mem_quota;
>>>> >> -};
>>>> >> -
>>>> >> -/* hvf_slot flags */
>>>> >> -#define HVF_SLOT_LOG (1 << 0)
>>>> >> -
>>>> >> -typedef struct hvf_slot {
>>>> >> -    uint64_t start;
>>>> >> -    uint64_t size;
>>>> >> -    uint8_t *mem;
>>>> >> -    int slot_id;
>>>> >> -    uint32_t flags;
>>>> >> -    MemoryRegion *region;
>>>> >> -} hvf_slot;
>>>> >> -
>>>> >> -typedef struct hvf_vcpu_caps {
>>>> >> -    uint64_t vmx_cap_pinbased;
>>>> >> -    uint64_t vmx_cap_procbased;
>>>> >> -    uint64_t vmx_cap_procbased2;
>>>> >> -    uint64_t vmx_cap_entry;
>>>> >> -    uint64_t vmx_cap_exit;
>>>> >> -    uint64_t vmx_cap_preemption_timer;
>>>> >> -} hvf_vcpu_caps;
>>>> >> -
>>>> >> -struct HVFState {
>>>> >> -    AccelState parent;
>>>> >> -    hvf_slot slots[32];
>>>> >> -    int num_slots;
>>>> >> -
>>>> >> -    hvf_vcpu_caps *hvf_caps;
>>>> >> -};
>>>> >> -extern HVFState *hvf_state;
>>>> >> -
>>>> >> -void hvf_set_phys_mem(MemoryRegionSection *, bool);
>>>> >>   void hvf_handle_io(CPUArchState *, uint16_t, void *, int, int, int);
>>>> >> -hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>>>> >>
>>>> >>   #ifdef NEED_CPU_H
>>>> >>   /* Functions exported to host specific mode */
>>>> >> diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
>>>> >> index ed9356565c..8b96ecd619 100644
>>>> >> --- a/target/i386/hvf/hvf.c
>>>> >> +++ b/target/i386/hvf/hvf.c
>>>> >> @@ -51,6 +51,7 @@
>>>> >>   #include "qemu/error-report.h"
>>>> >>
>>>> >>   #include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >>   #include "sysemu/runstate.h"
>>>> >>   #include "hvf-i386.h"
>>>> >>   #include "vmcs.h"
>>>> >> @@ -72,171 +73,6 @@
>>>> >>   #include "sysemu/accel.h"
>>>> >>   #include "target/i386/cpu.h"
>>>> >>
>>>> >> -#include "hvf-cpus.h"
>>>> >> -
>>>> >> -HVFState *hvf_state;
>>>> >> -
>>>> >> -static void assert_hvf_ok(hv_return_t ret)
>>>> >> -{
>>>> >> -    if (ret == HV_SUCCESS) {
>>>> >> -        return;
>>>> >> -    }
>>>> >> -
>>>> >> -    switch (ret) {
>>>> >> -    case HV_ERROR:
>>>> >> -        error_report("Error: HV_ERROR");
>>>> >> -        break;
>>>> >> -    case HV_BUSY:
>>>> >> -        error_report("Error: HV_BUSY");
>>>> >> -        break;
>>>> >> -    case HV_BAD_ARGUMENT:
>>>> >> -        error_report("Error: HV_BAD_ARGUMENT");
>>>> >> -        break;
>>>> >> -    case HV_NO_RESOURCES:
>>>> >> -        error_report("Error: HV_NO_RESOURCES");
>>>> >> -        break;
>>>> >> -    case HV_NO_DEVICE:
>>>> >> -        error_report("Error: HV_NO_DEVICE");
>>>> >> -        break;
>>>> >> -    case HV_UNSUPPORTED:
>>>> >> -        error_report("Error: HV_UNSUPPORTED");
>>>> >> -        break;
>>>> >> -    default:
>>>> >> -        error_report("Unknown Error");
>>>> >> -    }
>>>> >> -
>>>> >> -    abort();
>>>> >> -}
>>>> >> -
>>>> >> -/* Memory slots */
>>>> >> -hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>>>> >> -{
>>>> >> -    hvf_slot *slot;
>>>> >> -    int x;
>>>> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>>>> >> -        slot = &hvf_state->slots[x];
>>>> >> -        if (slot->size && start < (slot->start + slot->size) &&
>>>> >> -            (start + size) > slot->start) {
>>>> >> -            return slot;
>>>> >> -        }
>>>> >> -    }
>>>> >> -    return NULL;
>>>> >> -}
>>>> >> -
>>>> >> -struct mac_slot {
>>>> >> -    int present;
>>>> >> -    uint64_t size;
>>>> >> -    uint64_t gpa_start;
>>>> >> -    uint64_t gva;
>>>> >> -};
>>>> >> -
>>>> >> -struct mac_slot mac_slots[32];
>>>> >> -
>>>> >> -static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>>>> >> -{
>>>> >> -    struct mac_slot *macslot;
>>>> >> -    hv_return_t ret;
>>>> >> -
>>>> >> -    macslot = &mac_slots[slot->slot_id];
>>>> >> -
>>>> >> -    if (macslot->present) {
>>>> >> -        if (macslot->size != slot->size) {
>>>> >> -            macslot->present = 0;
>>>> >> -            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>>>> >> -            assert_hvf_ok(ret);
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    if (!slot->size) {
>>>> >> -        return 0;
>>>> >> -    }
>>>> >> -
>>>> >> -    macslot->present = 1;
>>>> >> -    macslot->gpa_start = slot->start;
>>>> >> -    macslot->size = slot->size;
>>>> >> -    ret = hv_vm_map((hv_uvaddr_t)slot->mem, slot->start, slot->size, flags);
>>>> >> -    assert_hvf_ok(ret);
>>>> >> -    return 0;
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>>>> >> -{
>>>> >> -    hvf_slot *mem;
>>>> >> -    MemoryRegion *area = section->mr;
>>>> >> -    bool writeable = !area->readonly && !area->rom_device;
>>>> >> -    hv_memory_flags_t flags;
>>>> >> -
>>>> >> -    if (!memory_region_is_ram(area)) {
>>>> >> -        if (writeable) {
>>>> >> -            return;
>>>> >> -        } else if (!memory_region_is_romd(area)) {
>>>> >> -            /*
>>>> >> -             * If the memory device is not in romd_mode, then we actually want
>>>> >> -             * to remove the hvf memory slot so all accesses will trap.
>>>> >> -             */
>>>> >> -             add = false;
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    mem = hvf_find_overlap_slot(
>>>> >> -            section->offset_within_address_space,
>>>> >> -            int128_get64(section->size));
>>>> >> -
>>>> >> -    if (mem && add) {
>>>> >> -        if (mem->size == int128_get64(section->size) &&
>>>> >> -            mem->start == section->offset_within_address_space &&
>>>> >> -            mem->mem == (memory_region_get_ram_ptr(area) +
>>>> >> -            section->offset_within_region)) {
>>>> >> -            return; /* Same region was attempted to register, go away. */
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    /* Region needs to be reset. set the size to 0 and remap it. */
>>>> >> -    if (mem) {
>>>> >> -        mem->size = 0;
>>>> >> -        if (do_hvf_set_memory(mem, 0)) {
>>>> >> -            error_report("Failed to reset overlapping slot");
>>>> >> -            abort();
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    if (!add) {
>>>> >> -        return;
>>>> >> -    }
>>>> >> -
>>>> >> -    if (area->readonly ||
>>>> >> -        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
>>>> >> -        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>>>> >> -    } else {
>>>> >> -        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>>>> >> -    }
>>>> >> -
>>>> >> -    /* Now make a new slot. */
>>>> >> -    int x;
>>>> >> -
>>>> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>>>> >> -        mem = &hvf_state->slots[x];
>>>> >> -        if (!mem->size) {
>>>> >> -            break;
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    if (x == hvf_state->num_slots) {
>>>> >> -        error_report("No free slots");
>>>> >> -        abort();
>>>> >> -    }
>>>> >> -
>>>> >> -    mem->size = int128_get64(section->size);
>>>> >> -    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
>>>> >> -    mem->start = section->offset_within_address_space;
>>>> >> -    mem->region = area;
>>>> >> -
>>>> >> -    if (do_hvf_set_memory(mem, flags)) {
>>>> >> -        error_report("Error registering new memory slot");
>>>> >> -        abort();
>>>> >> -    }
>>>> >> -}
>>>> >> -
>>>> >>   void vmx_update_tpr(CPUState *cpu)
>>>> >>   {
>>>> >>       /* TODO: need integrate APIC handling */
>>>> >> @@ -276,56 +112,6 @@ void hvf_handle_io(CPUArchState *env, uint16_t port, void *buffer,
>>>> >>       }
>>>> >>   }
>>>> >>
>>>> >> -static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
>>>> >> -{
>>>> >> -    if (!cpu->vcpu_dirty) {
>>>> >> -        hvf_get_registers(cpu);
>>>> >> -        cpu->vcpu_dirty = true;
>>>> >> -    }
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_cpu_synchronize_state(CPUState *cpu)
>>>> >> -{
>>>> >> -    if (!cpu->vcpu_dirty) {
>>>> >> -        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
>>>> >> -    }
>>>> >> -}
>>>> >> -
>>>> >> -static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>>>> >> -                                              run_on_cpu_data arg)
>>>> >> -{
>>>> >> -    hvf_put_registers(cpu);
>>>> >> -    cpu->vcpu_dirty = false;
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>>>> >> -{
>>>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
>>>> >> -}
>>>> >> -
>>>> >> -static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>>>> >> -                                             run_on_cpu_data arg)
>>>> >> -{
>>>> >> -    hvf_put_registers(cpu);
>>>> >> -    cpu->vcpu_dirty = false;
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_cpu_synchronize_post_init(CPUState *cpu)
>>>> >> -{
>>>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
>>>> >> -}
>>>> >> -
>>>> >> -static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>>>> >> -                                              run_on_cpu_data arg)
>>>> >> -{
>>>> >> -    cpu->vcpu_dirty = true;
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>>>> >> -{
>>>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
>>>> >> -}
>>>> >> -
>>>> >>   static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
>>>> >>   {
>>>> >>       int read, write;
>>>> >> @@ -370,109 +156,19 @@ static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
>>>> >>       return false;
>>>> >>   }
>>>> >>
>>>> >> -static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
>>>> >> -{
>>>> >> -    hvf_slot *slot;
>>>> >> -
>>>> >> -    slot = hvf_find_overlap_slot(
>>>> >> -            section->offset_within_address_space,
>>>> >> -            int128_get64(section->size));
>>>> >> -
>>>> >> -    /* protect region against writes; begin tracking it */
>>>> >> -    if (on) {
>>>> >> -        slot->flags |= HVF_SLOT_LOG;
>>>> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>>>> >> -                      HV_MEMORY_READ);
>>>> >> -    /* stop tracking region*/
>>>> >> -    } else {
>>>> >> -        slot->flags &= ~HVF_SLOT_LOG;
>>>> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>>>> >> -                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>>>> >> -    }
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_log_start(MemoryListener *listener,
>>>> >> -                          MemoryRegionSection *section, int old, int new)
>>>> >> -{
>>>> >> -    if (old != 0) {
>>>> >> -        return;
>>>> >> -    }
>>>> >> -
>>>> >> -    hvf_set_dirty_tracking(section, 1);
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_log_stop(MemoryListener *listener,
>>>> >> -                         MemoryRegionSection *section, int old, int new)
>>>> >> -{
>>>> >> -    if (new != 0) {
>>>> >> -        return;
>>>> >> -    }
>>>> >> -
>>>> >> -    hvf_set_dirty_tracking(section, 0);
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_log_sync(MemoryListener *listener,
>>>> >> -                         MemoryRegionSection *section)
>>>> >> -{
>>>> >> -    /*
>>>> >> -     * sync of dirty pages is handled elsewhere; just make sure we keep
>>>> >> -     * tracking the region.
>>>> >> -     */
>>>> >> -    hvf_set_dirty_tracking(section, 1);
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_region_add(MemoryListener *listener,
>>>> >> -                           MemoryRegionSection *section)
>>>> >> -{
>>>> >> -    hvf_set_phys_mem(section, true);
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_region_del(MemoryListener *listener,
>>>> >> -                           MemoryRegionSection *section)
>>>> >> -{
>>>> >> -    hvf_set_phys_mem(section, false);
>>>> >> -}
>>>> >> -
>>>> >> -static MemoryListener hvf_memory_listener = {
>>>> >> -    .priority = 10,
>>>> >> -    .region_add = hvf_region_add,
>>>> >> -    .region_del = hvf_region_del,
>>>> >> -    .log_start = hvf_log_start,
>>>> >> -    .log_stop = hvf_log_stop,
>>>> >> -    .log_sync = hvf_log_sync,
>>>> >> -};
>>>> >> -
>>>> >> -void hvf_vcpu_destroy(CPUState *cpu)
>>>> >> +void hvf_arch_vcpu_destroy(CPUState *cpu)
>>>> >>   {
>>>> >>       X86CPU *x86_cpu = X86_CPU(cpu);
>>>> >>       CPUX86State *env = &x86_cpu->env;
>>>> >>
>>>> >> -    hv_return_t ret = hv_vcpu_destroy((hv_vcpuid_t)cpu->hvf_fd);
>>>> >>       g_free(env->hvf_mmio_buf);
>>>> >> -    assert_hvf_ok(ret);
>>>> >> -}
>>>> >> -
>>>> >> -static void dummy_signal(int sig)
>>>> >> -{
>>>> >>   }
>>>> >>
>>>> >> -int hvf_init_vcpu(CPUState *cpu)
>>>> >> +int hvf_arch_init_vcpu(CPUState *cpu)
>>>> >>   {
>>>> >>
>>>> >>       X86CPU *x86cpu = X86_CPU(cpu);
>>>> >>       CPUX86State *env = &x86cpu->env;
>>>> >> -    int r;
>>>> >> -
>>>> >> -    /* init cpu signals */
>>>> >> -    sigset_t set;
>>>> >> -    struct sigaction sigact;
>>>> >> -
>>>> >> -    memset(&sigact, 0, sizeof(sigact));
>>>> >> -    sigact.sa_handler = dummy_signal;
>>>> >> -    sigaction(SIG_IPI, &sigact, NULL);
>>>> >> -
>>>> >> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>>>> >> -    sigdelset(&set, SIG_IPI);
>>>> >>
>>>> >>       init_emu();
>>>> >>       init_decoder();
>>>> >> @@ -480,10 +176,6 @@ int hvf_init_vcpu(CPUState *cpu)
>>>> >>       hvf_state->hvf_caps = g_new0(struct hvf_vcpu_caps, 1);
>>>> >>       env->hvf_mmio_buf = g_new(char, 4096);
>>>> >>
>>>> >> -    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>>>> >> -    cpu->vcpu_dirty = 1;
>>>> >> -    assert_hvf_ok(r);
>>>> >> -
>>>> >>       if (hv_vmx_read_capability(HV_VMX_CAP_PINBASED,
>>>> >>           &hvf_state->hvf_caps->vmx_cap_pinbased)) {
>>>> >>           abort();
>>>> >> @@ -865,49 +557,3 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>> >>
>>>> >>       return ret;
>>>> >>   }
>>>> >> -
>>>> >> -bool hvf_allowed;
>>>> >> -
>>>> >> -static int hvf_accel_init(MachineState *ms)
>>>> >> -{
>>>> >> -    int x;
>>>> >> -    hv_return_t ret;
>>>> >> -    HVFState *s;
>>>> >> -
>>>> >> -    ret = hv_vm_create(HV_VM_DEFAULT);
>>>> >> -    assert_hvf_ok(ret);
>>>> >> -
>>>> >> -    s = g_new0(HVFState, 1);
>>>> >> -
>>>> >> -    s->num_slots = 32;
>>>> >> -    for (x = 0; x < s->num_slots; ++x) {
>>>> >> -        s->slots[x].size = 0;
>>>> >> -        s->slots[x].slot_id = x;
>>>> >> -    }
>>>> >> -
>>>> >> -    hvf_state = s;
>>>> >> -    memory_listener_register(&hvf_memory_listener, &address_space_memory);
>>>> >> -    cpus_register_accel(&hvf_cpus);
>>>> >> -    return 0;
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_accel_class_init(ObjectClass *oc, void *data)
>>>> >> -{
>>>> >> -    AccelClass *ac = ACCEL_CLASS(oc);
>>>> >> -    ac->name = "HVF";
>>>> >> -    ac->init_machine = hvf_accel_init;
>>>> >> -    ac->allowed = &hvf_allowed;
>>>> >> -}
>>>> >> -
>>>> >> -static const TypeInfo hvf_accel_type = {
>>>> >> -    .name = TYPE_HVF_ACCEL,
>>>> >> -    .parent = TYPE_ACCEL,
>>>> >> -    .class_init = hvf_accel_class_init,
>>>> >> -};
>>>> >> -
>>>> >> -static void hvf_type_init(void)
>>>> >> -{
>>>> >> -    type_register_static(&hvf_accel_type);
>>>> >> -}
>>>> >> -
>>>> >> -type_init(hvf_type_init);
>>>> >> diff --git a/target/i386/hvf/meson.build b/target/i386/hvf/meson.build
>>>> >> index 409c9a3f14..c8a43717ee 100644
>>>> >> --- a/target/i386/hvf/meson.build
>>>> >> +++ b/target/i386/hvf/meson.build
>>>> >> @@ -1,6 +1,5 @@
>>>> >>   i386_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true: files(
>>>> >>     'hvf.c',
>>>> >> -  'hvf-cpus.c',
>>>> >>     'x86.c',
>>>> >>     'x86_cpuid.c',
>>>> >>     'x86_decode.c',
>>>> >> diff --git a/target/i386/hvf/x86hvf.c b/target/i386/hvf/x86hvf.c
>>>> >> index bbec412b6c..89b8e9d87a 100644
>>>> >> --- a/target/i386/hvf/x86hvf.c
>>>> >> +++ b/target/i386/hvf/x86hvf.c
>>>> >> @@ -20,6 +20,9 @@
>>>> >>   #include "qemu/osdep.h"
>>>> >>
>>>> >>   #include "qemu-common.h"
>>>> >> +#include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >> +#include "sysemu/hw_accel.h"
>>>> >>   #include "x86hvf.h"
>>>> >>   #include "vmx.h"
>>>> >>   #include "vmcs.h"
>>>> >> @@ -32,8 +35,6 @@
>>>> >>   #include <Hypervisor/hv.h>
>>>> >>   #include <Hypervisor/hv_vmx.h>
>>>> >>
>>>> >> -#include "hvf-cpus.h"
>>>> >> -
>>>> >>   void hvf_set_segment(struct CPUState *cpu, struct vmx_segment *vmx_seg,
>>>> >>                        SegmentCache *qseg, bool is_tr)
>>>> >>   {
>>>> >> @@ -437,7 +438,7 @@ int hvf_process_events(CPUState *cpu_state)
>>>> >>       env->eflags = rreg(cpu_state->hvf_fd, HV_X86_RFLAGS);
>>>> >>
>>>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_INIT) {
>>>> >> -        hvf_cpu_synchronize_state(cpu_state);
>>>> >> +        cpu_synchronize_state(cpu_state);
>>>> >>           do_cpu_init(cpu);
>>>> >>       }
>>>> >>
>>>> >> @@ -451,12 +452,12 @@ int hvf_process_events(CPUState *cpu_state)
>>>> >>           cpu_state->halted = 0;
>>>> >>       }
>>>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_SIPI) {
>>>> >> -        hvf_cpu_synchronize_state(cpu_state);
>>>> >> +        cpu_synchronize_state(cpu_state);
>>>> >>           do_cpu_sipi(cpu);
>>>> >>       }
>>>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_TPR) {
>>>> >>           cpu_state->interrupt_request &= ~CPU_INTERRUPT_TPR;
>>>> >> -        hvf_cpu_synchronize_state(cpu_state);
>>>> >> +        cpu_synchronize_state(cpu_state);
>>>> > The changes from hvf_cpu_*() to cpu_*() are cleanup and perhaps should
>>>> > be a separate patch. It follows cpu/accel cleanups Claudio was doing the
>>>> > summer.
>>>>
>>>>
>>>> The only reason they're in here is because we no longer have access to
>>>> the hvf_ functions from the file. I am perfectly happy to rebase the
>>>> patch on top of Claudio's if his goes in first. I'm sure it'll be
>>>> trivial for him to rebase on top of this too if my series goes in first.
>>>>
>>>>
>>>> >
>>>> > Phillipe raised the idea that the patch might go ahead of ARM-specific
>>>> > part (which might involve some discussions) and I agree with that.
>>>> >
>>>> > Some sync between Claudio series (CC'd him) and the patch might be need.
>>>>
>>>>
>>>> I would prefer not to hold back because of the sync. Claudio's cleanup
>>>> is trivial enough to adjust for if it gets merged ahead of this.
>>>>
>>>>
>>>> Alex
>>>>
>>>>
>>>>