[PATCH RFC V6 21/24] hw/intc/arm-gicv3-kvm: Pause all vCPUs & cache ICC_CTLR_EL1 for userspace PSCI CPU_ON

salil.mehta@opnsrc.net posted 24 patches 1 month, 2 weeks ago
[PATCH RFC V6 21/24] hw/intc/arm-gicv3-kvm: Pause all vCPUs & cache ICC_CTLR_EL1 for userspace PSCI CPU_ON
Posted by salil.mehta@opnsrc.net 1 month, 2 weeks ago
From: Salil Mehta <salil.mehta@huawei.com>

Problem:
=======

When PSCI CPU_ON was handled entirely in KVM, the operation executed under
VGIC/KVM locks at EL2 and appeared atomic to other vCPU threads (intermediate
states were not observable). With the SMCCC forward-to-userspace filter enabled,
PSCI ON/OFF calls now exit to QEMU, where policy checks are performed.

In the userspace CPU_ON handling (during cpu_reset?), QEMU must perform IOCTLs
to fetch ICC_CTLR_EL1 fields that reflect supported features and IRQ-related
configuration (e.g. EOImode, PMHE, CBPR). While these IOCTLs are in flight,
other vCPUs can run and cause transient inconsistency. KVM enforces atomicity by
trying to take all vCPU locks (kvm_trylock_all_vcpus() -> -EBUSY). QEMU
therefore pauses all vCPUs before issuing these IOCTLs to avoid contending for
locks and to prevent -EBUSY failures during cpu_reset.

KVM Details: (As I understand and stand ready to be corrected! :))

Userspace fetch of sysreg ICC_CTLR_EL1 results in access of ICH_VMCR_EL2 reg.
VMCR is per-vCPU and controls the CPU interface. Pending state is recorded in
the distributor for SPIs, and in each redistributor for SGIs and PPIs. Delivery
to the PE depends on the CPU interface configuration (VMCR fields such as PMR,
IGRPEN, EOImode, BPR). Updates to VMCR must therefore be applied atomically with
respect to interrupt injection and deactivation. The KVM ioctl layer first
attempts to lock all vCPU mutexes, and only then takes the VM lock before
calling vgic_v3_attr_regs_access(). This ordering serializes userspace accesses
with IRQ handling (IAR/EOI and SGI delivery?).

ICC_CTLR_EL1 initially reflects architectural defaults (e.g. EOImode, PMR).
Most fields are read-only feature indicators that never change. Writable
fields such as EOImode, PMHE and CBPR are configured once by the guest GICv3
driver and then remain pseudo-static. Both the initial defaults and the
guest-configured values can be cached and reused across resets, avoiding
repeated VM-wide pauses to fetch ICC_CTLR_EL1 from KVM on every cpu_reset().

Appendix: ICC_CTLR_EL1 layout (for reviewers)
=============================================

ICC_CTLR_EL1 [63:0]

  63                                                                        32
 +----------------------------------------------------------------------------+
 |                                   RES0                                     |
 +----------------------------------------------------------------------------+
  31        20 19 18 17 16 15 14 13 12 11 10   9   8  7  6   5  4  3  2  1  0
 +------------+--+--+--+--+--+--+--+--+--+---+---+---+--+--+--+--+--+--+--+--+
 |    RES0    |Ex|RS|RES0 |A3|SE| IDbits |  PRIbits  |R0|PM|  RES0     |EO|CB|
 +------------+--+--+--+--+--+--+--+--+--+---+---+---+--+--+--+--+--+--+--+--+
              |  |        |  |                       |   |             |  |
              |  |        |  |                       |   |             |  +CBPR
              |  |        |  |                       |   |             +EOImode
              |  |        |  |                       |   +-PMHE
              |  |        |  |                       +----RES0
              |  |        |  +--SEIS
              |  |        +-----A3V
              |  +--------------RSS
              +-----------------ExtRange

 Access: {Ex, RS, A3, SE, IDbits, PRIbits} = RO;
         {PMHE} = RW*;
         {EO, CB} = RW**;
	 others = RES0.
 Notes : * impl-def (may be RO when DS=0)
         ** CB may be RO when DS=0 (EO stays RW)

 Source: Arm GIC Architecture Specification (IHI 0069H.b),
         §12.2.6 “ICC_CTLR_EL1”, pp. 12-233…12-237

Resets that may trigger ICC_CTLR_EL1 fetch include:
  1. PSCI CPU_ON
  2. qemu_system_reset() during full VM reset
  3. Post-load path on migration
  4. Lazy realization via device_set/-deviceset

It can be expensive to pause the entire VM just to reset one vCPU, especially
for long-lived workloads where hundreds of resets may occur. For such systems,
frequent VM-wide pauses are unacceptable.

Solution:
========

This patch caches ICC_CTLR_EL1 early, seeding it either from architectural
defaults or on the first PSCI CPU_ON when the guest GICv3 driver has initialized
the interface. The cached value is then reused on every cpu_reset(), avoiding
repeated VM-wide pauses and heavy IOCTLs. The IOCTL path is retained only as a
fallback if the cached shadow is not valid.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 hw/intc/arm_gicv3_kvm.c            | 93 ++++++++++++++++++++++++++++--
 include/hw/intc/arm_gicv3_common.h | 10 ++++
 target/arm/arm-powerctl.c          |  1 +
 target/arm/cpu.h                   |  1 +
 4 files changed, 100 insertions(+), 5 deletions(-)

diff --git a/hw/intc/arm_gicv3_kvm.c b/hw/intc/arm_gicv3_kvm.c
index e97578f59a..62d6016e8a 100644
--- a/hw/intc/arm_gicv3_kvm.c
+++ b/hw/intc/arm_gicv3_kvm.c
@@ -27,6 +27,7 @@
 #include "qemu/module.h"
 #include "system/kvm.h"
 #include "system/runstate.h"
+#include "system/cpus.h"
 #include "kvm_arm.h"
 #include "gicv3_internal.h"
 #include "vgic_common.h"
@@ -681,13 +682,73 @@ static void kvm_arm_gicv3_get(GICv3State *s)
     }
 }
 
+/* Caller must hold the iothread (BQL). */
+static inline void
+kvm_gicc_get_cached_icc_ctlr_el1(GICv3CPUState *c, uint64_t regval[2],
+                                      bool *valid)
+{
+    const uint64_t attr = (uint64_t)KVM_VGIC_ATTR(ICC_CTLR_EL1, c->gicr_typer);
+    const int group = KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS;
+    GICv3State *s = c->gic;
+    uint64_t val = 0;
+    int ret;
+
+    assert(regval && valid);
+
+    if (*valid) {
+        /* Fast path: return cached (no vCPU pausing required). */
+        c->icc_ctlr_el1[GICV3_NS] = regval[GICV3_NS];
+        c->icc_ctlr_el1[GICV3_S] = regval[GICV3_S];
+        return;
+    }
+
+    ret = kvm_device_access(s->dev_fd, group, attr, &val, false, NULL);
+    if (ret == -EBUSY || ret == -EAGAIN) {
+        int tries;
+
+        /* One-time heavy path: avoid contention by pausing all vCPUs. */
+        pause_all_vcpus();
+        /*
+         * Even with vCPUs paused, we cannot fully rule out a non-vCPU context
+         * temporarily holding KVM vCPU mutexes; treat -EBUSY/-EAGAIN as
+         * transient and retry a few times. Final attempt aborts in-loop.
+         */
+        for (tries = 0; tries < 5; tries++) {
+            Error **errp = (tries == 4) ? &error_abort : NULL;
+
+            ret = kvm_device_access(s->dev_fd, group, attr, &val, false, errp);
+            if (!ret) {
+                break;
+            }
+            if (ret != -EBUSY && ret != -EAGAIN) {
+               error_setg_errno(&error_abort, -ret,
+                                "KVM_GET_DEVICE_ATTR failed: Group %d "
+                                "attr 0x%016" PRIx64, group, attr);
+               /* not reached */
+            }
+            g_usleep(50);
+        }
+        resume_all_vcpus();
+    }
+
+    /* Success: publish and seed cache. */
+    c->icc_ctlr_el1[GICV3_NS] = val;
+    c->icc_ctlr_el1[GICV3_S] = val;
+
+    regval[GICV3_NS] = c->icc_ctlr_el1[GICV3_NS];
+    regval[GICV3_S] = c->icc_ctlr_el1[GICV3_S];
+    *valid = true;
+}
+
 static void arm_gicv3_icc_reset(CPUARMState *env, const ARMCPRegInfo *ri)
 {
     GICv3State *s;
     GICv3CPUState *c;
+    ARMCPU *cpu;
 
     c = (GICv3CPUState *)env->gicv3state;
     s = c->gic;
+    cpu = ARM_CPU(c->cpu);
 
     c->icc_pmr_el1 = 0;
     /*
@@ -713,11 +774,33 @@ static void arm_gicv3_icc_reset(CPUARMState *env, const ARMCPRegInfo *ri)
     }
 
     /* Initialize to actual HW supported configuration */
-    kvm_device_access(s->dev_fd, KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS,
-                      KVM_VGIC_ATTR(ICC_CTLR_EL1, c->gicr_typer),
-                      &c->icc_ctlr_el1[GICV3_NS], false, &error_abort);
-
-    c->icc_ctlr_el1[GICV3_S] = c->icc_ctlr_el1[GICV3_NS];
+    /*
+     * Avoid racy VGIC CPU sysreg reads while vCPUs are running. KVM requires
+     * pausing all vCPUs for ICC_* sysregs accesses to prevent races with
+     * in-flight IRQ delivery (e.g. EOImode etc.).
+     *
+     * To keep the reset path fast, cache the architectural default and the
+     * guest GICv3 driver configured ICC_CTLR_EL1 on the first access and then
+     * reuse that for subsequent resets. Most fields in this register are
+     * invariants throughout the life of VM. Fields EOImode, PMHE and CBPR are
+     * pseudo static and dont change once configured by guest driver.
+     */
+    if (cpu->first_psci_on_request_seen || s->guest_gicc_initialized) {
+        if (!s->guest_gicc_initialized) {
+            s->guest_gicc_initialized = true;
+        }
+        kvm_gicc_get_cached_icc_ctlr_el1(c, c->icc_ctlr_configured,
+                                         &c->icc_ctlr_configured_valid);
+    } else {
+        /*
+         * kernel has not loded yet. It safe to assume not other vCPU is in
+         * KVM_RUN except vCPU 0 at this moment. Just in case, if there is
+         * other priviledged context of KVM accessing the register then we
+         * KVM device access can potentially return -EBUSY.
+         */
+        kvm_gicc_get_cached_icc_ctlr_el1(c, c->icc_ctlr_arch_def,
+                                         &c->icc_ctlr_arch_def_valid);
+    }
 }
 
 static void kvm_arm_gicv3_reset_hold(Object *obj, ResetType type)
diff --git a/include/hw/intc/arm_gicv3_common.h b/include/hw/intc/arm_gicv3_common.h
index a8a84c4687..0282a94edc 100644
--- a/include/hw/intc/arm_gicv3_common.h
+++ b/include/hw/intc/arm_gicv3_common.h
@@ -165,6 +165,15 @@ struct GICv3CPUState {
     uint64_t icc_apr[3][4];
     uint64_t icc_igrpen[3];
     uint64_t icc_ctlr_el3;
+    /*
+     * Shadow copy of ICC_CTLR_EL1 architectural default. Fetched once per-vCPU
+     * when no vCPUs are running, and reused on reset to avoid calling
+     * kvm_device_access() in the hot path.
+     */
+    uint64_t icc_ctlr_arch_def[2]; /* per-secstate (NS=0,S=1) */
+    bool icc_ctlr_arch_def_valid;
+    uint64_t icc_ctlr_configured[2];
+    bool icc_ctlr_configured_valid;
     bool gicc_accessible;
 
     /* Virtualization control interface */
@@ -240,6 +249,7 @@ struct GICv3State {
     bool force_8bit_prio;
     bool irq_reset_nonsecure;
     bool gicd_no_migration_shift_bug;
+    bool guest_gicc_initialized;
 
     int dev_fd; /* kvm device fd if backed by kvm vgic support */
     Error *migration_blocker;
diff --git a/target/arm/arm-powerctl.c b/target/arm/arm-powerctl.c
index 89074918a9..0b65898cec 100644
--- a/target/arm/arm-powerctl.c
+++ b/target/arm/arm-powerctl.c
@@ -68,6 +68,7 @@ static void arm_set_cpu_on_async_work(CPUState *target_cpu_state,
     ARMCPU *target_cpu = ARM_CPU(target_cpu_state);
     struct CpuOnInfo *info = (struct CpuOnInfo *) data.host_ptr;
 
+    target_cpu->first_psci_on_request_seen = true;
     /* Initialize the cpu we are turning on */
     cpu_reset(target_cpu_state);
     arm_emulate_firmware_reset(target_cpu_state, info->target_el);
diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index cd5982d362..603e482b3a 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -974,6 +974,7 @@ struct ArchCPU {
 
     /* Current power state, access guarded by BQL */
     ARMPSCIState power_state;
+    bool first_psci_on_request_seen;
 
     /* CPU has virtualization extension */
     bool has_el2;
-- 
2.34.1