KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow lockless read when ready

[RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow lockless read when ready

Posted by salil.mehta@opnsrc.net 4 months ago

From: Salil Mehta <salil.mehta@huawei.com>

[A rough illustration of the problem and the probable solution]

Userspace reads of ICC_CTLR_EL1 via KVM device attributes currently takes a slow
path that may acquire all vCPU locks. Under workloads that exercise userspace
PSCI CPU_ON flows or frequent vCPU resets, this can cause vCPU lock contention
in KVM and, in the worst cases, -EBUSY returns to userspace.

When PSCI CPU_ON and CPU_OFF calls are handled entirely in KVM, these operations
are executed under KVM vCPU locks in the host kernel (EL1) and appear atomic to
other vCPU threads. In this context, system register accesses are serialized
under KVM vCPU locks, ensuring atomicity with respect to other vCPUs. After
SMCCC filtering was introduced, PSCI CPU_ON and CPU_OFF calls can now exit to
userspace (QEMU). During the handling of PSCI CPU_ON call in userspace, a
cpu_reset() is exerted which reads ICC_CTLR_EL1 through KVM device attribute
IOCTLs. To avoid transient inconsistency and -EBUSY errors, QEMU is forced to
pause all vCPUs before issuing these IOCTLs. Later operation should be avoided
as it ends up in freezing VM till the entire KVM device IOCTL operation is
completed (and this could happen every time when PSCI CPU_ON is requested by
the Guest user). There are also known issues with the implementaiton of the
{pause, resume}_all_vcpus() in the QEMU which can lead to race conditions.

Introduce a per-vCPU shadow of ICC_CTLR_EL1 that allows userspace to read this
register without taking any vCPU-wide locks once the VGIC has been initialized.
The new helper vgic_v3_try_icc_ctlr_el1_lockless_access() attempts a fast,
lockless read of the per-vCPU shadow and returns -EBUSY if the VGIC is not yet
initialized, allowing the caller to fall back to the existing contended path.
This reduces cross-vCPU contention and prevents spurious -EBUSY errors during
userspace-driven CPU_ON sequences, without changing architectural behaviour.

The shadow is updated in set_gic_ctlr(): after VMCR fields are written, we
recompute ICC_CTLR_EL1 via get_gic_ctlr(). Writes remain unchanged and continue
to use the existing synchronized path.

Without this change, QEMU must cache ICC_CTLR_EL1 itself to avoid racy
pause_all_vcpus() usage during cpu_reset(), as described in:
https://lore.kernel.org/qemu-devel/20251001010127.3092631-22-salil.mehta@opnsrc.net/

RFC Discussion (Architectural Considerations):
=============================================

It might have been helpful if the ARM architecture were to have clearer
boundaries between dynamic and static (or pseudo-static) system registers(?)
Registers such as ICC_CTLR_EL1 contain configuration fields that are effectively
static once the VM is initialized (even the EOIMode, CBPR & PMHE are guest
GICv3 driver configured constants), yet the architecture allows them to coexist
with other more dynamic runtime fields of the ICH_ VMCR_EL2. If these
configuration registers had been architecturally immutable after VM
initialization, their accesses could have avoided expensive all-vCPU locking
otherwise required to guarantee atomic visibility across vCPUs.

ICC_CTLR_EL1 [63:0]

  63                                                                       32
 +---------------------------------------------------------------------------+
 |                                   RES0                                    |
 +---------------------------------------------------------------------------+
  31        20 19 18 17 16 15 14 13 12 11 10   9   8  7  6   5  4  3  2  1  0
 +------------+--+--+--+--+--+--+--+--+--+---+---+---+--+--+--+--+--+--+--+--+
 |    RES0    |Ex|RS|RES0 |A3|SE| IDbits |  PRIbits  |R0|PM|  RES0     |EO|CB|
 +------------+--+--+--+--+--+--+--+--+--+---+---+---+--+--+--+--+--+--+--+--+
              |  |        |  |                       |   |             |  |
              |  |        |  |                       |   |             |  +CBPR
              |  |        |  |                       |   |             +EOImode
              |  |        |  |                       |   +-PMHE
              |  |        |  |                       +----RES0
              |  |        |  +--SEIS
              |  |        +-----A3V
              |  +--------------RSS
              +-----------------ExtRange

 Access: {Ex, RS, A3, SE, IDbits, PRIbits} = RO;
         {PMHE} = RW*;
         {EO, CB} = RW**;
	 others = RES0.
 Notes : * impl-def (may be RO when DS=0)
         ** CB may be RO when DS=0 (EO stays RW)

 Source: Arm GIC Architecture Specification (IHI 0069H.b),
         §12.2.6 “ICC_CTLR_EL1”, pp. 12-233…12-237

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 arch/arm64/kvm/vgic-sys-reg-v3.c      | 14 ++++++++++++
 arch/arm64/kvm/vgic/vgic-kvm-device.c | 32 +++++++++++++++++++++++++++
 include/kvm/arm_vgic.h                |  3 +++
 3 files changed, 49 insertions(+)

diff --git a/arch/arm64/kvm/vgic-sys-reg-v3.c b/arch/arm64/kvm/vgic-sys-reg-v3.c
index bdc2d57370b2..80d6e38b4aa7 100644
--- a/arch/arm64/kvm/vgic-sys-reg-v3.c
+++ b/arch/arm64/kvm/vgic-sys-reg-v3.c
@@ -10,12 +10,17 @@
 #include "vgic/vgic.h"
 #include "sys_regs.h"
 
+static int get_gic_ctlr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r,
+			u64 *valp);
+
 static int set_gic_ctlr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r,
 			u64 val)
 {
 	u32 host_pri_bits, host_id_bits, host_seis, host_a3v, seis, a3v;
+	struct vgic_v3_cpu_if *cif = &vcpu->arch.vgic_cpu.vgic_v3;
 	struct vgic_cpu *vgic_v3_cpu = &vcpu->arch.vgic_cpu;
 	struct vgic_vmcr vmcr;
+	u64 sysreg;
 
 	vgic_get_vmcr(vcpu, &vmcr);
 
@@ -53,6 +58,15 @@ static int set_gic_ctlr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r,
 	vmcr.eoim = FIELD_GET(ICC_CTLR_EL1_EOImode_MASK, val);
 	vgic_set_vmcr(vcpu, &vmcr);
 
+	/* update the ICC_CTLR_EL1 shadow, if required */
+	get_gic_ctlr(vcpu, r, &sysreg);
+
+	/*
+	 * Update the ICC_CTLR_EL1 shadow to allow lock free access of static
+	 * or pesudo static register fields from KVM device IOCTLs
+	 */
+	cif->icc_ctlr_el1_shadow = (u32)sysreg;
+
 	return 0;
 }
 
diff --git a/arch/arm64/kvm/vgic/vgic-kvm-device.c b/arch/arm64/kvm/vgic/vgic-kvm-device.c
index 3d1a776b716d..4a2971525c1a 100644
--- a/arch/arm64/kvm/vgic/vgic-kvm-device.c
+++ b/arch/arm64/kvm/vgic/vgic-kvm-device.c
@@ -522,6 +522,29 @@ static bool reg_allowed_pre_init(struct kvm_device_attr *attr)
 	}
 }
 
+/*
+ * Fast, lockless read of ICC_CTLR_EL1 for userspace ioctl. Saves calls from
+ * -EBUSY due to inability to acquire vCPU mutexes for all the vCPUs
+ */
+static inline int
+vgic_v3_try_icc_ctlr_el1_lockless_access(struct kvm_device *dev,
+                                         const struct kvm_device_attr *attr,
+                                         struct kvm_vcpu *vcpu)
+{
+	struct vgic_v3_cpu_if *vgic_cpu_if = &vcpu->arch.vgic_cpu.vgic_v3;
+	u32 __user *uaddr;
+	u32 shadow;
+
+	/* pathological check, in case we are here too early */
+	if (!vgic_initialized(dev->kvm))
+		return -EBUSY;
+
+	shadow = vgic_cpu_if->icc_ctlr_el1_shadow;
+	uaddr = (u32 __user *)(unsigned long)attr->addr;
+
+	return put_user(shadow, uaddr) ? -EFAULT : 0;
+}
+
 /*
  * vgic_v3_attr_regs_access - allows user space to access VGIC v3 state
  *
@@ -533,6 +556,7 @@ static int vgic_v3_attr_regs_access(struct kvm_device *dev,
 				    struct kvm_device_attr *attr,
 				    bool is_write)
 {
+	u32 sysreg = attr->attr & ~KVM_DEV_ARM_VGIC_CPUID_MASK;
 	struct vgic_reg_attr reg_attr;
 	gpa_t addr;
 	struct kvm_vcpu *vcpu;
@@ -549,6 +573,14 @@ static int vgic_v3_attr_regs_access(struct kvm_device *dev,
 
 	switch (attr->group) {
 	case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS:
+		/* Try lockless ICC_CTLR_EL1 read first */
+		if (!is_write && sysreg == SYS_ICC_CTLR_EL1) {
+			ret = vgic_v3_try_icc_ctlr_el1_lockless_access(dev,
+				 attr, vcpu);
+			if(!ret)
+				return ret;
+		}
+
 		/* Sysregs uaccess is performed by the sysreg handling code */
 		uaccess = false;
 		break;
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 4000ff16f295..42c767c62291 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -334,6 +334,9 @@ struct vgic_v3_cpu_if {
 	struct its_vpe	its_vpe;
 
 	unsigned int used_lrs;
+
+	/* ICC_CTLR_EL1 shadow (published to readers) */
+	u32 icc_ctlr_el1_shadow; /* upper 32 bits of this reg are reserved */
 };
 
 struct vgic_cpu {
-- 
2.34.1

Re: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow lockless read when ready

Posted by Marc Zyngier 4 months ago

On Wed, 08 Oct 2025 21:19:55 +0100,
salil.mehta@opnsrc.net wrote:
> 
> From: Salil Mehta <salil.mehta@huawei.com>
> 
> [A rough illustration of the problem and the probable solution]
> 
> Userspace reads of ICC_CTLR_EL1 via KVM device attributes currently takes a slow
> path that may acquire all vCPU locks. Under workloads that exercise userspace
> PSCI CPU_ON flows or frequent vCPU resets, this can cause vCPU lock contention
> in KVM and, in the worst cases, -EBUSY returns to userspace.
> 
> When PSCI CPU_ON and CPU_OFF calls are handled entirely in KVM, these operations
> are executed under KVM vCPU locks in the host kernel (EL1) and appear atomic to
> other vCPU threads. In this context, system register accesses are serialized
> under KVM vCPU locks, ensuring atomicity with respect to other vCPUs. After
> SMCCC filtering was introduced, PSCI CPU_ON and CPU_OFF calls can now exit to
> userspace (QEMU). During the handling of PSCI CPU_ON call in userspace, a
> cpu_reset() is exerted which reads ICC_CTLR_EL1 through KVM device attribute
> IOCTLs. To avoid transient inconsistency and -EBUSY errors, QEMU is forced to
> pause all vCPUs before issuing these IOCTLs.

I'm going to repeat in public what I already said in private.

Why does QEMU need to know this? I don't see how this is related to
PSCI, and outside of save/restore, there is no reason why QEMU should
poke at this. If QEMU needs fixing, please fix QEMU.

Honestly, I don't see why the kernel should even care about this, and
I have no intention of adopting anything of the sort for something
that has all the hallmarks of a userspace bug.

	M.

-- 
Without deviation from the norm, progress is not possible.

RE: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow lockless read when ready

Posted by Salil Mehta 3 months, 4 weeks ago

HI Marc,

> From: Marc Zyngier <maz@kernel.org>
> Sent: Thursday, October 9, 2025 2:49 PM
> To: salil.mehta@opnsrc.net
[...]

> 
> On Wed, 08 Oct 2025 21:19:55 +0100,
> salil.mehta@opnsrc.net wrote:
> >
> > From: Salil Mehta <salil.mehta@huawei.com>
> >
> > [A rough illustration of the problem and the probable solution]
> >
> > Userspace reads of ICC_CTLR_EL1 via KVM device attributes currently
> > takes a slow path that may acquire all vCPU locks. Under workloads
> > that exercise userspace PSCI CPU_ON flows or frequent vCPU resets,
> > this can cause vCPU lock contention in KVM and, in the worst cases, -EBUSY
> returns to userspace.
> >
> > When PSCI CPU_ON and CPU_OFF calls are handled entirely in KVM, these
> > operations are executed under KVM vCPU locks in the host kernel (EL1)
> > and appear atomic to other vCPU threads. In this context, system
> > register accesses are serialized under KVM vCPU locks, ensuring
> > atomicity with respect to other vCPUs. After SMCCC filtering was
> > introduced, PSCI CPU_ON and CPU_OFF calls can now exit to userspace
> > (QEMU). During the handling of PSCI CPU_ON call in userspace, a
> > cpu_reset() is exerted which reads ICC_CTLR_EL1 through KVM device
> > attribute IOCTLs. To avoid transient inconsistency and -EBUSY errors,
> > QEMU is forced to pause all vCPUs before issuing these IOCTLs.
> 
> I'm going to repeat in public what I already said in private.
> 
> Why does QEMU need to know this? I don't see how this is related to PSCI,
> and outside of save/restore, there is no reason why QEMU should poke at
> this. If QEMU needs fixing, please fix QEMU.


Sure, and I did not disagree with it earlier but because I was not fully sure
so I refrained from replying prematurely here. 


> 
> Honestly, I don't see why the kernel should even care about this, and I have
> no intention of adopting anything of the sort for something that has all the
> hallmarks of a userspace bug.


I understand your point. So the probable solutions for the problem mentioned
in the patch could be:

1. Remove the KVM device access of ICC_CTLR_EL1 system register during CPU
    reset and only sync with KVM during migration at source & destination?
2. if 1 is not acceptable then cache in user space. 
3.  This KVM shadow register change 

IIUC, you've hinted at 1st as the solution. We've discussed 2 as well and as I
understand you don't have much apprehensions about it? And last point 3,
is of course totally rejected.

Hope I got it right?

Many thanks!

Best regards
Salil.

> 
> 	M.
> 
> --
> Without deviation from the norm, progress is not possible.

Re: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow lockless read when ready

Posted by Peter Maydell 3 months, 4 weeks ago

On Thu, 9 Oct 2025 at 14:48, Marc Zyngier <maz@kernel.org> wrote:
>
> On Wed, 08 Oct 2025 21:19:55 +0100,
> salil.mehta@opnsrc.net wrote:
> >
> > From: Salil Mehta <salil.mehta@huawei.com>
> >
> > [A rough illustration of the problem and the probable solution]
> >
> > Userspace reads of ICC_CTLR_EL1 via KVM device attributes currently takes a slow
> > path that may acquire all vCPU locks. Under workloads that exercise userspace
> > PSCI CPU_ON flows or frequent vCPU resets, this can cause vCPU lock contention
> > in KVM and, in the worst cases, -EBUSY returns to userspace.
> >
> > When PSCI CPU_ON and CPU_OFF calls are handled entirely in KVM, these operations
> > are executed under KVM vCPU locks in the host kernel (EL1) and appear atomic to
> > other vCPU threads. In this context, system register accesses are serialized
> > under KVM vCPU locks, ensuring atomicity with respect to other vCPUs. After
> > SMCCC filtering was introduced, PSCI CPU_ON and CPU_OFF calls can now exit to
> > userspace (QEMU). During the handling of PSCI CPU_ON call in userspace, a
> > cpu_reset() is exerted which reads ICC_CTLR_EL1 through KVM device attribute
> > IOCTLs. To avoid transient inconsistency and -EBUSY errors, QEMU is forced to
> > pause all vCPUs before issuing these IOCTLs.
>
> I'm going to repeat in public what I already said in private.
>
> Why does QEMU need to know this? I don't see how this is related to
> PSCI, and outside of save/restore, there is no reason why QEMU should
> poke at this. If QEMU needs fixing, please fix QEMU.

I don't know the background here, but generally speaking,
when we do a CPU reset that includes writing all the CPU state
of the "this is freshly reset from userspace's point of view" vcpu
back to the kernel. More generally, userspace should be able to
read and write sysregs for a vcpu any time it likes, and not
arbitrarily get back -EBUSY. What does the kernel expect
userspace to do with an errno like that?

-- PMM

Re: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow lockless read when ready

Posted by Marc Zyngier 3 months, 4 weeks ago

On Mon, 13 Oct 2025 09:42:58 +0100,
Peter Maydell <peter.maydell@linaro.org> wrote:
> 
> On Thu, 9 Oct 2025 at 14:48, Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Wed, 08 Oct 2025 21:19:55 +0100,
> > salil.mehta@opnsrc.net wrote:
> > >
> > > From: Salil Mehta <salil.mehta@huawei.com>
> > >
> > > [A rough illustration of the problem and the probable solution]
> > >
> > > Userspace reads of ICC_CTLR_EL1 via KVM device attributes currently takes a slow
> > > path that may acquire all vCPU locks. Under workloads that exercise userspace
> > > PSCI CPU_ON flows or frequent vCPU resets, this can cause vCPU lock contention
> > > in KVM and, in the worst cases, -EBUSY returns to userspace.
> > >
> > > When PSCI CPU_ON and CPU_OFF calls are handled entirely in KVM, these operations
> > > are executed under KVM vCPU locks in the host kernel (EL1) and appear atomic to
> > > other vCPU threads. In this context, system register accesses are serialized
> > > under KVM vCPU locks, ensuring atomicity with respect to other vCPUs. After
> > > SMCCC filtering was introduced, PSCI CPU_ON and CPU_OFF calls can now exit to
> > > userspace (QEMU). During the handling of PSCI CPU_ON call in userspace, a
> > > cpu_reset() is exerted which reads ICC_CTLR_EL1 through KVM device attribute
> > > IOCTLs. To avoid transient inconsistency and -EBUSY errors, QEMU is forced to
> > > pause all vCPUs before issuing these IOCTLs.
> >
> > I'm going to repeat in public what I already said in private.
> >
> > Why does QEMU need to know this? I don't see how this is related to
> > PSCI, and outside of save/restore, there is no reason why QEMU should
> > poke at this. If QEMU needs fixing, please fix QEMU.
> 
> I don't know the background here, but generally speaking,
> when we do a CPU reset that includes writing all the CPU state
> of the "this is freshly reset from userspace's point of view" vcpu
> back to the kernel. More generally, userspace should be able to
> read and write sysregs for a vcpu any time it likes, and not
> arbitrarily get back -EBUSY. What does the kernel expect
> userspace to do with an errno like that?

The main issue here is that GICv3 is modelled as a device, just like
GICv2, and that all the sysregs that are relevant to the GIC have the
same status as the MMIO registers: they can only be accessed when the
vcpus are not running.

These sysregs are not visible through the normal ONE_REG API, and
therefore not subjected to the "do whatever you want" rule.

Should we have done something else when the GICv3 save/restore API was
introduced and agreed upon with the QEMU people? Probably. Can we
change it now? Probably not. The only thing we could relax is the
scope of the lock when accessing a sysreg, so that we only mandate
that the targeted vcpu is not running instead of the whole VM.

And finally, if you object to this API, why should we do for GICv5,
which is so far implemented by following the exact same principles?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

Re: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow lockless read when ready

Posted by Peter Maydell 3 months, 3 weeks ago

On Mon, 13 Oct 2025 at 11:55, Marc Zyngier <maz@kernel.org> wrote:
>
> On Mon, 13 Oct 2025 09:42:58 +0100,
> Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > On Thu, 9 Oct 2025 at 14:48, Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Wed, 08 Oct 2025 21:19:55 +0100,
> > > salil.mehta@opnsrc.net wrote:
> > > >
> > > > From: Salil Mehta <salil.mehta@huawei.com>
> > > >
> > > > [A rough illustration of the problem and the probable solution]
> > > >
> > > > Userspace reads of ICC_CTLR_EL1 via KVM device attributes currently takes a slow
> > > > path that may acquire all vCPU locks. Under workloads that exercise userspace
> > > > PSCI CPU_ON flows or frequent vCPU resets, this can cause vCPU lock contention
> > > > in KVM and, in the worst cases, -EBUSY returns to userspace.
> > > >
> > > > When PSCI CPU_ON and CPU_OFF calls are handled entirely in KVM, these operations
> > > > are executed under KVM vCPU locks in the host kernel (EL1) and appear atomic to
> > > > other vCPU threads. In this context, system register accesses are serialized
> > > > under KVM vCPU locks, ensuring atomicity with respect to other vCPUs. After
> > > > SMCCC filtering was introduced, PSCI CPU_ON and CPU_OFF calls can now exit to
> > > > userspace (QEMU). During the handling of PSCI CPU_ON call in userspace, a
> > > > cpu_reset() is exerted which reads ICC_CTLR_EL1 through KVM device attribute
> > > > IOCTLs. To avoid transient inconsistency and -EBUSY errors, QEMU is forced to
> > > > pause all vCPUs before issuing these IOCTLs.
> > >
> > > I'm going to repeat in public what I already said in private.
> > >
> > > Why does QEMU need to know this? I don't see how this is related to
> > > PSCI, and outside of save/restore, there is no reason why QEMU should
> > > poke at this. If QEMU needs fixing, please fix QEMU.
> >
> > I don't know the background here, but generally speaking,
> > when we do a CPU reset that includes writing all the CPU state
> > of the "this is freshly reset from userspace's point of view" vcpu
> > back to the kernel. More generally, userspace should be able to
> > read and write sysregs for a vcpu any time it likes, and not
> > arbitrarily get back -EBUSY. What does the kernel expect
> > userspace to do with an errno like that?
>
> The main issue here is that GICv3 is modelled as a device, just like
> GICv2, and that all the sysregs that are relevant to the GIC have the
> same status as the MMIO registers: they can only be accessed when the
> vcpus are not running.
>
> These sysregs are not visible through the normal ONE_REG API, and
> therefore not subjected to the "do whatever you want" rule.

Ah, I'd forgotten that. But the cpuif registers are still
per-cpu, and they do still need to be reset on vcpu reset,
and that might still happen for a single vcpu when the VM
as a whole is still running.

That said, QEMU's current code for this could be refactored
to avoid the reset-time read of ICC_CTLR_EL1 from the kernel.
We do this so we can set the userspace struct field for this
register to the right value. But we could ask the kernel for
that value once on VM startup since it's not going to change mid-run.

That would bring ICC_CTLR_EL1 into line with the other cpuif
registers, where QEMU assumes it knows what the kernel's
reset value of them is (mostly "0") and doesn't bother to ask.
This is different from how we handle ONE_REG sysregs, where
I'm pretty sure we do ask the kernel the value of all of them
on a vcpu reset. (And then write the values back again, which
is a bit silly but nobody's ever said it was a performance
problem for them :-))

> Should we have done something else when the GICv3 save/restore API was
> introduced and agreed upon with the QEMU people? Probably. Can we
> change it now? Probably not. The only thing we could relax is the
> scope of the lock when accessing a sysreg, so that we only mandate
> that the targeted vcpu is not running instead of the whole VM.
>
> And finally, if you object to this API, why should we do for GICv5,
> which is so far implemented by following the exact same principles?

I don't object to the API inherently (I don't care whether we
do these register reads via a dev ioctl or something else,
from userspace's point of view it's just "do some syscall,
get a value") -- I'm just objecting to the kernel's
implementation of it where it might return EBUSY :-)

(Also, if the kernel had failed EINVAL unconditionally for
an attempt to do this on a not-stopped VM then we'd probably
have found this mismatch in understanding about how the
API should work years ago. "Mostly works but sometimes fails
EBUSY" is the worst of all worlds.)

I haven't yet got as far as thinking about the KVM interface
for GICv5 yet...

-- PMM

Re: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow lockless read when ready

Posted by Marc Zyngier 3 months, 3 weeks ago

On Mon, 13 Oct 2025 17:48:44 +0100,
Peter Maydell <peter.maydell@linaro.org> wrote:
> 
> On Mon, 13 Oct 2025 at 11:55, Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Mon, 13 Oct 2025 09:42:58 +0100,
> > Peter Maydell <peter.maydell@linaro.org> wrote:
> > >
> > > On Thu, 9 Oct 2025 at 14:48, Marc Zyngier <maz@kernel.org> wrote:
> > > >
> > > > On Wed, 08 Oct 2025 21:19:55 +0100,
> > > > salil.mehta@opnsrc.net wrote:
> > > > >
> > > > > From: Salil Mehta <salil.mehta@huawei.com>
> > > > >
> > > > > [A rough illustration of the problem and the probable solution]
> > > > >
> > > > > Userspace reads of ICC_CTLR_EL1 via KVM device attributes currently takes a slow
> > > > > path that may acquire all vCPU locks. Under workloads that exercise userspace
> > > > > PSCI CPU_ON flows or frequent vCPU resets, this can cause vCPU lock contention
> > > > > in KVM and, in the worst cases, -EBUSY returns to userspace.
> > > > >
> > > > > When PSCI CPU_ON and CPU_OFF calls are handled entirely in KVM, these operations
> > > > > are executed under KVM vCPU locks in the host kernel (EL1) and appear atomic to
> > > > > other vCPU threads. In this context, system register accesses are serialized
> > > > > under KVM vCPU locks, ensuring atomicity with respect to other vCPUs. After
> > > > > SMCCC filtering was introduced, PSCI CPU_ON and CPU_OFF calls can now exit to
> > > > > userspace (QEMU). During the handling of PSCI CPU_ON call in userspace, a
> > > > > cpu_reset() is exerted which reads ICC_CTLR_EL1 through KVM device attribute
> > > > > IOCTLs. To avoid transient inconsistency and -EBUSY errors, QEMU is forced to
> > > > > pause all vCPUs before issuing these IOCTLs.
> > > >
> > > > I'm going to repeat in public what I already said in private.
> > > >
> > > > Why does QEMU need to know this? I don't see how this is related to
> > > > PSCI, and outside of save/restore, there is no reason why QEMU should
> > > > poke at this. If QEMU needs fixing, please fix QEMU.
> > >
> > > I don't know the background here, but generally speaking,
> > > when we do a CPU reset that includes writing all the CPU state
> > > of the "this is freshly reset from userspace's point of view" vcpu
> > > back to the kernel. More generally, userspace should be able to
> > > read and write sysregs for a vcpu any time it likes, and not
> > > arbitrarily get back -EBUSY. What does the kernel expect
> > > userspace to do with an errno like that?
> >
> > The main issue here is that GICv3 is modelled as a device, just like
> > GICv2, and that all the sysregs that are relevant to the GIC have the
> > same status as the MMIO registers: they can only be accessed when the
> > vcpus are not running.
> >
> > These sysregs are not visible through the normal ONE_REG API, and
> > therefore not subjected to the "do whatever you want" rule.
> 
> Ah, I'd forgotten that. But the cpuif registers are still
> per-cpu, and they do still need to be reset on vcpu reset,
> and that might still happen for a single vcpu when the VM
> as a whole is still running.
> 
> That said, QEMU's current code for this could be refactored
> to avoid the reset-time read of ICC_CTLR_EL1 from the kernel.
> We do this so we can set the userspace struct field for this
> register to the right value. But we could ask the kernel for
> that value once on VM startup since it's not going to change mid-run.

The reset value is indeed cast in stone once the GIC has been created.

> That would bring ICC_CTLR_EL1 into line with the other cpuif
> registers, where QEMU assumes it knows what the kernel's
> reset value of them is (mostly "0") and doesn't bother to ask.
> This is different from how we handle ONE_REG sysregs, where
> I'm pretty sure we do ask the kernel the value of all of them
> on a vcpu reset. (And then write the values back again, which
> is a bit silly but nobody's ever said it was a performance
> problem for them :-))
>
> > Should we have done something else when the GICv3 save/restore API was
> > introduced and agreed upon with the QEMU people? Probably. Can we
> > change it now? Probably not. The only thing we could relax is the
> > scope of the lock when accessing a sysreg, so that we only mandate
> > that the targeted vcpu is not running instead of the whole VM.
> >
> > And finally, if you object to this API, why should we do for GICv5,
> > which is so far implemented by following the exact same principles?
> 
> I don't object to the API inherently (I don't care whether we
> do these register reads via a dev ioctl or something else,
> from userspace's point of view it's just "do some syscall,
> get a value") -- I'm just objecting to the kernel's
> implementation of it where it might return EBUSY :-)

To me, EBUSY has a clear meaning: you're otherwise using the resource,
and you need to relinquish it first, while EINVAL indicates that the
kernel doesn't understand what you want.

As I said, I'm happy to look at reducing the locking to only the
target vcpu in the case of a sysreg being accessed, but EBUSY will
stay.

> 
> (Also, if the kernel had failed EINVAL unconditionally for
> an attempt to do this on a not-stopped VM then we'd probably
> have found this mismatch in understanding about how the
> API should work years ago. "Mostly works but sometimes fails
> EBUSY" is the worst of all worlds.)
> 
> I haven't yet got as far as thinking about the KVM interface
> for GICv5 yet...

I guess that for the time being, we'll assume that GICv3 is the
reference.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

RE: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow lockless read when ready

Posted by Salil Mehta 3 months, 3 weeks ago

HI Marc,

> From: Marc Zyngier <maz@kernel.org>
> Sent: Tuesday, October 14, 2025 8:45 AM
> To: Peter Maydell <peter.maydell@linaro.org>
> 
> On Mon, 13 Oct 2025 17:48:44 +0100,
> Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > On Mon, 13 Oct 2025 at 11:55, Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Mon, 13 Oct 2025 09:42:58 +0100,
> > > Peter Maydell <peter.maydell@linaro.org> wrote:
> > > >
> > > > On Thu, 9 Oct 2025 at 14:48, Marc Zyngier <maz@kernel.org> wrote:
> > > > >
> > > > > On Wed, 08 Oct 2025 21:19:55 +0100, salil.mehta@opnsrc.net
> > > > > wrote:
> > > > > >
> > > > > > From: Salil Mehta <salil.mehta@huawei.com>
> > > > > >
> > > > > > [A rough illustration of the problem and the probable
> > > > > > solution]
> > > > > >
> > > > > > Userspace reads of ICC_CTLR_EL1 via KVM device attributes
> > > > > > currently takes a slow path that may acquire all vCPU locks.
> > > > > > Under workloads that exercise userspace PSCI CPU_ON flows or
> > > > > > frequent vCPU resets, this can cause vCPU lock contention in KVM
> and, in the worst cases, -EBUSY returns to userspace.
> > > > > >
> > > > > > When PSCI CPU_ON and CPU_OFF calls are handled entirely in
> > > > > > KVM, these operations are executed under KVM vCPU locks in the
> > > > > > host kernel (EL1) and appear atomic to other vCPU threads. In
> > > > > > this context, system register accesses are serialized under
> > > > > > KVM vCPU locks, ensuring atomicity with respect to other
> > > > > > vCPUs. After SMCCC filtering was introduced, PSCI CPU_ON and
> > > > > > CPU_OFF calls can now exit to userspace (QEMU). During the
> > > > > > handling of PSCI CPU_ON call in userspace, a
> > > > > > cpu_reset() is exerted which reads ICC_CTLR_EL1 through KVM
> > > > > > device attribute IOCTLs. To avoid transient inconsistency and
> > > > > > -EBUSY errors, QEMU is forced to pause all vCPUs before issuing
> these IOCTLs.
> > > > >
> > > > > I'm going to repeat in public what I already said in private.
> > > > >
> > > > > Why does QEMU need to know this? I don't see how this is related
> > > > > to PSCI, and outside of save/restore, there is no reason why
> > > > > QEMU should poke at this. If QEMU needs fixing, please fix QEMU.
> > > >
> > > > I don't know the background here, but generally speaking, when we
> > > > do a CPU reset that includes writing all the CPU state of the
> > > > "this is freshly reset from userspace's point of view" vcpu back
> > > > to the kernel. More generally, userspace should be able to read
> > > > and write sysregs for a vcpu any time it likes, and not
> > > > arbitrarily get back -EBUSY. What does the kernel expect userspace
> > > > to do with an errno like that?
> > >
> > > The main issue here is that GICv3 is modelled as a device, just like
> > > GICv2, and that all the sysregs that are relevant to the GIC have
> > > the same status as the MMIO registers: they can only be accessed
> > > when the vcpus are not running.
> > >
> > > These sysregs are not visible through the normal ONE_REG API, and
> > > therefore not subjected to the "do whatever you want" rule.
> >
> > Ah, I'd forgotten that. But the cpuif registers are still per-cpu, and
> > they do still need to be reset on vcpu reset, and that might still
> > happen for a single vcpu when the VM as a whole is still running.
> >
> > That said, QEMU's current code for this could be refactored to avoid
> > the reset-time read of ICC_CTLR_EL1 from the kernel.
> > We do this so we can set the userspace struct field for this register
> > to the right value. But we could ask the kernel for that value once on
> > VM startup since it's not going to change mid-run.
> 
> The reset value is indeed cast in stone once the GIC has been created.
> 
> > That would bring ICC_CTLR_EL1 into line with the other cpuif
> > registers, where QEMU assumes it knows what the kernel's reset value
> > of them is (mostly "0") and doesn't bother to ask.
> > This is different from how we handle ONE_REG sysregs, where I'm pretty
> > sure we do ask the kernel the value of all of them on a vcpu reset.
> > (And then write the values back again, which is a bit silly but
> > nobody's ever said it was a performance problem for them :-))
> >
> > > Should we have done something else when the GICv3 save/restore API
> > > was introduced and agreed upon with the QEMU people? Probably. Can
> > > we change it now? Probably not. The only thing we could relax is the
> > > scope of the lock when accessing a sysreg, so that we only mandate
> > > that the targeted vcpu is not running instead of the whole VM.
> > >
> > > And finally, if you object to this API, why should we do for GICv5,
> > > which is so far implemented by following the exact same principles?
> >
> > I don't object to the API inherently (I don't care whether we do these
> > register reads via a dev ioctl or something else, from userspace's
> > point of view it's just "do some syscall, get a value") -- I'm just
> > objecting to the kernel's implementation of it where it might return
> > EBUSY :-)
> 
> To me, EBUSY has a clear meaning: you're otherwise using the resource, and
> you need to relinquish it first, while EINVAL indicates that the kernel doesn't
> understand what you want.
> 
> As I said, I'm happy to look at reducing the locking to only the target vcpu in
> the case of a sysreg being accessed, but EBUSY will stay.


I forgot to mention that we also saw once the issue of -EBUSY happening when
userspace GICv3 ITS resets during guest reboot and it calls KVM_DEV_ARM_ITS_CTRL_RESET.

File: kvm/vgic/vgic-its.c

static int vgic_its_ctrl(struct kvm *kvm, struct vgic_its *its, u64 attr)
{
[...]
	mutex_lock(&kvm->lock);

	if (kvm_trylock_all_vcpus(kvm)) {
		mutex_unlock(&kvm->lock);
		return -EBUSY;
	}

	mutex_lock(&kvm->arch.config_lock);
	mutex_lock(&its->its_lock);

	switch (attr) {
	case KVM_DEV_ARM_ITS_CTRL_RESET:
		vgic_its_reset(kvm, its);
		break;
[...]
}


Best regards
Salil.

> 
> >
> > (Also, if the kernel had failed EINVAL unconditionally for an attempt
> > to do this on a not-stopped VM then we'd probably have found this
> > mismatch in understanding about how the API should work years ago.
> > "Mostly works but sometimes fails EBUSY" is the worst of all worlds.)
> >
> > I haven't yet got as far as thinking about the KVM interface for GICv5
> > yet...
> 
> I guess that for the time being, we'll assume that GICv3 is the reference.
> 
> Thanks,
> 
> 	M.
> 
> --
> Without deviation from the norm, progress is not possible.

Re: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow lockless read when ready

Posted by Peter Maydell 3 months, 3 weeks ago

On Tue, 14 Oct 2025 at 08:44, Marc Zyngier <maz@kernel.org> wrote:
>
> On Mon, 13 Oct 2025 17:48:44 +0100,
> Peter Maydell <peter.maydell@linaro.org> wrote:
> > I don't object to the API inherently (I don't care whether we
> > do these register reads via a dev ioctl or something else,
> > from userspace's point of view it's just "do some syscall,
> > get a value") -- I'm just objecting to the kernel's
> > implementation of it where it might return EBUSY :-)
>
> To me, EBUSY has a clear meaning: you're otherwise using the resource,
> and you need to relinquish it first, while EINVAL indicates that the
> kernel doesn't understand what you want.
>
> As I said, I'm happy to look at reducing the locking to only the
> target vcpu in the case of a sysreg being accessed, but EBUSY will
> stay.

I don't particularly have a strong feeling about the errno
value. I just think that it's much harder to accidentally
misuse an API which consistently returns an error if userspace
tries to call it in the wrong context, than if it
mostly works but occasionally fails.

(The horse has bolted for this specific case, of course:
if we made it fail consistently then that would probably
break existing deployed QEMU versions.)

-- PMM

Re: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow lockless read when ready

Posted by Salil Mehta 3 months, 3 weeks ago

Hi Peter

On Mon, Oct 13, 2025 at 4:48 PM Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Mon, 13 Oct 2025 at 11:55, Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Mon, 13 Oct 2025 09:42:58 +0100,
> > Peter Maydell <peter.maydell@linaro.org> wrote:
> > >
> > > On Thu, 9 Oct 2025 at 14:48, Marc Zyngier <maz@kernel.org> wrote:
> > > >
> > > > On Wed, 08 Oct 2025 21:19:55 +0100,
> > > > salil.mehta@opnsrc.net wrote:
> > > > >
> > > > > From: Salil Mehta <salil.mehta@huawei.com>
> > > > >
> > > > > [A rough illustration of the problem and the probable solution]
> > > > >
> > > > > Userspace reads of ICC_CTLR_EL1 via KVM device attributes currently takes a slow
> > > > > path that may acquire all vCPU locks. Under workloads that exercise userspace
> > > > > PSCI CPU_ON flows or frequent vCPU resets, this can cause vCPU lock contention
> > > > > in KVM and, in the worst cases, -EBUSY returns to userspace.
> > > > >
> > > > > When PSCI CPU_ON and CPU_OFF calls are handled entirely in KVM, these operations
> > > > > are executed under KVM vCPU locks in the host kernel (EL1) and appear atomic to
> > > > > other vCPU threads. In this context, system register accesses are serialized
> > > > > under KVM vCPU locks, ensuring atomicity with respect to other vCPUs. After
> > > > > SMCCC filtering was introduced, PSCI CPU_ON and CPU_OFF calls can now exit to
> > > > > userspace (QEMU). During the handling of PSCI CPU_ON call in userspace, a
> > > > > cpu_reset() is exerted which reads ICC_CTLR_EL1 through KVM device attribute
> > > > > IOCTLs. To avoid transient inconsistency and -EBUSY errors, QEMU is forced to
> > > > > pause all vCPUs before issuing these IOCTLs.
> > > >
> > > > I'm going to repeat in public what I already said in private.
> > > >
> > > > Why does QEMU need to know this? I don't see how this is related to
> > > > PSCI, and outside of save/restore, there is no reason why QEMU should
> > > > poke at this. If QEMU needs fixing, please fix QEMU.
> > >
> > > I don't know the background here, but generally speaking,
> > > when we do a CPU reset that includes writing all the CPU state
> > > of the "this is freshly reset from userspace's point of view" vcpu
> > > back to the kernel. More generally, userspace should be able to
> > > read and write sysregs for a vcpu any time it likes, and not
> > > arbitrarily get back -EBUSY. What does the kernel expect
> > > userspace to do with an errno like that?
> >
> > The main issue here is that GICv3 is modelled as a device, just like
> > GICv2, and that all the sysregs that are relevant to the GIC have the
> > same status as the MMIO registers: they can only be accessed when the
> > vcpus are not running.
> >
> > These sysregs are not visible through the normal ONE_REG API, and
> > therefore not subjected to the "do whatever you want" rule.
>
> Ah, I'd forgotten that. But the cpuif registers are still
> per-cpu, and they do still need to be reset on vcpu reset,
> and that might still happen for a single vcpu when the VM
> as a whole is still running.
>
> That said, QEMU's current code for this could be refactored
> to avoid the reset-time read of ICC_CTLR_EL1 from the kernel.
> We do this so we can set the userspace struct field for this
> register to the right value. But we could ask the kernel for
> that value once on VM startup since it's not going to change mid-run.
>
> That would bring ICC_CTLR_EL1 into line with the other cpuif
> registers, where QEMU assumes it knows what the kernel's
> reset value of them is (mostly "0") and doesn't bother to ask.
> This is different from how we handle ONE_REG sysregs, where
> I'm pretty sure we do ask the kernel the value of all of them
> on a vcpu reset. (And then write the values back again, which
> is a bit silly but nobody's ever said it was a performance
> problem for them :-))


This is effectively what the mentioned patch in the commit-log is doing.
Pasting here again:

https://lore.kernel.org/qemu-devel/20251001010127.3092631-22-salil.mehta@opnsrc.net/

Can this patch be accepted independently then?


>
> > Should we have done something else when the GICv3 save/restore API was
> > introduced and agreed upon with the QEMU people? Probably. Can we
> > change it now? Probably not. The only thing we could relax is the
> > scope of the lock when accessing a sysreg, so that we only mandate
> > that the targeted vcpu is not running instead of the whole VM.
> >
> > And finally, if you object to this API, why should we do for GICv5,
> > which is so far implemented by following the exact same principles?
>
> I don't object to the API inherently (I don't care whether we
> do these register reads via a dev ioctl or something else,
> from userspace's point of view it's just "do some syscall,
> get a value") -- I'm just objecting to the kernel's
> implementation of it where it might return EBUSY :-)
>
> (Also, if the kernel had failed EINVAL unconditionally for
> an attempt to do this on a not-stopped VM then we'd probably
> have found this mismatch in understanding about how the
> API should work years ago. "Mostly works but sometimes fails
> EBUSY" is the worst of all worlds.)
>
> I haven't yet got as far as thinking about the KVM interface
> for GICv5 yet...
>
> -- PMM

Re: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow lockless read when ready

Posted by Peter Maydell 3 months, 3 weeks ago

On Tue, 14 Oct 2025 at 04:02, Salil Mehta <salil.mehta@opnsrc.net> wrote:
> On Mon, Oct 13, 2025 at 4:48 PM Peter Maydell <peter.maydell@linaro.org> wrote:
> > That said, QEMU's current code for this could be refactored
> > to avoid the reset-time read of ICC_CTLR_EL1 from the kernel.
> > We do this so we can set the userspace struct field for this
> > register to the right value. But we could ask the kernel for
> > that value once on VM startup since it's not going to change mid-run.
> >
> > That would bring ICC_CTLR_EL1 into line with the other cpuif
> > registers, where QEMU assumes it knows what the kernel's
> > reset value of them is (mostly "0") and doesn't bother to ask.
> > This is different from how we handle ONE_REG sysregs, where
> > I'm pretty sure we do ask the kernel the value of all of them
> > on a vcpu reset. (And then write the values back again, which
> > is a bit silly but nobody's ever said it was a performance
> > problem for them :-))
>
>
> This is effectively what the mentioned patch in the commit-log is doing.
> Pasting here again:
>
> https://lore.kernel.org/qemu-devel/20251001010127.3092631-22-salil.mehta@opnsrc.net/

No, that is not what I have in mind. What I mean is that
we can just read the ICC_CTLR_EL1 value once on startup
(in our realize method, I think) and then use that value in
reset. That should not require any of that machinery to pause
the VM and keep retrying.

-- PMM

RE: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow lockless read when ready

Posted by Salil Mehta 3 months, 3 weeks ago

Hi Peter,

> From: Peter Maydell <peter.maydell@linaro.org>
> Sent: Tuesday, October 14, 2025 10:31 AM
> 
> On Tue, 14 Oct 2025 at 04:02, Salil Mehta <salil.mehta@opnsrc.net> wrote:
> > On Mon, Oct 13, 2025 at 4:48 PM Peter Maydell
> <peter.maydell@linaro.org> wrote:
> > > That said, QEMU's current code for this could be refactored to avoid
> > > the reset-time read of ICC_CTLR_EL1 from the kernel.
> > > We do this so we can set the userspace struct field for this
> > > register to the right value. But we could ask the kernel for that
> > > value once on VM startup since it's not going to change mid-run.
> > >
> > > That would bring ICC_CTLR_EL1 into line with the other cpuif
> > > registers, where QEMU assumes it knows what the kernel's reset value
> > > of them is (mostly "0") and doesn't bother to ask.
> > > This is different from how we handle ONE_REG sysregs, where I'm
> > > pretty sure we do ask the kernel the value of all of them on a vcpu
> > > reset. (And then write the values back again, which is a bit silly
> > > but nobody's ever said it was a performance problem for them :-))
> >
> >
> > This is effectively what the mentioned patch in the commit-log is doing.
> > Pasting here again:
> >
> > https://lore.kernel.org/qemu-devel/20251001010127.3092631-22-salil.meh
> > ta@opnsrc.net/
> 
> No, that is not what I have in mind. What I mean is that we can just read the
> ICC_CTLR_EL1 value once on startup (in our realize method, I think) and then
> use that value in reset. That should not require any of that machinery to
> pause the VM and keep retrying.


Indeed and that’s what I did initially and had fetched the value once in GICv3
realize but it hung the VM initialization leg during access.

BTW, can we safely assume that value of this register will be same across all
vCPUs (i.e. is just for SMP case and ignore heterogenous)? 


Best regards
Salil


> 
> -- PMM