VCPU reset fixes

[PATCH v2 2/2] RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET

Posted by Radim Krčmář 9 months ago

Add a toggleable VM capability to modify several reset related code
paths.  The goals are to
 1) Allow userspace to reset any VCPU.
 2) Allow userspace to provide the initial VCPU state.

(Right now, the boot VCPU isn't reset by KVM and KVM sets the state for
 VCPUs brought up by sbi_hart_start while userspace for all others.)

The goals are achieved with the following changes:
 * Reset the VCPU when setting MP_STATE_INIT_RECEIVED through IOCTL.
 * Preserve the userspace initialized VCPU state on sbi_hart_start.
 * Return to userspace on sbi_hart_stop.
 * Don't make VCPU reset request on sbi_system_suspend.

The patch is reusing MP_STATE_INIT_RECEIVED, because we didn't want to
add a new IOCTL, sorry. :)

Signed-off-by: Radim Krčmář <rkrcmar@ventanamicro.com>
---
If you search for cap 7.42 in api.rst, you'll see that it has a wrong
number, which is why we're 7.43, in case someone bothers to fix ARM.

I was also strongly considering creating all VCPUs in RUNNABLE state --
do you know of any similar quirks that aren't important, but could be
fixed with the new userspace toggle?
---
 Documentation/virt/kvm/api.rst        | 15 +++++++++++
 arch/riscv/include/asm/kvm_host.h     |  3 +++
 arch/riscv/include/asm/kvm_vcpu_sbi.h |  1 +
 arch/riscv/kvm/vcpu.c                 | 36 +++++++++++++++++----------
 arch/riscv/kvm/vcpu_sbi.c             | 17 +++++++++++++
 arch/riscv/kvm/vcpu_sbi_hsm.c         |  7 +++++-
 arch/riscv/kvm/vcpu_sbi_system.c      |  3 ++-
 arch/riscv/kvm/vm.c                   | 13 ++++++++++
 include/uapi/linux/kvm.h              |  1 +
 9 files changed, 81 insertions(+), 15 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 47c7c3f92314..63e6d23d34f0 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8496,6 +8496,21 @@ aforementioned registers before the first KVM_RUN. These registers are VM
 scoped, meaning that the same set of values are presented on all vCPUs in a
 given VM.
 
+7.43 KVM_CAP_RISCV_MP_STATE_RESET
+---------------------------------
+
+:Architectures: riscv
+:Type: VM
+:Parameters: None
+:Returns: 0 on success, -EINVAL if arg[0] is not zero
+
+When this capability is enabled, KVM:
+* Resets the VCPU when setting MP_STATE_INIT_RECEIVED through IOCTL.
+  The original MP_STATE is preserved.
+* Preserves the userspace initialized VCPU state on sbi_hart_start.
+* Returns to userspace on sbi_hart_stop.
+* Doesn't make VCPU reset request on sbi_system_suspend.
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index f673ebfdadf3..85cfebc32e4c 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -119,6 +119,9 @@ struct kvm_arch {
 
 	/* AIA Guest/VM context */
 	struct kvm_aia aia;
+
+	/* KVM_CAP_RISCV_MP_STATE_RESET */
+	bool mp_state_reset;
 };
 
 struct kvm_cpu_trap {
diff --git a/arch/riscv/include/asm/kvm_vcpu_sbi.h b/arch/riscv/include/asm/kvm_vcpu_sbi.h
index da28235939d1..439ab2b3534f 100644
--- a/arch/riscv/include/asm/kvm_vcpu_sbi.h
+++ b/arch/riscv/include/asm/kvm_vcpu_sbi.h
@@ -57,6 +57,7 @@ void kvm_riscv_vcpu_sbi_system_reset(struct kvm_vcpu *vcpu,
 				     u32 type, u64 flags);
 void kvm_riscv_vcpu_sbi_request_reset(struct kvm_vcpu *vcpu,
 				      unsigned long pc, unsigned long a1);
+void kvm_riscv_vcpu_sbi_load_reset_state(struct kvm_vcpu *vcpu);
 int kvm_riscv_vcpu_sbi_return(struct kvm_vcpu *vcpu, struct kvm_run *run);
 int kvm_riscv_vcpu_set_reg_sbi_ext(struct kvm_vcpu *vcpu,
 				   const struct kvm_one_reg *reg);
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index a78f9ec2fa0e..961b22c05981 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -51,11 +51,11 @@ const struct kvm_stats_header kvm_vcpu_stats_header = {
 		       sizeof(kvm_vcpu_stats_desc),
 };
 
-static void kvm_riscv_vcpu_context_reset(struct kvm_vcpu *vcpu)
+static void kvm_riscv_vcpu_context_reset(struct kvm_vcpu *vcpu,
+					 bool kvm_sbi_reset)
 {
 	struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr;
 	struct kvm_cpu_context *cntx = &vcpu->arch.guest_context;
-	struct kvm_vcpu_reset_state *reset_state = &vcpu->arch.reset_state;
 	void *vector_datap = cntx->vector.datap;
 
 	memset(cntx, 0, sizeof(*cntx));
@@ -65,13 +65,8 @@ static void kvm_riscv_vcpu_context_reset(struct kvm_vcpu *vcpu)
 	/* Restore datap as it's not a part of the guest context. */
 	cntx->vector.datap = vector_datap;
 
-	/* Load SBI reset values */
-	cntx->a0 = vcpu->vcpu_id;
-
-	spin_lock(&reset_state->lock);
-	cntx->sepc = reset_state->pc;
-	cntx->a1 = reset_state->a1;
-	spin_unlock(&reset_state->lock);
+	if (kvm_sbi_reset)
+		kvm_riscv_vcpu_sbi_load_reset_state(vcpu);
 
 	/* Setup reset state of shadow SSTATUS and HSTATUS CSRs */
 	cntx->sstatus = SR_SPP | SR_SPIE;
@@ -84,10 +79,19 @@ static void kvm_riscv_vcpu_context_reset(struct kvm_vcpu *vcpu)
 	csr->scounteren = 0x7;
 }
 
-static void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu)
+static void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu, bool kvm_sbi_reset)
 {
 	bool loaded;
 
+	/*
+	 * The userspace should have triggered a full reset earlier and could
+	 * have set initial state that needs to be preserved.
+	 */
+	if (kvm_sbi_reset && vcpu->kvm->arch.mp_state_reset) {
+		kvm_riscv_vcpu_sbi_load_reset_state(vcpu);
+		return;
+	}
+
 	/**
 	 * The preemption should be disabled here because it races with
 	 * kvm_sched_out/kvm_sched_in(called from preempt notifiers) which
@@ -100,7 +104,7 @@ static void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu)
 
 	vcpu->arch.last_exit_cpu = -1;
 
-	kvm_riscv_vcpu_context_reset(vcpu);
+	kvm_riscv_vcpu_context_reset(vcpu, kvm_sbi_reset);
 
 	kvm_riscv_vcpu_fp_reset(vcpu);
 
@@ -177,7 +181,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	kvm_riscv_vcpu_sbi_init(vcpu);
 
 	/* Reset VCPU */
-	kvm_riscv_reset_vcpu(vcpu);
+	kvm_riscv_reset_vcpu(vcpu, false);
 
 	return 0;
 }
@@ -526,6 +530,12 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
 	case KVM_MP_STATE_STOPPED:
 		__kvm_riscv_vcpu_power_off(vcpu);
 		break;
+	case KVM_MP_STATE_INIT_RECEIVED:
+		if (vcpu->kvm->arch.mp_state_reset)
+			kvm_riscv_reset_vcpu(vcpu, false);
+		else
+			ret = -EINVAL;
+		break;
 	default:
 		ret = -EINVAL;
 	}
@@ -714,7 +724,7 @@ static void kvm_riscv_check_vcpu_requests(struct kvm_vcpu *vcpu)
 		}
 
 		if (kvm_check_request(KVM_REQ_VCPU_RESET, vcpu))
-			kvm_riscv_reset_vcpu(vcpu);
+			kvm_riscv_reset_vcpu(vcpu, true);
 
 		if (kvm_check_request(KVM_REQ_UPDATE_HGATP, vcpu))
 			kvm_riscv_gstage_update_hgatp(vcpu);
diff --git a/arch/riscv/kvm/vcpu_sbi.c b/arch/riscv/kvm/vcpu_sbi.c
index 0afef0bb261d..31fd3cc98d66 100644
--- a/arch/riscv/kvm/vcpu_sbi.c
+++ b/arch/riscv/kvm/vcpu_sbi.c
@@ -167,6 +167,23 @@ void kvm_riscv_vcpu_sbi_request_reset(struct kvm_vcpu *vcpu,
 	kvm_make_request(KVM_REQ_VCPU_RESET, vcpu);
 }
 
+void kvm_riscv_vcpu_sbi_load_reset_state(struct kvm_vcpu *vcpu)
+{
+	struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr;
+	struct kvm_cpu_context *cntx = &vcpu->arch.guest_context;
+	struct kvm_vcpu_reset_state *reset_state = &vcpu->arch.reset_state;
+
+	cntx->a0 = vcpu->vcpu_id;
+
+	spin_lock(&vcpu->arch.reset_state.lock);
+	cntx->sepc = reset_state->pc;
+	cntx->a1 = reset_state->a1;
+	spin_unlock(&vcpu->arch.reset_state.lock);
+
+	cntx->sstatus &= ~SR_SIE;
+	csr->vsatp = 0;
+}
+
 int kvm_riscv_vcpu_sbi_return(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
 	struct kvm_cpu_context *cp = &vcpu->arch.guest_context;
diff --git a/arch/riscv/kvm/vcpu_sbi_hsm.c b/arch/riscv/kvm/vcpu_sbi_hsm.c
index f26207f84bab..d1bf1348eefd 100644
--- a/arch/riscv/kvm/vcpu_sbi_hsm.c
+++ b/arch/riscv/kvm/vcpu_sbi_hsm.c
@@ -89,7 +89,12 @@ static int kvm_sbi_ext_hsm_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
 		ret = kvm_sbi_hsm_vcpu_start(vcpu);
 		break;
 	case SBI_EXT_HSM_HART_STOP:
-		ret = kvm_sbi_hsm_vcpu_stop(vcpu);
+		if (vcpu->kvm->arch.mp_state_reset) {
+			kvm_riscv_vcpu_sbi_forward(vcpu, run);
+			retdata->uexit = true;
+		} else {
+			ret = kvm_sbi_hsm_vcpu_stop(vcpu);
+		}
 		break;
 	case SBI_EXT_HSM_HART_STATUS:
 		ret = kvm_sbi_hsm_vcpu_get_status(vcpu);
diff --git a/arch/riscv/kvm/vcpu_sbi_system.c b/arch/riscv/kvm/vcpu_sbi_system.c
index 359be90b0fc5..0482968705f8 100644
--- a/arch/riscv/kvm/vcpu_sbi_system.c
+++ b/arch/riscv/kvm/vcpu_sbi_system.c
@@ -44,7 +44,8 @@ static int kvm_sbi_ext_susp_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
 			}
 		}
 
-		kvm_riscv_vcpu_sbi_request_reset(vcpu, cp->a1, cp->a2);
+		if (!vcpu->kvm->arch.mp_state_reset)
+			kvm_riscv_vcpu_sbi_request_reset(vcpu, cp->a1, cp->a2);
 
 		/* userspace provides the suspend implementation */
 		kvm_riscv_vcpu_sbi_forward(vcpu, run);
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index 7396b8654f45..b27ec8f96697 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -209,6 +209,19 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	return r;
 }
 
+int kvm_vm_ioctl_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
+{
+	switch (cap->cap) {
+	case KVM_CAP_RISCV_MP_STATE_RESET:
+		if (cap->flags)
+			return -EINVAL;
+		kvm->arch.mp_state_reset = true;
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 {
 	return -EINVAL;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index b6ae8ad8934b..454b7d4a0448 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -930,6 +930,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
 #define KVM_CAP_X86_GUEST_MODE 238
 #define KVM_CAP_ARM_WRITABLE_IMP_ID_REGS 239
+#define KVM_CAP_RISCV_MP_STATE_RESET 240
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
-- 
2.49.0

Re: [PATCH v2 2/2] RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET

Posted by Anup Patel 9 months ago

On Thu, May 8, 2025 at 8:01 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>
> Add a toggleable VM capability to modify several reset related code
> paths.  The goals are to
>  1) Allow userspace to reset any VCPU.
>  2) Allow userspace to provide the initial VCPU state.
>
> (Right now, the boot VCPU isn't reset by KVM and KVM sets the state for
>  VCPUs brought up by sbi_hart_start while userspace for all others.)
>
> The goals are achieved with the following changes:
>  * Reset the VCPU when setting MP_STATE_INIT_RECEIVED through IOCTL.

Rather than using separate MP_STATE_INIT_RECEIVED ioctl(), we can
define a capability which when set, the set_mpstate ioctl() will reset the
VCPU upon changing VCPU state from RUNNABLE to STOPPED state.

>  * Preserve the userspace initialized VCPU state on sbi_hart_start.
>  * Return to userspace on sbi_hart_stop.

There is no userspace involvement required when a Guest VCPU
stops itself using SBI HSM stop() call so STRONG NO to this change.

>  * Don't make VCPU reset request on sbi_system_suspend.

The entry state of initiating VCPU is already available on SBI system
suspend call. The initiating VCPU must be resetted and entry state of
initiating VCPU must be setup.

>
> The patch is reusing MP_STATE_INIT_RECEIVED, because we didn't want to
> add a new IOCTL, sorry. :)
>
> Signed-off-by: Radim Krčmář <rkrcmar@ventanamicro.com>
> ---
> If you search for cap 7.42 in api.rst, you'll see that it has a wrong
> number, which is why we're 7.43, in case someone bothers to fix ARM.
>
> I was also strongly considering creating all VCPUs in RUNNABLE state --
> do you know of any similar quirks that aren't important, but could be
> fixed with the new userspace toggle?

Upon creating a VM, only one VCPU should be RUNNABLE and all
other VCPUs must remain in OFF state. This is intentional because
imagine a large number of VCPUs entering Guest OS at the same
time. We have spent a lot of effort in the past to get away from this
situation even in the host boot flow. We can't expect user space to
correctly set the initial MP_STATE of all VCPUs. We can certainly
think of some mechanism using which user space can specify
which VCPU should be runnable upon VM creation.

The current approach is to do HSM state management in kernel
space itself and not rely on user space. Allowing userspace to
resetting any VCPU is fine but this should not affect the flow for
SBI HSM, SBI System Reset, and SBI System Suspend.

Regards,
Anup

> ---
>  Documentation/virt/kvm/api.rst        | 15 +++++++++++
>  arch/riscv/include/asm/kvm_host.h     |  3 +++
>  arch/riscv/include/asm/kvm_vcpu_sbi.h |  1 +
>  arch/riscv/kvm/vcpu.c                 | 36 +++++++++++++++++----------
>  arch/riscv/kvm/vcpu_sbi.c             | 17 +++++++++++++
>  arch/riscv/kvm/vcpu_sbi_hsm.c         |  7 +++++-
>  arch/riscv/kvm/vcpu_sbi_system.c      |  3 ++-
>  arch/riscv/kvm/vm.c                   | 13 ++++++++++
>  include/uapi/linux/kvm.h              |  1 +
>  9 files changed, 81 insertions(+), 15 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 47c7c3f92314..63e6d23d34f0 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -8496,6 +8496,21 @@ aforementioned registers before the first KVM_RUN. These registers are VM
>  scoped, meaning that the same set of values are presented on all vCPUs in a
>  given VM.
>
> +7.43 KVM_CAP_RISCV_MP_STATE_RESET
> +---------------------------------
> +
> +:Architectures: riscv
> +:Type: VM
> +:Parameters: None
> +:Returns: 0 on success, -EINVAL if arg[0] is not zero
> +
> +When this capability is enabled, KVM:
> +* Resets the VCPU when setting MP_STATE_INIT_RECEIVED through IOCTL.
> +  The original MP_STATE is preserved.
> +* Preserves the userspace initialized VCPU state on sbi_hart_start.
> +* Returns to userspace on sbi_hart_stop.
> +* Doesn't make VCPU reset request on sbi_system_suspend.
> +
>  8. Other capabilities.
>  ======================
>
> diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> index f673ebfdadf3..85cfebc32e4c 100644
> --- a/arch/riscv/include/asm/kvm_host.h
> +++ b/arch/riscv/include/asm/kvm_host.h
> @@ -119,6 +119,9 @@ struct kvm_arch {
>
>         /* AIA Guest/VM context */
>         struct kvm_aia aia;
> +
> +       /* KVM_CAP_RISCV_MP_STATE_RESET */
> +       bool mp_state_reset;
>  };
>
>  struct kvm_cpu_trap {
> diff --git a/arch/riscv/include/asm/kvm_vcpu_sbi.h b/arch/riscv/include/asm/kvm_vcpu_sbi.h
> index da28235939d1..439ab2b3534f 100644
> --- a/arch/riscv/include/asm/kvm_vcpu_sbi.h
> +++ b/arch/riscv/include/asm/kvm_vcpu_sbi.h
> @@ -57,6 +57,7 @@ void kvm_riscv_vcpu_sbi_system_reset(struct kvm_vcpu *vcpu,
>                                      u32 type, u64 flags);
>  void kvm_riscv_vcpu_sbi_request_reset(struct kvm_vcpu *vcpu,
>                                       unsigned long pc, unsigned long a1);
> +void kvm_riscv_vcpu_sbi_load_reset_state(struct kvm_vcpu *vcpu);
>  int kvm_riscv_vcpu_sbi_return(struct kvm_vcpu *vcpu, struct kvm_run *run);
>  int kvm_riscv_vcpu_set_reg_sbi_ext(struct kvm_vcpu *vcpu,
>                                    const struct kvm_one_reg *reg);
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index a78f9ec2fa0e..961b22c05981 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -51,11 +51,11 @@ const struct kvm_stats_header kvm_vcpu_stats_header = {
>                        sizeof(kvm_vcpu_stats_desc),
>  };
>
> -static void kvm_riscv_vcpu_context_reset(struct kvm_vcpu *vcpu)
> +static void kvm_riscv_vcpu_context_reset(struct kvm_vcpu *vcpu,
> +                                        bool kvm_sbi_reset)
>  {
>         struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr;
>         struct kvm_cpu_context *cntx = &vcpu->arch.guest_context;
> -       struct kvm_vcpu_reset_state *reset_state = &vcpu->arch.reset_state;
>         void *vector_datap = cntx->vector.datap;
>
>         memset(cntx, 0, sizeof(*cntx));
> @@ -65,13 +65,8 @@ static void kvm_riscv_vcpu_context_reset(struct kvm_vcpu *vcpu)
>         /* Restore datap as it's not a part of the guest context. */
>         cntx->vector.datap = vector_datap;
>
> -       /* Load SBI reset values */
> -       cntx->a0 = vcpu->vcpu_id;
> -
> -       spin_lock(&reset_state->lock);
> -       cntx->sepc = reset_state->pc;
> -       cntx->a1 = reset_state->a1;
> -       spin_unlock(&reset_state->lock);
> +       if (kvm_sbi_reset)
> +               kvm_riscv_vcpu_sbi_load_reset_state(vcpu);
>
>         /* Setup reset state of shadow SSTATUS and HSTATUS CSRs */
>         cntx->sstatus = SR_SPP | SR_SPIE;
> @@ -84,10 +79,19 @@ static void kvm_riscv_vcpu_context_reset(struct kvm_vcpu *vcpu)
>         csr->scounteren = 0x7;
>  }
>
> -static void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu)
> +static void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu, bool kvm_sbi_reset)
>  {
>         bool loaded;
>
> +       /*
> +        * The userspace should have triggered a full reset earlier and could
> +        * have set initial state that needs to be preserved.
> +        */
> +       if (kvm_sbi_reset && vcpu->kvm->arch.mp_state_reset) {
> +               kvm_riscv_vcpu_sbi_load_reset_state(vcpu);
> +               return;
> +       }
> +
>         /**
>          * The preemption should be disabled here because it races with
>          * kvm_sched_out/kvm_sched_in(called from preempt notifiers) which
> @@ -100,7 +104,7 @@ static void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu)
>
>         vcpu->arch.last_exit_cpu = -1;
>
> -       kvm_riscv_vcpu_context_reset(vcpu);
> +       kvm_riscv_vcpu_context_reset(vcpu, kvm_sbi_reset);
>
>         kvm_riscv_vcpu_fp_reset(vcpu);
>
> @@ -177,7 +181,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         kvm_riscv_vcpu_sbi_init(vcpu);
>
>         /* Reset VCPU */
> -       kvm_riscv_reset_vcpu(vcpu);
> +       kvm_riscv_reset_vcpu(vcpu, false);
>
>         return 0;
>  }
> @@ -526,6 +530,12 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
>         case KVM_MP_STATE_STOPPED:
>                 __kvm_riscv_vcpu_power_off(vcpu);
>                 break;
> +       case KVM_MP_STATE_INIT_RECEIVED:
> +               if (vcpu->kvm->arch.mp_state_reset)
> +                       kvm_riscv_reset_vcpu(vcpu, false);
> +               else
> +                       ret = -EINVAL;
> +               break;
>         default:
>                 ret = -EINVAL;
>         }
> @@ -714,7 +724,7 @@ static void kvm_riscv_check_vcpu_requests(struct kvm_vcpu *vcpu)
>                 }
>
>                 if (kvm_check_request(KVM_REQ_VCPU_RESET, vcpu))
> -                       kvm_riscv_reset_vcpu(vcpu);
> +                       kvm_riscv_reset_vcpu(vcpu, true);
>
>                 if (kvm_check_request(KVM_REQ_UPDATE_HGATP, vcpu))
>                         kvm_riscv_gstage_update_hgatp(vcpu);
> diff --git a/arch/riscv/kvm/vcpu_sbi.c b/arch/riscv/kvm/vcpu_sbi.c
> index 0afef0bb261d..31fd3cc98d66 100644
> --- a/arch/riscv/kvm/vcpu_sbi.c
> +++ b/arch/riscv/kvm/vcpu_sbi.c
> @@ -167,6 +167,23 @@ void kvm_riscv_vcpu_sbi_request_reset(struct kvm_vcpu *vcpu,
>         kvm_make_request(KVM_REQ_VCPU_RESET, vcpu);
>  }
>
> +void kvm_riscv_vcpu_sbi_load_reset_state(struct kvm_vcpu *vcpu)
> +{
> +       struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr;
> +       struct kvm_cpu_context *cntx = &vcpu->arch.guest_context;
> +       struct kvm_vcpu_reset_state *reset_state = &vcpu->arch.reset_state;
> +
> +       cntx->a0 = vcpu->vcpu_id;
> +
> +       spin_lock(&vcpu->arch.reset_state.lock);
> +       cntx->sepc = reset_state->pc;
> +       cntx->a1 = reset_state->a1;
> +       spin_unlock(&vcpu->arch.reset_state.lock);
> +
> +       cntx->sstatus &= ~SR_SIE;
> +       csr->vsatp = 0;
> +}
> +
>  int kvm_riscv_vcpu_sbi_return(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  {
>         struct kvm_cpu_context *cp = &vcpu->arch.guest_context;
> diff --git a/arch/riscv/kvm/vcpu_sbi_hsm.c b/arch/riscv/kvm/vcpu_sbi_hsm.c
> index f26207f84bab..d1bf1348eefd 100644
> --- a/arch/riscv/kvm/vcpu_sbi_hsm.c
> +++ b/arch/riscv/kvm/vcpu_sbi_hsm.c
> @@ -89,7 +89,12 @@ static int kvm_sbi_ext_hsm_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
>                 ret = kvm_sbi_hsm_vcpu_start(vcpu);
>                 break;
>         case SBI_EXT_HSM_HART_STOP:
> -               ret = kvm_sbi_hsm_vcpu_stop(vcpu);
> +               if (vcpu->kvm->arch.mp_state_reset) {
> +                       kvm_riscv_vcpu_sbi_forward(vcpu, run);
> +                       retdata->uexit = true;
> +               } else {
> +                       ret = kvm_sbi_hsm_vcpu_stop(vcpu);
> +               }
>                 break;
>         case SBI_EXT_HSM_HART_STATUS:
>                 ret = kvm_sbi_hsm_vcpu_get_status(vcpu);
> diff --git a/arch/riscv/kvm/vcpu_sbi_system.c b/arch/riscv/kvm/vcpu_sbi_system.c
> index 359be90b0fc5..0482968705f8 100644
> --- a/arch/riscv/kvm/vcpu_sbi_system.c
> +++ b/arch/riscv/kvm/vcpu_sbi_system.c
> @@ -44,7 +44,8 @@ static int kvm_sbi_ext_susp_handler(struct kvm_vcpu *vcpu, struct kvm_run *run,
>                         }
>                 }
>
> -               kvm_riscv_vcpu_sbi_request_reset(vcpu, cp->a1, cp->a2);
> +               if (!vcpu->kvm->arch.mp_state_reset)
> +                       kvm_riscv_vcpu_sbi_request_reset(vcpu, cp->a1, cp->a2);
>
>                 /* userspace provides the suspend implementation */
>                 kvm_riscv_vcpu_sbi_forward(vcpu, run);
> diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
> index 7396b8654f45..b27ec8f96697 100644
> --- a/arch/riscv/kvm/vm.c
> +++ b/arch/riscv/kvm/vm.c
> @@ -209,6 +209,19 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>         return r;
>  }
>
> +int kvm_vm_ioctl_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
> +{
> +       switch (cap->cap) {
> +       case KVM_CAP_RISCV_MP_STATE_RESET:
> +               if (cap->flags)
> +                       return -EINVAL;
> +               kvm->arch.mp_state_reset = true;
> +               return 0;
> +       default:
> +               return -EINVAL;
> +       }
> +}
> +
>  int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  {
>         return -EINVAL;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index b6ae8ad8934b..454b7d4a0448 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -930,6 +930,7 @@ struct kvm_enable_cap {
>  #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
>  #define KVM_CAP_X86_GUEST_MODE 238
>  #define KVM_CAP_ARM_WRITABLE_IMP_ID_REGS 239
> +#define KVM_CAP_RISCV_MP_STATE_RESET 240
>
>  struct kvm_irq_routing_irqchip {
>         __u32 irqchip;
> --
> 2.49.0
>

Re: [PATCH v2 2/2] RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET

Posted by Radim Krčmář 9 months ago

2025-05-09T12:25:24+05:30, Anup Patel <anup@brainfault.org>:
> On Thu, May 8, 2025 at 8:01 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>>
>> Add a toggleable VM capability to modify several reset related code
>> paths.  The goals are to
>>  1) Allow userspace to reset any VCPU.
>>  2) Allow userspace to provide the initial VCPU state.
>>
>> (Right now, the boot VCPU isn't reset by KVM and KVM sets the state for
>>  VCPUs brought up by sbi_hart_start while userspace for all others.)
>>
>> The goals are achieved with the following changes:
>>  * Reset the VCPU when setting MP_STATE_INIT_RECEIVED through IOCTL.
>
> Rather than using separate MP_STATE_INIT_RECEIVED ioctl(), we can
> define a capability which when set, the set_mpstate ioctl() will reset the
> VCPU upon changing VCPU state from RUNNABLE to STOPPED state.

Yeah, I started with that and then realized it has two drawbacks:

 * It will require larger changes in userspaces, because for
   example QEMU now first loads the initial state and then toggles the
   mp_state, which would incorrectly reset the state.

 * It will also require an extra IOCTL if a stopped VCPU should be
   reset
    1) STOPPED -> RUNNING (= reset)
    2) RUNNING -> STOPPED (VCPU should be stopped)
   or if the current state of a VCPU is not known.
    1) ???     -> STOPPED
    2) STOPPED -> RUNNING
    3) RUNNING -> STOPPED

I can do that for v3 if you think it's better.

>>  * Preserve the userspace initialized VCPU state on sbi_hart_start.
>>  * Return to userspace on sbi_hart_stop.
>
> There is no userspace involvement required when a Guest VCPU
> stops itself using SBI HSM stop() call so STRONG NO to this change.

Ok, I'll drop it from v3 -- it can be handled by future patches that
trap SBI calls to userspace.

The lack of userspace involvement is the issue.  KVM doesn't know what
the initial state should be.

>>  * Don't make VCPU reset request on sbi_system_suspend.
>
> The entry state of initiating VCPU is already available on SBI system
> suspend call. The initiating VCPU must be resetted and entry state of
> initiating VCPU must be setup.

Userspace would simply call the VCPU reset and set the complete state,
because the userspace exit already provides all the sbi information.

I'll drop this change.  It doesn't make much sense if we aren't fixing
the sbi_hart_start reset.

>> The patch is reusing MP_STATE_INIT_RECEIVED, because we didn't want to
>> add a new IOCTL, sorry. :)
>>
>> Signed-off-by: Radim Krčmář <rkrcmar@ventanamicro.com>
>> ---
>> If you search for cap 7.42 in api.rst, you'll see that it has a wrong
>> number, which is why we're 7.43, in case someone bothers to fix ARM.
>>
>> I was also strongly considering creating all VCPUs in RUNNABLE state --
>> do you know of any similar quirks that aren't important, but could be
>> fixed with the new userspace toggle?
>
> Upon creating a VM, only one VCPU should be RUNNABLE and all
> other VCPUs must remain in OFF state. This is intentional because
> imagine a large number of VCPUs entering Guest OS at the same
> time. We have spent a lot of effort in the past to get away from this
> situation even in the host boot flow. We can't expect user space to
> correctly set the initial MP_STATE of all VCPUs. We can certainly
> think of some mechanism using which user space can specify
> which VCPU should be runnable upon VM creation.

We already do have the mechanism -- the userspace will set MP_STATE of
VCPU 0 to STOPPED and whatever VCPUs it wants as boot with to RUNNABLE
before running all the VCPUs for the first time.

The userspace must correctly set the initial MP state anyway, because a
resume will want a mp_state that a fresh boot.

> The current approach is to do HSM state management in kernel
> space itself and not rely on user space. Allowing userspace to
> resetting any VCPU is fine but this should not affect the flow for
> SBI HSM, SBI System Reset, and SBI System Suspend.

Yes, that is the design I was trying to change.  I think userspace
should have control over all aspects of the guest it executes in KVM.

Accelerating SBI in KVM is good, but userspace should be able to say how
the unspecified parts are implemented.  Trapping to userspace is the
simplest option.  (And sufficient for ecalls that are not a hot path.)

Thanks.

Re: [PATCH v2 2/2] RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET

Posted by Anup Patel 9 months ago

On Fri, May 9, 2025 at 2:16 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>
> 2025-05-09T12:25:24+05:30, Anup Patel <anup@brainfault.org>:
> > On Thu, May 8, 2025 at 8:01 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> >>
> >> Add a toggleable VM capability to modify several reset related code
> >> paths.  The goals are to
> >>  1) Allow userspace to reset any VCPU.
> >>  2) Allow userspace to provide the initial VCPU state.
> >>
> >> (Right now, the boot VCPU isn't reset by KVM and KVM sets the state for
> >>  VCPUs brought up by sbi_hart_start while userspace for all others.)
> >>
> >> The goals are achieved with the following changes:
> >>  * Reset the VCPU when setting MP_STATE_INIT_RECEIVED through IOCTL.
> >
> > Rather than using separate MP_STATE_INIT_RECEIVED ioctl(), we can
> > define a capability which when set, the set_mpstate ioctl() will reset the
> > VCPU upon changing VCPU state from RUNNABLE to STOPPED state.
>
> Yeah, I started with that and then realized it has two drawbacks:
>
>  * It will require larger changes in userspaces, because for
>    example QEMU now first loads the initial state and then toggles the
>    mp_state, which would incorrectly reset the state.
>
>  * It will also require an extra IOCTL if a stopped VCPU should be
>    reset
>     1) STOPPED -> RUNNING (= reset)
>     2) RUNNING -> STOPPED (VCPU should be stopped)
>    or if the current state of a VCPU is not known.
>     1) ???     -> STOPPED
>     2) STOPPED -> RUNNING
>     3) RUNNING -> STOPPED
>
> I can do that for v3 if you think it's better.

Okay, for now keep the MP_STATE_INIT_RECEIVED ioctl()

>
> >>  * Preserve the userspace initialized VCPU state on sbi_hart_start.
> >>  * Return to userspace on sbi_hart_stop.
> >
> > There is no userspace involvement required when a Guest VCPU
> > stops itself using SBI HSM stop() call so STRONG NO to this change.
>
> Ok, I'll drop it from v3 -- it can be handled by future patches that
> trap SBI calls to userspace.
>
> The lack of userspace involvement is the issue.  KVM doesn't know what
> the initial state should be.

The SBI HSM virtualization does not need any KVM userspace
involvement.

When a VCPU stops itself using SBI HSM stop(), the Guest itself
provides the entry address and argument when starting the VCPU
using SBI HSM start() without any KVM userspace involvement.

In fact, even at Guest boot time all non-boot VCPUs are brought-up
using SBI HSM start() by the boot VCPU where the Guest itself
provides entry address and argument without any KVM userspace
involvement.

>
> >>  * Don't make VCPU reset request on sbi_system_suspend.
> >
> > The entry state of initiating VCPU is already available on SBI system
> > suspend call. The initiating VCPU must be resetted and entry state of
> > initiating VCPU must be setup.
>
> Userspace would simply call the VCPU reset and set the complete state,
> because the userspace exit already provides all the sbi information.
>
> I'll drop this change.  It doesn't make much sense if we aren't fixing
> the sbi_hart_start reset.
>
> >> The patch is reusing MP_STATE_INIT_RECEIVED, because we didn't want to
> >> add a new IOCTL, sorry. :)
> >>
> >> Signed-off-by: Radim Krčmář <rkrcmar@ventanamicro.com>
> >> ---
> >> If you search for cap 7.42 in api.rst, you'll see that it has a wrong
> >> number, which is why we're 7.43, in case someone bothers to fix ARM.
> >>
> >> I was also strongly considering creating all VCPUs in RUNNABLE state --
> >> do you know of any similar quirks that aren't important, but could be
> >> fixed with the new userspace toggle?
> >
> > Upon creating a VM, only one VCPU should be RUNNABLE and all
> > other VCPUs must remain in OFF state. This is intentional because
> > imagine a large number of VCPUs entering Guest OS at the same
> > time. We have spent a lot of effort in the past to get away from this
> > situation even in the host boot flow. We can't expect user space to
> > correctly set the initial MP_STATE of all VCPUs. We can certainly
> > think of some mechanism using which user space can specify
> > which VCPU should be runnable upon VM creation.
>
> We already do have the mechanism -- the userspace will set MP_STATE of
> VCPU 0 to STOPPED and whatever VCPUs it wants as boot with to RUNNABLE
> before running all the VCPUs for the first time.

Okay, nothing to be done on this front.

>
> The userspace must correctly set the initial MP state anyway, because a
> resume will want a mp_state that a fresh boot.
>
> > The current approach is to do HSM state management in kernel
> > space itself and not rely on user space. Allowing userspace to
> > resetting any VCPU is fine but this should not affect the flow for
> > SBI HSM, SBI System Reset, and SBI System Suspend.
>
> Yes, that is the design I was trying to change.  I think userspace
> should have control over all aspects of the guest it executes in KVM.

For SBI HSM, the kernel space should be the only entity managing.

>
> Accelerating SBI in KVM is good, but userspace should be able to say how
> the unspecified parts are implemented.  Trapping to userspace is the
> simplest option.  (And sufficient for ecalls that are not a hot path.)
>

For the unspecified parts, we have KVM exits at appropriate places
e.g. SBI system reset, SBI system suspend, etc.

Regards,
Anup

Re: [PATCH v2 2/2] RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET

Posted by Andrew Jones 9 months ago

On Fri, May 09, 2025 at 05:33:49PM +0530, Anup Patel wrote:
> On Fri, May 9, 2025 at 2:16 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> >
> > 2025-05-09T12:25:24+05:30, Anup Patel <anup@brainfault.org>:
> > > On Thu, May 8, 2025 at 8:01 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> > >>
> > >> Add a toggleable VM capability to modify several reset related code
> > >> paths.  The goals are to
> > >>  1) Allow userspace to reset any VCPU.
> > >>  2) Allow userspace to provide the initial VCPU state.
> > >>
> > >> (Right now, the boot VCPU isn't reset by KVM and KVM sets the state for
> > >>  VCPUs brought up by sbi_hart_start while userspace for all others.)
> > >>
> > >> The goals are achieved with the following changes:
> > >>  * Reset the VCPU when setting MP_STATE_INIT_RECEIVED through IOCTL.
> > >
> > > Rather than using separate MP_STATE_INIT_RECEIVED ioctl(), we can
> > > define a capability which when set, the set_mpstate ioctl() will reset the
> > > VCPU upon changing VCPU state from RUNNABLE to STOPPED state.
> >
> > Yeah, I started with that and then realized it has two drawbacks:
> >
> >  * It will require larger changes in userspaces, because for
> >    example QEMU now first loads the initial state and then toggles the
> >    mp_state, which would incorrectly reset the state.
> >
> >  * It will also require an extra IOCTL if a stopped VCPU should be
> >    reset
> >     1) STOPPED -> RUNNING (= reset)
> >     2) RUNNING -> STOPPED (VCPU should be stopped)
> >    or if the current state of a VCPU is not known.
> >     1) ???     -> STOPPED
> >     2) STOPPED -> RUNNING
> >     3) RUNNING -> STOPPED
> >
> > I can do that for v3 if you think it's better.
> 
> Okay, for now keep the MP_STATE_INIT_RECEIVED ioctl()
> 
> >
> > >>  * Preserve the userspace initialized VCPU state on sbi_hart_start.
> > >>  * Return to userspace on sbi_hart_stop.
> > >
> > > There is no userspace involvement required when a Guest VCPU
> > > stops itself using SBI HSM stop() call so STRONG NO to this change.
> >
> > Ok, I'll drop it from v3 -- it can be handled by future patches that
> > trap SBI calls to userspace.
> >
> > The lack of userspace involvement is the issue.  KVM doesn't know what
> > the initial state should be.
> 
> The SBI HSM virtualization does not need any KVM userspace
> involvement.
> 
> When a VCPU stops itself using SBI HSM stop(), the Guest itself
> provides the entry address and argument when starting the VCPU
> using SBI HSM start() without any KVM userspace involvement.
> 
> In fact, even at Guest boot time all non-boot VCPUs are brought-up
> using SBI HSM start() by the boot VCPU where the Guest itself
> provides entry address and argument without any KVM userspace
> involvement.

HSM only specifies the state of a few registers and the ISA only a few
more. All other registers have IMPDEF reset values. Zeros, like KVM
selects, are a good choice and the best default, but if the VMM disagrees,
then it should be allowed to select what it likes, as the VMM/user is the
policy maker and KVM is "just" the accelerator.

> 
> >
> > >>  * Don't make VCPU reset request on sbi_system_suspend.
> > >
> > > The entry state of initiating VCPU is already available on SBI system
> > > suspend call. The initiating VCPU must be resetted and entry state of
> > > initiating VCPU must be setup.
> >
> > Userspace would simply call the VCPU reset and set the complete state,
> > because the userspace exit already provides all the sbi information.
> >
> > I'll drop this change.  It doesn't make much sense if we aren't fixing
> > the sbi_hart_start reset.
> >
> > >> The patch is reusing MP_STATE_INIT_RECEIVED, because we didn't want to
> > >> add a new IOCTL, sorry. :)
> > >>
> > >> Signed-off-by: Radim Krčmář <rkrcmar@ventanamicro.com>
> > >> ---
> > >> If you search for cap 7.42 in api.rst, you'll see that it has a wrong
> > >> number, which is why we're 7.43, in case someone bothers to fix ARM.
> > >>
> > >> I was also strongly considering creating all VCPUs in RUNNABLE state --
> > >> do you know of any similar quirks that aren't important, but could be
> > >> fixed with the new userspace toggle?
> > >
> > > Upon creating a VM, only one VCPU should be RUNNABLE and all
> > > other VCPUs must remain in OFF state. This is intentional because
> > > imagine a large number of VCPUs entering Guest OS at the same
> > > time. We have spent a lot of effort in the past to get away from this
> > > situation even in the host boot flow. We can't expect user space to
> > > correctly set the initial MP_STATE of all VCPUs. We can certainly
> > > think of some mechanism using which user space can specify
> > > which VCPU should be runnable upon VM creation.
> >
> > We already do have the mechanism -- the userspace will set MP_STATE of
> > VCPU 0 to STOPPED and whatever VCPUs it wants as boot with to RUNNABLE
> > before running all the VCPUs for the first time.
> 
> Okay, nothing to be done on this front.
> 
> >
> > The userspace must correctly set the initial MP state anyway, because a
> > resume will want a mp_state that a fresh boot.
> >
> > > The current approach is to do HSM state management in kernel
> > > space itself and not rely on user space. Allowing userspace to
> > > resetting any VCPU is fine but this should not affect the flow for
> > > SBI HSM, SBI System Reset, and SBI System Suspend.
> >
> > Yes, that is the design I was trying to change.  I think userspace
> > should have control over all aspects of the guest it executes in KVM.
> 
> For SBI HSM, the kernel space should be the only entity managing.

The VMM should always be the only manager. KVM can provide defaults in
order to support simple VMMs that don't have opinions, but VMMs concerned
with managing all state on behalf of their users' choices and to ensure
successful migrations, will want to be involved.

Thanks,
drew

> 
> >
> > Accelerating SBI in KVM is good, but userspace should be able to say how
> > the unspecified parts are implemented.  Trapping to userspace is the
> > simplest option.  (And sufficient for ecalls that are not a hot path.)
> >
> 
> For the unspecified parts, we have KVM exits at appropriate places
> e.g. SBI system reset, SBI system suspend, etc.
> 
> Regards,
> Anup

Re: [PATCH v2 2/2] RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET

Posted by Anup Patel 9 months ago

On Fri, May 9, 2025 at 5:49 PM Andrew Jones <ajones@ventanamicro.com> wrote:
>
> On Fri, May 09, 2025 at 05:33:49PM +0530, Anup Patel wrote:
> > On Fri, May 9, 2025 at 2:16 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> > >
> > > 2025-05-09T12:25:24+05:30, Anup Patel <anup@brainfault.org>:
> > > > On Thu, May 8, 2025 at 8:01 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> > > >>
> > > >> Add a toggleable VM capability to modify several reset related code
> > > >> paths.  The goals are to
> > > >>  1) Allow userspace to reset any VCPU.
> > > >>  2) Allow userspace to provide the initial VCPU state.
> > > >>
> > > >> (Right now, the boot VCPU isn't reset by KVM and KVM sets the state for
> > > >>  VCPUs brought up by sbi_hart_start while userspace for all others.)
> > > >>
> > > >> The goals are achieved with the following changes:
> > > >>  * Reset the VCPU when setting MP_STATE_INIT_RECEIVED through IOCTL.
> > > >
> > > > Rather than using separate MP_STATE_INIT_RECEIVED ioctl(), we can
> > > > define a capability which when set, the set_mpstate ioctl() will reset the
> > > > VCPU upon changing VCPU state from RUNNABLE to STOPPED state.
> > >
> > > Yeah, I started with that and then realized it has two drawbacks:
> > >
> > >  * It will require larger changes in userspaces, because for
> > >    example QEMU now first loads the initial state and then toggles the
> > >    mp_state, which would incorrectly reset the state.
> > >
> > >  * It will also require an extra IOCTL if a stopped VCPU should be
> > >    reset
> > >     1) STOPPED -> RUNNING (= reset)
> > >     2) RUNNING -> STOPPED (VCPU should be stopped)
> > >    or if the current state of a VCPU is not known.
> > >     1) ???     -> STOPPED
> > >     2) STOPPED -> RUNNING
> > >     3) RUNNING -> STOPPED
> > >
> > > I can do that for v3 if you think it's better.
> >
> > Okay, for now keep the MP_STATE_INIT_RECEIVED ioctl()
> >
> > >
> > > >>  * Preserve the userspace initialized VCPU state on sbi_hart_start.
> > > >>  * Return to userspace on sbi_hart_stop.
> > > >
> > > > There is no userspace involvement required when a Guest VCPU
> > > > stops itself using SBI HSM stop() call so STRONG NO to this change.
> > >
> > > Ok, I'll drop it from v3 -- it can be handled by future patches that
> > > trap SBI calls to userspace.
> > >
> > > The lack of userspace involvement is the issue.  KVM doesn't know what
> > > the initial state should be.
> >
> > The SBI HSM virtualization does not need any KVM userspace
> > involvement.
> >
> > When a VCPU stops itself using SBI HSM stop(), the Guest itself
> > provides the entry address and argument when starting the VCPU
> > using SBI HSM start() without any KVM userspace involvement.
> >
> > In fact, even at Guest boot time all non-boot VCPUs are brought-up
> > using SBI HSM start() by the boot VCPU where the Guest itself
> > provides entry address and argument without any KVM userspace
> > involvement.
>
> HSM only specifies the state of a few registers and the ISA only a few
> more. All other registers have IMPDEF reset values. Zeros, like KVM
> selects, are a good choice and the best default, but if the VMM disagrees,
> then it should be allowed to select what it likes, as the VMM/user is the
> policy maker and KVM is "just" the accelerator.

Till now there are no such IMPDEF reset values expected. We will
cross that bridge when needed. Although, I doubt we will ever need it.

>
> >
> > >
> > > >>  * Don't make VCPU reset request on sbi_system_suspend.
> > > >
> > > > The entry state of initiating VCPU is already available on SBI system
> > > > suspend call. The initiating VCPU must be resetted and entry state of
> > > > initiating VCPU must be setup.
> > >
> > > Userspace would simply call the VCPU reset and set the complete state,
> > > because the userspace exit already provides all the sbi information.
> > >
> > > I'll drop this change.  It doesn't make much sense if we aren't fixing
> > > the sbi_hart_start reset.
> > >
> > > >> The patch is reusing MP_STATE_INIT_RECEIVED, because we didn't want to
> > > >> add a new IOCTL, sorry. :)
> > > >>
> > > >> Signed-off-by: Radim Krčmář <rkrcmar@ventanamicro.com>
> > > >> ---
> > > >> If you search for cap 7.42 in api.rst, you'll see that it has a wrong
> > > >> number, which is why we're 7.43, in case someone bothers to fix ARM.
> > > >>
> > > >> I was also strongly considering creating all VCPUs in RUNNABLE state --
> > > >> do you know of any similar quirks that aren't important, but could be
> > > >> fixed with the new userspace toggle?
> > > >
> > > > Upon creating a VM, only one VCPU should be RUNNABLE and all
> > > > other VCPUs must remain in OFF state. This is intentional because
> > > > imagine a large number of VCPUs entering Guest OS at the same
> > > > time. We have spent a lot of effort in the past to get away from this
> > > > situation even in the host boot flow. We can't expect user space to
> > > > correctly set the initial MP_STATE of all VCPUs. We can certainly
> > > > think of some mechanism using which user space can specify
> > > > which VCPU should be runnable upon VM creation.
> > >
> > > We already do have the mechanism -- the userspace will set MP_STATE of
> > > VCPU 0 to STOPPED and whatever VCPUs it wants as boot with to RUNNABLE
> > > before running all the VCPUs for the first time.
> >
> > Okay, nothing to be done on this front.
> >
> > >
> > > The userspace must correctly set the initial MP state anyway, because a
> > > resume will want a mp_state that a fresh boot.
> > >
> > > > The current approach is to do HSM state management in kernel
> > > > space itself and not rely on user space. Allowing userspace to
> > > > resetting any VCPU is fine but this should not affect the flow for
> > > > SBI HSM, SBI System Reset, and SBI System Suspend.
> > >
> > > Yes, that is the design I was trying to change.  I think userspace
> > > should have control over all aspects of the guest it executes in KVM.
> >
> > For SBI HSM, the kernel space should be the only entity managing.
>
> The VMM should always be the only manager. KVM can provide defaults in
> order to support simple VMMs that don't have opinions, but VMMs concerned
> with managing all state on behalf of their users' choices and to ensure
> successful migrations, will want to be involved.

I think you misunderstood my comment. VMM is still the over manager
but the VCPU SBI HSM states are managed by the kernel KVM.

Regards,
Anup

>
> Thanks,
> drew
>
> >
> > >
> > > Accelerating SBI in KVM is good, but userspace should be able to say how
> > > the unspecified parts are implemented.  Trapping to userspace is the
> > > simplest option.  (And sufficient for ecalls that are not a hot path.)
> > >
> >
> > For the unspecified parts, we have KVM exits at appropriate places
> > e.g. SBI system reset, SBI system suspend, etc.
> >
> > Regards,
> > Anup

Re: [PATCH v2 2/2] RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET

Posted by Radim Krčmář 9 months ago

2025-05-09T17:59:28+05:30, Anup Patel <anup@brainfault.org>:
> On Fri, May 9, 2025 at 5:49 PM Andrew Jones <ajones@ventanamicro.com> wrote:
>> On Fri, May 09, 2025 at 05:33:49PM +0530, Anup Patel wrote:
>> > On Fri, May 9, 2025 at 2:16 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>> > > 2025-05-09T12:25:24+05:30, Anup Patel <anup@brainfault.org>:
>> > > > On Thu, May 8, 2025 at 8:01 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>> > > >>  * Preserve the userspace initialized VCPU state on sbi_hart_start.
>> > > >>  * Return to userspace on sbi_hart_stop.
>> > > >
>> > > > There is no userspace involvement required when a Guest VCPU
>> > > > stops itself using SBI HSM stop() call so STRONG NO to this change.
>> > >
>> > > Ok, I'll drop it from v3 -- it can be handled by future patches that
>> > > trap SBI calls to userspace.
>> > >
>> > > The lack of userspace involvement is the issue.  KVM doesn't know what
>> > > the initial state should be.
>> >
>> > The SBI HSM virtualization does not need any KVM userspace
>> > involvement.
>> >
>> > When a VCPU stops itself using SBI HSM stop(), the Guest itself
>> > provides the entry address and argument when starting the VCPU
>> > using SBI HSM start() without any KVM userspace involvement.
>> >
>> > In fact, even at Guest boot time all non-boot VCPUs are brought-up
>> > using SBI HSM start() by the boot VCPU where the Guest itself
>> > provides entry address and argument without any KVM userspace
>> > involvement.
>>
>> HSM only specifies the state of a few registers and the ISA only a few
>> more. All other registers have IMPDEF reset values. Zeros, like KVM
>> selects, are a good choice and the best default, but if the VMM disagrees,
>> then it should be allowed to select what it likes, as the VMM/user is the
>> policy maker and KVM is "just" the accelerator.
>
> Till now there are no such IMPDEF reset values expected. We will
> cross that bridge when needed. Although, I doubt we will ever need it.

The IMPDEF issue already exists.  KVM resets scounteren to 7, but
userspace wants it to be different, likely 0.

Re: [PATCH v2 2/2] RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET

Posted by Anup Patel 9 months ago

On Fri, May 9, 2025 at 7:27 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>
> 2025-05-09T17:59:28+05:30, Anup Patel <anup@brainfault.org>:
> > On Fri, May 9, 2025 at 5:49 PM Andrew Jones <ajones@ventanamicro.com> wrote:
> >> On Fri, May 09, 2025 at 05:33:49PM +0530, Anup Patel wrote:
> >> > On Fri, May 9, 2025 at 2:16 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> >> > > 2025-05-09T12:25:24+05:30, Anup Patel <anup@brainfault.org>:
> >> > > > On Thu, May 8, 2025 at 8:01 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> >> > > >>  * Preserve the userspace initialized VCPU state on sbi_hart_start.
> >> > > >>  * Return to userspace on sbi_hart_stop.
> >> > > >
> >> > > > There is no userspace involvement required when a Guest VCPU
> >> > > > stops itself using SBI HSM stop() call so STRONG NO to this change.
> >> > >
> >> > > Ok, I'll drop it from v3 -- it can be handled by future patches that
> >> > > trap SBI calls to userspace.
> >> > >
> >> > > The lack of userspace involvement is the issue.  KVM doesn't know what
> >> > > the initial state should be.
> >> >
> >> > The SBI HSM virtualization does not need any KVM userspace
> >> > involvement.
> >> >
> >> > When a VCPU stops itself using SBI HSM stop(), the Guest itself
> >> > provides the entry address and argument when starting the VCPU
> >> > using SBI HSM start() without any KVM userspace involvement.
> >> >
> >> > In fact, even at Guest boot time all non-boot VCPUs are brought-up
> >> > using SBI HSM start() by the boot VCPU where the Guest itself
> >> > provides entry address and argument without any KVM userspace
> >> > involvement.
> >>
> >> HSM only specifies the state of a few registers and the ISA only a few
> >> more. All other registers have IMPDEF reset values. Zeros, like KVM
> >> selects, are a good choice and the best default, but if the VMM disagrees,
> >> then it should be allowed to select what it likes, as the VMM/user is the
> >> policy maker and KVM is "just" the accelerator.
> >
> > Till now there are no such IMPDEF reset values expected. We will
> > cross that bridge when needed. Although, I doubt we will ever need it.
>
> The IMPDEF issue already exists.  KVM resets scounteren to 7, but
> userspace wants it to be different, likely 0.

The scounteren set to 7 is temporary workaround which is supposed
to be removed once Guest Linux initializes scounteren correctly at
boot time. At this point in time we can remove this scounteren
workaround from KVM because the SBI PMU driver is initializing
scounteren correctly since many kernel releases.
(Refer, pmu_sbi_starting_cpu() in drivers/perf/riscv_pmu_sbi.c)

If we absolutely need a mechanism for userspace to provide custom
reset values of CSRs then we should use the existing ONE_REG
interface to define a new type using which CSR reset values can be
set at VM creation time. But, I still think we don't need such a
mechanism immediatly.

Regards,
Anup

[PATCH v2 1/2] KVM: RISC-V: reset smstateen in a better place
[PATCH v2 2/2] RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET