KVM: RISC-V: VCPU reset fixes

[PATCH 4/5] KVM: RISC-V: reset VCPU state when becoming runnable

Posted by Radim Krčmář 10 months, 1 week ago

Beware, this patch is "breaking" the userspace interface, because it
fixes a KVM/QEMU bug where the boot VCPU is not being reset by KVM.

The VCPU reset paths are inconsistent right now.  KVM resets VCPUs that
are brought up by KVM-accelerated SBI calls, but does nothing for VCPUs
brought up through ioctls.

We need to perform a KVM reset even when the VCPU is started through an
ioctl.  This patch is one of the ways we can achieve it.

Assume that userspace has no business setting the post-reset state.
KVM is de-facto the SBI implementation, as the SBI HSM acceleration
cannot be disabled and userspace cannot control the reset state, so KVM
should be in full control of the post-reset state.

Do not reset the pc and a1 registers, because SBI reset is expected to
provide them and KVM has no idea what these registers should be -- only
the userspace knows where it put the data.

An important consideration is resume.  Userspace might want to start
with non-reset state.  Check ran_atleast_once to allow this, because
KVM-SBI HSM creates some VCPUs as STOPPED.

The drawback is that userspace can still start the boot VCPU with an
incorrect reset state, because there is no way to distinguish a freshly
reset new VCPU on the KVM side (userspace might set some values by
mistake) from a restored VCPU (userspace must set all values).

The advantage of this solution is that it fixes current QEMU and makes
some sense with the assumption that KVM implements SBI HSM.
I do not like it too much, so I'd be in favor of a different solution if
we can still afford to drop support for current userspaces.

For a cleaner solution, we should add interfaces to perform the KVM-SBI
reset request on userspace demand.  I think it would also be much better
if userspace was in control of the post-reset state.

Signed-off-by: Radim Krčmář <rkrcmar@ventanamicro.com>
---
 arch/riscv/include/asm/kvm_host.h     |  1 +
 arch/riscv/include/asm/kvm_vcpu_sbi.h |  3 +++
 arch/riscv/kvm/vcpu.c                 |  9 +++++++++
 arch/riscv/kvm/vcpu_sbi.c             | 21 +++++++++++++++++++--
 4 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index 0c8c9c05af91..9bbf8c4a286b 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -195,6 +195,7 @@ struct kvm_vcpu_smstateen_csr {
 
 struct kvm_vcpu_reset_state {
 	spinlock_t lock;
+	bool active;
 	unsigned long pc;
 	unsigned long a1;
 };
diff --git a/arch/riscv/include/asm/kvm_vcpu_sbi.h b/arch/riscv/include/asm/kvm_vcpu_sbi.h
index aaaa81355276..2c334a87e02a 100644
--- a/arch/riscv/include/asm/kvm_vcpu_sbi.h
+++ b/arch/riscv/include/asm/kvm_vcpu_sbi.h
@@ -57,6 +57,9 @@ void kvm_riscv_vcpu_sbi_system_reset(struct kvm_vcpu *vcpu,
 				     u32 type, u64 flags);
 void kvm_riscv_vcpu_sbi_request_reset(struct kvm_vcpu *vcpu,
                                       unsigned long pc, unsigned long a1);
+void __kvm_riscv_vcpu_set_reset_state(struct kvm_vcpu *vcpu,
+                                      unsigned long pc, unsigned long a1);
+void kvm_riscv_vcpu_sbi_request_reset_from_userspace(struct kvm_vcpu *vcpu);
 int kvm_riscv_vcpu_sbi_return(struct kvm_vcpu *vcpu, struct kvm_run *run);
 int kvm_riscv_vcpu_set_reg_sbi_ext(struct kvm_vcpu *vcpu,
 				   const struct kvm_one_reg *reg);
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index b8485c1c1ce4..4578863a39e3 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -58,6 +58,11 @@ static void kvm_riscv_vcpu_context_reset(struct kvm_vcpu *vcpu)
 	struct kvm_vcpu_reset_state *reset_state = &vcpu->arch.reset_state;
 	void *vector_datap = cntx->vector.datap;
 
+	spin_lock(&reset_state->lock);
+	if (!reset_state->active)
+		__kvm_riscv_vcpu_set_reset_state(vcpu, cntx->sepc, cntx->a1);
+	spin_unlock(&reset_state->lock);
+
 	memset(cntx, 0, sizeof(*cntx));
 	memset(csr, 0, sizeof(*csr));
 
@@ -520,6 +525,10 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
 
 	switch (mp_state->mp_state) {
 	case KVM_MP_STATE_RUNNABLE:
+		if (riscv_vcpu_supports_sbi_ext(vcpu, KVM_RISCV_SBI_EXT_HSM) &&
+				vcpu->arch.ran_atleast_once &&
+				kvm_riscv_vcpu_stopped(vcpu))
+			kvm_riscv_vcpu_sbi_request_reset_from_userspace(vcpu);
 		WRITE_ONCE(vcpu->arch.mp_state, *mp_state);
 		break;
 	case KVM_MP_STATE_STOPPED:
diff --git a/arch/riscv/kvm/vcpu_sbi.c b/arch/riscv/kvm/vcpu_sbi.c
index 3d7955e05cc3..77f9f0bd3842 100644
--- a/arch/riscv/kvm/vcpu_sbi.c
+++ b/arch/riscv/kvm/vcpu_sbi.c
@@ -156,12 +156,29 @@ void kvm_riscv_vcpu_sbi_system_reset(struct kvm_vcpu *vcpu,
 	run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
 }
 
+/* must be called with held vcpu->arch.reset_state.lock */
+void __kvm_riscv_vcpu_set_reset_state(struct kvm_vcpu *vcpu,
+                                      unsigned long pc, unsigned long a1)
+{
+	vcpu->arch.reset_state.active = true;
+	vcpu->arch.reset_state.pc = pc;
+	vcpu->arch.reset_state.a1 = a1;
+}
+
 void kvm_riscv_vcpu_sbi_request_reset(struct kvm_vcpu *vcpu,
                                       unsigned long pc, unsigned long a1)
 {
 	spin_lock(&vcpu->arch.reset_state.lock);
-	vcpu->arch.reset_state.pc = pc;
-	vcpu->arch.reset_state.a1 = a1;
+	__kvm_riscv_vcpu_set_reset_state(vcpu, pc, a1);
+	spin_unlock(&vcpu->arch.reset_state.lock);
+
+	kvm_make_request(KVM_REQ_VCPU_RESET, vcpu);
+}
+
+void kvm_riscv_vcpu_sbi_request_reset_from_userspace(struct kvm_vcpu *vcpu)
+{
+	spin_lock(&vcpu->arch.reset_state.lock);
+	vcpu->arch.reset_state.active = false;
 	spin_unlock(&vcpu->arch.reset_state.lock);
 
 	kvm_make_request(KVM_REQ_VCPU_RESET, vcpu);
-- 
2.48.1

Re: [PATCH 4/5] KVM: RISC-V: reset VCPU state when becoming runnable

Posted by Anup Patel 9 months, 2 weeks ago

On Thu, Apr 3, 2025 at 5:02 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>
> Beware, this patch is "breaking" the userspace interface, because it
> fixes a KVM/QEMU bug where the boot VCPU is not being reset by KVM.
>
> The VCPU reset paths are inconsistent right now.  KVM resets VCPUs that
> are brought up by KVM-accelerated SBI calls, but does nothing for VCPUs
> brought up through ioctls.
>
> We need to perform a KVM reset even when the VCPU is started through an
> ioctl.  This patch is one of the ways we can achieve it.
>
> Assume that userspace has no business setting the post-reset state.
> KVM is de-facto the SBI implementation, as the SBI HSM acceleration
> cannot be disabled and userspace cannot control the reset state, so KVM
> should be in full control of the post-reset state.
>
> Do not reset the pc and a1 registers, because SBI reset is expected to
> provide them and KVM has no idea what these registers should be -- only
> the userspace knows where it put the data.
>
> An important consideration is resume.  Userspace might want to start
> with non-reset state.  Check ran_atleast_once to allow this, because
> KVM-SBI HSM creates some VCPUs as STOPPED.
>
> The drawback is that userspace can still start the boot VCPU with an
> incorrect reset state, because there is no way to distinguish a freshly
> reset new VCPU on the KVM side (userspace might set some values by
> mistake) from a restored VCPU (userspace must set all values).
>
> The advantage of this solution is that it fixes current QEMU and makes
> some sense with the assumption that KVM implements SBI HSM.
> I do not like it too much, so I'd be in favor of a different solution if
> we can still afford to drop support for current userspaces.
>
> For a cleaner solution, we should add interfaces to perform the KVM-SBI
> reset request on userspace demand.  I think it would also be much better
> if userspace was in control of the post-reset state.

Apart from breaking KVM user-space, this patch is incorrect and
does not align with the:
1) SBI spec
2) OS boot protocol.

The SBI spec only defines the entry state of certain CPU registers
(namely, PC, A0, and A1) when CPU enters S-mode:
1) Upon SBI HSM start call from some other CPU
2) Upon resuming from non-retentive SBI HSM suspend or
    SBI system suspend

The S-mode entry state of the boot CPU is defined by the
OS boot protocol and not by the SBI spec. Due to this, reason
KVM RISC-V expects user-space to set up the S-mode entry
state of the boot CPU upon system reset.

Due to above, reasons we should not go ahead with this patch.

Regards,
Anup

>
> Signed-off-by: Radim Krčmář <rkrcmar@ventanamicro.com>
> ---
>  arch/riscv/include/asm/kvm_host.h     |  1 +
>  arch/riscv/include/asm/kvm_vcpu_sbi.h |  3 +++
>  arch/riscv/kvm/vcpu.c                 |  9 +++++++++
>  arch/riscv/kvm/vcpu_sbi.c             | 21 +++++++++++++++++++--
>  4 files changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> index 0c8c9c05af91..9bbf8c4a286b 100644
> --- a/arch/riscv/include/asm/kvm_host.h
> +++ b/arch/riscv/include/asm/kvm_host.h
> @@ -195,6 +195,7 @@ struct kvm_vcpu_smstateen_csr {
>
>  struct kvm_vcpu_reset_state {
>         spinlock_t lock;
> +       bool active;
>         unsigned long pc;
>         unsigned long a1;
>  };
> diff --git a/arch/riscv/include/asm/kvm_vcpu_sbi.h b/arch/riscv/include/asm/kvm_vcpu_sbi.h
> index aaaa81355276..2c334a87e02a 100644
> --- a/arch/riscv/include/asm/kvm_vcpu_sbi.h
> +++ b/arch/riscv/include/asm/kvm_vcpu_sbi.h
> @@ -57,6 +57,9 @@ void kvm_riscv_vcpu_sbi_system_reset(struct kvm_vcpu *vcpu,
>                                      u32 type, u64 flags);
>  void kvm_riscv_vcpu_sbi_request_reset(struct kvm_vcpu *vcpu,
>                                        unsigned long pc, unsigned long a1);
> +void __kvm_riscv_vcpu_set_reset_state(struct kvm_vcpu *vcpu,
> +                                      unsigned long pc, unsigned long a1);
> +void kvm_riscv_vcpu_sbi_request_reset_from_userspace(struct kvm_vcpu *vcpu);
>  int kvm_riscv_vcpu_sbi_return(struct kvm_vcpu *vcpu, struct kvm_run *run);
>  int kvm_riscv_vcpu_set_reg_sbi_ext(struct kvm_vcpu *vcpu,
>                                    const struct kvm_one_reg *reg);
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index b8485c1c1ce4..4578863a39e3 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -58,6 +58,11 @@ static void kvm_riscv_vcpu_context_reset(struct kvm_vcpu *vcpu)
>         struct kvm_vcpu_reset_state *reset_state = &vcpu->arch.reset_state;
>         void *vector_datap = cntx->vector.datap;
>
> +       spin_lock(&reset_state->lock);
> +       if (!reset_state->active)
> +               __kvm_riscv_vcpu_set_reset_state(vcpu, cntx->sepc, cntx->a1);
> +       spin_unlock(&reset_state->lock);
> +
>         memset(cntx, 0, sizeof(*cntx));
>         memset(csr, 0, sizeof(*csr));
>
> @@ -520,6 +525,10 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
>
>         switch (mp_state->mp_state) {
>         case KVM_MP_STATE_RUNNABLE:
> +               if (riscv_vcpu_supports_sbi_ext(vcpu, KVM_RISCV_SBI_EXT_HSM) &&
> +                               vcpu->arch.ran_atleast_once &&
> +                               kvm_riscv_vcpu_stopped(vcpu))
> +                       kvm_riscv_vcpu_sbi_request_reset_from_userspace(vcpu);
>                 WRITE_ONCE(vcpu->arch.mp_state, *mp_state);
>                 break;
>         case KVM_MP_STATE_STOPPED:
> diff --git a/arch/riscv/kvm/vcpu_sbi.c b/arch/riscv/kvm/vcpu_sbi.c
> index 3d7955e05cc3..77f9f0bd3842 100644
> --- a/arch/riscv/kvm/vcpu_sbi.c
> +++ b/arch/riscv/kvm/vcpu_sbi.c
> @@ -156,12 +156,29 @@ void kvm_riscv_vcpu_sbi_system_reset(struct kvm_vcpu *vcpu,
>         run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
>  }
>
> +/* must be called with held vcpu->arch.reset_state.lock */
> +void __kvm_riscv_vcpu_set_reset_state(struct kvm_vcpu *vcpu,
> +                                      unsigned long pc, unsigned long a1)
> +{
> +       vcpu->arch.reset_state.active = true;
> +       vcpu->arch.reset_state.pc = pc;
> +       vcpu->arch.reset_state.a1 = a1;
> +}
> +
>  void kvm_riscv_vcpu_sbi_request_reset(struct kvm_vcpu *vcpu,
>                                        unsigned long pc, unsigned long a1)
>  {
>         spin_lock(&vcpu->arch.reset_state.lock);
> -       vcpu->arch.reset_state.pc = pc;
> -       vcpu->arch.reset_state.a1 = a1;
> +       __kvm_riscv_vcpu_set_reset_state(vcpu, pc, a1);
> +       spin_unlock(&vcpu->arch.reset_state.lock);
> +
> +       kvm_make_request(KVM_REQ_VCPU_RESET, vcpu);
> +}
> +
> +void kvm_riscv_vcpu_sbi_request_reset_from_userspace(struct kvm_vcpu *vcpu)
> +{
> +       spin_lock(&vcpu->arch.reset_state.lock);
> +       vcpu->arch.reset_state.active = false;
>         spin_unlock(&vcpu->arch.reset_state.lock);
>
>         kvm_make_request(KVM_REQ_VCPU_RESET, vcpu);
> --
> 2.48.1
>

Re: [PATCH 4/5] KVM: RISC-V: reset VCPU state when becoming runnable

Posted by Radim Krčmář 9 months, 2 weeks ago

2025-04-28T17:52:25+05:30, Anup Patel <anup@brainfault.org>:
> On Thu, Apr 3, 2025 at 5:02 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>> For a cleaner solution, we should add interfaces to perform the KVM-SBI
>> reset request on userspace demand.  I think it would also be much better
>> if userspace was in control of the post-reset state.
>
> Apart from breaking KVM user-space, this patch is incorrect and
> does not align with the:
> 1) SBI spec
> 2) OS boot protocol.
>
> The SBI spec only defines the entry state of certain CPU registers
> (namely, PC, A0, and A1) when CPU enters S-mode:
> 1) Upon SBI HSM start call from some other CPU
> 2) Upon resuming from non-retentive SBI HSM suspend or
>     SBI system suspend
>
> The S-mode entry state of the boot CPU is defined by the
> OS boot protocol and not by the SBI spec. Due to this, reason
> KVM RISC-V expects user-space to set up the S-mode entry
> state of the boot CPU upon system reset.

We can handle the initial state consistency in other patches.
What needs addressing is a way to trigger the KVM reset from userspace,
even if only to clear the internal KVM state.

I think mp_state is currently the best signalization that KVM should
reset, so I added it there.

What would be your preferred interface for that?

Thanks.

Re: [PATCH 4/5] KVM: RISC-V: reset VCPU state when becoming runnable

Posted by Anup Patel 9 months, 2 weeks ago

On Mon, Apr 28, 2025 at 11:15 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>
> 2025-04-28T17:52:25+05:30, Anup Patel <anup@brainfault.org>:
> > On Thu, Apr 3, 2025 at 5:02 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> >> For a cleaner solution, we should add interfaces to perform the KVM-SBI
> >> reset request on userspace demand.  I think it would also be much better
> >> if userspace was in control of the post-reset state.
> >
> > Apart from breaking KVM user-space, this patch is incorrect and
> > does not align with the:
> > 1) SBI spec
> > 2) OS boot protocol.
> >
> > The SBI spec only defines the entry state of certain CPU registers
> > (namely, PC, A0, and A1) when CPU enters S-mode:
> > 1) Upon SBI HSM start call from some other CPU
> > 2) Upon resuming from non-retentive SBI HSM suspend or
> >     SBI system suspend
> >
> > The S-mode entry state of the boot CPU is defined by the
> > OS boot protocol and not by the SBI spec. Due to this, reason
> > KVM RISC-V expects user-space to set up the S-mode entry
> > state of the boot CPU upon system reset.
>
> We can handle the initial state consistency in other patches.
> What needs addressing is a way to trigger the KVM reset from userspace,
> even if only to clear the internal KVM state.
>
> I think mp_state is currently the best signalization that KVM should
> reset, so I added it there.
>
> What would be your preferred interface for that?
>

Instead of creating a new interface, I would prefer that VCPU
which initiates SBI System Reset should be resetted immediately
in-kernel space before forwarding the system reset request to
user space. This way we also force KVM user-space to explicitly
set the PC, A0, and A1 before running the VCPU again after
system reset.

Regards,
Anup

Re: [PATCH 4/5] KVM: RISC-V: reset VCPU state when becoming runnable

Posted by Radim Krčmář 9 months, 2 weeks ago

2025-04-29T11:25:35+05:30, Anup Patel <apatel@ventanamicro.com>:
> On Mon, Apr 28, 2025 at 11:15 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>>
>> 2025-04-28T17:52:25+05:30, Anup Patel <anup@brainfault.org>:
>> > On Thu, Apr 3, 2025 at 5:02 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>> >> For a cleaner solution, we should add interfaces to perform the KVM-SBI
>> >> reset request on userspace demand.  I think it would also be much better
>> >> if userspace was in control of the post-reset state.
>> >
>> > Apart from breaking KVM user-space, this patch is incorrect and
>> > does not align with the:
>> > 1) SBI spec
>> > 2) OS boot protocol.
>> >
>> > The SBI spec only defines the entry state of certain CPU registers
>> > (namely, PC, A0, and A1) when CPU enters S-mode:
>> > 1) Upon SBI HSM start call from some other CPU
>> > 2) Upon resuming from non-retentive SBI HSM suspend or
>> >     SBI system suspend
>> >
>> > The S-mode entry state of the boot CPU is defined by the
>> > OS boot protocol and not by the SBI spec. Due to this, reason
>> > KVM RISC-V expects user-space to set up the S-mode entry
>> > state of the boot CPU upon system reset.
>>
>> We can handle the initial state consistency in other patches.
>> What needs addressing is a way to trigger the KVM reset from userspace,
>> even if only to clear the internal KVM state.
>>
>> I think mp_state is currently the best signalization that KVM should
>> reset, so I added it there.
>>
>> What would be your preferred interface for that?
>>
>
> Instead of creating a new interface, I would prefer that VCPU
> which initiates SBI System Reset should be resetted immediately
> in-kernel space before forwarding the system reset request to
> user space.

The initiating VCPU might not be the boot VCPU.
It would be safer to reset all of them.

You also previously mentioned that we need to preserve the pre-reset
state for userspace, which I completely agree with and it is why the
reset happens later.

>             This way we also force KVM user-space to explicitly
> set the PC, A0, and A1 before running the VCPU again after
> system reset.

We also want to consider reset from emulation outside of KVM.

There is a "simple" solution that covers everything (except speed) --
the userspace can tear down the whole VM and re-create it.
Do we want to do this instead and drop all resets from KVM?

Re: [PATCH 4/5] KVM: RISC-V: reset VCPU state when becoming runnable

Posted by Anup Patel 9 months, 2 weeks ago

On Tue, Apr 29, 2025 at 3:55 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>
> 2025-04-29T11:25:35+05:30, Anup Patel <apatel@ventanamicro.com>:
> > On Mon, Apr 28, 2025 at 11:15 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> >>
> >> 2025-04-28T17:52:25+05:30, Anup Patel <anup@brainfault.org>:
> >> > On Thu, Apr 3, 2025 at 5:02 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> >> >> For a cleaner solution, we should add interfaces to perform the KVM-SBI
> >> >> reset request on userspace demand.  I think it would also be much better
> >> >> if userspace was in control of the post-reset state.
> >> >
> >> > Apart from breaking KVM user-space, this patch is incorrect and
> >> > does not align with the:
> >> > 1) SBI spec
> >> > 2) OS boot protocol.
> >> >
> >> > The SBI spec only defines the entry state of certain CPU registers
> >> > (namely, PC, A0, and A1) when CPU enters S-mode:
> >> > 1) Upon SBI HSM start call from some other CPU
> >> > 2) Upon resuming from non-retentive SBI HSM suspend or
> >> >     SBI system suspend
> >> >
> >> > The S-mode entry state of the boot CPU is defined by the
> >> > OS boot protocol and not by the SBI spec. Due to this, reason
> >> > KVM RISC-V expects user-space to set up the S-mode entry
> >> > state of the boot CPU upon system reset.
> >>
> >> We can handle the initial state consistency in other patches.
> >> What needs addressing is a way to trigger the KVM reset from userspace,
> >> even if only to clear the internal KVM state.
> >>
> >> I think mp_state is currently the best signalization that KVM should
> >> reset, so I added it there.
> >>
> >> What would be your preferred interface for that?
> >>
> >
> > Instead of creating a new interface, I would prefer that VCPU
> > which initiates SBI System Reset should be resetted immediately
> > in-kernel space before forwarding the system reset request to
> > user space.
>
> The initiating VCPU might not be the boot VCPU.
> It would be safer to reset all of them.

I meant initiating VCPU and not the boot VCPU. Currently, the
non-initiating VCPUs are already resetted by VCPU requests
so nothing special needs to be done.

>
> You also previously mentioned that we need to preserve the pre-reset
> state for userspace, which I completely agree with and it is why the
> reset happens later.

Yes, that was only for debug purposes from user space. At the
moment, there is no one using this for debug purposes so we
can sacrifice that.

>
> >             This way we also force KVM user-space to explicitly
> > set the PC, A0, and A1 before running the VCPU again after
> > system reset.
>
> We also want to consider reset from emulation outside of KVM.
>
> There is a "simple" solution that covers everything (except speed) --
> the userspace can tear down the whole VM and re-create it.
> Do we want to do this instead and drop all resets from KVM?

I think we should keep the VCPU resets in KVM so that handling
of system reset handling in user space remains simple. The user
space can also re-create the VM upon system reset but that is
user space choice.

Regards,
Anup

Re: [PATCH 4/5] KVM: RISC-V: reset VCPU state when becoming runnable

Posted by Radim Krčmář 9 months, 2 weeks ago

2025-04-29T20:31:18+05:30, Anup Patel <anup@brainfault.org>:
> On Tue, Apr 29, 2025 at 3:55 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>>
>> 2025-04-29T11:25:35+05:30, Anup Patel <apatel@ventanamicro.com>:
>> > On Mon, Apr 28, 2025 at 11:15 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>> >>
>> >> 2025-04-28T17:52:25+05:30, Anup Patel <anup@brainfault.org>:
>> >> > On Thu, Apr 3, 2025 at 5:02 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>> >> >> For a cleaner solution, we should add interfaces to perform the KVM-SBI
>> >> >> reset request on userspace demand.  I think it would also be much better
>> >> >> if userspace was in control of the post-reset state.
>> >> >
>> >> > Apart from breaking KVM user-space, this patch is incorrect and
>> >> > does not align with the:
>> >> > 1) SBI spec
>> >> > 2) OS boot protocol.
>> >> >
>> >> > The SBI spec only defines the entry state of certain CPU registers
>> >> > (namely, PC, A0, and A1) when CPU enters S-mode:
>> >> > 1) Upon SBI HSM start call from some other CPU
>> >> > 2) Upon resuming from non-retentive SBI HSM suspend or
>> >> >     SBI system suspend
>> >> >
>> >> > The S-mode entry state of the boot CPU is defined by the
>> >> > OS boot protocol and not by the SBI spec. Due to this, reason
>> >> > KVM RISC-V expects user-space to set up the S-mode entry
>> >> > state of the boot CPU upon system reset.
>> >>
>> >> We can handle the initial state consistency in other patches.
>> >> What needs addressing is a way to trigger the KVM reset from userspace,
>> >> even if only to clear the internal KVM state.
>> >>
>> >> I think mp_state is currently the best signalization that KVM should
>> >> reset, so I added it there.
>> >>
>> >> What would be your preferred interface for that?
>> >>
>> >
>> > Instead of creating a new interface, I would prefer that VCPU
>> > which initiates SBI System Reset should be resetted immediately
>> > in-kernel space before forwarding the system reset request to
>> > user space.
>>
>> The initiating VCPU might not be the boot VCPU.
>> It would be safer to reset all of them.
>
> I meant initiating VCPU and not the boot VCPU. Currently, the
> non-initiating VCPUs are already resetted by VCPU requests
> so nothing special needs to be done.

Currently, we make the request only for VCPUs brought up by HSM -- the
non-boot VCPUs.  There is a single VCPU not being reset and resetting
the reset initiating VCPU changes nothing. e.g.

  1) VCPU 1 initiates the reset through an ecall.
  2) All VCPUs are stopped and return to userspace.
  3) Userspace prepares VCPU 0 as the boot VCPU.
  4) VCPU 0 executes without going through KVM reset paths.

The point of this patch is to reset the boot VCPU, so we reset the VCPU
that is made runnable by the KVM_SET_MP_STATE IOCTL.

For design alternatives, it is also possible to reset immediately in an
IOCTL instead of making the reset request.

>> You also previously mentioned that we need to preserve the pre-reset
>> state for userspace, which I completely agree with and it is why the
>> reset happens later.
>
> Yes, that was only for debug purposes from user space. At the
> moment, there is no one using this for debug purposes so we
> can sacrifice that.

We still can't immediately reset the boot VCPU, because it might already
be in userspace.  We don't really benefit from immediately resetting the
initiating VCPU.
Also, making the reset request for all VCPUs from the initiating VCPU
has some undesirable race conditions we would have to prevent, so I do
prefer we go the IOCTL reset way.

>> >             This way we also force KVM user-space to explicitly
>> > set the PC, A0, and A1 before running the VCPU again after
>> > system reset.
>>
>> We also want to consider reset from emulation outside of KVM.
>>
>> There is a "simple" solution that covers everything (except speed) --
>> the userspace can tear down the whole VM and re-create it.
>> Do we want to do this instead and drop all resets from KVM?
>
> I think we should keep the VCPU resets in KVM so that handling
> of system reset handling in user space remains simple. The user
> space can also re-create the VM upon system reset but that is
> user space choice.

Ok.

Re: [PATCH 4/5] KVM: RISC-V: reset VCPU state when becoming runnable

Posted by Anup Patel 9 months, 2 weeks ago

On Tue, Apr 29, 2025 at 9:51 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>
> 2025-04-29T20:31:18+05:30, Anup Patel <anup@brainfault.org>:
> > On Tue, Apr 29, 2025 at 3:55 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> >>
> >> 2025-04-29T11:25:35+05:30, Anup Patel <apatel@ventanamicro.com>:
> >> > On Mon, Apr 28, 2025 at 11:15 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> >> >>
> >> >> 2025-04-28T17:52:25+05:30, Anup Patel <anup@brainfault.org>:
> >> >> > On Thu, Apr 3, 2025 at 5:02 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> >> >> >> For a cleaner solution, we should add interfaces to perform the KVM-SBI
> >> >> >> reset request on userspace demand.  I think it would also be much better
> >> >> >> if userspace was in control of the post-reset state.
> >> >> >
> >> >> > Apart from breaking KVM user-space, this patch is incorrect and
> >> >> > does not align with the:
> >> >> > 1) SBI spec
> >> >> > 2) OS boot protocol.
> >> >> >
> >> >> > The SBI spec only defines the entry state of certain CPU registers
> >> >> > (namely, PC, A0, and A1) when CPU enters S-mode:
> >> >> > 1) Upon SBI HSM start call from some other CPU
> >> >> > 2) Upon resuming from non-retentive SBI HSM suspend or
> >> >> >     SBI system suspend
> >> >> >
> >> >> > The S-mode entry state of the boot CPU is defined by the
> >> >> > OS boot protocol and not by the SBI spec. Due to this, reason
> >> >> > KVM RISC-V expects user-space to set up the S-mode entry
> >> >> > state of the boot CPU upon system reset.
> >> >>
> >> >> We can handle the initial state consistency in other patches.
> >> >> What needs addressing is a way to trigger the KVM reset from userspace,
> >> >> even if only to clear the internal KVM state.
> >> >>
> >> >> I think mp_state is currently the best signalization that KVM should
> >> >> reset, so I added it there.
> >> >>
> >> >> What would be your preferred interface for that?
> >> >>
> >> >
> >> > Instead of creating a new interface, I would prefer that VCPU
> >> > which initiates SBI System Reset should be resetted immediately
> >> > in-kernel space before forwarding the system reset request to
> >> > user space.
> >>
> >> The initiating VCPU might not be the boot VCPU.
> >> It would be safer to reset all of them.
> >
> > I meant initiating VCPU and not the boot VCPU. Currently, the
> > non-initiating VCPUs are already resetted by VCPU requests
> > so nothing special needs to be done.

There is no designated boot VCPU for KVM so let us only use the
term "initiating" or "non-initiating" VCPUs in context of system reset.

>
> Currently, we make the request only for VCPUs brought up by HSM -- the
> non-boot VCPUs.  There is a single VCPU not being reset and resetting
> the reset initiating VCPU changes nothing. e.g.
>
>   1) VCPU 1 initiates the reset through an ecall.
>   2) All VCPUs are stopped and return to userspace.

When all VCPUs are stopped, all VCPUs except VCPU1
(in this example) will SLEEP because we do
"kvm_make_all_cpus_request(vcpu->kvm, KVM_REQ_SLEEP)"
so none of the VCPUs except VCPU1 (in this case) will
return to userspace.

>   3) Userspace prepares VCPU 0 as the boot VCPU.
>   4) VCPU 0 executes without going through KVM reset paths.

Userspace will see a system reset event exit for the
initiating VCPU by that time all other VCPUs are already
sleeping with mp_state == KVM_MP_STATE_STOPPED.

>
> The point of this patch is to reset the boot VCPU, so we reset the VCPU
> that is made runnable by the KVM_SET_MP_STATE IOCTL.

Like I said before, we don't need to do this. The initiating VCPU
can be resetted just before exiting to user space for system reset
event exit.

>
> For design alternatives, it is also possible to reset immediately in an
> IOCTL instead of making the reset request.
>
> >> You also previously mentioned that we need to preserve the pre-reset
> >> state for userspace, which I completely agree with and it is why the
> >> reset happens later.
> >
> > Yes, that was only for debug purposes from user space. At the
> > moment, there is no one using this for debug purposes so we
> > can sacrifice that.
>
> We still can't immediately reset the boot VCPU, because it might already
> be in userspace.  We don't really benefit from immediately resetting the
> initiating VCPU.
> Also, making the reset request for all VCPUs from the initiating VCPU
> has some undesirable race conditions we would have to prevent, so I do
> prefer we go the IOCTL reset way.

All VCPUs are sleeping with mp_state == KVM_MP_STATE_STOPPED
when userspace sees system reset exit on the initiating VCPU so I don't
see any race condition if we also reset the initiating VCPU before exiting
to userspace.

>
> >> >             This way we also force KVM user-space to explicitly
> >> > set the PC, A0, and A1 before running the VCPU again after
> >> > system reset.
> >>
> >> We also want to consider reset from emulation outside of KVM.
> >>
> >> There is a "simple" solution that covers everything (except speed) --
> >> the userspace can tear down the whole VM and re-create it.
> >> Do we want to do this instead and drop all resets from KVM?
> >
> > I think we should keep the VCPU resets in KVM so that handling
> > of system reset handling in user space remains simple. The user
> > space can also re-create the VM upon system reset but that is
> > user space choice.
>
> Ok.

Regards,
Anup

Re: [PATCH 4/5] KVM: RISC-V: reset VCPU state when becoming runnable

Posted by Anup Patel 9 months, 2 weeks ago

On Wed, Apr 30, 2025 at 9:52 AM Anup Patel <anup@brainfault.org> wrote:
>
> On Tue, Apr 29, 2025 at 9:51 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> >
> > 2025-04-29T20:31:18+05:30, Anup Patel <anup@brainfault.org>:
> > > On Tue, Apr 29, 2025 at 3:55 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> > >>
> > >> 2025-04-29T11:25:35+05:30, Anup Patel <apatel@ventanamicro.com>:
> > >> > On Mon, Apr 28, 2025 at 11:15 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> > >> >>
> > >> >> 2025-04-28T17:52:25+05:30, Anup Patel <anup@brainfault.org>:
> > >> >> > On Thu, Apr 3, 2025 at 5:02 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> > >> >> >> For a cleaner solution, we should add interfaces to perform the KVM-SBI
> > >> >> >> reset request on userspace demand.  I think it would also be much better
> > >> >> >> if userspace was in control of the post-reset state.
> > >> >> >
> > >> >> > Apart from breaking KVM user-space, this patch is incorrect and
> > >> >> > does not align with the:
> > >> >> > 1) SBI spec
> > >> >> > 2) OS boot protocol.
> > >> >> >
> > >> >> > The SBI spec only defines the entry state of certain CPU registers
> > >> >> > (namely, PC, A0, and A1) when CPU enters S-mode:
> > >> >> > 1) Upon SBI HSM start call from some other CPU
> > >> >> > 2) Upon resuming from non-retentive SBI HSM suspend or
> > >> >> >     SBI system suspend
> > >> >> >
> > >> >> > The S-mode entry state of the boot CPU is defined by the
> > >> >> > OS boot protocol and not by the SBI spec. Due to this, reason
> > >> >> > KVM RISC-V expects user-space to set up the S-mode entry
> > >> >> > state of the boot CPU upon system reset.
> > >> >>
> > >> >> We can handle the initial state consistency in other patches.
> > >> >> What needs addressing is a way to trigger the KVM reset from userspace,
> > >> >> even if only to clear the internal KVM state.
> > >> >>
> > >> >> I think mp_state is currently the best signalization that KVM should
> > >> >> reset, so I added it there.
> > >> >>
> > >> >> What would be your preferred interface for that?
> > >> >>
> > >> >
> > >> > Instead of creating a new interface, I would prefer that VCPU
> > >> > which initiates SBI System Reset should be resetted immediately
> > >> > in-kernel space before forwarding the system reset request to
> > >> > user space.
> > >>
> > >> The initiating VCPU might not be the boot VCPU.
> > >> It would be safer to reset all of them.
> > >
> > > I meant initiating VCPU and not the boot VCPU. Currently, the
> > > non-initiating VCPUs are already resetted by VCPU requests
> > > so nothing special needs to be done.
>
> There is no designated boot VCPU for KVM so let us only use the
> term "initiating" or "non-initiating" VCPUs in context of system reset.
>
> >
> > Currently, we make the request only for VCPUs brought up by HSM -- the
> > non-boot VCPUs.  There is a single VCPU not being reset and resetting
> > the reset initiating VCPU changes nothing. e.g.
> >
> >   1) VCPU 1 initiates the reset through an ecall.
> >   2) All VCPUs are stopped and return to userspace.
>
> When all VCPUs are stopped, all VCPUs except VCPU1
> (in this example) will SLEEP because we do
> "kvm_make_all_cpus_request(vcpu->kvm, KVM_REQ_SLEEP)"
> so none of the VCPUs except VCPU1 (in this case) will
> return to userspace.
>
> >   3) Userspace prepares VCPU 0 as the boot VCPU.
> >   4) VCPU 0 executes without going through KVM reset paths.
>
> Userspace will see a system reset event exit for the
> initiating VCPU by that time all other VCPUs are already
> sleeping with mp_state == KVM_MP_STATE_STOPPED.
>
> >
> > The point of this patch is to reset the boot VCPU, so we reset the VCPU
> > that is made runnable by the KVM_SET_MP_STATE IOCTL.
>
> Like I said before, we don't need to do this. The initiating VCPU
> can be resetted just before exiting to user space for system reset
> event exit.
>

Below is what I am suggesting. This change completely removes
dependency of kvm_sbi_hsm_vcpu_start() on "reset" structures.

diff --git a/arch/riscv/include/asm/kvm_host.h
b/arch/riscv/include/asm/kvm_host.h
index 0e9c2fab6378..6bd12469852d 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -396,6 +396,7 @@ int kvm_riscv_vcpu_get_reg(struct kvm_vcpu *vcpu,
 int kvm_riscv_vcpu_set_reg(struct kvm_vcpu *vcpu,
                const struct kvm_one_reg *reg);

+void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu);
 int kvm_riscv_vcpu_set_interrupt(struct kvm_vcpu *vcpu, unsigned int irq);
 int kvm_riscv_vcpu_unset_interrupt(struct kvm_vcpu *vcpu, unsigned int irq);
 void kvm_riscv_vcpu_flush_interrupts(struct kvm_vcpu *vcpu);
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 02635bac91f1..801c6a1a1aef 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -51,7 +51,7 @@ const struct kvm_stats_header kvm_vcpu_stats_header = {
                sizeof(kvm_vcpu_stats_desc),
 };

-static void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu)
+void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu)
 {
     struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr;
     struct kvm_vcpu_csr *reset_csr = &vcpu->arch.guest_reset_csr;
@@ -689,6 +689,9 @@ static void kvm_riscv_check_vcpu_requests(struct
kvm_vcpu *vcpu)
     struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);

     if (kvm_request_pending(vcpu)) {
+        if (kvm_check_request(KVM_REQ_VCPU_RESET, vcpu))
+            kvm_riscv_reset_vcpu(vcpu);
+
         if (kvm_check_request(KVM_REQ_SLEEP, vcpu)) {
             kvm_vcpu_srcu_read_unlock(vcpu);
             rcuwait_wait_event(wait,
@@ -705,9 +708,6 @@ static void kvm_riscv_check_vcpu_requests(struct
kvm_vcpu *vcpu)
             }
         }

-        if (kvm_check_request(KVM_REQ_VCPU_RESET, vcpu))
-            kvm_riscv_reset_vcpu(vcpu);
-
         if (kvm_check_request(KVM_REQ_UPDATE_HGATP, vcpu))
             kvm_riscv_gstage_update_hgatp(vcpu);

diff --git a/arch/riscv/kvm/vcpu_sbi.c b/arch/riscv/kvm/vcpu_sbi.c
index d1c83a77735e..79477e7f240a 100644
--- a/arch/riscv/kvm/vcpu_sbi.c
+++ b/arch/riscv/kvm/vcpu_sbi.c
@@ -146,9 +146,15 @@ void kvm_riscv_vcpu_sbi_system_reset(struct kvm_vcpu *vcpu,
         spin_lock(&vcpu->arch.mp_state_lock);
         WRITE_ONCE(tmp->arch.mp_state.mp_state, KVM_MP_STATE_STOPPED);
         spin_unlock(&vcpu->arch.mp_state_lock);
+        if (tmp != vcpu) {
+            kvm_make_request(KVM_REQ_SLEEP | KVM_REQ_VCPU_RESET, vcpu);
+            kvm_vcpu_kick(vcpu);
+        } else {
+            kvm_make_request(KVM_REQ_SLEEP, vcpu);
+        }
     }
-    kvm_make_all_cpus_request(vcpu->kvm, KVM_REQ_SLEEP);

+    kvm_riscv_reset_vcpu(vcpu);
     memset(&run->system_event, 0, sizeof(run->system_event));
     run->system_event.type = type;
     run->system_event.ndata = 1;
diff --git a/arch/riscv/kvm/vcpu_sbi_hsm.c b/arch/riscv/kvm/vcpu_sbi_hsm.c
index 3070bb31745d..30d7d59db5a5 100644
--- a/arch/riscv/kvm/vcpu_sbi_hsm.c
+++ b/arch/riscv/kvm/vcpu_sbi_hsm.c
@@ -15,15 +15,15 @@

 static int kvm_sbi_hsm_vcpu_start(struct kvm_vcpu *vcpu)
 {
-    struct kvm_cpu_context *reset_cntx;
-    struct kvm_cpu_context *cp = &vcpu->arch.guest_context;
-    struct kvm_vcpu *target_vcpu;
+    struct kvm_cpu_context *target_cp, *cp = &vcpu->arch.guest_context;
     unsigned long target_vcpuid = cp->a0;
+    struct kvm_vcpu *target_vcpu;
     int ret = 0;

     target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, target_vcpuid);
     if (!target_vcpu)
         return SBI_ERR_INVALID_PARAM;
+    target_cp = &target_vcpu->arch.guest_context;

     spin_lock(&target_vcpu->arch.mp_state_lock);

@@ -32,17 +32,12 @@ static int kvm_sbi_hsm_vcpu_start(struct kvm_vcpu *vcpu)
         goto out;
     }

-    spin_lock(&target_vcpu->arch.reset_cntx_lock);
-    reset_cntx = &target_vcpu->arch.guest_reset_context;
     /* start address */
-    reset_cntx->sepc = cp->a1;
+    target_cp->sepc = cp->a1;
     /* target vcpu id to start */
-    reset_cntx->a0 = target_vcpuid;
+    target_cp->a0 = target_vcpuid;
     /* private data passed from kernel */
-    reset_cntx->a1 = cp->a2;
-    spin_unlock(&target_vcpu->arch.reset_cntx_lock);
-
-    kvm_make_request(KVM_REQ_VCPU_RESET, target_vcpu);
+    target_cp->a1 = cp->a2;

     __kvm_riscv_vcpu_power_on(target_vcpu);

@@ -63,6 +58,7 @@ static int kvm_sbi_hsm_vcpu_stop(struct kvm_vcpu *vcpu)
         goto out;
     }

+    kvm_make_request(KVM_REQ_VCPU_RESET, vcpu);
     __kvm_riscv_vcpu_power_off(vcpu);

 out:

Regards,
Anup

Re: [PATCH 4/5] KVM: RISC-V: reset VCPU state when becoming runnable

Posted by Radim Krčmář 9 months, 2 weeks ago

2025-04-30T10:56:35+05:30, Anup Patel <anup@brainfault.org>:
> On Wed, Apr 30, 2025 at 9:52 AM Anup Patel <anup@brainfault.org> wrote:
>>
>> On Tue, Apr 29, 2025 at 9:51 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>> >
>> > 2025-04-29T20:31:18+05:30, Anup Patel <anup@brainfault.org>:
>> > > On Tue, Apr 29, 2025 at 3:55 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>> > >>
>> > >> 2025-04-29T11:25:35+05:30, Anup Patel <apatel@ventanamicro.com>:
>> > >> > On Mon, Apr 28, 2025 at 11:15 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>> > >> >>
>> > >> >> 2025-04-28T17:52:25+05:30, Anup Patel <anup@brainfault.org>:
>> > >> >> > On Thu, Apr 3, 2025 at 5:02 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>> > >> >> >> For a cleaner solution, we should add interfaces to perform the KVM-SBI
>> > >> >> >> reset request on userspace demand.  I think it would also be much better
>> > >> >> >> if userspace was in control of the post-reset state.
>> > >> >> >
>> > >> >> > Apart from breaking KVM user-space, this patch is incorrect and
>> > >> >> > does not align with the:
>> > >> >> > 1) SBI spec
>> > >> >> > 2) OS boot protocol.
>> > >> >> >
>> > >> >> > The SBI spec only defines the entry state of certain CPU registers
>> > >> >> > (namely, PC, A0, and A1) when CPU enters S-mode:
>> > >> >> > 1) Upon SBI HSM start call from some other CPU
>> > >> >> > 2) Upon resuming from non-retentive SBI HSM suspend or
>> > >> >> >     SBI system suspend
>> > >> >> >
>> > >> >> > The S-mode entry state of the boot CPU is defined by the
>> > >> >> > OS boot protocol and not by the SBI spec. Due to this, reason
>> > >> >> > KVM RISC-V expects user-space to set up the S-mode entry
>> > >> >> > state of the boot CPU upon system reset.
>> > >> >>
>> > >> >> We can handle the initial state consistency in other patches.
>> > >> >> What needs addressing is a way to trigger the KVM reset from userspace,
>> > >> >> even if only to clear the internal KVM state.
>> > >> >>
>> > >> >> I think mp_state is currently the best signalization that KVM should
>> > >> >> reset, so I added it there.
>> > >> >>
>> > >> >> What would be your preferred interface for that?
>> > >> >>
>> > >> >
>> > >> > Instead of creating a new interface, I would prefer that VCPU
>> > >> > which initiates SBI System Reset should be resetted immediately
>> > >> > in-kernel space before forwarding the system reset request to
>> > >> > user space.
>> > >>
>> > >> The initiating VCPU might not be the boot VCPU.
>> > >> It would be safer to reset all of them.
>> > >
>> > > I meant initiating VCPU and not the boot VCPU. Currently, the
>> > > non-initiating VCPUs are already resetted by VCPU requests
>> > > so nothing special needs to be done.
>>
>> There is no designated boot VCPU for KVM so let us only use the
>> term "initiating" or "non-initiating" VCPUs in context of system reset.

That is exactly how I use it.  Some VCPU will be the boot VCPU (the VCPU
made runnable by KVM_SET_MP_STATE) and loaded with state from userspace.

RISC-V doesn't guarantee that the boot VCPU is the reset initiating
VCPU, so I think KVM should allow it.

>> > Currently, we make the request only for VCPUs brought up by HSM -- the
>> > non-boot VCPUs.  There is a single VCPU not being reset and resetting
>> > the reset initiating VCPU changes nothing. e.g.
>> >
>> >   1) VCPU 1 initiates the reset through an ecall.
>> >   2) All VCPUs are stopped and return to userspace.
>>
>> When all VCPUs are stopped, all VCPUs except VCPU1
>> (in this example) will SLEEP because we do
>> "kvm_make_all_cpus_request(vcpu->kvm, KVM_REQ_SLEEP)"
>> so none of the VCPUs except VCPU1 (in this case) will
>> return to userspace.

Userspace should be able to do whatever it likes -- in my example, all
the VCPUs are brought to userspace and a different boot VCPU is
selected.

(Perhaps userspace wanted to record their reset pre-reset state, or
 maybe it really wants to boot with a designated VCPU.)

>> >   3) Userspace prepares VCPU 0 as the boot VCPU.
>> >   4) VCPU 0 executes without going through KVM reset paths.
>>
>> Userspace will see a system reset event exit for the
>> initiating VCPU by that time all other VCPUs are already
>> sleeping with mp_state == KVM_MP_STATE_STOPPED.
>>
>> >
>> > The point of this patch is to reset the boot VCPU, so we reset the VCPU
>> > that is made runnable by the KVM_SET_MP_STATE IOCTL.
>>
>> Like I said before, we don't need to do this. The initiating VCPU
>> can be resetted just before exiting to user space for system reset
>> event exit.

You assume initiating VCPU == boot VCPU.

We should prevent KVM_SET_MP_STATE IOCTL for all non-initiating VCPUs if
we decide to accept the assumption.

I'd rather choose a different design, though.

How about a new userspace interface for IOCTL reset?
(Can be capability toggle for KVM_SET_MP_STATE or a straight new IOCTL.)

That wouldn't "fix" current userspaces, but would significantly improve
the sanity of the KVM interface.

> Below is what I am suggesting. This change completely removes
> dependency of kvm_sbi_hsm_vcpu_start() on "reset" structures.

I'd keep the reset structure in this series -- it's small enough and
locklessly accessing the state of another VCPU needs a lot of
consideration to prevent all possible race conditions.

Thanks.

Re: [PATCH 4/5] KVM: RISC-V: reset VCPU state when becoming runnable

Posted by Anup Patel 9 months, 2 weeks ago

On Wed, Apr 30, 2025 at 1:59 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>
> 2025-04-30T10:56:35+05:30, Anup Patel <anup@brainfault.org>:
> > On Wed, Apr 30, 2025 at 9:52 AM Anup Patel <anup@brainfault.org> wrote:
> >>
> >> On Tue, Apr 29, 2025 at 9:51 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> >> >
> >> > 2025-04-29T20:31:18+05:30, Anup Patel <anup@brainfault.org>:
> >> > > On Tue, Apr 29, 2025 at 3:55 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> >> > >>
> >> > >> 2025-04-29T11:25:35+05:30, Anup Patel <apatel@ventanamicro.com>:
> >> > >> > On Mon, Apr 28, 2025 at 11:15 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> >> > >> >>
> >> > >> >> 2025-04-28T17:52:25+05:30, Anup Patel <anup@brainfault.org>:
> >> > >> >> > On Thu, Apr 3, 2025 at 5:02 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> >> > >> >> >> For a cleaner solution, we should add interfaces to perform the KVM-SBI
> >> > >> >> >> reset request on userspace demand.  I think it would also be much better
> >> > >> >> >> if userspace was in control of the post-reset state.
> >> > >> >> >
> >> > >> >> > Apart from breaking KVM user-space, this patch is incorrect and
> >> > >> >> > does not align with the:
> >> > >> >> > 1) SBI spec
> >> > >> >> > 2) OS boot protocol.
> >> > >> >> >
> >> > >> >> > The SBI spec only defines the entry state of certain CPU registers
> >> > >> >> > (namely, PC, A0, and A1) when CPU enters S-mode:
> >> > >> >> > 1) Upon SBI HSM start call from some other CPU
> >> > >> >> > 2) Upon resuming from non-retentive SBI HSM suspend or
> >> > >> >> >     SBI system suspend
> >> > >> >> >
> >> > >> >> > The S-mode entry state of the boot CPU is defined by the
> >> > >> >> > OS boot protocol and not by the SBI spec. Due to this, reason
> >> > >> >> > KVM RISC-V expects user-space to set up the S-mode entry
> >> > >> >> > state of the boot CPU upon system reset.
> >> > >> >>
> >> > >> >> We can handle the initial state consistency in other patches.
> >> > >> >> What needs addressing is a way to trigger the KVM reset from userspace,
> >> > >> >> even if only to clear the internal KVM state.
> >> > >> >>
> >> > >> >> I think mp_state is currently the best signalization that KVM should
> >> > >> >> reset, so I added it there.
> >> > >> >>
> >> > >> >> What would be your preferred interface for that?
> >> > >> >>
> >> > >> >
> >> > >> > Instead of creating a new interface, I would prefer that VCPU
> >> > >> > which initiates SBI System Reset should be resetted immediately
> >> > >> > in-kernel space before forwarding the system reset request to
> >> > >> > user space.
> >> > >>
> >> > >> The initiating VCPU might not be the boot VCPU.
> >> > >> It would be safer to reset all of them.
> >> > >
> >> > > I meant initiating VCPU and not the boot VCPU. Currently, the
> >> > > non-initiating VCPUs are already resetted by VCPU requests
> >> > > so nothing special needs to be done.
> >>
> >> There is no designated boot VCPU for KVM so let us only use the
> >> term "initiating" or "non-initiating" VCPUs in context of system reset.
>
> That is exactly how I use it.  Some VCPU will be the boot VCPU (the VCPU
> made runnable by KVM_SET_MP_STATE) and loaded with state from userspace.
>
> RISC-V doesn't guarantee that the boot VCPU is the reset initiating
> VCPU, so I think KVM should allow it.

We do allow any VCPU to be the boot VCPU. I am not sure why you
think otherwise.

>
> >> > Currently, we make the request only for VCPUs brought up by HSM -- the
> >> > non-boot VCPUs.  There is a single VCPU not being reset and resetting
> >> > the reset initiating VCPU changes nothing. e.g.
> >> >
> >> >   1) VCPU 1 initiates the reset through an ecall.
> >> >   2) All VCPUs are stopped and return to userspace.
> >>
> >> When all VCPUs are stopped, all VCPUs except VCPU1
> >> (in this example) will SLEEP because we do
> >> "kvm_make_all_cpus_request(vcpu->kvm, KVM_REQ_SLEEP)"
> >> so none of the VCPUs except VCPU1 (in this case) will
> >> return to userspace.
>
> Userspace should be able to do whatever it likes -- in my example, all
> the VCPUs are brought to userspace and a different boot VCPU is
> selected.

In your example, the VCPU1 (initiating VCPU) need not be the
boot VCPU after system reset. The user space can setup some
other VCPU as boot VCPU (by setting its MPSTATE, PC, A0,
and A1) and simply do ioctl_run() for the initiating VCPU without
changing its MPSTATE.

>
> (Perhaps userspace wanted to record their reset pre-reset state, or
>  maybe it really wants to boot with a designated VCPU.)
>
> >> >   3) Userspace prepares VCPU 0 as the boot VCPU.
> >> >   4) VCPU 0 executes without going through KVM reset paths.
> >>
> >> Userspace will see a system reset event exit for the
> >> initiating VCPU by that time all other VCPUs are already
> >> sleeping with mp_state == KVM_MP_STATE_STOPPED.
> >>
> >> >
> >> > The point of this patch is to reset the boot VCPU, so we reset the VCPU
> >> > that is made runnable by the KVM_SET_MP_STATE IOCTL.
> >>
> >> Like I said before, we don't need to do this. The initiating VCPU
> >> can be resetted just before exiting to user space for system reset
> >> event exit.
>
> You assume initiating VCPU == boot VCPU.
>
> We should prevent KVM_SET_MP_STATE IOCTL for all non-initiating VCPUs if
> we decide to accept the assumption.

There is no such assumption.

>
> I'd rather choose a different design, though.
>
> How about a new userspace interface for IOCTL reset?
> (Can be capability toggle for KVM_SET_MP_STATE or a straight new IOCTL.)
>
> That wouldn't "fix" current userspaces, but would significantly improve
> the sanity of the KVM interface.

I believe the current implementation needs a few improvements
that's all. We certainly don't need to introduce any new IOCTL.

Also, keep in mind that so far we have avoided any RISC-V
specific KVM IOCTLs and we should try to keep it that way
as long as we can.

Regards,
Anup

Re: [PATCH 4/5] KVM: RISC-V: reset VCPU state when becoming runnable

Posted by Radim Krčmář 9 months, 2 weeks ago

2025-04-30T15:47:13+05:30, Anup Patel <anup@brainfault.org>:
> On Wed, Apr 30, 2025 at 1:59 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>> 2025-04-30T10:56:35+05:30, Anup Patel <anup@brainfault.org>:
>> > On Wed, Apr 30, 2025 at 9:52 AM Anup Patel <anup@brainfault.org> wrote:
>> >> On Tue, Apr 29, 2025 at 9:51 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>> >> > The point of this patch is to reset the boot VCPU, so we reset the VCPU
>> >> > that is made runnable by the KVM_SET_MP_STATE IOCTL.
>> >>
>> >> Like I said before, we don't need to do this. The initiating VCPU
>> >> can be resetted just before exiting to user space for system reset
>> >> event exit.
>>
>> You assume initiating VCPU == boot VCPU.
>>
>> We should prevent KVM_SET_MP_STATE IOCTL for all non-initiating VCPUs if
>> we decide to accept the assumption.
>
> There is no such assumption.

You probably haven't intended it:

  1) VCPU 0 is "chilling" in userspace.
  2) VCPU 1 initiates SBI reset.
  3) VCPU 1 makes a reset request to VCPU 0.
  4) VCPU 1 returns to userspace.
  5) Userspace knows it should reset the VM.
  6) VCPU 0 still hasn't entered KVM.
  7) Userspace sets the initial state of VCPU 0 and enters KVM.
  8) VCPU 0 is reset in KVM, because of the pending request.
  9) The initial boot state from userspace is lost.

>> I'd rather choose a different design, though.
>>
>> How about a new userspace interface for IOCTL reset?
>> (Can be capability toggle for KVM_SET_MP_STATE or a straight new IOCTL.)
>>
>> That wouldn't "fix" current userspaces, but would significantly improve
>> the sanity of the KVM interface.
>
> I believe the current implementation needs a few improvements
> that's all. We certainly don't need to introduce any new IOCTL.

I do too.  The whole patch could have been a single line:

diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index d3d957a9e5c4..b3e6ad87e1cd 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -511,6 +511,7 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,

 	switch (mp_state->mp_state) {
 	case KVM_MP_STATE_RUNNABLE:
+		kvm_riscv_reset_vcpu(vcpu);
 		WRITE_ONCE(vcpu->arch.mp_state, *mp_state);
 		break;
 	case KVM_MP_STATE_STOPPED:

It is the backward compatibility and trying to fix current userspaces
that's making it ugly.  I already gave up on the latter, so we can have
a decently clean solution with the former.

> Also, keep in mind that so far we have avoided any RISC-V
> specific KVM IOCTLs and we should try to keep it that way
> as long as we can.

We can re-use KVM_SET_MP_STATE and add a KVM capability.
Userspace will opt-in to reset the VCPU through the existing IOCTL.

This design will also allow userspace to trigger a VCPU reset without
tearing down the whole VM.

Re: [PATCH 4/5] KVM: RISC-V: reset VCPU state when becoming runnable

Posted by Anup Patel 9 months, 2 weeks ago

On Wed, Apr 30, 2025 at 5:15 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>
> 2025-04-30T15:47:13+05:30, Anup Patel <anup@brainfault.org>:
> > On Wed, Apr 30, 2025 at 1:59 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> >> 2025-04-30T10:56:35+05:30, Anup Patel <anup@brainfault.org>:
> >> > On Wed, Apr 30, 2025 at 9:52 AM Anup Patel <anup@brainfault.org> wrote:
> >> >> On Tue, Apr 29, 2025 at 9:51 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> >> >> > The point of this patch is to reset the boot VCPU, so we reset the VCPU
> >> >> > that is made runnable by the KVM_SET_MP_STATE IOCTL.
> >> >>
> >> >> Like I said before, we don't need to do this. The initiating VCPU
> >> >> can be resetted just before exiting to user space for system reset
> >> >> event exit.
> >>
> >> You assume initiating VCPU == boot VCPU.
> >>
> >> We should prevent KVM_SET_MP_STATE IOCTL for all non-initiating VCPUs if
> >> we decide to accept the assumption.
> >
> > There is no such assumption.
>
> You probably haven't intended it:
>
>   1) VCPU 0 is "chilling" in userspace.
>   2) VCPU 1 initiates SBI reset.
>   3) VCPU 1 makes a reset request to VCPU 0.
>   4) VCPU 1 returns to userspace.
>   5) Userspace knows it should reset the VM.
>   6) VCPU 0 still hasn't entered KVM.
>   7) Userspace sets the initial state of VCPU 0 and enters KVM.
>   8) VCPU 0 is reset in KVM, because of the pending request.
>   9) The initial boot state from userspace is lost.
>
> >> I'd rather choose a different design, though.
> >>
> >> How about a new userspace interface for IOCTL reset?
> >> (Can be capability toggle for KVM_SET_MP_STATE or a straight new IOCTL.)
> >>
> >> That wouldn't "fix" current userspaces, but would significantly improve
> >> the sanity of the KVM interface.
> >
> > I believe the current implementation needs a few improvements
> > that's all. We certainly don't need to introduce any new IOCTL.
>
> I do too.  The whole patch could have been a single line:
>
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index d3d957a9e5c4..b3e6ad87e1cd 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -511,6 +511,7 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
>
>         switch (mp_state->mp_state) {
>         case KVM_MP_STATE_RUNNABLE:
> +               kvm_riscv_reset_vcpu(vcpu);
>                 WRITE_ONCE(vcpu->arch.mp_state, *mp_state);
>                 break;
>         case KVM_MP_STATE_STOPPED:
>
> It is the backward compatibility and trying to fix current userspaces
> that's making it ugly.  I already gave up on the latter, so we can have
> a decently clean solution with the former.
>
> > Also, keep in mind that so far we have avoided any RISC-V
> > specific KVM IOCTLs and we should try to keep it that way
> > as long as we can.
>
> We can re-use KVM_SET_MP_STATE and add a KVM capability.
> Userspace will opt-in to reset the VCPU through the existing IOCTL.
>
> This design will also allow userspace to trigger a VCPU reset without
> tearing down the whole VM.

Okay, lets go ahead with a KVM capability which user space can opt-in
for KVM_SET_MP_STATE ioctl().

Keep in mind that at runtime Guest can still do CPU hotplug using SBI
HSM start/stop and do system suspend using SBI SUSP so we should
continue to have VCPU reset requests for both these SBI extensions.

Regards,
Anup

Re: [PATCH 4/5] KVM: RISC-V: reset VCPU state when becoming runnable

Posted by Radim Krčmář 9 months, 2 weeks ago

2025-04-30T18:32:31+05:30, Anup Patel <anup@brainfault.org>:
> On Wed, Apr 30, 2025 at 5:15 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>> We can re-use KVM_SET_MP_STATE and add a KVM capability.
>> Userspace will opt-in to reset the VCPU through the existing IOCTL.
>>
>> This design will also allow userspace to trigger a VCPU reset without
>> tearing down the whole VM.
>
> Okay, lets go ahead with a KVM capability which user space can opt-in
> for KVM_SET_MP_STATE ioctl().
>
> Keep in mind that at runtime Guest can still do CPU hotplug using SBI
> HSM start/stop and do system suspend using SBI SUSP so we should
> continue to have VCPU reset requests for both these SBI extensions.

Will do, thanks.

Re: [PATCH 4/5] KVM: RISC-V: reset VCPU state when becoming runnable

Posted by Andrew Jones 9 months, 2 weeks ago

On Thu, Apr 03, 2025 at 01:25:23PM +0200, Radim Krčmář wrote:
> Beware, this patch is "breaking" the userspace interface, because it
> fixes a KVM/QEMU bug where the boot VCPU is not being reset by KVM.
> 
> The VCPU reset paths are inconsistent right now.  KVM resets VCPUs that
> are brought up by KVM-accelerated SBI calls, but does nothing for VCPUs
> brought up through ioctls.

I guess we currently expect userspace to make a series of set-one-reg
ioctls in order to prepare ("reset") newly created vcpus, and I guess
the problem is that KVM isn't capturing the resulting configuration
in order to replay it when SBI HSM reset is invoked by the guest. But,
instead of capture-replay we could just exit to userspace on an SBI
HSM reset call and let userspace repeat what it did at vcpu-create
time.

> 
> We need to perform a KVM reset even when the VCPU is started through an
> ioctl.  This patch is one of the ways we can achieve it.
> 
> Assume that userspace has no business setting the post-reset state.
> KVM is de-facto the SBI implementation, as the SBI HSM acceleration
> cannot be disabled and userspace cannot control the reset state, so KVM
> should be in full control of the post-reset state.
> 
> Do not reset the pc and a1 registers, because SBI reset is expected to
> provide them and KVM has no idea what these registers should be -- only
> the userspace knows where it put the data.

s/userspace/guest/

> 
> An important consideration is resume.  Userspace might want to start
> with non-reset state.  Check ran_atleast_once to allow this, because
> KVM-SBI HSM creates some VCPUs as STOPPED.
> 
> The drawback is that userspace can still start the boot VCPU with an
> incorrect reset state, because there is no way to distinguish a freshly
> reset new VCPU on the KVM side (userspace might set some values by
> mistake) from a restored VCPU (userspace must set all values).

If there's a correct vs. incorrect reset state that KVM needs to enforce,
then we'll need a different API than just a bunch of set-one-reg calls,
or set/get-one-reg should be WARL for userpace.

> 
> The advantage of this solution is that it fixes current QEMU and makes
> some sense with the assumption that KVM implements SBI HSM.
> I do not like it too much, so I'd be in favor of a different solution if
> we can still afford to drop support for current userspaces.
> 
> For a cleaner solution, we should add interfaces to perform the KVM-SBI
> reset request on userspace demand.

That's what the change to kvm_arch_vcpu_ioctl_set_mpstate() in this
patch is providing, right?

> I think it would also be much better
> if userspace was in control of the post-reset state.

Agreed. Can we just exit to userspace on SBI HSM reset?

Thanks,
drew

> 
> Signed-off-by: Radim Krčmář <rkrcmar@ventanamicro.com>
> ---
>  arch/riscv/include/asm/kvm_host.h     |  1 +
>  arch/riscv/include/asm/kvm_vcpu_sbi.h |  3 +++
>  arch/riscv/kvm/vcpu.c                 |  9 +++++++++
>  arch/riscv/kvm/vcpu_sbi.c             | 21 +++++++++++++++++++--
>  4 files changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> index 0c8c9c05af91..9bbf8c4a286b 100644
> --- a/arch/riscv/include/asm/kvm_host.h
> +++ b/arch/riscv/include/asm/kvm_host.h
> @@ -195,6 +195,7 @@ struct kvm_vcpu_smstateen_csr {
>  
>  struct kvm_vcpu_reset_state {
>  	spinlock_t lock;
> +	bool active;
>  	unsigned long pc;
>  	unsigned long a1;
>  };
> diff --git a/arch/riscv/include/asm/kvm_vcpu_sbi.h b/arch/riscv/include/asm/kvm_vcpu_sbi.h
> index aaaa81355276..2c334a87e02a 100644
> --- a/arch/riscv/include/asm/kvm_vcpu_sbi.h
> +++ b/arch/riscv/include/asm/kvm_vcpu_sbi.h
> @@ -57,6 +57,9 @@ void kvm_riscv_vcpu_sbi_system_reset(struct kvm_vcpu *vcpu,
>  				     u32 type, u64 flags);
>  void kvm_riscv_vcpu_sbi_request_reset(struct kvm_vcpu *vcpu,
>                                        unsigned long pc, unsigned long a1);
> +void __kvm_riscv_vcpu_set_reset_state(struct kvm_vcpu *vcpu,
> +                                      unsigned long pc, unsigned long a1);
> +void kvm_riscv_vcpu_sbi_request_reset_from_userspace(struct kvm_vcpu *vcpu);
>  int kvm_riscv_vcpu_sbi_return(struct kvm_vcpu *vcpu, struct kvm_run *run);
>  int kvm_riscv_vcpu_set_reg_sbi_ext(struct kvm_vcpu *vcpu,
>  				   const struct kvm_one_reg *reg);
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index b8485c1c1ce4..4578863a39e3 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -58,6 +58,11 @@ static void kvm_riscv_vcpu_context_reset(struct kvm_vcpu *vcpu)
>  	struct kvm_vcpu_reset_state *reset_state = &vcpu->arch.reset_state;
>  	void *vector_datap = cntx->vector.datap;
>  
> +	spin_lock(&reset_state->lock);
> +	if (!reset_state->active)
> +		__kvm_riscv_vcpu_set_reset_state(vcpu, cntx->sepc, cntx->a1);
> +	spin_unlock(&reset_state->lock);
> +
>  	memset(cntx, 0, sizeof(*cntx));
>  	memset(csr, 0, sizeof(*csr));
>  
> @@ -520,6 +525,10 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
>  
>  	switch (mp_state->mp_state) {
>  	case KVM_MP_STATE_RUNNABLE:
> +		if (riscv_vcpu_supports_sbi_ext(vcpu, KVM_RISCV_SBI_EXT_HSM) &&
> +				vcpu->arch.ran_atleast_once &&
> +				kvm_riscv_vcpu_stopped(vcpu))
> +			kvm_riscv_vcpu_sbi_request_reset_from_userspace(vcpu);
>  		WRITE_ONCE(vcpu->arch.mp_state, *mp_state);
>  		break;
>  	case KVM_MP_STATE_STOPPED:
> diff --git a/arch/riscv/kvm/vcpu_sbi.c b/arch/riscv/kvm/vcpu_sbi.c
> index 3d7955e05cc3..77f9f0bd3842 100644
> --- a/arch/riscv/kvm/vcpu_sbi.c
> +++ b/arch/riscv/kvm/vcpu_sbi.c
> @@ -156,12 +156,29 @@ void kvm_riscv_vcpu_sbi_system_reset(struct kvm_vcpu *vcpu,
>  	run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
>  }
>  
> +/* must be called with held vcpu->arch.reset_state.lock */
> +void __kvm_riscv_vcpu_set_reset_state(struct kvm_vcpu *vcpu,
> +                                      unsigned long pc, unsigned long a1)
> +{
> +	vcpu->arch.reset_state.active = true;
> +	vcpu->arch.reset_state.pc = pc;
> +	vcpu->arch.reset_state.a1 = a1;
> +}
> +
>  void kvm_riscv_vcpu_sbi_request_reset(struct kvm_vcpu *vcpu,
>                                        unsigned long pc, unsigned long a1)
>  {
>  	spin_lock(&vcpu->arch.reset_state.lock);
> -	vcpu->arch.reset_state.pc = pc;
> -	vcpu->arch.reset_state.a1 = a1;
> +	__kvm_riscv_vcpu_set_reset_state(vcpu, pc, a1);
> +	spin_unlock(&vcpu->arch.reset_state.lock);
> +
> +	kvm_make_request(KVM_REQ_VCPU_RESET, vcpu);
> +}
> +
> +void kvm_riscv_vcpu_sbi_request_reset_from_userspace(struct kvm_vcpu *vcpu)
> +{
> +	spin_lock(&vcpu->arch.reset_state.lock);
> +	vcpu->arch.reset_state.active = false;
>  	spin_unlock(&vcpu->arch.reset_state.lock);
>  
>  	kvm_make_request(KVM_REQ_VCPU_RESET, vcpu);
> -- 
> 2.48.1
>

Re: [PATCH 4/5] KVM: RISC-V: reset VCPU state when becoming runnable

Posted by Radim Krčmář 9 months, 2 weeks ago

2025-04-25T15:26:08+02:00, Andrew Jones <ajones@ventanamicro.com>:
> On Thu, Apr 03, 2025 at 01:25:23PM +0200, Radim Krčmář wrote:
>> Beware, this patch is "breaking" the userspace interface, because it
>> fixes a KVM/QEMU bug where the boot VCPU is not being reset by KVM.
>> 
>> The VCPU reset paths are inconsistent right now.  KVM resets VCPUs that
>> are brought up by KVM-accelerated SBI calls, but does nothing for VCPUs
>> brought up through ioctls.
>
> I guess we currently expect userspace to make a series of set-one-reg
> ioctls in order to prepare ("reset") newly created vcpus,

Userspace should currently get-one-reg a freshly reset VCPU to know what
KVM SBI decides is the correct reset.  Userspace shouldn't set-one-reg
anything other than what KVM decides, hence we can currently just let
KVM do it.

>                                                           and I guess
> the problem is that KVM isn't capturing the resulting configuration
> in order to replay it when SBI HSM reset is invoked by the guest.

That can also be a solution, but it's not possible to capture the
desired reset state with current IOCTLs, because the first run of a VCPU
can just as well be a resume from a mid-execution.

>                                                                   But,
> instead of capture-replay we could just exit to userspace on an SBI
> HSM reset call and let userspace repeat what it did at vcpu-create
> time.

Right, I like the idea.  (It doesn't fix current userspaces, though.)

>> We need to perform a KVM reset even when the VCPU is started through an
>> ioctl.  This patch is one of the ways we can achieve it.
>> 
>> Assume that userspace has no business setting the post-reset state.
>> KVM is de-facto the SBI implementation, as the SBI HSM acceleration
>> cannot be disabled and userspace cannot control the reset state, so KVM
>> should be in full control of the post-reset state.
>> 
>> Do not reset the pc and a1 registers, because SBI reset is expected to
>> provide them and KVM has no idea what these registers should be -- only
>> the userspace knows where it put the data.
>
> s/userspace/guest/

Both are correct...  I should have made the context clearer here.
I meant the initial hart boot, where userspace loads code/dt and sets
pc/a1 to them.

>> An important consideration is resume.  Userspace might want to start
>> with non-reset state.  Check ran_atleast_once to allow this, because
>> KVM-SBI HSM creates some VCPUs as STOPPED.
>> 
>> The drawback is that userspace can still start the boot VCPU with an
>> incorrect reset state, because there is no way to distinguish a freshly
>> reset new VCPU on the KVM side (userspace might set some values by
>> mistake) from a restored VCPU (userspace must set all values).
>
> If there's a correct vs. incorrect reset state that KVM needs to enforce,
> then we'll need a different API than just a bunch of set-one-reg calls,
> or set/get-one-reg should be WARL for userpace.

Incorrect might have been too strong word... while the SBI reset state
is technically UNSPECIFIED, I think it's just asking for bugs if the
harts have different initial states based on their reset method.

>    set/get-one-reg should be WARL for userpace.

WAAAA! :)

>> The advantage of this solution is that it fixes current QEMU and makes
>> some sense with the assumption that KVM implements SBI HSM.
>> I do not like it too much, so I'd be in favor of a different solution if
>> we can still afford to drop support for current userspaces.
>> 
>> For a cleaner solution, we should add interfaces to perform the KVM-SBI
>> reset request on userspace demand.
>
> That's what the change to kvm_arch_vcpu_ioctl_set_mpstate() in this
> patch is providing, right?

It does.  With conditions to be as compatible as possible.

>> I think it would also be much better
>> if userspace was in control of the post-reset state.
>
> Agreed. Can we just exit to userspace on SBI HSM reset?

Yes.  (It needs an userspace toggle if we care about
backward-compatiblity, though.)

How much do we want to fix/break current userspaces?

Thanks.

[PATCH 1/5] KVM: RISC-V: refactor vector state reset
[PATCH 2/5] KVM: RISC-V: refactor sbi reset request
[PATCH 3/5] KVM: RISC-V: remove unnecessary SBI reset state
[PATCH 4/5] KVM: RISC-V: reset VCPU state when becoming runnable
[PATCH 5/5] KVM: RISC-V: reset smstateen CSRs