[PATCH v3 0/2] RISC-V: KVM: VCPU reset fixes

Radim Krčmář posted 2 patches 7 months, 1 week ago
Documentation/virt/kvm/api.rst        | 22 ++++++++++++++++++++++
arch/riscv/include/asm/kvm_host.h     |  6 ++++++
arch/riscv/include/asm/kvm_vcpu_sbi.h |  1 +
arch/riscv/kvm/vcpu.c                 | 27 ++++++++++++++-------------
arch/riscv/kvm/vcpu_sbi.c             | 27 +++++++++++++++++++++++++--
arch/riscv/kvm/vm.c                   | 18 ++++++++++++++++++
include/uapi/linux/kvm.h              |  2 ++
7 files changed, 88 insertions(+), 15 deletions(-)
[PATCH v3 0/2] RISC-V: KVM: VCPU reset fixes
Posted by Radim Krčmář 7 months, 1 week ago
Hello,

the design still requires a discussion.

[v3 1/2] removes most of the additional changes that the KVM capability
was doing in v2.  [v3 2/2] is new and previews a general solution to the
lack of userspace control over KVM SBI.

A possible QEMU implementation for both capabilities can be seen in
https://github.com/radimkrcmar/qemu/tree/reset_fixes_v3
The next step would be to forward the HSM ecalls to QEMU.

v2: https://lore.kernel.org/kvm-riscv/20250508142842.1496099-2-rkrcmar@ventanamicro.com/
v1: https://lore.kernel.org/kvm-riscv/20250403112522.1566629-3-rkrcmar@ventanamicro.com/

Radim Krčmář (2):
  RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET
  RISC-V: KVM: add KVM_CAP_RISCV_USERSPACE_SBI

 Documentation/virt/kvm/api.rst        | 22 ++++++++++++++++++++++
 arch/riscv/include/asm/kvm_host.h     |  6 ++++++
 arch/riscv/include/asm/kvm_vcpu_sbi.h |  1 +
 arch/riscv/kvm/vcpu.c                 | 27 ++++++++++++++-------------
 arch/riscv/kvm/vcpu_sbi.c             | 27 +++++++++++++++++++++++++--
 arch/riscv/kvm/vm.c                   | 18 ++++++++++++++++++
 include/uapi/linux/kvm.h              |  2 ++
 7 files changed, 88 insertions(+), 15 deletions(-)

-- 
2.49.0

Re: [PATCH v3 0/2] RISC-V: KVM: VCPU reset fixes
Posted by Atish Patra 7 months ago
On 5/15/25 7:37 AM, Radim KrÄmáŠwrote:
> Hello,
> 
> the design still requires a discussion.
> 
> [v3 1/2] removes most of the additional changes that the KVM capability
> was doing in v2.  [v3 2/2] is new and previews a general solution to the
> lack of userspace control over KVM SBI.
> 

I am still missing the motivation behind it. If the motivation is SBI 
HSM suspend, the PATCH2 doesn't achieve that as it forwards every call 
to the user space. Why do you want to control hsm start/stop from the 
user space ?


> A possible QEMU implementation for both capabilities can be seen in
> https://github.com/radimkrcmar/qemu/tree/reset_fixes_v3
> The next step would be to forward the HSM ecalls to QEMU.
> 
> v2: https://lore.kernel.org/kvm-riscv/20250508142842.1496099-2-rkrcmar@ventanamicro.com/
> v1: https://lore.kernel.org/kvm-riscv/20250403112522.1566629-3-rkrcmar@ventanamicro.com/
> 
> Radim Krčmář (2):
>    RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET
>    RISC-V: KVM: add KVM_CAP_RISCV_USERSPACE_SBI
> 
>   Documentation/virt/kvm/api.rst        | 22 ++++++++++++++++++++++
>   arch/riscv/include/asm/kvm_host.h     |  6 ++++++
>   arch/riscv/include/asm/kvm_vcpu_sbi.h |  1 +
>   arch/riscv/kvm/vcpu.c                 | 27 ++++++++++++++-------------
>   arch/riscv/kvm/vcpu_sbi.c             | 27 +++++++++++++++++++++++++--
>   arch/riscv/kvm/vm.c                   | 18 ++++++++++++++++++
>   include/uapi/linux/kvm.h              |  2 ++
>   7 files changed, 88 insertions(+), 15 deletions(-)
> 

Re: [PATCH v3 0/2] RISC-V: KVM: VCPU reset fixes
Posted by Radim Krčmář 7 months ago
2025-05-22T14:43:40-07:00, Atish Patra <atish.patra@linux.dev>:
> On 5/15/25 7:37 AM, Radim KrÄmáŠwrote:
>> Hello,
>> 
>> the design still requires a discussion.
>> 
>> [v3 1/2] removes most of the additional changes that the KVM capability
>> was doing in v2.  [v3 2/2] is new and previews a general solution to the
>> lack of userspace control over KVM SBI.
>> 
>
> I am still missing the motivation behind it. If the motivation is SBI 
> HSM suspend, the PATCH2 doesn't achieve that as it forwards every call 
> to the user space. Why do you want to control hsm start/stop from the 
> user space ?

HSM needs fixing, because KVM doesn't know what the state after
sbi_hart_start should be.
For example, we had a discussion about scounteren and regardless of what
default we choose in KVM, the userspace might want a different value.
I don't think that HSM start/stop is a hot path, so trapping to
userspace seems better than adding more kernel code.

Forwarding all the unimplemented SBI ecalls shouldn't be a performance
issue, because S-mode software would hopefully learn after the first
error and stop trying again.

Allowing userspace to fully implement the ecall instruction one of the
motivations as well -- SBI is not a part of RISC-V ISA, so someone might
be interested in accelerating a different M-mode software with KVM.

I'll send v4 later today -- there is a missing part in [2/2], because
userspace also needs to be able to emulate the base SBI extension.
Re: [PATCH v3 0/2] RISC-V: KVM: VCPU reset fixes
Posted by Anup Patel 7 months ago
On Fri, May 23, 2025 at 12:47 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>
> 2025-05-22T14:43:40-07:00, Atish Patra <atish.patra@linux.dev>:
> > On 5/15/25 7:37 AM, Radim KrÄmáŠwrote:
> >> Hello,
> >>
> >> the design still requires a discussion.
> >>
> >> [v3 1/2] removes most of the additional changes that the KVM capability
> >> was doing in v2.  [v3 2/2] is new and previews a general solution to the
> >> lack of userspace control over KVM SBI.
> >>
> >
> > I am still missing the motivation behind it. If the motivation is SBI
> > HSM suspend, the PATCH2 doesn't achieve that as it forwards every call
> > to the user space. Why do you want to control hsm start/stop from the
> > user space ?
>
> HSM needs fixing, because KVM doesn't know what the state after
> sbi_hart_start should be.
> For example, we had a discussion about scounteren and regardless of what
> default we choose in KVM, the userspace might want a different value.
> I don't think that HSM start/stop is a hot path, so trapping to
> userspace seems better than adding more kernel code.

There are no implementation specific S-mode CSR reset values
required at the moment. Whenever the need arises, we will extend
the ONE_REG interface so that user space can specify custom
CSR reset values at Guest/VM creation time. We don't need to
forward SBI HSM calls to user space for custom S-mode CSR
reset values.

>
> Forwarding all the unimplemented SBI ecalls shouldn't be a performance
> issue, because S-mode software would hopefully learn after the first
> error and stop trying again.
>
> Allowing userspace to fully implement the ecall instruction one of the
> motivations as well -- SBI is not a part of RISC-V ISA, so someone might
> be interested in accelerating a different M-mode software with KVM.
>
> I'll send v4 later today -- there is a missing part in [2/2], because
> userspace also needs to be able to emulate the base SBI extension.
>

Emulating entire SBI in user space has may challenges, here
are few:

1) SBI IPI in userspace will require an ioctl to trigger VCPU local
interrupt which does not exist. We only have KVM ioctls to trigger
external interrupts and MSIs.

2) SBI RFENCE in userspace will requires HFENCE operation in
user space which is not allowed by RISC-V ISA.

3) SBI PMU uses Linux perf framework APIs to share counters
between host and guest. The Linux perf APIs for guest perf events
are not available to userspace as syscall or ioctl.

4) SBI STA uses sched_info.run_delay which I am sure is not
available to user space.

5) SBI NACL when implemented will be using tons of HS-mode
functionality (HS-mode CSRs, HFENCEs, etc.) to achieve the
nested world-switch and none of these are accessible to userspace.

6) SBI FWFT may require programming hstateenX CSRs which
are not accessible to userspace.

7) SBI DBTR requires direct coordination between the KVM RISC-V
and kernel hw_breakpoint driver to share the debug triggers.

... and so on ...

Based on the above, emulating the entire SBI in user space is
a non-starter. The best approach is to selectively forward SBI
calls to user space where needed (e.g. SBI system reset,
SBI system suspend, SBI debug console, etc.).

Regards,
Anup
Re: [PATCH v3 0/2] RISC-V: KVM: VCPU reset fixes
Posted by Radim Krčmář 7 months ago
2025-05-23T13:38:26+05:30, Anup Patel <apatel@ventanamicro.com>:
> On Fri, May 23, 2025 at 12:47 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>>
>> 2025-05-22T14:43:40-07:00, Atish Patra <atish.patra@linux.dev>:
>> > On 5/15/25 7:37 AM, Radim KrÄmáŠwrote:
>> >> Hello,
>> >>
>> >> the design still requires a discussion.
>> >>
>> >> [v3 1/2] removes most of the additional changes that the KVM capability
>> >> was doing in v2.  [v3 2/2] is new and previews a general solution to the
>> >> lack of userspace control over KVM SBI.
>> >>
>> >
>> > I am still missing the motivation behind it. If the motivation is SBI
>> > HSM suspend, the PATCH2 doesn't achieve that as it forwards every call
>> > to the user space. Why do you want to control hsm start/stop from the
>> > user space ?
>>
>> HSM needs fixing, because KVM doesn't know what the state after
>> sbi_hart_start should be.
>> For example, we had a discussion about scounteren and regardless of what
>> default we choose in KVM, the userspace might want a different value.
>> I don't think that HSM start/stop is a hot path, so trapping to
>> userspace seems better than adding more kernel code.
>
> There are no implementation specific S-mode CSR reset values
> required at the moment.

Jessica mentioned that BSD requires scounteren to be non-zero, so
userspace should be able to provide that value.

I would prefer if KVM could avoid getting into those discussions.
We can just just let userspace be as crazy as it wants.

>                         Whenever the need arises, we will extend
> the ONE_REG interface so that user space can specify custom
> CSR reset values at Guest/VM creation time. We don't need to
> forward SBI HSM calls to user space for custom S-mode CSR
> reset values.

The benefits of adding a new ONE_REG interface seem very small compared
to the drawbacks of having extra kernel code.

If userspace would want to reset or setup new multi-VCPUs VMs often, we
could add an interface that loads the whole register state from
userspace in a single IOCTL, because ONE_REG is not the best interface
for bulk data transfer either.

>> Forwarding all the unimplemented SBI ecalls shouldn't be a performance
>> issue, because S-mode software would hopefully learn after the first
>> error and stop trying again.
>>
>> Allowing userspace to fully implement the ecall instruction one of the
>> motivations as well -- SBI is not a part of RISC-V ISA, so someone might
>> be interested in accelerating a different M-mode software with KVM.
>>
>> I'll send v4 later today -- there is a missing part in [2/2], because
>> userspace also needs to be able to emulate the base SBI extension.
>>
>
> [...]          The best approach is to selectively forward SBI
> calls to user space where needed (e.g. SBI system reset,
> SBI system suspend, SBI debug console, etc.).

That is exactly what my proposal does, it's just that the userspace says
what is "needed".

If we started with this mechanism, KVM would not have needed to add
SRST/SUSP/DBCN SBI emulation at all -- they would be forwarded as any
other unhandled ecall.
Re: [PATCH v3 0/2] RISC-V: KVM: VCPU reset fixes
Posted by Anup Patel 6 months, 4 weeks ago
On Fri, May 23, 2025 at 2:50 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>
> 2025-05-23T13:38:26+05:30, Anup Patel <apatel@ventanamicro.com>:
> > On Fri, May 23, 2025 at 12:47 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
> >>
> >> 2025-05-22T14:43:40-07:00, Atish Patra <atish.patra@linux.dev>:
> >> > On 5/15/25 7:37 AM, Radim KrÄmáŠwrote:
> >> >> Hello,
> >> >>
> >> >> the design still requires a discussion.
> >> >>
> >> >> [v3 1/2] removes most of the additional changes that the KVM capability
> >> >> was doing in v2.  [v3 2/2] is new and previews a general solution to the
> >> >> lack of userspace control over KVM SBI.
> >> >>
> >> >
> >> > I am still missing the motivation behind it. If the motivation is SBI
> >> > HSM suspend, the PATCH2 doesn't achieve that as it forwards every call
> >> > to the user space. Why do you want to control hsm start/stop from the
> >> > user space ?
> >>
> >> HSM needs fixing, because KVM doesn't know what the state after
> >> sbi_hart_start should be.
> >> For example, we had a discussion about scounteren and regardless of what
> >> default we choose in KVM, the userspace might want a different value.
> >> I don't think that HSM start/stop is a hot path, so trapping to
> >> userspace seems better than adding more kernel code.
> >
> > There are no implementation specific S-mode CSR reset values
> > required at the moment.
>
> Jessica mentioned that BSD requires scounteren to be non-zero, so
> userspace should be able to provide that value.
>
> I would prefer if KVM could avoid getting into those discussions.
> We can just just let userspace be as crazy as it wants.

The supervisor OS must not expect a particular state of S-mode
CSRs other than what is defined in the boot protocol or the SBI
specification.

Like mentioned before, scounteren setup in KVM RISC-V and
OpenSBI is a HACK for buggy OSes which don't set up scounteren
CSR correctly when a HART comes-up. Even KVM user space
should not entertain such HACKs.

>
> >                         Whenever the need arises, we will extend
> > the ONE_REG interface so that user space can specify custom
> > CSR reset values at Guest/VM creation time. We don't need to
> > forward SBI HSM calls to user space for custom S-mode CSR
> > reset values.
>
> The benefits of adding a new ONE_REG interface seem very small compared
> to the drawbacks of having extra kernel code.

Forwarding HSM stop to userspace will slow down CPU hotplug
on Guest side. Further, this directly impacts SBI system suspend
performance for Guest because Guest is supposed to turn-off all
VCPUs except one before entering the SBI system suspend state.

>
> If userspace would want to reset or setup new multi-VCPUs VMs often, we
> could add an interface that loads the whole register state from
> userspace in a single IOCTL, because ONE_REG is not the best interface
> for bulk data transfer either.

Instead of inventing a new interface, we can simply improve the
ONE_REG interface to allow bulk read/write of multiple ONE_REG
registers which will benefit other architectures as well.

If required in the future, this bulk ONE_REG read/write interface
can also be used to load reset state of VCPU CSRs.

>
> >> Forwarding all the unimplemented SBI ecalls shouldn't be a performance
> >> issue, because S-mode software would hopefully learn after the first
> >> error and stop trying again.
> >>
> >> Allowing userspace to fully implement the ecall instruction one of the
> >> motivations as well -- SBI is not a part of RISC-V ISA, so someone might
> >> be interested in accelerating a different M-mode software with KVM.
> >>
> >> I'll send v4 later today -- there is a missing part in [2/2], because
> >> userspace also needs to be able to emulate the base SBI extension.
> >>
> >
> > [...]          The best approach is to selectively forward SBI
> > calls to user space where needed (e.g. SBI system reset,
> > SBI system suspend, SBI debug console, etc.).
>
> That is exactly what my proposal does, it's just that the userspace says
> what is "needed".

Nope, the approach taken by your patch is problematic because
for example userspace might disable SBI RFENCE or SBI PMU
with no means to implement these SBI extensions in user space.

We can't blindly forward an SBI extension to userspace when
userspace lacks the capability to implement this extension.

>
> If we started with this mechanism, KVM would not have needed to add
> SRST/SUSP/DBCN SBI emulation at all -- they would be forwarded as any
> other unhandled ecall.

SBI SRST extension is implemented in kernel space because
we are re-using the existing KVM_EXIT_SYSTEM_EVENT so
that we can also re-use existing KVM_EXIT_SYSTEM_EVENT
related code on userspace side.

SBI SUSP and DBCN are already forward to user space and
we only have a minimal code in kernel space to ensure that:
1) In-kernel SBI BASE extension is aware of these extensions
2) These are forwarded to userspace only when userspace
    enables these extensions.

In addition to the above, we are blindly forwarding SBI
experimental and vendor extensions to user space so
user space can do its own thing by implementing these
extensions.

Regards,
Anup
Re: [PATCH v3 0/2] RISC-V: KVM: VCPU reset fixes
Posted by Atish Patra 7 months ago
On 5/23/25 2:20 AM, Radim Krčmář wrote:
> 2025-05-23T13:38:26+05:30, Anup Patel <apatel@ventanamicro.com>:
>> On Fri, May 23, 2025 at 12:47 PM Radim Krčmář <rkrcmar@ventanamicro.com> wrote:
>>> 2025-05-22T14:43:40-07:00, Atish Patra <atish.patra@linux.dev>:
>>>> On 5/15/25 7:37 AM, Radim KrÄmáŠwrote:
>>>>> Hello,
>>>>>
>>>>> the design still requires a discussion.
>>>>>
>>>>> [v3 1/2] removes most of the additional changes that the KVM capability
>>>>> was doing in v2.  [v3 2/2] is new and previews a general solution to the
>>>>> lack of userspace control over KVM SBI.
>>>>>
>>>> I am still missing the motivation behind it. If the motivation is SBI
>>>> HSM suspend, the PATCH2 doesn't achieve that as it forwards every call
>>>> to the user space. Why do you want to control hsm start/stop from the
>>>> user space ?
>>> HSM needs fixing, because KVM doesn't know what the state after
>>> sbi_hart_start should be.
>>> For example, we had a discussion about scounteren and regardless of what
>>> default we choose in KVM, the userspace might want a different value.
>>> I don't think that HSM start/stop is a hot path, so trapping to
>>> userspace seems better than adding more kernel code.
>> There are no implementation specific S-mode CSR reset values
>> required at the moment.
> Jessica mentioned that BSD requires scounteren to be non-zero, so
> userspace should be able to provide that value.

Jessica admitted that it was a bug which should be fixed.

> I would prefer if KVM could avoid getting into those discussions.
> We can just just let userspace be as crazy as it wants.

The scounteren state you mentioned is already fixed now.

I would prefer to do this if there are more of these issues. Otherwise,
we may gain little by just delegating more work to the userspace for no 
reason.

>>                          Whenever the need arises, we will extend
>> the ONE_REG interface so that user space can specify custom
>> CSR reset values at Guest/VM creation time. We don't need to
>> forward SBI HSM calls to user space for custom S-mode CSR
>> reset values.
> The benefits of adding a new ONE_REG interface seem very small compared
> to the drawbacks of having extra kernel code.

How ? The extra kernel code is just few lines where it just registers a 
SBI extension and forwards
it to the userspace. That's for the entire extension.

For extensions like HSM, only selective functions that should be 
forwarded to the userspace which
defeats the purpose.

Let's not try to fix something that is not broken yet.

> If userspace would want to reset or setup new multi-VCPUs VMs often, we
> could add an interface that loads the whole register state from
> userspace in a single IOCTL, because ONE_REG is not the best interface
> for bulk data transfer either.
>
>>> Forwarding all the unimplemented SBI ecalls shouldn't be a performance
>>> issue, because S-mode software would hopefully learn after the first
>>> error and stop trying again.
>>>
>>> Allowing userspace to fully implement the ecall instruction one of the
>>> motivations as well -- SBI is not a part of RISC-V ISA, so someone might
>>> be interested in accelerating a different M-mode software with KVM.
>>>
>>> I'll send v4 later today -- there is a missing part in [2/2], because
>>> userspace also needs to be able to emulate the base SBI extension.
>>>
>> [...]          The best approach is to selectively forward SBI
>> calls to user space where needed (e.g. SBI system reset,
>> SBI system suspend, SBI debug console, etc.).
> That is exactly what my proposal does, it's just that the userspace says
> what is "needed".
>
> If we started with this mechanism, KVM would not have needed to add
> SRST/SUSP/DBCN SBI emulation at all -- they would be forwarded as any
> other unhandled ecall.