[v2] RE: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

RE: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Tian, Kevin 3 months, 4 weeks ago

> From: Mostafa Saleh <smostafa@google.com>
> Sent: Wednesday, January 8, 2025 8:10 PM
> 
> On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> > On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > > I am open to any suggestions, but I believe any solution considered for
> > > merge, should have enough features to be usable on actual systems
> (translating
> > > IOMMU can be used for example) so either para-virt as this series or full
> > > nesting as the PoC above (or maybe both?), which IMO comes down to
> the
> > > trade-off mentioned above.
> >
> > IMHO no, you can have a completely usable solution without host/guest
> > controlled translation. This is equivilant to a bare metal system with
> > no IOMMU HW. This exists and is still broadly useful. The majority of
> > cloud VMs out there are in this configuration.
> >
> > That is the simplest/smallest thing to start with. Adding host/guest
> > controlled translation is a build-on-top excercise that seems to have
> > a lot of options and people may end up wanting to do all of them.
> >
> > I don't think you need to show that host/guest controlled translation
> > is possible to make progress, of course it is possible. Just getting
> > to the point where pkvm can own the SMMU HW and provide DMA
> isolation
> > between all of it's direct host/guest is a good step.
> 
> My plan was basically:
> 1) Finish and send nested SMMUv3 as RFC, with more insights about
>    performance and complexity trade-offs of both approaches.
> 
> 2) Discuss next steps for the upstream solution in an upcoming conference
>    (like LPC or earlier if possible) and work on upstreaming it.
> 
> 3) Work on guest device passthrough and IOMMU support.
> 
> I am open to gradually upstream this as you mentioned where as a first
> step pKVM would establish DMA isolation without translation for host,
> that should be enough to have functional pKVM and run protected
> workloads.

Does that approach assume starting from a full-fledged SMMU driver 
inside pKVM or do we still expect the host to enumerate/initialize
the hw (but skip any translation) so the pKVM part can focus only
on managing translation?

I'm curious about the burden of maintaining another IOMMU
subsystem under the KVM directory. It's not built into the host kernel
image, but hosted in the same kernel repo. This series tried to
reduce the duplication via io-pgtable-arm but still considerable 
duplication exists (~2000LOC in pKVM). The would be very confusing
moving forward and hard to maintain e.g. ensure bugs fixed in
both sides.

The CPU side is a different story. iiuc KVM-ARM is a split driver model 
from day one for nVHE. It's kept even for VHE with difference only
on using hypercall vs using direct function call. pKVM is added 
incrementally on top of nVHE hence it's natural to maintain the
 pKVM logic in the kernel repo. No duplication.

But there is no such thing in the IOMMU side. Probably we'd want to
try reusing the entire IOMMU sub-system in pKVM if it's agreed
to use full-fledged drivers in pKVM. Or if continuing the split-driver
model should we try splitting the existing drivers into two parts then
connecting two together via function call on native and via hypercall
in pKVM (similar to how KVM-ARM does)?

Thanks
Kevin

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Mostafa Saleh 3 months, 3 weeks ago

Hi Kevin,

On Thu, Jan 16, 2025 at 08:51:11AM +0000, Tian, Kevin wrote:
> > From: Mostafa Saleh <smostafa@google.com>
> > Sent: Wednesday, January 8, 2025 8:10 PM
> > 
> > On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> > > On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > > > I am open to any suggestions, but I believe any solution considered for
> > > > merge, should have enough features to be usable on actual systems
> > (translating
> > > > IOMMU can be used for example) so either para-virt as this series or full
> > > > nesting as the PoC above (or maybe both?), which IMO comes down to
> > the
> > > > trade-off mentioned above.
> > >
> > > IMHO no, you can have a completely usable solution without host/guest
> > > controlled translation. This is equivilant to a bare metal system with
> > > no IOMMU HW. This exists and is still broadly useful. The majority of
> > > cloud VMs out there are in this configuration.
> > >
> > > That is the simplest/smallest thing to start with. Adding host/guest
> > > controlled translation is a build-on-top excercise that seems to have
> > > a lot of options and people may end up wanting to do all of them.
> > >
> > > I don't think you need to show that host/guest controlled translation
> > > is possible to make progress, of course it is possible. Just getting
> > > to the point where pkvm can own the SMMU HW and provide DMA
> > isolation
> > > between all of it's direct host/guest is a good step.
> > 
> > My plan was basically:
> > 1) Finish and send nested SMMUv3 as RFC, with more insights about
> >    performance and complexity trade-offs of both approaches.
> > 
> > 2) Discuss next steps for the upstream solution in an upcoming conference
> >    (like LPC or earlier if possible) and work on upstreaming it.
> > 
> > 3) Work on guest device passthrough and IOMMU support.
> > 
> > I am open to gradually upstream this as you mentioned where as a first
> > step pKVM would establish DMA isolation without translation for host,
> > that should be enough to have functional pKVM and run protected
> > workloads.
> 
> Does that approach assume starting from a full-fledged SMMU driver 
> inside pKVM or do we still expect the host to enumerate/initialize
> the hw (but skip any translation) so the pKVM part can focus only
> on managing translation?

I have been thinking about this, and I think most of the initialization
won’t be changed, and we would do any possible initialization in the
kernel avoiding complexity in the hypervisor (parsing
device-tree/acpi...) also that makes code re-use easier if both drivers
do that in the kernel space.

> 
> I'm curious about the burden of maintaining another IOMMU
> subsystem under the KVM directory. It's not built into the host kernel
> image, but hosted in the same kernel repo. This series tried to
> reduce the duplication via io-pgtable-arm but still considerable 
> duplication exists (~2000LOC in pKVM). The would be very confusing
> moving forward and hard to maintain e.g. ensure bugs fixed in
> both sides.

KVM IOMMU subsystem is very different from the one kernel, it’s about
paravirtualtion and abstraction, I tried my best to make sure all
possible code can be re-used by splitting arm-smmu-v3-common.c and
io-pgtable-arm-common.c and even re-using iommu_iotlb_gather from the
iommu code.
So my guess, there won't be much of that effort as there is no
duplication in logic.

I am still thinking about how v3 will look like, but as mentioned I am
inclined to Jason’s suggestion to reduce the series and remove the
paravirtualtion stuff and only establish DMA isolation as a starting point.
That will remove a lot of code from the KVM IOMMU for now, but we'd
need to address that later.
And we can build on top of this code either a para-virtual approach or
nested-emulation one.

> 
> The CPU side is a different story. iiuc KVM-ARM is a split driver model 
> from day one for nVHE. It's kept even for VHE with difference only
> on using hypercall vs using direct function call. pKVM is added 
> incrementally on top of nVHE hence it's natural to maintain the
>  pKVM logic in the kernel repo. No duplication.
> 
> But there is no such thing in the IOMMU side. Probably we'd want to
> try reusing the entire IOMMU sub-system in pKVM if it's agreed
> to use full-fledged drivers in pKVM. Or if continuing the split-driver
> model should we try splitting the existing drivers into two parts then
> connecting two together via function call on native and via hypercall
> in pKVM (similar to how KVM-ARM does)?

For the IOMMU KVM code, it’s quite different from the kernel one and serves
different purposes, so there is no logic duplication there.
The idea to use hypercalls/function calls in some places for VHE/nVHE,
doesn’t really translate here, as already the driver is abstracted by
iommu_ops, unlike KVM which has one code base for everything, as I
mentioned in another reply, we can standardize the hypercall part of the
kernel driver into a an IOMMU agnostic file (as virtio-iommu) and the
KVM SMMUv3 kernel driver would ony be responsible for initialization,
that should be the closest to the split model in nVHE.

Also, pKVM have some different code path in the kernel, for example pKVM
has a different mem abort handler, different initialization (pkvm.c)

Thanks,
Mostafa

> 
> Thanks
> Kevin

RE: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Tian, Kevin 3 months, 3 weeks ago

> From: Mostafa Saleh <smostafa@google.com>
> Sent: Wednesday, January 22, 2025 7:29 PM
> 
> Hi Kevin,
> 
> On Thu, Jan 16, 2025 at 08:51:11AM +0000, Tian, Kevin wrote:
> > > From: Mostafa Saleh <smostafa@google.com>
> > > Sent: Wednesday, January 8, 2025 8:10 PM
> > >
> > > My plan was basically:
> > > 1) Finish and send nested SMMUv3 as RFC, with more insights about
> > >    performance and complexity trade-offs of both approaches.
> > >
> > > 2) Discuss next steps for the upstream solution in an upcoming
> conference
> > >    (like LPC or earlier if possible) and work on upstreaming it.
> > >
> > > 3) Work on guest device passthrough and IOMMU support.
> > >
> > > I am open to gradually upstream this as you mentioned where as a first
> > > step pKVM would establish DMA isolation without translation for host,
> > > that should be enough to have functional pKVM and run protected
> > > workloads.
> >
> > Does that approach assume starting from a full-fledged SMMU driver
> > inside pKVM or do we still expect the host to enumerate/initialize
> > the hw (but skip any translation) so the pKVM part can focus only
> > on managing translation?
> 
> I have been thinking about this, and I think most of the initialization
> won’t be changed, and we would do any possible initialization in the
> kernel avoiding complexity in the hypervisor (parsing
> device-tree/acpi...) also that makes code re-use easier if both drivers
> do that in the kernel space.

yeah that'd make sense for now. 

> 
> >
> > I'm curious about the burden of maintaining another IOMMU
> > subsystem under the KVM directory. It's not built into the host kernel
> > image, but hosted in the same kernel repo. This series tried to
> > reduce the duplication via io-pgtable-arm but still considerable
> > duplication exists (~2000LOC in pKVM). The would be very confusing
> > moving forward and hard to maintain e.g. ensure bugs fixed in
> > both sides.
> 
> KVM IOMMU subsystem is very different from the one kernel, it’s about
> paravirtualtion and abstraction, I tried my best to make sure all
> possible code can be re-used by splitting arm-smmu-v3-common.c and
> io-pgtable-arm-common.c and even re-using iommu_iotlb_gather from the
> iommu code.
> So my guess, there won't be much of that effort as there is no
> duplication in logic.

I'm not sure how different it is. In concept it still manages iommu
mappings, just with additional restrictions. Bear me that I haven't
looked into the detail of the 2000LOC driver in pKVM smmu driver. 
but the size does scare me, especially considering the case when
other vendors are supported later.

Let's keep it in mind and re-check after you have v3. It's simpler hence
suppose the actual difference between a pKVM iommu driver and
a normal kernel IOMMU driver can be judged more easily than now.

The learning here would be beneficial to the design in other pKVM
components, e.g. when porting pKVM to x86. Currently KVM x86 is 
monothetic. Maintaining pKVM under KVM/x86 would be a much
bigger challenge than doing it under KVM/arm. There will also be
question about what can be shared and how to better maintain
the pKVM specific logic in KVM/x86.

Overall my gut-feeling is that the pKVM specific code must be small
enough otherwise maintaining a run-time irrelevant project in the
kernel repo would be questionable. 😊

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Mostafa Saleh 3 months, 2 weeks ago

On Thu, Jan 23, 2025 at 08:25:13AM +0000, Tian, Kevin wrote:
> > From: Mostafa Saleh <smostafa@google.com>
> > Sent: Wednesday, January 22, 2025 7:29 PM
> > 
> > Hi Kevin,
> > 
> > On Thu, Jan 16, 2025 at 08:51:11AM +0000, Tian, Kevin wrote:
> > > > From: Mostafa Saleh <smostafa@google.com>
> > > > Sent: Wednesday, January 8, 2025 8:10 PM
> > > >
> > > > My plan was basically:
> > > > 1) Finish and send nested SMMUv3 as RFC, with more insights about
> > > >    performance and complexity trade-offs of both approaches.
> > > >
> > > > 2) Discuss next steps for the upstream solution in an upcoming
> > conference
> > > >    (like LPC or earlier if possible) and work on upstreaming it.
> > > >
> > > > 3) Work on guest device passthrough and IOMMU support.
> > > >
> > > > I am open to gradually upstream this as you mentioned where as a first
> > > > step pKVM would establish DMA isolation without translation for host,
> > > > that should be enough to have functional pKVM and run protected
> > > > workloads.
> > >
> > > Does that approach assume starting from a full-fledged SMMU driver
> > > inside pKVM or do we still expect the host to enumerate/initialize
> > > the hw (but skip any translation) so the pKVM part can focus only
> > > on managing translation?
> > 
> > I have been thinking about this, and I think most of the initialization
> > won’t be changed, and we would do any possible initialization in the
> > kernel avoiding complexity in the hypervisor (parsing
> > device-tree/acpi...) also that makes code re-use easier if both drivers
> > do that in the kernel space.
> 
> yeah that'd make sense for now. 
> 
> > 
> > >
> > > I'm curious about the burden of maintaining another IOMMU
> > > subsystem under the KVM directory. It's not built into the host kernel
> > > image, but hosted in the same kernel repo. This series tried to
> > > reduce the duplication via io-pgtable-arm but still considerable
> > > duplication exists (~2000LOC in pKVM). The would be very confusing
> > > moving forward and hard to maintain e.g. ensure bugs fixed in
> > > both sides.
> > 
> > KVM IOMMU subsystem is very different from the one kernel, it’s about
> > paravirtualtion and abstraction, I tried my best to make sure all
> > possible code can be re-used by splitting arm-smmu-v3-common.c and
> > io-pgtable-arm-common.c and even re-using iommu_iotlb_gather from the
> > iommu code.
> > So my guess, there won't be much of that effort as there is no
> > duplication in logic.
> 
> I'm not sure how different it is. In concept it still manages iommu
> mappings, just with additional restrictions. Bear me that I haven't
> looked into the detail of the 2000LOC driver in pKVM smmu driver. 
> but the size does scare me, especially considering the case when
> other vendors are supported later.
> 
> Let's keep it in mind and re-check after you have v3. It's simpler hence
> suppose the actual difference between a pKVM iommu driver and
> a normal kernel IOMMU driver can be judged more easily than now.

I see, I believe we can reduce the size by re-using more data-structure
types + more refactoring on the kernel side.

Also, we can make many parts of the code standard outside the driver as
calling hypercalls, dealing with memory allocation...., so. other IOMMUs
will only add minimal code.

> 
> The learning here would be beneficial to the design in other pKVM
> components, e.g. when porting pKVM to x86. Currently KVM x86 is 
> monothetic. Maintaining pKVM under KVM/x86 would be a much
> bigger challenge than doing it under KVM/arm. There will also be
> question about what can be shared and how to better maintain
> the pKVM specific logic in KVM/x86.
> 
> Overall my gut-feeling is that the pKVM specific code must be small
> enough otherwise maintaining a run-time irrelevant project in the
> kernel repo would be questionable. 😊
> 

I am not sure I understand, but I don’t see how pKVM is irrelevant,
it’s a mode in KVM (just like, nvhe/hvhe where they run in 2 exception
levels) and can’t be separated from the kernel as that defeats the
point of KVM, that means that all hypercalls have to be stable ABI,
same for the shared data, shared structs, types...

Thanks,
Mostafa

RE: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Tian, Kevin 2 months, 3 weeks ago

> From: Mostafa Saleh <smostafa@google.com>
> Sent: Wednesday, January 29, 2025 8:21 PM
> 
> On Thu, Jan 23, 2025 at 08:25:13AM +0000, Tian, Kevin wrote:
> >
> > The learning here would be beneficial to the design in other pKVM
> > components, e.g. when porting pKVM to x86. Currently KVM x86 is
> > monothetic. Maintaining pKVM under KVM/x86 would be a much
> > bigger challenge than doing it under KVM/arm. There will also be
> > question about what can be shared and how to better maintain
> > the pKVM specific logic in KVM/x86.
> >
> > Overall my gut-feeling is that the pKVM specific code must be small
> > enough otherwise maintaining a run-time irrelevant project in the
> > kernel repo would be questionable. 😊
> >
> 
> I am not sure I understand, but I don’t see how pKVM is irrelevant,
> it’s a mode in KVM (just like, nvhe/hvhe where they run in 2 exception
> levels) and can’t be separated from the kernel as that defeats the
> point of KVM, that means that all hypercalls have to be stable ABI,
> same for the shared data, shared structs, types...
> 

Yes pKVM doesn't favor stable ABI. My point was more on the part
that nvhe is a hardware limitation so kvm-arm already coped with it
from day one then adding the concept of pKVM atop was relatively
easy, but changing other subsystems to support this split model
just for pKVM adds more maintenance burden. Then the maintainers
may challenge the value of supporting pKVM if the size of maintaining
the split model becomes too large... Anyway we will see how it
turns out with more discussions on your next version.

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Jason Gunthorpe 3 months, 2 weeks ago

On Wed, Jan 29, 2025 at 12:21:01PM +0000, Mostafa Saleh wrote:
> levels) and can’t be separated from the kernel as that defeats the
> point of KVM, that means that all hypercalls have to be stable ABI,
> same for the shared data, shared structs, types...

Sorry, just trying to understand this sentance, today pkvm has no
stable ABI right? That is the whole point of building it into the
kernel?

Things like the CC world are creating stable ABIs for their pkvm like
environments because they are not built into the kernel? And thus they
take the pain of that?

Jason

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Mostafa Saleh 3 months, 2 weeks ago

On Wed, Jan 29, 2025 at 09:50:53AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 29, 2025 at 12:21:01PM +0000, Mostafa Saleh wrote:
> > levels) and can’t be separated from the kernel as that defeats the
> > point of KVM, that means that all hypercalls have to be stable ABI,
> > same for the shared data, shared structs, types...
> 
> Sorry, just trying to understand this sentance, today pkvm has no
> stable ABI right? That is the whole point of building it into the
> kernel?

Yes.

> 
> Things like the CC world are creating stable ABIs for their pkvm like
> environments because they are not built into the kernel? And thus they
> take the pain of that?

Yes, my point is, we can't just separate pKVM as Kevin was mentioning as
they has no ABI and it is tightly coupled with the kernel.


Thanks,
Mostafa

> 
> Jason