[PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops

Mostafa Saleh posted 29 patches 2 months, 1 week ago
[PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops
Posted by Mostafa Saleh 2 months, 1 week ago
Register the SMMUv3 through IOMMU ops, that only support identity
domains. This allows the driver to know which device are currently used
to properly enable/disable then.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 92 ++++++++++++++++++-
 1 file changed, 91 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
index 2e51e211250d..d7b2a50a4cb2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
@@ -96,6 +96,95 @@ static int kvm_arm_smmu_device_reset(struct host_arm_smmu_device *host_smmu)
 	return 0;
 }
 
+static struct platform_driver kvm_arm_smmu_driver;
+static struct arm_smmu_device *
+kvm_arm_smmu_get_by_fwnode(struct fwnode_handle *fwnode)
+{
+	struct device *dev;
+
+	dev = driver_find_device_by_fwnode(&kvm_arm_smmu_driver.driver, fwnode);
+	put_device(dev);
+	return dev ? dev_get_drvdata(dev) : NULL;
+}
+
+static struct iommu_device *kvm_arm_smmu_probe_device(struct device *dev)
+{
+	struct arm_smmu_device *smmu;
+	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+
+	if (WARN_ON_ONCE(dev_iommu_priv_get(dev)))
+		return ERR_PTR(-EBUSY);
+
+	smmu = kvm_arm_smmu_get_by_fwnode(fwspec->iommu_fwnode);
+	if (!smmu)
+		return ERR_PTR(-ENODEV);
+
+	dev_iommu_priv_set(dev, smmu);
+	return &smmu->iommu;
+}
+
+static void kvm_arm_smmu_release_device(struct device *dev)
+{
+	int i;
+	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+	struct arm_smmu_device *smmu = dev_iommu_priv_get(dev);
+	struct host_arm_smmu_device *host_smmu = smmu_to_host(smmu);
+
+	for (i = 0; i < fwspec->num_ids; i++) {
+		int sid = fwspec->ids[i];
+
+		kvm_call_hyp_nvhe(__pkvm_iommu_disable_dev, host_smmu->id, sid);
+	}
+}
+
+static phys_addr_t kvm_arm_smmu_iova_to_phys(struct iommu_domain *domain,
+					     dma_addr_t iova)
+{
+	return iova;
+}
+
+static int kvm_arm_smmu_attach_dev(struct iommu_domain *domain,
+				   struct device *dev)
+{
+	int i, ret = 0;
+	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+	struct arm_smmu_device *smmu = dev_iommu_priv_get(dev);
+	struct host_arm_smmu_device *host_smmu = smmu_to_host(smmu);
+
+	for (i = 0; i < fwspec->num_ids; i++) {
+		int sid = fwspec->ids[i];
+
+		ret = kvm_call_hyp_nvhe(__pkvm_iommu_enable_dev, host_smmu->id, sid);
+		if (ret)
+			goto out_err;
+	}
+	return ret;
+out_err:
+	while (i--)
+		kvm_call_hyp_nvhe(__pkvm_iommu_disable_dev, host_smmu->id, fwspec->ids[i]);
+
+	return ret;
+}
+
+static struct iommu_domain kvm_arm_smmu_def_domain = {
+	.type = IOMMU_DOMAIN_IDENTITY,
+	.ops = &(const struct iommu_domain_ops) {
+		.attach_dev	= kvm_arm_smmu_attach_dev,
+		.iova_to_phys	= kvm_arm_smmu_iova_to_phys,
+	}
+};
+
+static struct iommu_ops kvm_arm_smmu_ops = {
+	.device_group		= arm_smmu_device_group,
+	.of_xlate		= arm_smmu_of_xlate,
+	.get_resv_regions	= arm_smmu_get_resv_regions,
+	.probe_device		= kvm_arm_smmu_probe_device,
+	.release_device		= kvm_arm_smmu_release_device,
+	.pgsize_bitmap		= -1UL,
+	.owner			= THIS_MODULE,
+	.default_domain 	= &kvm_arm_smmu_def_domain,
+};
+
 static int kvm_arm_smmu_probe(struct platform_device *pdev)
 {
 	int ret;
@@ -170,7 +259,7 @@ static int kvm_arm_smmu_probe(struct platform_device *pdev)
 	hyp_smmu->features = smmu->features;
 	kvm_arm_smmu_cur++;
 
-	return 0;
+	return arm_smmu_register_iommu(smmu, &kvm_arm_smmu_ops, ioaddr);
 }
 
 static void kvm_arm_smmu_remove(struct platform_device *pdev)
@@ -184,6 +273,7 @@ static void kvm_arm_smmu_remove(struct platform_device *pdev)
 	 */
 	arm_smmu_device_disable(smmu);
 	arm_smmu_update_gbpa(smmu, host_smmu->boot_gbpa, GBPA_ABORT);
+	arm_smmu_unregister_iommu(smmu);
 }
 
 static const struct of_device_id arm_smmu_of_match[] = {
-- 
2.50.1.552.g942d659e1b-goog
Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops
Posted by Jason Gunthorpe 2 months, 1 week ago
On Mon, Jul 28, 2025 at 05:53:16PM +0000, Mostafa Saleh wrote:
> Register the SMMUv3 through IOMMU ops, that only support identity
> domains. This allows the driver to know which device are currently used
> to properly enable/disable then.
> 
> Signed-off-by: Mostafa Saleh <smostafa@google.com>
> ---
>  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 92 ++++++++++++++++++-
>  1 file changed, 91 insertions(+), 1 deletion(-)

Can you split the new iommu subysstem driver out please? I think I
asked this before.

This series is big, reviewing a new iommu driver should be done separately.

Please review all the comments for the verisilicon driver, I think
many of the remarks apply here too:

 - Domain attachment looks questionable. Please do not have
   attach/detach language at all in the hypervisor facing API.
 - Get the ordering and APIs right so replace works. You need this to support RMRs
 - Use a blocking domain not some unclear detatch idea
 - Use a blocking domain for release
 - Use the smmu-v3 approach for the fwspec, don't store things in the drvdata.

Jason
Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops
Posted by Mostafa Saleh 2 months, 1 week ago
On Wed, Jul 30, 2025 at 11:42:53AM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 28, 2025 at 05:53:16PM +0000, Mostafa Saleh wrote:
> > Register the SMMUv3 through IOMMU ops, that only support identity
> > domains. This allows the driver to know which device are currently used
> > to properly enable/disable then.
> > 
> > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > ---
> >  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 92 ++++++++++++++++++-
> >  1 file changed, 91 insertions(+), 1 deletion(-)
> 
> Can you split the new iommu subysstem driver out please? I think I
> asked this before.

Sorry, maybe I misunderstood, do you mean split this patch into multiple
patches or split all KVM SMMUv3 driver out of this series?

> 
> This series is big, reviewing a new iommu driver should be done separately.
> 
> Please review all the comments for the verisilicon driver, I think
> many of the remarks apply here too:

Thanks, I will go through them.

> 
>  - Domain attachment looks questionable. Please do not have
>    attach/detach language at all in the hypervisor facing API.

I am not sure I understand this one, the hypervisor API has no
attach/detach APIs?
We only notify the hypervisor via “enable/disable” hypercalls when
devices are attached or released (as only IDENTITY DOMAIN is supported)
so it can enable or disable translation.

>  - Get the ordering and APIs right so replace works. You need this to support RMRs

I see, we can’t support bypass for security reasons, but we can enable
the identity map for such devices.

>  - Use a blocking domain not some unclear detatch idea

There is not a single detach in this driver, we only support attach and release
operation, which just goes to the hypervisor as enable/disable identity.

>  - Use a blocking domain for release

I see, I can add “blocked_domain” similar to “arm_smmu_blocked_domain”
which just disables the device translation, I will look into, I am just
concerned it's more code as this driver only supports a single domain
type.

>  - Use the smmu-v3 approach for the fwspec, don't store things in the drvdata.

This is using “dev_iommu_fwspec_get” similar to the SMMUv3 driver,
am I missing something?

The driver only save the "struct arm_smmu_device" in drvdata, which is
exactly what the other driver does (which allows us to re-use a lot of
the code)

Thanks,
Mostafa


> 
> Jason
Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops
Posted by Jason Gunthorpe 2 months, 1 week ago
On Wed, Jul 30, 2025 at 03:07:14PM +0000, Mostafa Saleh wrote:
> On Wed, Jul 30, 2025 at 11:42:53AM -0300, Jason Gunthorpe wrote:
> > On Mon, Jul 28, 2025 at 05:53:16PM +0000, Mostafa Saleh wrote:
> > > Register the SMMUv3 through IOMMU ops, that only support identity
> > > domains. This allows the driver to know which device are currently used
> > > to properly enable/disable then.
> > > 
> > > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > > ---
> > >  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 92 ++++++++++++++++++-
> > >  1 file changed, 91 insertions(+), 1 deletion(-)
> > 
> > Can you split the new iommu subysstem driver out please? I think I
> > asked this before.
> 
> Sorry, maybe I misunderstood, do you mean split this patch into multiple
> patches or split all KVM SMMUv3 driver out of this series?

Yes the latter, the iommu driver introduction is best as its own
series

> >  - Domain attachment looks questionable. Please do not have
> >    attach/detach language at all in the hypervisor facing API.
> 
> I am not sure I understand this one, the hypervisor API has no
> attach/detach APIs?
> We only notify the hypervisor via “enable/disable” hypercalls when
> devices are attached or released (as only IDENTITY DOMAIN is supported)
> so it can enable or disable translation.

Same difference, different words.. We've had trouble with this kind of
ambiguous language.

attach identity
attach blocking

Keep it clean and simple

> >  - Get the ordering and APIs right so replace works. You need this to support RMRs
> 
> I see, we can’t support bypass for security reasons, but we can enable
> the identity map for such devices.

You will have a small issue if the bootup has the pkvm side put the
device into a blocking translation then later switches to an
"identity".

As far as I understand it display scan out buffers have to be
continuously hitlessly mapped. So you want all the transitions from
bootup bypass, pkvm isolation, "identity", etc to continuously
hitlessly map the RMRs containing the buffers.

I think :)

> >  - Use a blocking domain not some unclear detatch idea
> 
> There is not a single detach in this driver, we only support attach and release
> operation, which just goes to the hypervisor as enable/disable
> identity.

'disable' then.

> >  - Use a blocking domain for release
> 
> I see, I can add “blocked_domain” similar to “arm_smmu_blocked_domain”
> which just disables the device translation, I will look into, I am just
> concerned it's more code as this driver only supports a single domain
> type.

Yes, this is the right thing. There should not be operations in the
driver that change the underyling translation that are anything other
than attaching domains. It becomes too hard to maintain.

 
> >  - Use the smmu-v3 approach for the fwspec, don't store things in the drvdata.
> 
> This is using “dev_iommu_fwspec_get” similar to the SMMUv3 driver,
> am I missing something?

Oh I got it confused, you are just calling 

+       .device_group           = arm_smmu_device_group,
+       .of_xlate               = arm_smmu_of_xlate,
+       .get_resv_regions       = arm_smmu_get_resv_regions,

Which is also weird, why does an entirely new driver call 4 random functions
in arm smmu?

Maybe don't, they are all trivial functions, or you don't need
them. For example you don't need MSI_IOVA_BASE if the driver doesn't
support paging.

Jason
Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops
Posted by Mostafa Saleh 2 months ago
On Wed, Jul 30, 2025 at 01:47:52PM -0300, Jason Gunthorpe wrote:
> On Wed, Jul 30, 2025 at 03:07:14PM +0000, Mostafa Saleh wrote:
> > On Wed, Jul 30, 2025 at 11:42:53AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Jul 28, 2025 at 05:53:16PM +0000, Mostafa Saleh wrote:
> > > > Register the SMMUv3 through IOMMU ops, that only support identity
> > > > domains. This allows the driver to know which device are currently used
> > > > to properly enable/disable then.
> > > > 
> > > > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > > > ---
> > > >  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 92 ++++++++++++++++++-
> > > >  1 file changed, 91 insertions(+), 1 deletion(-)
> > > 
> > > Can you split the new iommu subysstem driver out please? I think I
> > > asked this before.
> > 
> > Sorry, maybe I misunderstood, do you mean split this patch into multiple
> > patches or split all KVM SMMUv3 driver out of this series?
> 
> Yes the latter, the iommu driver introduction is best as its own
> series

I thought about that but I was worried the maintainers wouldn't like
introducing the infrastructure first in the hypervisor without a user.
I am open to split this, but let’s see what they think.

> 
> > >  - Domain attachment looks questionable. Please do not have
> > >    attach/detach language at all in the hypervisor facing API.
> > 
> > I am not sure I understand this one, the hypervisor API has no
> > attach/detach APIs?
> > We only notify the hypervisor via “enable/disable” hypercalls when
> > devices are attached or released (as only IDENTITY DOMAIN is supported)
> > so it can enable or disable translation.
> 
> Same difference, different words.. We've had trouble with this kind of
> ambiguous language.
> 
> attach identity
> attach blocking
> 
> Keep it clean and simple

Makes sense, from the kernel point of view it will be attached to
identity/blocking domains, but the hypervisor api is just enable/disable HVC
as it doesn’t know what is a domain. If terminology is really a problem,
I can make it one hypercall as “set_state” with on/off or identity/blocking

> 
> > >  - Get the ordering and APIs right so replace works. You need this to support RMRs
> > 
> > I see, we can’t support bypass for security reasons, but we can enable
> > the identity map for such devices.
> 
> You will have a small issue if the bootup has the pkvm side put the
> device into a blocking translation then later switches to an
> "identity".
> 
> As far as I understand it display scan out buffers have to be
> continuously hitlessly mapped. So you want all the transitions from
> bootup bypass, pkvm isolation, "identity", etc to continuously
> hitlessly map the RMRs containing the buffers.
> 
> I think :)

I think that would be hard with pKVM, maybe it’s possible to achieve
that seamingless, we would need to export such devices to pKVM at
boot before de-privilege.

TBH, I am not sure what hardware does that. So, another option is to fail
gracefully if RMR exists (which falls back to the current driver) and then
pKVM would run with DMA isolation, which is the status quo.

And then RMR support for pKVM can be added later if there is interset.

> 
> > >  - Use a blocking domain not some unclear detatch idea
> > 
> > There is not a single detach in this driver, we only support attach and release
> > operation, which just goes to the hypervisor as enable/disable
> > identity.
> 
> 'disable' then.
> 
> > >  - Use a blocking domain for release
> > 
> > I see, I can add “blocked_domain” similar to “arm_smmu_blocked_domain”
> > which just disables the device translation, I will look into, I am just
> > concerned it's more code as this driver only supports a single domain
> > type.
> 
> Yes, this is the right thing. There should not be operations in the
> driver that change the underyling translation that are anything other
> than attaching domains. It becomes too hard to maintain.

Will do as mentioned above.

> 
>  
> > >  - Use the smmu-v3 approach for the fwspec, don't store things in the drvdata.
> > 
> > This is using “dev_iommu_fwspec_get” similar to the SMMUv3 driver,
> > am I missing something?
> 
> Oh I got it confused, you are just calling 
> 
> +       .device_group           = arm_smmu_device_group,
> +       .of_xlate               = arm_smmu_of_xlate,
> +       .get_resv_regions       = arm_smmu_get_resv_regions,
> 
> Which is also weird, why does an entirely new driver call 4 random functions
> in arm smmu?
> 
> Maybe don't, they are all trivial functions, or you don't need
> them. For example you don't need MSI_IOVA_BASE if the driver doesn't
> support paging.
> 

They are not random, as part of this series the SMMUv3 driver is split
where some of the code goes to “arm-smmu-v3-common.c” which is used by
both drivers, this reduces a lot of duplication.

I am not sure if we need get_resv_regions, maybe it's useful for sysfs
"/sys/kernel/iommu_groups/reserved_regions"? I will double check.

Thanks,
Mostafa

> Jason
Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops
Posted by Jason Gunthorpe 2 months ago
On Thu, Jul 31, 2025 at 02:17:17PM +0000, Mostafa Saleh wrote:
> On Wed, Jul 30, 2025 at 01:47:52PM -0300, Jason Gunthorpe wrote:
> > On Wed, Jul 30, 2025 at 03:07:14PM +0000, Mostafa Saleh wrote:
> > > On Wed, Jul 30, 2025 at 11:42:53AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Jul 28, 2025 at 05:53:16PM +0000, Mostafa Saleh wrote:
> > > > > Register the SMMUv3 through IOMMU ops, that only support identity
> > > > > domains. This allows the driver to know which device are currently used
> > > > > to properly enable/disable then.
> > > > > 
> > > > > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > > > > ---
> > > > >  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 92 ++++++++++++++++++-
> > > > >  1 file changed, 91 insertions(+), 1 deletion(-)
> > > > 
> > > > Can you split the new iommu subysstem driver out please? I think I
> > > > asked this before.
> > > 
> > > Sorry, maybe I misunderstood, do you mean split this patch into multiple
> > > patches or split all KVM SMMUv3 driver out of this series?
> > 
> > Yes the latter, the iommu driver introduction is best as its own
> > series
> 
> I thought about that but I was worried the maintainers wouldn't like
> introducing the infrastructure first in the hypervisor without a user.
> I am open to split this, but let’s see what they think.

You can merge both series at the same time

> Makes sense, from the kernel point of view it will be attached to
> identity/blocking domains, but the hypervisor api is just enable/disable HVC
> as it doesn’t know what is a domain. If terminology is really a problem,
> I can make it one hypercall as “set_state” with on/off or identity/blocking

I would call it set_state with states IDENTITY/BLOCKING. That is
clear. enable/disable is ambiguous.

> TBH, I am not sure what hardware does that. So, another option is to fail
> gracefully if RMR exists (which falls back to the current driver) and then
> pKVM would run with DMA isolation, which is the status quo.

iGPUs either access the DRAM through the iommu or they use some OS
invisible side band channel.

The ones that use the iommu have this quirk.

> They are not random, as part of this series the SMMUv3 driver is split
> where some of the code goes to “arm-smmu-v3-common.c” which is used by
> both drivers, this reduces a lot of duplication.

I find it very confusing.

It made sense to factor some of the code out so that pKVM can have
it's own smmv3 HW driver, sure.

But I don't understand why a paravirtualized iommu driver for pKVM has
any relation to smmuv3. Shouldn't it just be calling some hypercalls
to set IDENTITY/BLOCKING?

> I am not sure if we need get_resv_regions, maybe it's useful for sysfs
> "/sys/kernel/iommu_groups/reserved_regions"? I will double check.

It is important to get this info from the FW..

Jason
Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops
Posted by Mostafa Saleh 2 months ago
On Thu, Jul 31, 2025 at 01:57:57PM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 31, 2025 at 02:17:17PM +0000, Mostafa Saleh wrote:
> > On Wed, Jul 30, 2025 at 01:47:52PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Jul 30, 2025 at 03:07:14PM +0000, Mostafa Saleh wrote:
> > > > On Wed, Jul 30, 2025 at 11:42:53AM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Jul 28, 2025 at 05:53:16PM +0000, Mostafa Saleh wrote:
> > > > > > Register the SMMUv3 through IOMMU ops, that only support identity
> > > > > > domains. This allows the driver to know which device are currently used
> > > > > > to properly enable/disable then.
> > > > > > 
> > > > > > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > > > > > ---
> > > > > >  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 92 ++++++++++++++++++-
> > > > > >  1 file changed, 91 insertions(+), 1 deletion(-)
> > > > > 
> > > > > Can you split the new iommu subysstem driver out please? I think I
> > > > > asked this before.
> > > > 
> > > > Sorry, maybe I misunderstood, do you mean split this patch into multiple
> > > > patches or split all KVM SMMUv3 driver out of this series?
> > > 
> > > Yes the latter, the iommu driver introduction is best as its own
> > > series
> > 
> > I thought about that but I was worried the maintainers wouldn't like
> > introducing the infrastructure first in the hypervisor without a user.
> > I am open to split this, but let’s see what they think.
> 
> You can merge both series at the same time

Ok, I can split the last 12 patches which are the SMMUv3 driver, if the
maintainers are ok with that.

> 
> > Makes sense, from the kernel point of view it will be attached to
> > identity/blocking domains, but the hypervisor api is just enable/disable HVC
> > as it doesn’t know what is a domain. If terminology is really a problem,
> > I can make it one hypercall as “set_state” with on/off or identity/blocking
> 
> I would call it set_state with states IDENTITY/BLOCKING. That is
> clear. enable/disable is ambiguous.

Ok, will do that.

> 
> > TBH, I am not sure what hardware does that. So, another option is to fail
> > gracefully if RMR exists (which falls back to the current driver) and then
> > pKVM would run with DMA isolation, which is the status quo.
> 
> iGPUs either access the DRAM through the iommu or they use some OS
> invisible side band channel.
> 
> The ones that use the iommu have this quirk.

I see, I think that can be added later, and these devices can keep using the
current SMMU_V3 driver as it, I can add a check to elide this
registeration this driver if the platform have this quirk so the other
driver can probe the SMMUs after.

> 
> > They are not random, as part of this series the SMMUv3 driver is split
> > where some of the code goes to “arm-smmu-v3-common.c” which is used by
> > both drivers, this reduces a lot of duplication.
> 
> I find it very confusing.
> 
> It made sense to factor some of the code out so that pKVM can have
> it's own smmv3 HW driver, sure.
> 
> But I don't understand why a paravirtualized iommu driver for pKVM has
> any relation to smmuv3. Shouldn't it just be calling some hypercalls
> to set IDENTITY/BLOCKING?

Well it’s not really “paravirtualized” as virtio-iommu, this is an SMMUv3
driver (it uses the same binding a the smmu-v3)
It re-use the same probe code, fw/hw parsing and so on (inside the kernel),
also re-use the same structs to make that possible. The only difference is
that the page tables and STEs are managed by the hypervisor.
In part-2[1] I add event q parsing, which reuses 90% of the irq/evtq,
insert/remove_master logic, otherwise we have to duplicate all of that logic.

So, I think it makes sense to re-use as much logic as possible, as both drivers
are smmu-v3 drivers with one caveat about HW table management

As mentioned in the cover letter, we can also still build nesting on top of
this driver, and I plan to post an RFC for that, once this one is sorted.

[1] https://android-kvm.googlesource.com/linux/+log/refs/heads/for-upstream/pkvm-smmu-v3-part-2


> 
> > I am not sure if we need get_resv_regions, maybe it's useful for sysfs
> > "/sys/kernel/iommu_groups/reserved_regions"? I will double check.
> 
> It is important to get this info from the FW..
>
Yes, I think that should remain.

Thanks,
Mostafa
> Jason
Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops
Posted by Jason Gunthorpe 2 months ago
On Thu, Jul 31, 2025 at 05:44:55PM +0000, Mostafa Saleh wrote:
> > > They are not random, as part of this series the SMMUv3 driver is split
> > > where some of the code goes to “arm-smmu-v3-common.c” which is used by
> > > both drivers, this reduces a lot of duplication.
> > 
> > I find it very confusing.
> > 
> > It made sense to factor some of the code out so that pKVM can have
> > it's own smmv3 HW driver, sure.
> > 
> > But I don't understand why a paravirtualized iommu driver for pKVM has
> > any relation to smmuv3. Shouldn't it just be calling some hypercalls
> > to set IDENTITY/BLOCKING?
> 
> Well it’s not really “paravirtualized” as virtio-iommu, this is an SMMUv3
> driver (it uses the same binding a the smmu-v3)

> It re-use the same probe code, fw/hw parsing and so on (inside the kernel),
> also re-use the same structs to make that possible. 

I think this is not quite true, I think you have some part of the smmu driver
bootstrap the pkvm protected driver.

But then the pkvm takes over all the registers and the command queue.

Are you saying the event queue is left behind for the kernel? How does
that work if it doesn't have access to the registers?

So what is left of the actual *iommu subsystem* driver is just some
pkvm hypercalls?

It seems more sensible to me to have a pkvm HW driver for SMMUv3 that
is split between pkvm and kernel, that operates the HW - but is NOT an
iommu subsystem driver

Then an iommu subsystem driver that does the hypercalls, that is NOT
connected to SMMUv3 at all.

In other words you have two cleanly seperate concerns here, an "pkvm
iommu subsystem" that lets pkvm control iommu HW - and the current
"iommu subsystem" that lets the kernel control iommu HW. The same
driver should not register to both.

> As mentioned in the cover letter, we can also still build nesting on top of
> this driver, and I plan to post an RFC for that, once this one is sorted.

I would expect nesting to present an actual paravirtualized SMMUv3
though, with a proper normal SMMUv3 IOMMU subystem driver. This is how
ARM architecture is built to work, why mess it up?

So my advice above seems cleaner, you have the pkvm iommu HW driver
that turns around and presents a completely normal SMMUv3 HW API which
is bound by the ordinately SMMUv3 iommu subsystem driver.

Jason
Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops
Posted by Mostafa Saleh 2 months ago
On Fri, Aug 01, 2025 at 03:59:30PM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 31, 2025 at 05:44:55PM +0000, Mostafa Saleh wrote:
> > > > They are not random, as part of this series the SMMUv3 driver is split
> > > > where some of the code goes to “arm-smmu-v3-common.c” which is used by
> > > > both drivers, this reduces a lot of duplication.
> > > 
> > > I find it very confusing.
> > > 
> > > It made sense to factor some of the code out so that pKVM can have
> > > it's own smmv3 HW driver, sure.
> > > 
> > > But I don't understand why a paravirtualized iommu driver for pKVM has
> > > any relation to smmuv3. Shouldn't it just be calling some hypercalls
> > > to set IDENTITY/BLOCKING?
> > 
> > Well it’s not really “paravirtualized” as virtio-iommu, this is an SMMUv3
> > driver (it uses the same binding a the smmu-v3)
> 
> > It re-use the same probe code, fw/hw parsing and so on (inside the kernel),
> > also re-use the same structs to make that possible. 
> 
> I think this is not quite true, I think you have some part of the smmu driver
> bootstrap the pkvm protected driver.
> 
> But then the pkvm takes over all the registers and the command queue.
> 
> Are you saying the event queue is left behind for the kernel? How does
> that work if it doesn't have access to the registers?

The evtq itself will be owned by the kernel, However, MMIO access would be
trapped and emulated, here the PoC for part-2 of this series (as mentioned in
the cover letter) This is very close to how nesting will work.
https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-smmu-v3-part-2/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c#744


> 
> So what is left of the actual *iommu subsystem* driver is just some
> pkvm hypercalls?

Yes at the moment there are only 2 hypercalls and one hypervisor callback
to shadow the page table, when we go to nesting, the hypercalls will be
removed and there will be only data abort callback for MMIO emulation.

> 
> It seems more sensible to me to have a pkvm HW driver for SMMUv3 that
> is split between pkvm and kernel, that operates the HW - but is NOT an
> iommu subsystem driver
> 
> Then an iommu subsystem driver that does the hypercalls, that is NOT
> connected to SMMUv3 at all.
> 
> In other words you have two cleanly seperate concerns here, an "pkvm
> iommu subsystem" that lets pkvm control iommu HW - and the current
> "iommu subsystem" that lets the kernel control iommu HW. The same
> driver should not register to both.
> 

I am not sure how that would work exactly, for example how would probe_device
work, xlate... in a generic way? same for other ops. We can make some of these
functions (hypercalls wrappers) in a separate file. Also I am not sure how that
looks from the kernel perspective (do we have 2 struct devices per SMMU?)

But, tbh, i’d prefer to drop iommu_ops at all, check my answer below.

> > As mentioned in the cover letter, we can also still build nesting on top of
> > this driver, and I plan to post an RFC for that, once this one is sorted.
> 
> I would expect nesting to present an actual paravirtualized SMMUv3
> though, with a proper normal SMMUv3 IOMMU subystem driver. This is how
> ARM architecture is built to work, why mess it up?
> 
> So my advice above seems cleaner, you have the pkvm iommu HW driver
> that turns around and presents a completely normal SMMUv3 HW API which
> is bound by the ordinately SMMUv3 iommu subsystem driver.
> 

I think we are on the same page about how that will look at the end.
For nesting there will be a pKVM driver (as mentioned in the cover letter)
to probe register the SMMUs, then it will unbind itself to let the current
(ARM_SMMU_V3) driver probe the SMMUs and it can run unmodified. Which
will be full transparent.

Then the hypervisor driver will use trap and emulate to handle SMMU access
to the MMIO, providing an architecturally accurate SMMUv3 emualation, and
it will not register iommu_ops.
Nor will it use any hypercalls, as the main reason I added those is to tell
the hypervisor what SIDs are used in identity while others remain blocked, as
enabling all the SID space doesn’t only require a lot of memory but also
doesn't feel secure.
With nesting, we don’t need those, as the hypervisor will trap CFGI and will
know what SID to shadow.

However, based on the feedback on my v2 series, it was better to split pKVM
support, so the initial series only establishes DMA isolation. Then when we
can enable full translating domains (either nesting or pv which is another
discussion)

So, I will happily drop the hypercalls and the iommu_ops from this series,
if there is a better way to enlighten the hypervisor about the SIDs to be
in identity.

Otherwise I can’t see any other way to move forward other than going back to
posting large serieses.

Thanks,
Mostafa


> Jason
Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops
Posted by Jason Gunthorpe 2 months ago
On Mon, Aug 04, 2025 at 02:41:31PM +0000, Mostafa Saleh wrote:
> > Are you saying the event queue is left behind for the kernel? How does
> > that work if it doesn't have access to the registers?
> 
> The evtq itself will be owned by the kernel, However, MMIO access would be
> trapped and emulated, here the PoC for part-2 of this series (as mentioned in
> the cover letter) This is very close to how nesting will work.
> https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-smmu-v3-part-2/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c#744

Oh weird, but Ok.

> > In other words you have two cleanly seperate concerns here, an "pkvm
> > iommu subsystem" that lets pkvm control iommu HW - and the current
> > "iommu subsystem" that lets the kernel control iommu HW. The same
> > driver should not register to both.
> > 
> 
> I am not sure how that would work exactly, for example how would probe_device
> work, xlate... in a generic way?

Well, I think it is not so bad actually.

You just need to call iommu_device_register

	ret = iommu_device_register(&smmu->iommu, &arm_smmu_ops, dev);

Where 'dev' is always the smmu struct device, even if your current
probing driver is not the smmu device. That will capture all the iommu
activity the ACPI/DT says is linked to that dev.

From there you just make a normal small iommu driver with no
connection back to the SMMUv3.

Eg you could spawn an aux device from smmuv3 to do this with a far
cleaner separation.

xlate is just calling iommu_fwspec_add_ids() it is one line.

> same for other ops. We can make some of these
> functions (hypercalls wrappers) in a separate file. 

What other ops are you worried about?

static struct iommu_ops arm_smmu_ops = {
	.identity_domain	= &arm_smmu_identity_domain,
	.blocked_domain		= &arm_smmu_blocked_domain,
	.release_device		= arm_smmu_release_device,
	.domain_alloc_paging_flags = arm_smmu_domain_alloc_paging_flags,

	^^ Those are all your new domain type that does the hypercalls

	.probe_device		= arm_smmu_probe_device,
	.get_resv_regions	= arm_smmu_get_resv_regions,

  	^^ These are pretty empty

	.device_group		= arm_smmu_device_group,
	.of_xlate		= arm_smmu_of_xlate,

        ^^ Common code

Don't need these:

	.capable		= arm_smmu_capable,
	.hw_info		= arm_smmu_hw_info,
	.domain_alloc_sva       = arm_smmu_sva_domain_alloc,
	.page_response		= arm_smmu_page_response,
	.def_domain_type	= arm_smmu_def_domain_type,
	.get_viommu_size	= arm_smmu_get_viommu_size,
	.viommu_init		= arm_vsmmu_init,
	.user_pasid_table	= 1,
	.pgsize_bitmap		= -1UL, /* Restricted during device attach */
	.owner			= THIS_MODULE,

> Also I am not sure how that
> looks from the kernel perspective (do we have 2 struct devices per SMMU?)

I think you'd want to have pkvm bound to the physical struct device
and then spawn new faux, aux or something devices for the virtualized
IOMMUs that probes the new paravirt driver. This driver would be fully
self contained.

> I think we are on the same page about how that will look at the end.
> For nesting there will be a pKVM driver (as mentioned in the cover letter)
> to probe register the SMMUs, then it will unbind itself to let the
> current

That sounds a little freaky to me. I guess it can be made to work, and
is maybe the best option, but if this is the longterm plan then
starting out with a non-iommu subsystem pkvm driver seems like a good
idea.

Today the pkvm driver would do enough to boot up pkvm in this limited
mode, and maybe you have some small code duplications with the iommu
driver. It forks off a aux device to create the para-virt pkvm iommu
subsystem driver.

Tomorrow you further shrink down the pkvm driver and remove the
para-virt driver, then just somehow bind/unbind the pkvm one at early
boot.

Minimal changes to the existing smmu driver in either case.

> Then the hypervisor driver will use trap and emulate to handle SMMU access
> to the MMIO, providing an architecturally accurate SMMUv3 emualation, and
> it will not register iommu_ops.

Well, it registers normal smmuv3 ops only, I think you mean.

> So, I will happily drop the hypercalls and the iommu_ops from this series,
> if there is a better way to enlighten the hypervisor about the SIDs to be
> in identity.

The iommu ops are a reasonable and appropriate way to do this.

But since pkvm is all so special anyhow maybe you could hook
pci_enable_device() and do your identity hypercall there? Do you
already trap the config write to catch Bus Master Enable? Can pkvm
just set identity when BME=1 and ABORT when BME=0?

Jason
Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops
Posted by Mostafa Saleh 2 months ago
On Tue, Aug 05, 2025 at 02:57:53PM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 04, 2025 at 02:41:31PM +0000, Mostafa Saleh wrote:
> > > Are you saying the event queue is left behind for the kernel? How does
> > > that work if it doesn't have access to the registers?
> > 
> > The evtq itself will be owned by the kernel, However, MMIO access would be
> > trapped and emulated, here the PoC for part-2 of this series (as mentioned in
> > the cover letter) This is very close to how nesting will work.
> > https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-smmu-v3-part-2/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c#744
> 
> Oh weird, but Ok.
> 
> > > In other words you have two cleanly seperate concerns here, an "pkvm
> > > iommu subsystem" that lets pkvm control iommu HW - and the current
> > > "iommu subsystem" that lets the kernel control iommu HW. The same
> > > driver should not register to both.
> > > 
> > 
> > I am not sure how that would work exactly, for example how would probe_device
> > work, xlate... in a generic way?
> 
> Well, I think it is not so bad actually.
> 
> You just need to call iommu_device_register
> 
> 	ret = iommu_device_register(&smmu->iommu, &arm_smmu_ops, dev);
> 
> Where 'dev' is always the smmu struct device, even if your current
> probing driver is not the smmu device. That will capture all the iommu
> activity the ACPI/DT says is linked to that dev.
> 
> From there you just make a normal small iommu driver with no
> connection back to the SMMUv3.
> 
> Eg you could spawn an aux device from smmuv3 to do this with a far
> cleaner separation.
> 
> xlate is just calling iommu_fwspec_add_ids() it is one line.

I am not sure I understand, the SMMU driver will register its IOMMU
ops to probe the devices, then faux devices would also register
IOMMU ops for the KVM HVCs?
But that means that all DMA operations would still go through the
SMMU one?

> 
> > same for other ops. We can make some of these
> > functions (hypercalls wrappers) in a separate file. 
> 
> What other ops are you worried about?
> 
> static struct iommu_ops arm_smmu_ops = {
> 	.identity_domain	= &arm_smmu_identity_domain,
> 	.blocked_domain		= &arm_smmu_blocked_domain,
> 	.release_device		= arm_smmu_release_device,
> 	.domain_alloc_paging_flags = arm_smmu_domain_alloc_paging_flags,
> 
> 	^^ Those are all your new domain type that does the hypercalls
> 
> 	.probe_device		= arm_smmu_probe_device,
> 	.get_resv_regions	= arm_smmu_get_resv_regions,
> 
>   	^^ These are pretty empty
> 
> 	.device_group		= arm_smmu_device_group,
> 	.of_xlate		= arm_smmu_of_xlate,
> 
>         ^^ Common code
> 
> Don't need these:
> 
> 	.capable		= arm_smmu_capable,
> 	.hw_info		= arm_smmu_hw_info,
> 	.domain_alloc_sva       = arm_smmu_sva_domain_alloc,
> 	.page_response		= arm_smmu_page_response,
> 	.def_domain_type	= arm_smmu_def_domain_type,
> 	.get_viommu_size	= arm_smmu_get_viommu_size,
> 	.viommu_init		= arm_vsmmu_init,
> 	.user_pasid_table	= 1,
> 	.pgsize_bitmap		= -1UL, /* Restricted during device attach */
> 	.owner			= THIS_MODULE,

Makes sense, thanks for the detailed explanation.

> 
> > Also I am not sure how that
> > looks from the kernel perspective (do we have 2 struct devices per SMMU?)
> 
> I think you'd want to have pkvm bound to the physical struct device
> and then spawn new faux, aux or something devices for the virtualized
> IOMMUs that probes the new paravirt driver. This driver would be fully
> self contained.

I think it’s hard to reason about this as 2 devices, from my pov it seems
that the pKVM HVCs are a library that can be part of separate common file,
then called from drivers. (with common ops)
Instead of having extra complexity of 2 drivers (KVM and IOMMU PV).
However, I can see the value of that as it totally abstracts the iommu ops
outside the device specific code, I will give it more thought.
But it feels that might be more suitable for a full fledged PV
implementation (as in RFC v1 and v2).

> 
> > I think we are on the same page about how that will look at the end.
> > For nesting there will be a pKVM driver (as mentioned in the cover letter)
> > to probe register the SMMUs, then it will unbind itself to let the
> > current
> 
> That sounds a little freaky to me. I guess it can be made to work, and
> is maybe the best option, but if this is the longterm plan then
> starting out with a non-iommu subsystem pkvm driver seems like a good
> idea.
> 
> Today the pkvm driver would do enough to boot up pkvm in this limited
> mode, and maybe you have some small code duplications with the iommu
> driver. It forks off a aux device to create the para-virt pkvm iommu
> subsystem driver.
> 
> Tomorrow you further shrink down the pkvm driver and remove the
> para-virt driver, then just somehow bind/unbind the pkvm one at early
> boot.
> 
> Minimal changes to the existing smmu driver in either case.
> 
> > Then the hypervisor driver will use trap and emulate to handle SMMU access
> > to the MMIO, providing an architecturally accurate SMMUv3 emualation, and
> > it will not register iommu_ops.
> 
> Well, it registers normal smmuv3 ops only, I think you mean.

I mean when nesting is added, the arm-smmu-v3-kvm, will not call
“iommu_device_register”

> 
> > So, I will happily drop the hypercalls and the iommu_ops from this series,
> > if there is a better way to enlighten the hypervisor about the SIDs to be
> > in identity.
> 
> The iommu ops are a reasonable and appropriate way to do this.
> 
> But since pkvm is all so special anyhow maybe you could hook
> pci_enable_device() and do your identity hypercall there? Do you
> already trap the config write to catch Bus Master Enable? Can pkvm
> just set identity when BME=1 and ABORT when BME=0?

pKVM doesn’t trap actual device access, also we have to support
platform devices, so it would be better to rely on existing
abstractions as “probe_device”


I had an offline discussion with Will and Robin and they believe it might
be better if we get rid of the kernel KVM SMMUv3 driver at all, and just
rely on ARM_SMMU_V3 + extra hooks, so there is a single driver managing
the SMMUs in the system.
This way we don’t need to split current SMMUv3 or have different IOMMU ops,
and reduces some of the duplication, also that avoids the need for a fake device.

Then we have an extra file for KVM with some of the hooks (similar to the
hooks in arm_smmu_impl_ops we have for tegra)

And that might be more suitable for nesting also, to avoid the bind/unbind flow.

I will investigate that and if feasible I will send v4 (hopefully
shortly) based on this idea, otherwise I will see if we can separate
KVM code and SMMU bootstrap code.


Thanks,
Mostafa

> 
> Jason
Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops
Posted by Jason Gunthorpe 1 month, 3 weeks ago
On Wed, Aug 06, 2025 at 02:10:35PM +0000, Mostafa Saleh wrote:
> I am not sure I understand, the SMMU driver will register its IOMMU
> ops to probe the devices

You couldn't do this. But why do you need the iommu subsystem to help
you do probing for the pKVM driver? Today SMMU starts all devices in
ABORT mode except for some it scans manually from the fw tables.

They switch to identity when the iommu subsystem attaches devices, you
can continue to do that by having the paravirt driver tell pkvm when
it attaches.

What is wrong with this approach?

> > > Also I am not sure how that
> > > looks from the kernel perspective (do we have 2 struct devices per SMMU?)
> > 
> > I think you'd want to have pkvm bound to the physical struct device
> > and then spawn new faux, aux or something devices for the virtualized
> > IOMMUs that probes the new paravirt driver. This driver would be fully
> > self contained.
> 
> I think it’s hard to reason about this as 2 devices, from my pov it seems
> that the pKVM HVCs are a library that can be part of separate common file,
> then called from drivers. (with common ops)
> Instead of having extra complexity of 2 drivers (KVM and IOMMU PV).
> However, I can see the value of that as it totally abstracts the iommu ops
> outside the device specific code, I will give it more thought.
> But it feels that might be more suitable for a full fledged PV
> implementation (as in RFC v1 and v2).

Maybe, but I'm feeling sensitive here to not mess up the ARM SMMU
driver with this stuff that is honestly looking harder and harder to
understand what it is trying to do...

If you can keep the pkvm enablement to three drivers:
 - A pKVM SMMU driver sharing some header files
 - A the untrusted half of the above driver
 - A para virt IOMMU driver

And not further change the smmu driver beyond making some code
sharable it sure would be nice from a maintenance perspective.

> I had an offline discussion with Will and Robin and they believe it might
> be better if we get rid of the kernel KVM SMMUv3 driver at all, and just
> rely on ARM_SMMU_V3 + extra hooks, so there is a single driver managing
> the SMMUs in the system.

> This way we don’t need to split current SMMUv3 or have different IOMMU ops,
> and reduces some of the duplication, also that avoids the need for a fake device.
> 
> Then we have an extra file for KVM with some of the hooks (similar to the
> hooks in arm_smmu_impl_ops we have for tegra)
> 
> And that might be more suitable for nesting also, to avoid the bind/unbind flow.
> 
> I will investigate that and if feasible I will send v4 (hopefully
> shortly) based on this idea, otherwise I will see if we can separate
> KVM code and SMMU bootstrap code.

Maybe, not sure what exactly you imagine here.. You still have your
para virt driver, yes?

This especially is what bothers me, I don't think you should have a
para virt driver for pkvm hidden inside the smmu driver at all.

And if we have a smmu driver that optionally doesn't register with the
iommu subsystem at all - that seems unwise..

Jason
Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops
Posted by Mostafa Saleh 1 month, 3 weeks ago
On Mon, Aug 11, 2025 at 03:55:23PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 06, 2025 at 02:10:35PM +0000, Mostafa Saleh wrote:
> > I am not sure I understand, the SMMU driver will register its IOMMU
> > ops to probe the devices
> 
> You couldn't do this. But why do you need the iommu subsystem to help
> you do probing for the pKVM driver? Today SMMU starts all devices in
> ABORT mode except for some it scans manually from the fw tables.
> 
> They switch to identity when the iommu subsystem attaches devices, you
> can continue to do that by having the paravirt driver tell pkvm when
> it attaches.
> 
> What is wrong with this approach?
> 

My confusion is that in this proposal we have 2 drivers:
- arm-smmu-v3-kvm: Register arm_smmu_ops? binds to the SMMUs
- pkvm-iommu: Register pkvm_iommu_ops, binds to faux devices.
So how does attach/detach... (rest of iommu_ops) work? In that case we need
the pkvm driver to handle those. So, why do we need to have iommu_ops for the
kvm one?

> > > > Also I am not sure how that
> > > > looks from the kernel perspective (do we have 2 struct devices per SMMU?)
> > > 
> > > I think you'd want to have pkvm bound to the physical struct device
> > > and then spawn new faux, aux or something devices for the virtualized
> > > IOMMUs that probes the new paravirt driver. This driver would be fully
> > > self contained.
> > 
> > I think it’s hard to reason about this as 2 devices, from my pov it seems
> > that the pKVM HVCs are a library that can be part of separate common file,
> > then called from drivers. (with common ops)
> > Instead of having extra complexity of 2 drivers (KVM and IOMMU PV).
> > However, I can see the value of that as it totally abstracts the iommu ops
> > outside the device specific code, I will give it more thought.
> > But it feels that might be more suitable for a full fledged PV
> > implementation (as in RFC v1 and v2).
> 
> Maybe, but I'm feeling sensitive here to not mess up the ARM SMMU
> driver with this stuff that is honestly looking harder and harder to
> understand what it is trying to do...
> 
> If you can keep the pkvm enablement to three drivers:
>  - A pKVM SMMU driver sharing some header files
>  - A the untrusted half of the above driver
>  - A para virt IOMMU driver
> 
> And not further change the smmu driver beyond making some code
> sharable it sure would be nice from a maintenance perspective.

I am almost done with v4, which relies on a single driver, I don’t think
it’s that complicated, it adds a few impl_ops + some few re-works.

I think that is much simpler than having 3 drivers.
Also better for the current SMMUv3 driver maintainability to have the KVM driver
as mode, where all the KVM logic is implemented in a new file which relies on few
ops, similar to “tegra241-cmdqv.c”

I will post this version, and then it would be easier to compare both approaches.

> 
> > I had an offline discussion with Will and Robin and they believe it might
> > be better if we get rid of the kernel KVM SMMUv3 driver at all, and just
> > rely on ARM_SMMU_V3 + extra hooks, so there is a single driver managing
> > the SMMUs in the system.
> 
> > This way we don’t need to split current SMMUv3 or have different IOMMU ops,
> > and reduces some of the duplication, also that avoids the need for a fake device.
> > 
> > Then we have an extra file for KVM with some of the hooks (similar to the
> > hooks in arm_smmu_impl_ops we have for tegra)
> > 
> > And that might be more suitable for nesting also, to avoid the bind/unbind flow.
> > 
> > I will investigate that and if feasible I will send v4 (hopefully
> > shortly) based on this idea, otherwise I will see if we can separate
> > KVM code and SMMU bootstrap code.
> 
> Maybe, not sure what exactly you imagine here.. You still have your
> para virt driver, yes?
> 
> This especially is what bothers me, I don't think you should have a
> para virt driver for pkvm hidden inside the smmu driver at all.
> 
> And if we have a smmu driver that optionally doesn't register with the
> iommu subsystem at all - that seems unwise..

I was imagining just splitting all the KVM specific code outside of the
SMMU code, but not as a driver, it would be a library which “arm-smmu-v3-kvm”
calls into.

Thanks,
Mostafa

> 
> Jason
Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops
Posted by Jason Gunthorpe 1 month, 3 weeks ago
On Tue, Aug 12, 2025 at 10:29:38AM +0000, Mostafa Saleh wrote:
> On Mon, Aug 11, 2025 at 03:55:23PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 06, 2025 at 02:10:35PM +0000, Mostafa Saleh wrote:
> > > I am not sure I understand, the SMMU driver will register its IOMMU
> > > ops to probe the devices
> > 
> > You couldn't do this. But why do you need the iommu subsystem to help
> > you do probing for the pKVM driver? Today SMMU starts all devices in
> > ABORT mode except for some it scans manually from the fw tables.
> > 
> > They switch to identity when the iommu subsystem attaches devices, you
> > can continue to do that by having the paravirt driver tell pkvm when
> > it attaches.
> > 
> > What is wrong with this approach?
> > 
> 
> My confusion is that in this proposal we have 2 drivers:
> - arm-smmu-v3-kvm: Register arm_smmu_ops? binds to the SMMUs

No, I don't mean two iommu subsystem drivers. You have only the
pkvm-iommu driver. Whatever you bind to the arm-smmu does not register
with the iommu subsystem.

> I am almost done with v4, which relies on a single driver, I don’t think
> it’s that complicated, it adds a few impl_ops + some few re-works.
> 
> I think that is much simpler than having 3 drivers.
> Also better for the current SMMUv3 driver maintainability to have the KVM driver
> as mode, where all the KVM logic is implemented in a new file which relies on few
> ops, similar to “tegra241-cmdqv.c”

I don't understand how you can do this, it is fundamentally not an
iommu subsystem driver if pkvm is in control of the HW.

And I still strongly do not want to see a para virt iommu driver hidden
inside the smmu driver. It should be its own thing.

The tegra ops were for customizing the iommu subsystem behavior of the
arm iommu driver, not to turn it into a wrapper for a different
paravirt driver!!

Jason
Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops
Posted by Mostafa Saleh 1 month, 3 weeks ago
On Tue, Aug 12, 2025 at 09:10:56AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 12, 2025 at 10:29:38AM +0000, Mostafa Saleh wrote:
> > On Mon, Aug 11, 2025 at 03:55:23PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Aug 06, 2025 at 02:10:35PM +0000, Mostafa Saleh wrote:
> > > > I am not sure I understand, the SMMU driver will register its IOMMU
> > > > ops to probe the devices
> > > 
> > > You couldn't do this. But why do you need the iommu subsystem to help
> > > you do probing for the pKVM driver? Today SMMU starts all devices in
> > > ABORT mode except for some it scans manually from the fw tables.
> > > 
> > > They switch to identity when the iommu subsystem attaches devices, you
> > > can continue to do that by having the paravirt driver tell pkvm when
> > > it attaches.
> > > 
> > > What is wrong with this approach?
> > > 
> > 
> > My confusion is that in this proposal we have 2 drivers:
> > - arm-smmu-v3-kvm: Register arm_smmu_ops? binds to the SMMUs
> 
> No, I don't mean two iommu subsystem drivers. You have only the
> pkvm-iommu driver. Whatever you bind to the arm-smmu does not register
> with the iommu subsystem.

I see.

> 
> > I am almost done with v4, which relies on a single driver, I don’t think
> > it’s that complicated, it adds a few impl_ops + some few re-works.
> > 
> > I think that is much simpler than having 3 drivers.
> > Also better for the current SMMUv3 driver maintainability to have the KVM driver
> > as mode, where all the KVM logic is implemented in a new file which relies on few
> > ops, similar to “tegra241-cmdqv.c”
> 
> I don't understand how you can do this, it is fundamentally not an
> iommu subsystem driver if pkvm is in control of the HW.
> 
> And I still strongly do not want to see a para virt iommu driver hidden
> inside the smmu driver. It should be its own thing.
> 
> The tegra ops were for customizing the iommu subsystem behavior of the
> arm iommu driver, not to turn it into a wrapper for a different
> paravirt driver!!

I see, but most of the code in KVM mode is exactly the same as in the
current driver, as the driver is not HW agnostic (unlike virtio-iommu).
In fact it does almost everything, and just delegates
attach_identity/blocked to the hypervisor.

IMO, having a full fledged KVM IOMMU driver + faux devices + moving
all shared SMMUv3 code, just for this driver to implement a handful
lines of code, is an over-kill, especially since most of this logic
won’t be needed in the future.

In addition, with no standard iommu-binding, this driver has to be
enlightened somehow about how to deal with device operations.

As mentioned before, when nesting is added, many of the hooks will be
removed anyway as KVM would rely on trap and emulate instead of HVCs.

Otherwise, we can skip this series and I can post nesting directly
(which would be a relatively bigger one), that probably would rely
on the same concept of the driver bootstrapping the hypervisor driver.

I am generally open to any path to move this forward, as Robin and
Will originally suggested the KVM mode in the upstream driver approach,
what do you think?

Thank,
Mostafa

> 
> Jason
Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops
Posted by Jason Gunthorpe 1 month, 3 weeks ago
On Tue, Aug 12, 2025 at 12:37:08PM +0000, Mostafa Saleh wrote:
> I see, but most of the code in KVM mode is exactly the same as in the
> current driver, as the driver is not HW agnostic (unlike virtio-iommu).
> In fact it does almost everything, and just delegates
> attach_identity/blocked to the hypervisor.

How is that possible? the kvm driver for smmuv3 should be very
different than the iommu subsystem driver. That seemed to be what this
series was showing.. Even the memory allocations and page table code
were all modified?

This delta seems to only get bigger as you move on toward having full
emulation in the hypervisor.

So, I'm confused what is being shared here and why are we trying so
hard to force two different things to share some unclear amount of
code?

> In addition, with no standard iommu-binding, this driver has to be
> enlightened somehow about how to deal with device operations.
> 
> As mentioned before, when nesting is added, many of the hooks will be
> removed anyway as KVM would rely on trap and emulate instead of HVCs.
> 
> Otherwise, we can skip this series and I can post nesting directly
> (which would be a relatively bigger one), that probably would rely
> on the same concept of the driver bootstrapping the hypervisor driver.

I think you end up with the same design I am suggesting here, it is
nonsense to have one smmu driver when it is actually split into two
instances where one is running inside the protected world and one is
the normal iommu subsystem driver. That's not just bootstrapping, that
is two full instances running in parallel that are really very
different things.

> I am generally open to any path to move this forward, as Robin and
> Will originally suggested the KVM mode in the upstream driver approach,
> what do you think?

Well, I'd like to see you succeed here too, but it these series are
very big and seem a pretty invasive. I'm very keen we are still able
to support the smmuv3 driver in all the non-pkvm configurations
without hassle, and I don't want to become blocked on inscrutible pkvm
problems because we have decided to share a few lines of code..

Are you sure there isn't some inbetween you can do that allows getting
to the full emulation of smmuv3 without so much churn to existing
code?

This is why I prefer adding new stuff that you then just erase vs
mangling the existing drivers potentially forever..

Jason
Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops
Posted by Mostafa Saleh 1 month, 3 weeks ago
On Tue, Aug 12, 2025 at 10:48:08AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 12, 2025 at 12:37:08PM +0000, Mostafa Saleh wrote:
> > I see, but most of the code in KVM mode is exactly the same as in the
> > current driver, as the driver is not HW agnostic (unlike virtio-iommu).
> > In fact it does almost everything, and just delegates
> > attach_identity/blocked to the hypervisor.
> 
> How is that possible? the kvm driver for smmuv3 should be very
> different than the iommu subsystem driver. That seemed to be what this
> series was showing.. Even the memory allocations and page table code
> were all modified?
> 
> This delta seems to only get bigger as you move on toward having full
> emulation in the hypervisor.
> 
> So, I'm confused what is being shared here and why are we trying so
> hard to force two different things to share some unclear amount of
> code?
> 

Originally, this series from v1-v3(currently), just split the common
code (SMMUv3 and io-pgtable) to separate files and added a new driver
so it was a clear boundary.

But, I started re-working the series as I mentioned based on the feedback
to have KVM as mode in the driver.

IMO, there is a value in either approaches (single or 2 drivers), a
single driver is less intrusive to the current driver as it doesn't do
any splitting, and just add few hooks.

As I mentioned I open to both, not trying to force anything :)

> > In addition, with no standard iommu-binding, this driver has to be
> > enlightened somehow about how to deal with device operations.
> > 
> > As mentioned before, when nesting is added, many of the hooks will be
> > removed anyway as KVM would rely on trap and emulate instead of HVCs.
> > 
> > Otherwise, we can skip this series and I can post nesting directly
> > (which would be a relatively bigger one), that probably would rely
> > on the same concept of the driver bootstrapping the hypervisor driver.
> 
> I think you end up with the same design I am suggesting here, it is
> nonsense to have one smmu driver when it is actually split into two
> instances where one is running inside the protected world and one is
> the normal iommu subsystem driver. That's not just bootstrapping, that
> is two full instances running in parallel that are really very
> different things.

But they don’t do the same thing? they collaborate to configure the IOMMUs.
That’s how the kernel boots, it starts as one object then splits between
EL1 and EL2, which runs in parallel. At run time KVM will decide if it
runs fully in EL2 or in split mode and either do function calls or
hypercalls. It makes sense for KVM drivers to follow the same model.
See kvm_call_* for example.

> 
> > I am generally open to any path to move this forward, as Robin and
> > Will originally suggested the KVM mode in the upstream driver approach,
> > what do you think?
> 
> Well, I'd like to see you succeed here too, but it these series are
> very big and seem a pretty invasive. I'm very keen we are still able
> to support the smmuv3 driver in all the non-pkvm configurations
> without hassle, and I don't want to become blocked on inscrutible pkvm
> problems because we have decided to share a few lines of code..

Makes sense, but making changes is part of evolving the driver, for
example attaching devices looks nothing at the moment as 3 years ago
when I started working on this.
Many of the cmdq code has been reworked for tegra which is not part of
the SMMUv3 architecture.

IMHO, we can’t just reject changes because it’s invasive, or some people
doesn't care about it.

Instead based on maintainability of new changes and whether it makes
sense or not.

> 
> Are you sure there isn't some inbetween you can do that allows getting
> to the full emulation of smmuv3 without so much churn to existing
> code?
> 
> This is why I prefer adding new stuff that you then just erase vs
> mangling the existing drivers potentially forever..
> 

What I can do is to hold back v4, and I can revive my nesting patches,
it doesn’t need any hooks in iommu_ops nor hypercalls.

However, the tricky part would be the probe, as the hypervisor has to know
about the SMMUs before the kernel doesn’t any anything significant with,
but the same time we don't to repeat the probe path.

If that goes well, I can post something next week with nesting instead
of the current approach.

Thanks,
Mostafa

> Jason