Controlled by the IOMMU driver, ATS is usually enabled "on demand", when a
device requests a translation service from its associated IOMMU HW running
on the channel of a given PASID. This is working even when a device has no
translation on its RID, i.e. RID is IOMMU bypassed.
On the other hand, certain PCIe device requires non-PASID ATS, when its RID
stream is IOMMU bypassed. Call this "always on".
For instance, the CXL spec notes in "3.2.5.13 Memory Type on CXL.cache":
"To source requests on CXL.cache, devices need to get the Host Physical
Address (HPA) from the Host by means of an ATS request on CXL.io."
In other word, the CXL.cache capability relies on ATS. Otherwise, it won't
have access to the host physical memory.
Introduce a new pci_ats_always_on() for IOMMU driver to scan a PCI device,
to shift ATS policies between "on demand" and "always on".
Add the support for CXL.cache devices first. Non-CXL devices will be added
in quirks.c file.
Suggested-by: Vikram Sethi <vsethi@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
include/linux/pci-ats.h | 3 +++
include/uapi/linux/pci_regs.h | 5 ++++
drivers/pci/ats.c | 44 +++++++++++++++++++++++++++++++++++
3 files changed, 52 insertions(+)
diff --git a/include/linux/pci-ats.h b/include/linux/pci-ats.h
index 75c6c86cf09d..d14ba727d38b 100644
--- a/include/linux/pci-ats.h
+++ b/include/linux/pci-ats.h
@@ -12,6 +12,7 @@ int pci_prepare_ats(struct pci_dev *dev, int ps);
void pci_disable_ats(struct pci_dev *dev);
int pci_ats_queue_depth(struct pci_dev *dev);
int pci_ats_page_aligned(struct pci_dev *dev);
+bool pci_ats_always_on(struct pci_dev *dev);
#else /* CONFIG_PCI_ATS */
static inline bool pci_ats_supported(struct pci_dev *d)
{ return false; }
@@ -24,6 +25,8 @@ static inline int pci_ats_queue_depth(struct pci_dev *d)
{ return -ENODEV; }
static inline int pci_ats_page_aligned(struct pci_dev *dev)
{ return 0; }
+static inline bool pci_ats_always_on(struct pci_dev *dev)
+{ return false; }
#endif /* CONFIG_PCI_ATS */
#ifdef CONFIG_PCI_PRI
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index 3add74ae2594..84da6d7645a3 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1258,6 +1258,11 @@
#define PCI_DVSEC_CXL_PORT_CTL 0x0c
#define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR 0x00000001
+/* CXL 2.0 8.1.3: PCIe DVSEC for CXL Device */
+#define CXL_DVSEC_PCIE_DEVICE 0
+#define CXL_DVSEC_CAP_OFFSET 0xA
+#define CXL_DVSEC_CACHE_CAPABLE BIT(0)
+
/* Integrity and Data Encryption Extended Capability */
#define PCI_IDE_CAP 0x04
#define PCI_IDE_CAP_LINK 0x1 /* Link IDE Stream Supported */
diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
index ec6c8dbdc5e9..1795131f0697 100644
--- a/drivers/pci/ats.c
+++ b/drivers/pci/ats.c
@@ -205,6 +205,50 @@ int pci_ats_page_aligned(struct pci_dev *pdev)
return 0;
}
+/*
+ * CXL r4.0, sec 3.2.5.13 Memory Type on CXL.cache notes: to source requests on
+ * CXL.cache, devices need to get the Host Physical Address (HPA) from the Host
+ * by means of an ATS request on CXL.io.
+ *
+ * In other world, CXL.cache devices cannot access physical memory without ATS.
+ */
+static bool pci_cxl_ats_always_on(struct pci_dev *pdev)
+{
+ int offset;
+ u16 cap;
+
+ offset = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
+ CXL_DVSEC_PCIE_DEVICE);
+ if (!offset)
+ return false;
+
+ pci_read_config_word(pdev, offset + CXL_DVSEC_CAP_OFFSET, &cap);
+ if (cap & CXL_DVSEC_CACHE_CAPABLE)
+ return true;
+
+ return false;
+}
+
+/**
+ * pci_ats_always_on - Whether the PCI device requires ATS to be always enabled
+ * @pdev: the PCI device
+ *
+ * Returns true, if the PCI device requires non-PASID ATS function on an IOMMU
+ * bypassed configuration.
+ */
+bool pci_ats_always_on(struct pci_dev *pdev)
+{
+ if (pci_ats_disabled() || !pci_ats_supported(pdev))
+ return false;
+
+ /* A VF inherits its PF's requirement for ATS function */
+ if (pdev->is_virtfn)
+ pdev = pci_physfn(pdev);
+
+ return pci_cxl_ats_always_on(pdev);
+}
+EXPORT_SYMBOL_GPL(pci_ats_always_on);
+
#ifdef CONFIG_PCI_PRI
void pci_pri_init(struct pci_dev *pdev)
{
--
2.43.0
+Dan. I recalled an offline discussion in which he raised concern on
having the kernel blindly enable ATS for cxl.cache device instead of
creating a knob for admin to configure from userspace (in case
security is viewed more important than functionality, upon allowing
DMA to read data out of CPU caches)...
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Saturday, January 17, 2026 12:57 PM
>
> Controlled by the IOMMU driver, ATS is usually enabled "on demand", when
> a
> device requests a translation service from its associated IOMMU HW running
> on the channel of a given PASID. This is working even when a device has no
> translation on its RID, i.e. RID is IOMMU bypassed.
>
> On the other hand, certain PCIe device requires non-PASID ATS, when its RID
> stream is IOMMU bypassed. Call this "always on".
>
> For instance, the CXL spec notes in "3.2.5.13 Memory Type on CXL.cache":
> "To source requests on CXL.cache, devices need to get the Host Physical
> Address (HPA) from the Host by means of an ATS request on CXL.io."
> In other word, the CXL.cache capability relies on ATS. Otherwise, it won't
> have access to the host physical memory.
>
> Introduce a new pci_ats_always_on() for IOMMU driver to scan a PCI device,
> to shift ATS policies between "on demand" and "always on".
>
> Add the support for CXL.cache devices first. Non-CXL devices will be added
> in quirks.c file.
>
> Suggested-by: Vikram Sethi <vsethi@nvidia.com>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
> include/linux/pci-ats.h | 3 +++
> include/uapi/linux/pci_regs.h | 5 ++++
> drivers/pci/ats.c | 44 +++++++++++++++++++++++++++++++++++
> 3 files changed, 52 insertions(+)
>
> diff --git a/include/linux/pci-ats.h b/include/linux/pci-ats.h
> index 75c6c86cf09d..d14ba727d38b 100644
> --- a/include/linux/pci-ats.h
> +++ b/include/linux/pci-ats.h
> @@ -12,6 +12,7 @@ int pci_prepare_ats(struct pci_dev *dev, int ps);
> void pci_disable_ats(struct pci_dev *dev);
> int pci_ats_queue_depth(struct pci_dev *dev);
> int pci_ats_page_aligned(struct pci_dev *dev);
> +bool pci_ats_always_on(struct pci_dev *dev);
> #else /* CONFIG_PCI_ATS */
> static inline bool pci_ats_supported(struct pci_dev *d)
> { return false; }
> @@ -24,6 +25,8 @@ static inline int pci_ats_queue_depth(struct pci_dev *d)
> { return -ENODEV; }
> static inline int pci_ats_page_aligned(struct pci_dev *dev)
> { return 0; }
> +static inline bool pci_ats_always_on(struct pci_dev *dev)
> +{ return false; }
> #endif /* CONFIG_PCI_ATS */
>
> #ifdef CONFIG_PCI_PRI
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index 3add74ae2594..84da6d7645a3 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -1258,6 +1258,11 @@
> #define PCI_DVSEC_CXL_PORT_CTL 0x0c
> #define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR 0x00000001
>
> +/* CXL 2.0 8.1.3: PCIe DVSEC for CXL Device */
> +#define CXL_DVSEC_PCIE_DEVICE 0
> +#define CXL_DVSEC_CAP_OFFSET 0xA
> +#define CXL_DVSEC_CACHE_CAPABLE BIT(0)
> +
> /* Integrity and Data Encryption Extended Capability */
> #define PCI_IDE_CAP 0x04
> #define PCI_IDE_CAP_LINK 0x1 /* Link IDE Stream Supported */
> diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
> index ec6c8dbdc5e9..1795131f0697 100644
> --- a/drivers/pci/ats.c
> +++ b/drivers/pci/ats.c
> @@ -205,6 +205,50 @@ int pci_ats_page_aligned(struct pci_dev *pdev)
> return 0;
> }
>
> +/*
> + * CXL r4.0, sec 3.2.5.13 Memory Type on CXL.cache notes: to source
> requests on
> + * CXL.cache, devices need to get the Host Physical Address (HPA) from the
> Host
> + * by means of an ATS request on CXL.io.
> + *
> + * In other world, CXL.cache devices cannot access physical memory
> without ATS.
> + */
> +static bool pci_cxl_ats_always_on(struct pci_dev *pdev)
> +{
> + int offset;
> + u16 cap;
> +
> + offset = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
> + CXL_DVSEC_PCIE_DEVICE);
> + if (!offset)
> + return false;
> +
> + pci_read_config_word(pdev, offset + CXL_DVSEC_CAP_OFFSET,
> &cap);
> + if (cap & CXL_DVSEC_CACHE_CAPABLE)
> + return true;
> +
> + return false;
> +}
> +
> +/**
> + * pci_ats_always_on - Whether the PCI device requires ATS to be always
> enabled
> + * @pdev: the PCI device
> + *
> + * Returns true, if the PCI device requires non-PASID ATS function on an
> IOMMU
> + * bypassed configuration.
> + */
> +bool pci_ats_always_on(struct pci_dev *pdev)
> +{
> + if (pci_ats_disabled() || !pci_ats_supported(pdev))
> + return false;
> +
> + /* A VF inherits its PF's requirement for ATS function */
> + if (pdev->is_virtfn)
> + pdev = pci_physfn(pdev);
> +
> + return pci_cxl_ats_always_on(pdev);
> +}
> +EXPORT_SYMBOL_GPL(pci_ats_always_on);
> +
> #ifdef CONFIG_PCI_PRI
> void pci_pri_init(struct pci_dev *pdev)
> {
> --
> 2.43.0
On Wed, 21 Jan 2026 08:01:36 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:
> +Dan. I recalled an offline discussion in which he raised concern on
> having the kernel blindly enable ATS for cxl.cache device instead of
> creating a knob for admin to configure from userspace (in case
> security is viewed more important than functionality, upon allowing
> DMA to read data out of CPU caches)...
>
+CC Linux-cxl
Jonathan
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Saturday, January 17, 2026 12:57 PM
> >
> > Controlled by the IOMMU driver, ATS is usually enabled "on demand", when
> > a
> > device requests a translation service from its associated IOMMU HW running
> > on the channel of a given PASID. This is working even when a device has no
> > translation on its RID, i.e. RID is IOMMU bypassed.
> >
> > On the other hand, certain PCIe device requires non-PASID ATS, when its RID
> > stream is IOMMU bypassed. Call this "always on".
> >
> > For instance, the CXL spec notes in "3.2.5.13 Memory Type on CXL.cache":
> > "To source requests on CXL.cache, devices need to get the Host Physical
> > Address (HPA) from the Host by means of an ATS request on CXL.io."
> > In other word, the CXL.cache capability relies on ATS. Otherwise, it won't
> > have access to the host physical memory.
> >
> > Introduce a new pci_ats_always_on() for IOMMU driver to scan a PCI device,
> > to shift ATS policies between "on demand" and "always on".
> >
> > Add the support for CXL.cache devices first. Non-CXL devices will be added
> > in quirks.c file.
> >
> > Suggested-by: Vikram Sethi <vsethi@nvidia.com>
> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> > ---
> > include/linux/pci-ats.h | 3 +++
> > include/uapi/linux/pci_regs.h | 5 ++++
> > drivers/pci/ats.c | 44 +++++++++++++++++++++++++++++++++++
> > 3 files changed, 52 insertions(+)
> >
> > diff --git a/include/linux/pci-ats.h b/include/linux/pci-ats.h
> > index 75c6c86cf09d..d14ba727d38b 100644
> > --- a/include/linux/pci-ats.h
> > +++ b/include/linux/pci-ats.h
> > @@ -12,6 +12,7 @@ int pci_prepare_ats(struct pci_dev *dev, int ps);
> > void pci_disable_ats(struct pci_dev *dev);
> > int pci_ats_queue_depth(struct pci_dev *dev);
> > int pci_ats_page_aligned(struct pci_dev *dev);
> > +bool pci_ats_always_on(struct pci_dev *dev);
> > #else /* CONFIG_PCI_ATS */
> > static inline bool pci_ats_supported(struct pci_dev *d)
> > { return false; }
> > @@ -24,6 +25,8 @@ static inline int pci_ats_queue_depth(struct pci_dev *d)
> > { return -ENODEV; }
> > static inline int pci_ats_page_aligned(struct pci_dev *dev)
> > { return 0; }
> > +static inline bool pci_ats_always_on(struct pci_dev *dev)
> > +{ return false; }
> > #endif /* CONFIG_PCI_ATS */
> >
> > #ifdef CONFIG_PCI_PRI
> > diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> > index 3add74ae2594..84da6d7645a3 100644
> > --- a/include/uapi/linux/pci_regs.h
> > +++ b/include/uapi/linux/pci_regs.h
> > @@ -1258,6 +1258,11 @@
> > #define PCI_DVSEC_CXL_PORT_CTL 0x0c
> > #define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR 0x00000001
> >
> > +/* CXL 2.0 8.1.3: PCIe DVSEC for CXL Device */
> > +#define CXL_DVSEC_PCIE_DEVICE 0
> > +#define CXL_DVSEC_CAP_OFFSET 0xA
> > +#define CXL_DVSEC_CACHE_CAPABLE BIT(0)
> > +
> > /* Integrity and Data Encryption Extended Capability */
> > #define PCI_IDE_CAP 0x04
> > #define PCI_IDE_CAP_LINK 0x1 /* Link IDE Stream Supported */
> > diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
> > index ec6c8dbdc5e9..1795131f0697 100644
> > --- a/drivers/pci/ats.c
> > +++ b/drivers/pci/ats.c
> > @@ -205,6 +205,50 @@ int pci_ats_page_aligned(struct pci_dev *pdev)
> > return 0;
> > }
> >
> > +/*
> > + * CXL r4.0, sec 3.2.5.13 Memory Type on CXL.cache notes: to source
> > requests on
> > + * CXL.cache, devices need to get the Host Physical Address (HPA) from the
> > Host
> > + * by means of an ATS request on CXL.io.
> > + *
> > + * In other world, CXL.cache devices cannot access physical memory
> > without ATS.
> > + */
> > +static bool pci_cxl_ats_always_on(struct pci_dev *pdev)
> > +{
> > + int offset;
> > + u16 cap;
> > +
> > + offset = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
> > + CXL_DVSEC_PCIE_DEVICE);
> > + if (!offset)
> > + return false;
> > +
> > + pci_read_config_word(pdev, offset + CXL_DVSEC_CAP_OFFSET,
> > &cap);
> > + if (cap & CXL_DVSEC_CACHE_CAPABLE)
> > + return true;
> > +
> > + return false;
> > +}
> > +
> > +/**
> > + * pci_ats_always_on - Whether the PCI device requires ATS to be always
> > enabled
> > + * @pdev: the PCI device
> > + *
> > + * Returns true, if the PCI device requires non-PASID ATS function on an
> > IOMMU
> > + * bypassed configuration.
> > + */
> > +bool pci_ats_always_on(struct pci_dev *pdev)
> > +{
> > + if (pci_ats_disabled() || !pci_ats_supported(pdev))
> > + return false;
> > +
> > + /* A VF inherits its PF's requirement for ATS function */
> > + if (pdev->is_virtfn)
> > + pdev = pci_physfn(pdev);
> > +
> > + return pci_cxl_ats_always_on(pdev);
> > +}
> > +EXPORT_SYMBOL_GPL(pci_ats_always_on);
> > +
> > #ifdef CONFIG_PCI_PRI
> > void pci_pri_init(struct pci_dev *pdev)
> > {
> > --
> > 2.43.0
>
>
On Wed, Jan 21, 2026 at 10:03:07AM +0000, Jonathan Cameron wrote: > On Wed, 21 Jan 2026 08:01:36 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > +Dan. I recalled an offline discussion in which he raised concern on > > having the kernel blindly enable ATS for cxl.cache device instead of > > creating a knob for admin to configure from userspace (in case > > security is viewed more important than functionality, upon allowing > > DMA to read data out of CPU caches)... > > > > +CC Linux-cxl A cxl.cache device supporting ATS will automatically enable ATS today if the kernel option to enable translation is set. Even if the device is marked untrusted by the PCI layer (eg an external port). Yes this is effectively a security issue, but it is not really a CXL specific problem. We might perfer to not enable ATS for untrusted devices and then fail to load drivers for "ats always on" cases. Or maybe we can enable one of the ATS security features someday, though I wonder if those work for CXL.. Jason
On 1/21/26 13:03, Jason Gunthorpe wrote: > On Wed, Jan 21, 2026 at 10:03:07AM +0000, Jonathan Cameron wrote: >> On Wed, 21 Jan 2026 08:01:36 +0000 >> "Tian, Kevin" <kevin.tian@intel.com> wrote: >> >>> +Dan. I recalled an offline discussion in which he raised concern on >>> having the kernel blindly enable ATS for cxl.cache device instead of >>> creating a knob for admin to configure from userspace (in case >>> security is viewed more important than functionality, upon allowing >>> DMA to read data out of CPU caches)... >>> >> +CC Linux-cxl > A cxl.cache device supporting ATS will automatically enable ATS today > if the kernel option to enable translation is set. > > Even if the device is marked untrusted by the PCI layer (eg an > external port). > > Yes this is effectively a security issue, but it is not really a CXL > specific problem. > > We might perfer to not enable ATS for untrusted devices and then fail to > load drivers for "ats always on" cases. > > Or maybe we can enable one of the ATS security features someday, > though I wonder if those work for CXL.. I raised my concerns about CXL.cache and virtualization at LPC: https://lpc.events/event/19/contributions/2173/attachments/1842/3940/LPC_2025_CXL_CACHE.pdf I expose there some concerns, although I admit some could be due to my twisted understanding of what CXL specs states, but regarding IOMMU/ATS, my view is ATS is not safe enough ... what I guess is a matter of opinion (trusted device based on vendor confirming it is an "official" device not enough for paranoid mode with vendors subjected to governments "agencies" actions/pressures). But this links to what I think Jason points out about ATS security features where the IOMMU hardware can be configured for checking those translated PCIe accesses as well, if the host owner/admin paranoid mind decides so. With CXL cache that is not possible since the route is through a different link and AFAIK, there is no support for something like this by current implementations. I think it could be implemented without impacting the gains from CXL.cache, but that is another story. So, FWIW, I think it should not be enabled by default. Thank you, Alejandro > Jason >
Jason Gunthorpe wrote: > On Wed, Jan 21, 2026 at 10:03:07AM +0000, Jonathan Cameron wrote: > > On Wed, 21 Jan 2026 08:01:36 +0000 > > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > +Dan. I recalled an offline discussion in which he raised concern on > > > having the kernel blindly enable ATS for cxl.cache device instead of > > > creating a knob for admin to configure from userspace (in case > > > security is viewed more important than functionality, upon allowing > > > DMA to read data out of CPU caches)... > > > > > > > +CC Linux-cxl > > A cxl.cache device supporting ATS will automatically enable ATS today > if the kernel option to enable translation is set. > > Even if the device is marked untrusted by the PCI layer (eg an > external port). > > Yes this is effectively a security issue, but it is not really a CXL > specific problem. My contention is that it is a worse or at least different problem in the CXL case because now you have a new toolkit in an attack that wants to exfiltrate data from CPU caches. > We might perfer to not enable ATS for untrusted devices and then fail to > load drivers for "ats always on" cases. The current PCI untrusted flag is not fit for purpose in this new age of PCI device authentication and CXL.cache capable devices. > Or maybe we can enable one of the ATS security features someday, > though I wonder if those work for CXL.. It should work, but before that I do not see the justification to say effectively: "We have a less than perfect legacy way (PCI untrusted flag) to nod at ATS security problems. Let us ignore even that for a new class of devices that advertise they can trigger all the old security problems plus new ones." I do not immediately see what is wrong with requiring userspace policy opt-in. That naturally gets replaced by installing the device's certificate (for native PCI CMA), authenticating the device with the TSM (for PCI IDE), or obviated by secure-ATS if that arrives.
On Wed, Jan 21, 2026 at 09:44:32PM -0800, dan.j.williams@intel.com wrote: > Jason Gunthorpe wrote: > > On Wed, Jan 21, 2026 at 10:03:07AM +0000, Jonathan Cameron wrote: > > > On Wed, 21 Jan 2026 08:01:36 +0000 > > > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > > > +Dan. I recalled an offline discussion in which he raised concern on > > > > having the kernel blindly enable ATS for cxl.cache device instead of > > > > creating a knob for admin to configure from userspace (in case > > > > security is viewed more important than functionality, upon allowing > > > > DMA to read data out of CPU caches)... > > > > > > > > > > +CC Linux-cxl > > > > A cxl.cache device supporting ATS will automatically enable ATS today > > if the kernel option to enable translation is set. > > > > Even if the device is marked untrusted by the PCI layer (eg an > > external port). > > > > Yes this is effectively a security issue, but it is not really a CXL > > specific problem. > > My contention is that it is a worse or at least different problem in the > CXL case because now you have a new toolkit in an attack that wants to > exfiltrate data from CPU caches. ?? I don't see CXL as meaningfully different than PCI in terms of what data can be accessed with Translated requests. If the IOMMU doesn't block Translated requests the whole systems is open. CXL doesn't make it more open. > "We have a less than perfect legacy way (PCI untrusted flag) to nod at > ATS security problems. Let us ignore even that for a new class of > devices that advertise they can trigger all the old security problems > plus new ones." Ah, I missed that we are already force disabling ATS in this untrusted case, so we should ensure that continues to be the case here too. Nicolin does it need a change? > I do not immediately see what is wrong with requiring userspace policy > opt-in. That naturally gets replaced by installing the device's > certificate (for native PCI CMA), authenticating the device with the > TSM (for PCI IDE), or obviated by secure-ATS if that arrives. I think that goes back to the discussion about not loading drivers before validating the device. It would also make alot of sense to leave the IOMMU blocking until the driver is loaded for these secure situations. The blocking translation should block ATS too. Then the flow you are describing will work well: 1) At pre-boot the IOMMU will block all DMA including Translated. 2) The OS activates the IOMMU driver and keeps blocking. 3) Instead of immediately binding a default domain the IOMMU core leaves the translation blocking. 4) The OS defers loading the driver to userspace. 5) Userspace measures the device and "accepts" it by loading the driver 6) IOMMU core attaches a non-blocking default domain and activates ATS Jason
Jason Gunthorpe wrote: > On Wed, Jan 21, 2026 at 09:44:32PM -0800, dan.j.williams@intel.com wrote: > > Jason Gunthorpe wrote: > > > On Wed, Jan 21, 2026 at 10:03:07AM +0000, Jonathan Cameron wrote: > > > > On Wed, 21 Jan 2026 08:01:36 +0000 > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > > > > > +Dan. I recalled an offline discussion in which he raised concern on > > > > > having the kernel blindly enable ATS for cxl.cache device instead of > > > > > creating a knob for admin to configure from userspace (in case > > > > > security is viewed more important than functionality, upon allowing > > > > > DMA to read data out of CPU caches)... > > > > > > > > > > > > > +CC Linux-cxl > > > > > > A cxl.cache device supporting ATS will automatically enable ATS today > > > if the kernel option to enable translation is set. > > > > > > Even if the device is marked untrusted by the PCI layer (eg an > > > external port). > > > > > > Yes this is effectively a security issue, but it is not really a CXL > > > specific problem. > > > > My contention is that it is a worse or at least different problem in the > > CXL case because now you have a new toolkit in an attack that wants to > > exfiltrate data from CPU caches. > > ?? I don't see CXL as meaningfully different than PCI in terms of what > data can be accessed with Translated requests. If the IOMMU doesn't > block Translated requests the whole systems is open. CXL doesn't make > it more open. Right, the game is mostly over in the current case, but CXL.cache still deserves to be treated carefully. Consider a world where we do have limitations against requests to HPAs that were never translated for the device. In that scenario the device can help side channel the contents of HPAs it does not otherwise have access by messing with aliased lines it does have access. At a minimum CXL.cache is not improving the security story, so no time like the present to put a policy mechanism in place that improves upon the PCI untrusted flag. > > "We have a less than perfect legacy way (PCI untrusted flag) to nod at > > ATS security problems. Let us ignore even that for a new class of > > devices that advertise they can trigger all the old security problems > > plus new ones." > > Ah, I missed that we are already force disabling ATS in this untrusted > case, so we should ensure that continues to be the case here > too. Nicolin does it need a change? > > > I do not immediately see what is wrong with requiring userspace policy > > opt-in. That naturally gets replaced by installing the device's > > certificate (for native PCI CMA), authenticating the device with the > > TSM (for PCI IDE), or obviated by secure-ATS if that arrives. > > I think that goes back to the discussion about not loading drivers > before validating the device. > > It would also make alot of sense to leave the IOMMU blocking until the > driver is loaded for these secure situations. The blocking translation > should block ATS too. > > Then the flow you are describing will work well: > > 1) At pre-boot the IOMMU will block all DMA including Translated. > 2) The OS activates the IOMMU driver and keeps blocking. > 3) Instead of immediately binding a default domain the IOMMU core > leaves the translation blocking. > 4) The OS defers loading the driver to userspace. > 5) Userspace measures the device and "accepts" it by loading the > driver > 6) IOMMU core attaches a non-blocking default domain and activates ATS That works for me. Give the paranoid the ability to have a point where they can be assured that the shields were not lowered prematurely.
> From: Williams, Dan J <dan.j.williams@intel.com> > Sent: Friday, January 23, 2026 3:46 AM > > Jason Gunthorpe wrote: > > On Wed, Jan 21, 2026 at 09:44:32PM -0800, dan.j.williams@intel.com > wrote: > > > I do not immediately see what is wrong with requiring userspace policy > > > opt-in. That naturally gets replaced by installing the device's > > > certificate (for native PCI CMA), authenticating the device with the > > > TSM (for PCI IDE), or obviated by secure-ATS if that arrives. > > > > I think that goes back to the discussion about not loading drivers > > before validating the device. > > > > It would also make alot of sense to leave the IOMMU blocking until the > > driver is loaded for these secure situations. The blocking translation > > should block ATS too. > > > > Then the flow you are describing will work well: > > > > 1) At pre-boot the IOMMU will block all DMA including Translated. > > 2) The OS activates the IOMMU driver and keeps blocking. > > 3) Instead of immediately binding a default domain the IOMMU core > > leaves the translation blocking. > > 4) The OS defers loading the driver to userspace. > > 5) Userspace measures the device and "accepts" it by loading the > > driver > > 6) IOMMU core attaches a non-blocking default domain and activates ATS > > That works for me. Give the paranoid the ability to have a point where they > can > be assured that the shields were not lowered prematurely. Jason described the flow as "for these secure situations", i.e. not a general requirement for cxl.cache, but iiuc Dan may instead want userspace policy opt-in to be default (and with CMA/TSM etc. it gets easier)? Better to clarity the agreement here as the output decides whether to continue what this series tries to do... At a glance cxl.cache devices have gained ATS enabled automatically in most cases (same as for all other ats-capable PCI devices): - ARM: ATS is enabled automatically when attaching the default domain to the device in certain configurations, and this series tries to auto enable it in a missing configuration - AMD: ATS is enabled at domain attach time - Intel: ATS is enabled when a device is probed by intel-iommu driver (incompatible with the suggested flow) Given above already shipped in distributions, probably we have to keep them for compatibility (implying this series makes sense to fix a gap in existing policy), then treat the suggested flow as an enhancement for future?
On Tue, Jan 27, 2026 at 08:10:06AM +0000, Tian, Kevin wrote: > > From: Williams, Dan J <dan.j.williams@intel.com> > > Sent: Friday, January 23, 2026 3:46 AM > > > > Jason Gunthorpe wrote: > > > On Wed, Jan 21, 2026 at 09:44:32PM -0800, dan.j.williams@intel.com > > wrote: > > > > I do not immediately see what is wrong with requiring userspace policy > > > > opt-in. That naturally gets replaced by installing the device's > > > > certificate (for native PCI CMA), authenticating the device with the > > > > TSM (for PCI IDE), or obviated by secure-ATS if that arrives. > > > > > > I think that goes back to the discussion about not loading drivers > > > before validating the device. > > > > > > It would also make alot of sense to leave the IOMMU blocking until the > > > driver is loaded for these secure situations. The blocking translation > > > should block ATS too. > > > > > > Then the flow you are describing will work well: > > > > > > 1) At pre-boot the IOMMU will block all DMA including Translated. > > > 2) The OS activates the IOMMU driver and keeps blocking. > > > 3) Instead of immediately binding a default domain the IOMMU core > > > leaves the translation blocking. > > > 4) The OS defers loading the driver to userspace. > > > 5) Userspace measures the device and "accepts" it by loading the > > > driver > > > 6) IOMMU core attaches a non-blocking default domain and activates ATS > > > > That works for me. Give the paranoid the ability to have a point where they > > can > > be assured that the shields were not lowered prematurely. > > Jason described the flow as "for these secure situations", i.e. not a general > requirement for cxl.cache, but iiuc Dan may instead want userspace policy > opt-in to be default (and with CMA/TSM etc. it gets easier)? I think the general strategy has been to push userspace to do security decisions before binding drivers. So we have a plan for confidential compute VMs, and if there is interest then we can probably re-use that plan in all other cases. > At a glance cxl.cache devices have gained ATS enabled automatically in > most cases (same as for all other ats-capable PCI devices): Yes. > - ARM: ATS is enabled automatically when attaching the default domain > to the device in certain configurations, and this series tries to auto > enable it in a missing configuration Yes, ARM took the position that ATS should be left disabled for IDENTITY both because of SMMU constraints and also because it made some sense that you wouldn't want ATS overhead just to get a 1:1 translation. > - AMD: ATS is enabled at domain attach time I'd argue this is an error and it should work like ARM > - Intel: ATS is enabled when a device is probed by intel-iommu driver > (incompatible with the suggested flow) This is definately not a good choice :) IMHO it is security required that the IOMMU driver block Translated requests while a BLOCKED domain is attached, and while the IOMMU is refusing ATS then device's ATS enable should be disabled. > Given above already shipped in distributions, probably we have to keep > them for compatibility (implying this series makes sense to fix a gap > in existing policy), then treat the suggested flow as an enhancement > for future? I don't think we have a compatability issue here, just a security one. Drivers need to ensure that ATS is disabled at PCI and Translated requestes blocked in IOMMU HW while a BLOCKED domain is attached. Drivers can choose if they want to enable ATS for IDENTITY or not, (recommend not for performance and consistency). Jason
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, January 27, 2026 11:05 PM
>
> On Tue, Jan 27, 2026 at 08:10:06AM +0000, Tian, Kevin wrote:
> > > From: Williams, Dan J <dan.j.williams@intel.com>
> > > Sent: Friday, January 23, 2026 3:46 AM
> > >
> > > Jason Gunthorpe wrote:
> > > > On Wed, Jan 21, 2026 at 09:44:32PM -0800, dan.j.williams@intel.com
> > > wrote:
> > > > > I do not immediately see what is wrong with requiring userspace
> policy
> > > > > opt-in. That naturally gets replaced by installing the device's
> > > > > certificate (for native PCI CMA), authenticating the device with the
> > > > > TSM (for PCI IDE), or obviated by secure-ATS if that arrives.
> > > >
> > > > I think that goes back to the discussion about not loading drivers
> > > > before validating the device.
> > > >
> > > > It would also make alot of sense to leave the IOMMU blocking until the
> > > > driver is loaded for these secure situations. The blocking translation
> > > > should block ATS too.
> > > >
> > > > Then the flow you are describing will work well:
> > > >
> > > > 1) At pre-boot the IOMMU will block all DMA including Translated.
> > > > 2) The OS activates the IOMMU driver and keeps blocking.
> > > > 3) Instead of immediately binding a default domain the IOMMU core
> > > > leaves the translation blocking.
> > > > 4) The OS defers loading the driver to userspace.
> > > > 5) Userspace measures the device and "accepts" it by loading the
> > > > driver
> > > > 6) IOMMU core attaches a non-blocking default domain and activates
> ATS
> > >
> > > That works for me. Give the paranoid the ability to have a point where
> they
> > > can
> > > be assured that the shields were not lowered prematurely.
> >
> > Jason described the flow as "for these secure situations", i.e. not a general
> > requirement for cxl.cache, but iiuc Dan may instead want userspace policy
> > opt-in to be default (and with CMA/TSM etc. it gets easier)?
>
> I think the general strategy has been to push userspace to do security
> decisions before binding drivers. So we have a plan for confidential
> compute VMs, and if there is interest then we can probably re-use that
> plan in all other cases.
make sense
>
> > At a glance cxl.cache devices have gained ATS enabled automatically in
> > most cases (same as for all other ats-capable PCI devices):
>
> Yes.
>
> > - ARM: ATS is enabled automatically when attaching the default domain
> > to the device in certain configurations, and this series tries to auto
> > enable it in a missing configuration
>
> Yes, ARM took the position that ATS should be left disabled for
> IDENTITY both because of SMMU constraints and also because it made
> some sense that you wouldn't want ATS overhead just to get a 1:1
> translation.
>
> > - AMD: ATS is enabled at domain attach time
>
> I'd argue this is an error and it should work like ARM
>
> > - Intel: ATS is enabled when a device is probed by intel-iommu driver
> > (incompatible with the suggested flow)
>
> This is definately not a good choice :)
>
> IMHO it is security required that the IOMMU driver block Translated
> requests while a BLOCKED domain is attached, and while the IOMMU is
> refusing ATS then device's ATS enable should be disabled.
It was made that way by commit 5518f239aff1 ("iommu/vt-d: Move
scalable mode ATS enablement to probe path "). The old policy was
same as AMD side, and changed to current way so domain change
in RID won't affect the ATS requirement for PASIDs.
But I agree BLOCKED is special. Ideally there is no reason to have
RID attached to a BLOCKED domain while PASIDs are not
blocked... oh, wait
there is one scenario, e.g. VFIO allows domains attached to RID and
PASIDs being changed independently. It's a sane situation to have
userspace change RID domain via attach/detach/re-attach, while
PASID domains are still active. 'detach' will attach RID to a BLOCKED
domain, then disabling ATS in that window may break PASIDs.
How does ARM address this scenario? Is it more suitable to have
a new interface specific for driver bind/unbind to enable/disable
ATS instead of toggling it based on BLOCKED?
>
> > Given above already shipped in distributions, probably we have to keep
> > them for compatibility (implying this series makes sense to fix a gap
> > in existing policy), then treat the suggested flow as an enhancement
> > for future?
>
> I don't think we have a compatability issue here, just a security
> one.
'compatibility issue' if auto-enabling is completely removed with
userspace opt-in as the only way. But since it's not the case, then
yes it's more a security one.
>
> Drivers need to ensure that ATS is disabled at PCI and Translated
> requestes blocked in IOMMU HW while a BLOCKED domain is attached.
>
> Drivers can choose if they want to enable ATS for IDENTITY or not,
> (recommend not for performance and consistency).
>
> Jason
On Wed, Jan 28, 2026 at 12:57:59AM +0000, Tian, Kevin wrote:
> > > - Intel: ATS is enabled when a device is probed by intel-iommu driver
> > > (incompatible with the suggested flow)
> >
> > This is definately not a good choice :)
> >
> > IMHO it is security required that the IOMMU driver block Translated
> > requests while a BLOCKED domain is attached, and while the IOMMU is
> > refusing ATS then device's ATS enable should be disabled.
>
> It was made that way by commit 5518f239aff1 ("iommu/vt-d: Move
> scalable mode ATS enablement to probe path "). The old policy was
> same as AMD side, and changed to current way so domain change
> in RID won't affect the ATS requirement for PASIDs.
That's a legimiate thing, but always on is a heavy handed solution.
The driver should track what is going on with the PASID and enable ATS
if required.
Which also solves this:
> there is one scenario, e.g. VFIO allows domains attached to RID and
> PASIDs being changed independently. It's a sane situation to have
> userspace change RID domain via attach/detach/re-attach, while
> PASID domains are still active. 'detach' will attach RID to a BLOCKED
> domain, then disabling ATS in that window may break PASIDs.
> How does ARM address this scenario? Is it more suitable to have
> a new interface specific for driver bind/unbind to enable/disable
> ATS instead of toggling it based on BLOCKED?
And is what SMMUv3 is doing already. With an IDENTITY translation on
the RID it starts out with ATS disabled and switches to IDENTITY with
ATS enabled when the first PASID appears. Switches back when the PASID
goes away.
Jason
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, January 28, 2026 9:12 PM
>
> On Wed, Jan 28, 2026 at 12:57:59AM +0000, Tian, Kevin wrote:
> > > > - Intel: ATS is enabled when a device is probed by intel-iommu driver
> > > > (incompatible with the suggested flow)
> > >
> > > This is definately not a good choice :)
> > >
> > > IMHO it is security required that the IOMMU driver block Translated
> > > requests while a BLOCKED domain is attached, and while the IOMMU is
> > > refusing ATS then device's ATS enable should be disabled.
> >
> > It was made that way by commit 5518f239aff1 ("iommu/vt-d: Move
> > scalable mode ATS enablement to probe path "). The old policy was
> > same as AMD side, and changed to current way so domain change
> > in RID won't affect the ATS requirement for PASIDs.
>
> That's a legimiate thing, but always on is a heavy handed solution.
yes. at that point the rationale was made purely based on functionality
instead of security.
>
> The driver should track what is going on with the PASID and enable ATS
> if required.
>
> Which also solves this:
>
> > there is one scenario, e.g. VFIO allows domains attached to RID and
> > PASIDs being changed independently. It's a sane situation to have
> > userspace change RID domain via attach/detach/re-attach, while
> > PASID domains are still active. 'detach' will attach RID to a BLOCKED
> > domain, then disabling ATS in that window may break PASIDs.
>
> > How does ARM address this scenario? Is it more suitable to have
> > a new interface specific for driver bind/unbind to enable/disable
> > ATS instead of toggling it based on BLOCKED?
>
> And is what SMMUv3 is doing already. With an IDENTITY translation on
> the RID it starts out with ATS disabled and switches to IDENTITY with
> ATS enabled when the first PASID appears. Switches back when the PASID
> goes away.
>
yes, that should work. In that way the driver binding flow is covered
automatically.
Jason Gunthorpe wrote: [..] > > Jason described the flow as "for these secure situations", i.e. not a general > > requirement for cxl.cache, but iiuc Dan may instead want userspace policy > > opt-in to be default (and with CMA/TSM etc. it gets easier)? > > I think the general strategy has been to push userspace to do security > decisions before binding drivers. So we have a plan for confidential > compute VMs, and if there is interest then we can probably re-use that > plan in all other cases. Right, if you want to configure a kernel to automatically enable ATS that is choice. But, as distros get more security concious about devices for confidential compute, it would be nice to be able to rely on the same opt-in model for other security concerns like ATS security. > > At a glance cxl.cache devices have gained ATS enabled automatically in > > most cases (same as for all other ats-capable PCI devices): > > Yes. > > > - ARM: ATS is enabled automatically when attaching the default domain > > to the device in certain configurations, and this series tries to auto > > enable it in a missing configuration > > Yes, ARM took the position that ATS should be left disabled for > IDENTITY both because of SMMU constraints and also because it made > some sense that you wouldn't want ATS overhead just to get a 1:1 > translation. Does this mean that ARM already today does not enable ATS until driver attach, or is incremental work needed for that capability? > > - AMD: ATS is enabled at domain attach time > > I'd argue this is an error and it should work like ARM > > > - Intel: ATS is enabled when a device is probed by intel-iommu driver > > (incompatible with the suggested flow) > > This is definately not a good choice :) > > IMHO it is security required that the IOMMU driver block Translated > requests while a BLOCKED domain is attached, and while the IOMMU is > refusing ATS then device's ATS enable should be disabled. > > > Given above already shipped in distributions, probably we have to keep > > them for compatibility (implying this series makes sense to fix a gap > > in existing policy), then treat the suggested flow as an enhancement > > for future? > > I don't think we have a compatability issue here, just a security > one. > > Drivers need to ensure that ATS is disabled at PCI and Translated > requestes blocked in IOMMU HW while a BLOCKED domain is attached. "Drivers" here meaning IOMMU drivers, right? > Drivers can choose if they want to enable ATS for IDENTITY or not, > (recommend not for performance and consistency). > > Jason
On Tue, Jan 27, 2026 at 04:49:07PM -0800, dan.j.williams@intel.com wrote: > > Yes, ARM took the position that ATS should be left disabled for > > IDENTITY both because of SMMU constraints and also because it made > > some sense that you wouldn't want ATS overhead just to get a 1:1 > > translation. > > Does this mean that ARM already today does not enable ATS until driver > attach, or is incremental work needed for that capability? All of the iommu drivers setup an iommu translation and enable ATS before any driver is bound. We would need to do more work in the core to leave the translation blocked when there is no driver. I don't think it is that difficult > > Drivers need to ensure that ATS is disabled at PCI and Translated > > requestes blocked in IOMMU HW while a BLOCKED domain is attached. > > "Drivers" here meaning IOMMU drivers, right? Yes Jason
On Wed, Jan 28, 2026 at 09:05:20AM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 27, 2026 at 04:49:07PM -0800, dan.j.williams@intel.com wrote:
> > > Yes, ARM took the position that ATS should be left disabled for
> > > IDENTITY both because of SMMU constraints and also because it made
> > > some sense that you wouldn't want ATS overhead just to get a 1:1
> > > translation.
> >
> > Does this mean that ARM already today does not enable ATS until driver
> > attach, or is incremental work needed for that capability?
>
> All of the iommu drivers setup an iommu translation and enable ATS
> before any driver is bound.
>
> We would need to do more work in the core to leave the translation
> blocked when there is no driver. I don't think it is that difficult
Hmm, not sure if we could use group->domain=NULL as "blocked..
Otherwise, I made a draft:
-----------------------------------------------------------------
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 349f31bedfa17..8ed15d5ea1f51 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -437,8 +437,6 @@ static int driver_sysfs_add(struct device *dev)
{
int ret;
- bus_notify(dev, BUS_NOTIFY_BIND_DRIVER);
-
ret = sysfs_create_link(&dev->driver->p->kobj, &dev->kobj,
kobject_name(&dev->kobj));
if (ret)
@@ -638,10 +636,12 @@ static int really_probe(struct device *dev, const struct device_driver *drv)
if (ret)
goto pinctrl_bind_failed;
+ bus_notify(dev, BUS_NOTIFY_BIND_DRIVER);
+
if (dev->bus->dma_configure) {
ret = dev->bus->dma_configure(dev);
if (ret)
- goto pinctrl_bind_failed;
+ goto bus_notify_bind_failed;
}
ret = driver_sysfs_add(dev);
@@ -717,9 +717,10 @@ static int really_probe(struct device *dev, const struct device_driver *drv)
probe_failed:
driver_sysfs_remove(dev);
sysfs_failed:
- bus_notify(dev, BUS_NOTIFY_DRIVER_NOT_BOUND);
if (dev->bus && dev->bus->dma_cleanup)
dev->bus->dma_cleanup(dev);
+bus_notify_bind_failed:
+ bus_notify(dev, BUS_NOTIFY_DRIVER_NOT_BOUND);
pinctrl_bind_failed:
device_links_no_driver(dev);
device_unbind_cleanup(dev);
@@ -1275,8 +1276,6 @@ static void __device_release_driver(struct device *dev, struct device *parent)
driver_sysfs_remove(dev);
- bus_notify(dev, BUS_NOTIFY_UNBIND_DRIVER);
-
pm_runtime_put_sync(dev);
device_remove(dev);
@@ -1284,6 +1283,8 @@ static void __device_release_driver(struct device *dev, struct device *parent)
if (dev->bus && dev->bus->dma_cleanup)
dev->bus->dma_cleanup(dev);
+ bus_notify(dev, BUS_NOTIFY_UNBIND_DRIVER);
+
device_unbind_cleanup(dev);
device_links_driver_cleanup(dev);
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 2ca990dfbb884..af53dce00e29b 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -106,6 +106,7 @@ static int __iommu_attach_group(struct iommu_domain *domain,
static struct iommu_domain *__iommu_paging_domain_alloc_flags(struct device *dev,
unsigned int type,
unsigned int flags);
+static int __iommu_group_alloc_blocking_domain(struct iommu_group *group);
enum {
IOMMU_SET_DOMAIN_MUST_SUCCEED = 1 << 0,
@@ -618,12 +619,6 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
ret = iommu_init_device(dev);
if (ret)
return ret;
- /*
- * And if we do now see any replay calls, they would indicate someone
- * misusing the dma_configure path outside bus code.
- */
- if (dev->driver)
- dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n");
group = dev->iommu_group;
gdev = iommu_group_alloc_device(group, dev);
@@ -641,6 +636,15 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
WARN_ON(group->default_domain && !group->domain);
if (group->default_domain)
iommu_create_device_direct_mappings(group->default_domain, dev);
+
+ /* Block translation requests from a device without driver */
+ if (!dev->driver) {
+ ret = __iommu_group_alloc_blocking_domain(group);
+ if (ret)
+ goto err_remove_gdev;
+ group->domain = group->blocking_domain;
+ }
+
if (group->domain) {
ret = __iommu_device_set_domain(group, dev, group->domain, NULL,
0);
@@ -1781,19 +1785,70 @@ static int probe_iommu_group(struct device *dev, void *data)
return ret;
}
+static int iommu_attach_default_domain(struct device *dev)
+{
+ struct iommu_group *group = iommu_group_get(dev);
+ int ret = 0;
+
+ if (!group)
+ return 0;
+
+ mutex_lock(&group->mutex);
+
+ if (group->blocking_domain) {
+ if (!group->default_domain) {
+ ret = iommu_setup_default_domain(group, 0);
+ if (!ret)
+ iommu_setup_dma_ops(dev);
+ } else if (group->domain == group->blocking_domain) {
+ ret = __iommu_group_set_domain(
+ group, group->default_domain);
+ }
+ }
+
+ mutex_unlock(&group->mutex);
+ iommu_group_put(group);
+ return ret;
+}
+
+static void iommu_detach_default_domain(struct device *dev)
+{
+ struct iommu_group *group = iommu_group_get(dev);
+
+ if (!group)
+ return;
+
+ mutex_lock(&group->mutex);
+
+ if (group->blocking_domain && group->domain != group->blocking_domain) {
+ __iommu_attach_device(group->blocking_domain, dev,
+ group->domain);
+ group->domain = group->blocking_domain;
+ }
+
+ mutex_unlock(&group->mutex);
+ iommu_group_put(group);
+}
+
static int iommu_bus_notifier(struct notifier_block *nb,
unsigned long action, void *data)
{
struct device *dev = data;
+ int ret;
if (action == BUS_NOTIFY_ADD_DEVICE) {
- int ret;
-
ret = iommu_probe_device(dev);
return (ret) ? NOTIFY_DONE : NOTIFY_OK;
} else if (action == BUS_NOTIFY_REMOVED_DEVICE) {
iommu_release_device(dev);
return NOTIFY_OK;
+ } else if (action == BUS_NOTIFY_BIND_DRIVER) {
+ ret = iommu_attach_default_domain(dev);
+ return ret ? NOTIFY_DONE : NOTIFY_OK;
+ } else if (action == BUS_NOTIFY_UNBOUND_DRIVER ||
+ action == BUS_NOTIFY_DRIVER_NOT_BOUND) {
+ iommu_detach_default_domain(dev);
+ return NOTIFY_OK;
}
return 0;
-----------------------------------------------------------------
Thanks
Nicolin
On Mon, Feb 02, 2026 at 09:13:50PM -0800, Nicolin Chen wrote:
> On Wed, Jan 28, 2026 at 09:05:20AM -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 27, 2026 at 04:49:07PM -0800, dan.j.williams@intel.com wrote:
> > > > Yes, ARM took the position that ATS should be left disabled for
> > > > IDENTITY both because of SMMU constraints and also because it made
> > > > some sense that you wouldn't want ATS overhead just to get a 1:1
> > > > translation.
> > >
> > > Does this mean that ARM already today does not enable ATS until driver
> > > attach, or is incremental work needed for that capability?
> >
> > All of the iommu drivers setup an iommu translation and enable ATS
> > before any driver is bound.
> >
> > We would need to do more work in the core to leave the translation
> > blocked when there is no driver. I don't think it is that difficult
>
> Hmm, not sure if we could use group->domain=NULL as "blocked..
Definately not, we need to use a proper blocked domain.
> @@ -437,8 +437,6 @@ static int driver_sysfs_add(struct device *dev)
> {
> int ret;
>
> - bus_notify(dev, BUS_NOTIFY_BIND_DRIVER);
> -
> ret = sysfs_create_link(&dev->driver->p->kobj, &dev->kobj,
> kobject_name(&dev->kobj));
> if (ret)
> @@ -638,10 +636,12 @@ static int really_probe(struct device *dev, const struct device_driver *drv)
> if (ret)
> goto pinctrl_bind_failed;
>
> + bus_notify(dev, BUS_NOTIFY_BIND_DRIVER);
> +
> if (dev->bus->dma_configure) {
> ret = dev->bus->dma_configure(dev);
> if (ret)
> - goto pinctrl_bind_failed;
> + goto bus_notify_bind_failed;
> }
We shouldn't need any of these? The dma_configure callback already
gets into the iommu code to validate the domain and restrict VFIO,
no further callbacks should be needed.
When the iommu driver is probed to the device we can assume no driver
is bound and immediately attach the blocked domain.
Jason
On Tue, Feb 03, 2026 at 10:33:48AM -0400, Jason Gunthorpe wrote:
> On Mon, Feb 02, 2026 at 09:13:50PM -0800, Nicolin Chen wrote:
> > On Wed, Jan 28, 2026 at 09:05:20AM -0400, Jason Gunthorpe wrote:
> > > On Tue, Jan 27, 2026 at 04:49:07PM -0800, dan.j.williams@intel.com wrote:
> > > > > Yes, ARM took the position that ATS should be left disabled for
> > > > > IDENTITY both because of SMMU constraints and also because it made
> > > > > some sense that you wouldn't want ATS overhead just to get a 1:1
> > > > > translation.
> > > >
> > > > Does this mean that ARM already today does not enable ATS until driver
> > > > attach, or is incremental work needed for that capability?
> > >
> > > All of the iommu drivers setup an iommu translation and enable ATS
> > > before any driver is bound.
> > >
> > > We would need to do more work in the core to leave the translation
> > > blocked when there is no driver. I don't think it is that difficult
> >
> > Hmm, not sure if we could use group->domain=NULL as "blocked..
>
> Definately not, we need to use a proper blocked domain.
Yea, I suspected so.
> > @@ -437,8 +437,6 @@ static int driver_sysfs_add(struct device *dev)
> > {
> > int ret;
> >
> > - bus_notify(dev, BUS_NOTIFY_BIND_DRIVER);
> > -
> > ret = sysfs_create_link(&dev->driver->p->kobj, &dev->kobj,
> > kobject_name(&dev->kobj));
> > if (ret)
> > @@ -638,10 +636,12 @@ static int really_probe(struct device *dev, const struct device_driver *drv)
> > if (ret)
> > goto pinctrl_bind_failed;
> >
> > + bus_notify(dev, BUS_NOTIFY_BIND_DRIVER);
> > +
> > if (dev->bus->dma_configure) {
> > ret = dev->bus->dma_configure(dev);
> > if (ret)
> > - goto pinctrl_bind_failed;
> > + goto bus_notify_bind_failed;
> > }
>
> We shouldn't need any of these? The dma_configure callback already
> gets into the iommu code to validate the domain and restrict VFIO,
> no further callbacks should be needed.
>
> When the iommu driver is probed to the device we can assume no driver
> is bound and immediately attach the blocked domain.
I was trying to use dev->driver that gets set before dma_configure()
and unset after dma_cleanup(). But looks like we could just keep the
track of group->owner_cnt in iommu_device_use/unuse_default_domain().
Btw, attaching to IOMMU_DOMAIN_BLOCKED/group->blocking_domain is not
allowed in general if require_direct=true. I assume this case can be
an exception since there's no point in allowing a device that has no
driver yet to access any reserved region?
Thanks
Nicolin
On Tue, Feb 03, 2026 at 09:45:17AM -0800, Nicolin Chen wrote: > Btw, attaching to IOMMU_DOMAIN_BLOCKED/group->blocking_domain is not > allowed in general if require_direct=true. I assume this case can be > an exception since there's no point in allowing a device that has no > driver yet to access any reserved region? If require_direct is set then we have to disable this mechanism.. I'm not sure exactly what to do about this as the require_direct comes from the hypervisor in a CC VM and we probably don't want to give the hypervisor this kind of escape hatch. Perhaps we need to lock off to failure on CC VMs if this ever happens.. But baremetal should just keep working how it always worked in this case.. Jason
On 2026-02-03 5:55 pm, Jason Gunthorpe wrote: > On Tue, Feb 03, 2026 at 09:45:17AM -0800, Nicolin Chen wrote: >> Btw, attaching to IOMMU_DOMAIN_BLOCKED/group->blocking_domain is not >> allowed in general if require_direct=true. I assume this case can be >> an exception since there's no point in allowing a device that has no >> driver yet to access any reserved region? No, the point of RMRs in general is that the device can be assumed to already be accessing them, and that access must be preserved, regardless of whether an OS driver may or may not take over the device later. In fact RMRs with the "Remapping Permitted" flag are only strictly needed *until* an OS driver has taken control over whatever it was that firmware left them doing. > If require_direct is set then we have to disable this mechanism.. > > I'm not sure exactly what to do about this as the require_direct comes > from the hypervisor in a CC VM and we probably don't want to give the > hypervisor this kind of escape hatch. > > Perhaps we need to lock off to failure on CC VMs if this ever > happens.. > > But baremetal should just keep working how it always worked in this > case.. Realistically this combination cannot exist bare-metal, since if the device requires to send ATS TT's to access an RMR then the SMMU would have to be enabled pre-boot, so then the RMR means we cannot ever disable it to reconfigure, so we'd be stuffed from the start... Even though it's potentially a little more plausible in a VM where the underlying S2 would satisfy ATS, for much the same reason it's still going to look highly suspect from the VM's point of view to be presented with a device whose apparent ability to perform ATS traffic through a supposedly-disabled S1 SMMU must not be disrupted. However I think there would be no point exposing the ATS details to the VM to begin with. It's the host's decision to trust the device to play in the translated PA space and system cache coherency protocol, and no guest would be allowed to mess with those aspects either way, so there seems no obvious good reason for them to know at all. Thanks, Robin.
On Tue, Feb 03, 2026 at 06:59:35PM +0000, Robin Murphy wrote: > Realistically this combination cannot exist bare-metal, since if the device > requires to send ATS TT's to access an RMR then the SMMU would have to be > enabled pre-boot, so then the RMR means we cannot ever disable it to > reconfigure, so we'd be stuffed from the start... This thread has gotten mixed up.. First this series as it is has nothing to do with RMRs. What the latter part is discussing is a future series to implement what I think MS calls "boot DMA security". Meaning we don't get into a position of allowing a device access to OS memory, even through ATS translated requests, until after userspace has approved the device. This is something that should combine with Dynamic Root of Trust for Measurement, as DRTM is much less useful if DMA can mutate the OS code after the DTRM returns. It is also meaningful for systems with encrypted PCI where the OS can measure the PCI device before permitting it access to anything. So... When we do implement this new security mode, what should it do if FW attempts to attack the kernel with these nonsensical RMR configurations? With DRTM we explicitly don't trust the FW for security anymore, so it is a problem. I strongly suspect the answer is that RMR has to be ignored in this more secure mode. > However I think there would be no point exposing the ATS details to > the VM to begin with. It's the host's decision to trust the device > to play in the translated PA space and system cache coherency > protocol, and no guest would be allowed to mess with those aspects > either way, so there seems no obvious good reason for them to know > at all. If the vSMMU is presented then the guest must be aware of the ATS because only the guest can generate the ATC invalidations for changes in the S1. Without a vSMMU the ATS can be hidden from the guest. Jason
On 2026-02-03 11:16 pm, Jason Gunthorpe wrote: > On Tue, Feb 03, 2026 at 06:59:35PM +0000, Robin Murphy wrote: >> Realistically this combination cannot exist bare-metal, since if the device >> requires to send ATS TT's to access an RMR then the SMMU would have to be >> enabled pre-boot, so then the RMR means we cannot ever disable it to >> reconfigure, so we'd be stuffed from the start... > > This thread has gotten mixed up.. > > First this series as it is has nothing to do with RMRs. I know, but you brought up require_direct so I figured it was worth clarifying that should not in fact impact ATS decisions, since the combination a device requiring ATS while *also* requiring an RMR would be essentially impossible to support given the SMMU architecture, thus we can reasonably assume nobody will do such a thing (or at least do it with any expectation of it ever working). > What the latter part is discussing is a future series to implement > what I think MS calls "boot DMA security". Meaning we don't get into a > position of allowing a device access to OS memory, even through ATS > translated requests, until after userspace has approved the device. Pre-boot protection is in the same boat, as things currently stand - an OS could either preserve security (by using GBPA to keep traffic blocked while reconfiguring the rest of the SMMU), *or* have ongoing DMA for a splash screen or whatever; it cannot realistically have both. > This is something that should combine with Dynamic Root of Trust for > Measurement, as DRTM is much less useful if DMA can mutate the OS code > after the DTRM returns. > > It is also meaningful for systems with encrypted PCI where the OS can > measure the PCI device before permitting it access to anything. > > So... When we do implement this new security mode, what should it do > if FW attempts to attack the kernel with these nonsensical RMR > configurations? With DRTM we explicitly don't trust the FW for > security anymore, so it is a problem. > > I strongly suspect the answer is that RMR has to be ignored in this > more secure mode. Yes, I think the only valid case for having an RMR and expecting it to work in combination with these other things is if the device has some firmware or preloaded configuration in memory which it will still need to access at that address once an OS driver starts using it, but does not need to access *during* the boot-time handover. Thus it seems fair to still honour the reserved regions upon attaching to a default domain, but not worry too much about being in a transient blocking state in the interim if it's unavoidable for other reasons (at worst maybe just log a warning that we're doing so). >> However I think there would be no point exposing the ATS details to >> the VM to begin with. It's the host's decision to trust the device >> to play in the translated PA space and system cache coherency >> protocol, and no guest would be allowed to mess with those aspects >> either way, so there seems no obvious good reason for them to know >> at all. > > If the vSMMU is presented then the guest must be aware of the ATS > because only the guest can generate the ATC invalidations for changes > in the S1. Only if you assume DVM or some other mechanism for the guest to issue S1 invalidations directly to the hardware - with an emulated CMDQ we can do whatever we like. And in fact, I think we actually *have* to if the host has enabled ATS itself, since we cannot assume that a guest is going to choose to use it, thus we cannot rely on the guest issuing ATCIs in order to get the correct behaviour it expects unless and until we've seen it set EATS appropriately in all the corresponding vSTEs. So we would have to forbid DVM, and only allow other direct mechanisms that can be dynamically enabled for as long as the guest configuration matches... Fun. Thanks, Robin.
On Wed, Feb 04, 2026 at 12:18:15PM +0000, Robin Murphy wrote: > > I strongly suspect the answer is that RMR has to be ignored in this > > more secure mode. > > Yes, I think the only valid case for having an RMR and expecting it to work > in combination with these other things is if the device has some firmware or > preloaded configuration in memory which it will still need to access at that > address once an OS driver starts using it, but does not need to access > *during* the boot-time handover. Splash screens are the most obvious case here where the framebuffer may be in DMA'able memory and must go through the iommu.. At least we are already shipping products where the GPU has DRAM based framebuffer, the GPU requires ATS for alot of functions, but the framebuffer scan out does not use ATS. Sigh. So that will be exciting to make work at some point. > Thus it seems fair to still honour the > reserved regions upon attaching to a default domain, but not worry too much > about being in a transient blocking state in the interim if it's unavoidable > for other reasons (at worst maybe just log a warning that we're > doing so). The interest in the blocking state was to disable ATS. Maybe another approach would be to have a "RMR blocking" domain which is a paging domain that tells the driver explicitly not to enable ATS for it. Then we could validate the RMR range is OK and install this special domain and still have security against translated TLPs.. > > > However I think there would be no point exposing the ATS details to > > > the VM to begin with. It's the host's decision to trust the device > > > to play in the translated PA space and system cache coherency > > > protocol, and no guest would be allowed to mess with those aspects > > > either way, so there seems no obvious good reason for them to know > > > at all. > > > > If the vSMMU is presented then the guest must be aware of the ATS > > because only the guest can generate the ATC invalidations for changes > > in the S1. > > Only if you assume DVM or some other mechanism for the guest to issue S1 > invalidations directly to the hardware - with an emulated CMDQ we can do > whatever we like. With alot of work yes, but that is not the model that is implemented today. If the hypervisor has to generate a ATC invalidation from an IOTLB invalidation then it also needs a map of ASID to RID&PASID, which it can only build by inspecting all the CD tables. The VMMs in nesting mode don't read the CD tables at all today, so they don't implement this option. > And in fact, I think we actually *have* to if the host has enabled ATS > itself, since we cannot assume that a guest is going to choose to use it, > thus we cannot rely on the guest issuing ATCIs in order to get the correct > behaviour it expects unless and until we've seen it set EATS appropriately > in all the corresponding vSTEs. Due to the above we've done the reverse, the host does not get to unilaterally decide ATS policy, it follows the guest's vEATS setting so that we never have a situation where the hypervisor has to generate ATC invalidations. The kernel offers the VMM the freedom to do it either way, but today all the VMMs I'm aware of choose the above path. Jason
On Tue, Feb 03, 2026 at 06:59:35PM +0000, Robin Murphy wrote: > On 2026-02-03 5:55 pm, Jason Gunthorpe wrote: > > On Tue, Feb 03, 2026 at 09:45:17AM -0800, Nicolin Chen wrote: > > > Btw, attaching to IOMMU_DOMAIN_BLOCKED/group->blocking_domain is not > > > allowed in general if require_direct=true. I assume this case can be > > > an exception since there's no point in allowing a device that has no > > > driver yet to access any reserved region? > > No, the point of RMRs in general is that the device can be assumed to > already be accessing them, and that access must be preserved, regardless of > whether an OS driver may or may not take over the device later. I see. Thanks for the input. > In fact RMRs > with the "Remapping Permitted" flag are only strictly needed *until* an OS > driver has taken control over whatever it was that firmware left them doing. Yes. I see that doesn't set require_direct. Nicolin
On Tue, Feb 03, 2026 at 01:55:40PM -0400, Jason Gunthorpe wrote:
> On Tue, Feb 03, 2026 at 09:45:17AM -0800, Nicolin Chen wrote:
> > Btw, attaching to IOMMU_DOMAIN_BLOCKED/group->blocking_domain is not
> > allowed in general if require_direct=true. I assume this case can be
> > an exception since there's no point in allowing a device that has no
> > driver yet to access any reserved region?
>
> If require_direct is set then we have to disable this mechanism..
>
> I'm not sure exactly what to do about this as the require_direct comes
> from the hypervisor in a CC VM and we probably don't want to give the
> hypervisor this kind of escape hatch.
>
> Perhaps we need to lock off to failure on CC VMs if this ever
> happens..
>
> But baremetal should just keep working how it always worked in this
> case..
OK. I will put a note in the patch, since it would literally skip
any VM case at this moment.
I just realized a corner case, as iommu_probe_device() may attach
the device to group->domain if it's set:
https://lore.kernel.org/all/9-v5-1b99ae392328+44574-iommu_err_unwind_jgg@nvidia.com/
I am not sure about the use case, but I assume we should skip the
blocking_domain as well in this case?
Then, this makes the condition be:
+ if (!dev->driver && !group->domain && !dev->iommu->require_direct) {
+ ret = __iommu_group_alloc_blocking_domain(group);
+ if (ret)
+ goto err_remove_gdev;
+ group->domain = group->blocking_domain;
+ }
Thanks
Nicolin
On Tue, Feb 03, 2026 at 10:50:39AM -0800, Nicolin Chen wrote:
> On Tue, Feb 03, 2026 at 01:55:40PM -0400, Jason Gunthorpe wrote:
> > On Tue, Feb 03, 2026 at 09:45:17AM -0800, Nicolin Chen wrote:
> > > Btw, attaching to IOMMU_DOMAIN_BLOCKED/group->blocking_domain is not
> > > allowed in general if require_direct=true. I assume this case can be
> > > an exception since there's no point in allowing a device that has no
> > > driver yet to access any reserved region?
> >
> > If require_direct is set then we have to disable this mechanism..
> >
> > I'm not sure exactly what to do about this as the require_direct comes
> > from the hypervisor in a CC VM and we probably don't want to give the
> > hypervisor this kind of escape hatch.
> >
> > Perhaps we need to lock off to failure on CC VMs if this ever
> > happens..
> >
> > But baremetal should just keep working how it always worked in this
> > case..
>
> OK. I will put a note in the patch, since it would literally skip
> any VM case at this moment.
>
> I just realized a corner case, as iommu_probe_device() may attach
> the device to group->domain if it's set:
> https://lore.kernel.org/all/9-v5-1b99ae392328+44574-iommu_err_unwind_jgg@nvidia.com/
>
> I am not sure about the use case, but I assume we should skip the
> blocking_domain as well in this case?
>
> Then, this makes the condition be:
> + if (!dev->driver && !group->domain && !dev->iommu->require_direct) {
> + ret = __iommu_group_alloc_blocking_domain(group);
> + if (ret)
> + goto err_remove_gdev;
> + group->domain = group->blocking_domain;
> + }
Just to be clear this is some other project "DMA boot security" and
IDK if we need to do it until the CC patches land for user space
device binding policy, or someone seriously implements DRTM..
Jason
On Thu, Jan 22, 2026 at 09:14:32AM -0400, Jason Gunthorpe wrote: > On Wed, Jan 21, 2026 at 09:44:32PM -0800, dan.j.williams@intel.com wrote: > > "We have a less than perfect legacy way (PCI untrusted flag) to nod at > > ATS security problems. Let us ignore even that for a new class of > > devices that advertise they can trigger all the old security problems > > plus new ones." > > Ah, I missed that we are already force disabling ATS in this untrusted > case, so we should ensure that continues to be the case here > too. Nicolin does it need a change? pci_ats_always_on() validates against !pci_ats_supported(pdev), so we ensured that untrusted devices would not be always on. Perhaps we should highlight in the commit message, as it's a topic? Nicolin
On Thu, Jan 22, 2026 at 08:29:10AM -0800, Nicolin Chen wrote: > On Thu, Jan 22, 2026 at 09:14:32AM -0400, Jason Gunthorpe wrote: > > On Wed, Jan 21, 2026 at 09:44:32PM -0800, dan.j.williams@intel.com wrote: > > > "We have a less than perfect legacy way (PCI untrusted flag) to nod at > > > ATS security problems. Let us ignore even that for a new class of > > > devices that advertise they can trigger all the old security problems > > > plus new ones." > > > > Ah, I missed that we are already force disabling ATS in this untrusted > > case, so we should ensure that continues to be the case here > > too. Nicolin does it need a change? > > pci_ats_always_on() validates against !pci_ats_supported(pdev), so > we ensured that untrusted devices would not be always on. > > Perhaps we should highlight in the commit message, as it's a topic? Yes Jason
On 1/21/26 21:03, Jason Gunthorpe wrote:
> On Wed, Jan 21, 2026 at 10:03:07AM +0000, Jonathan Cameron wrote:
>> On Wed, 21 Jan 2026 08:01:36 +0000
>> "Tian, Kevin"<kevin.tian@intel.com> wrote:
>>
>>> +Dan. I recalled an offline discussion in which he raised concern on
>>> having the kernel blindly enable ATS for cxl.cache device instead of
>>> creating a knob for admin to configure from userspace (in case
>>> security is viewed more important than functionality, upon allowing
>>> DMA to read data out of CPU caches)...
>>>
>> +CC Linux-cxl
> A cxl.cache device supporting ATS will automatically enable ATS today
> if the kernel option to enable translation is set.
>
> Even if the device is marked untrusted by the PCI layer (eg an
> external port).
I don't follow here. The untrusted check is now in pci_ats_supported():
/**
* pci_ats_supported - check if the device can use ATS
* @dev: the PCI device
*
* Returns true if the device supports ATS and is allowed to use it, false
* otherwise.
*/
bool pci_ats_supported(struct pci_dev *dev)
{
if (!dev->ats_cap)
return false;
return (dev->untrusted == 0);
}
EXPORT_SYMBOL_GPL(pci_ats_supported);
The iommu drivers (intel/amd/arm-smmuv3) all call pci_ats_supported()
before enabling ATS on a device. Anything I missed?
>
> Yes this is effectively a security issue, but it is not really a CXL
> specific problem.
>
> We might perfer to not enable ATS for untrusted devices and then fail to
> load drivers for "ats always on" cases.
>
> Or maybe we can enable one of the ATS security features someday,
> though I wonder if those work for CXL..
Thanks,
baolu
On Thu, Jan 22, 2026 at 09:17:27AM +0800, Baolu Lu wrote:
> On 1/21/26 21:03, Jason Gunthorpe wrote:
> > On Wed, Jan 21, 2026 at 10:03:07AM +0000, Jonathan Cameron wrote:
> > > On Wed, 21 Jan 2026 08:01:36 +0000
> > > "Tian, Kevin"<kevin.tian@intel.com> wrote:
> > >
> > > > +Dan. I recalled an offline discussion in which he raised concern on
> > > > having the kernel blindly enable ATS for cxl.cache device instead of
> > > > creating a knob for admin to configure from userspace (in case
> > > > security is viewed more important than functionality, upon allowing
> > > > DMA to read data out of CPU caches)...
> > > >
> > > +CC Linux-cxl
> > A cxl.cache device supporting ATS will automatically enable ATS today
> > if the kernel option to enable translation is set.
> >
> > Even if the device is marked untrusted by the PCI layer (eg an
> > external port).
>
> I don't follow here. The untrusted check is now in pci_ats_supported():
>
> /**
> * pci_ats_supported - check if the device can use ATS
> * @dev: the PCI device
> *
> * Returns true if the device supports ATS and is allowed to use it, false
> * otherwise.
> */
> bool pci_ats_supported(struct pci_dev *dev)
> {
> if (!dev->ats_cap)
> return false;
>
> return (dev->untrusted == 0);
> }
> EXPORT_SYMBOL_GPL(pci_ats_supported);
>
> The iommu drivers (intel/amd/arm-smmuv3) all call pci_ats_supported()
> before enabling ATS on a device. Anything I missed?
No, not at all, I forgot about this!
Jason
On Fri, Jan 16, 2026 at 08:56:40PM -0800, Nicolin Chen wrote: > Controlled by the IOMMU driver, ATS is usually enabled "on demand", when a > device requests a translation service from its associated IOMMU HW running > on the channel of a given PASID. This is working even when a device has no > translation on its RID, i.e. RID is IOMMU bypassed. I would add here that this is done to allow optimizing devices running in IDENTITY translation as there is no point to using ATS to return the same value as it already has. > For instance, the CXL spec notes in "3.2.5.13 Memory Type on CXL.cache": > "To source requests on CXL.cache, devices need to get the Host Physical > Address (HPA) from the Host by means of an ATS request on CXL.io." > In other word, the CXL.cache capability relies on ATS. Otherwise, it won't > have access to the host physical memory. > > Introduce a new pci_ats_always_on() for IOMMU driver to scan a PCI device, > to shift ATS policies between "on demand" and "always on". > > Add the support for CXL.cache devices first. Non-CXL devices will be added > in quirks.c file. > > Suggested-by: Vikram Sethi <vsethi@nvidia.com> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> > --- > include/linux/pci-ats.h | 3 +++ > include/uapi/linux/pci_regs.h | 5 ++++ > drivers/pci/ats.c | 44 +++++++++++++++++++++++++++++++++++ > 3 files changed, 52 insertions(+) This implementation looks OK to me Thanks, Jason
© 2016 - 2026 Red Hat, Inc.