Currently, we enforce the use of bounce buffers to ensure that memory
accessed by non-secure devices is explicitly shared with the host [1].
However, for secure devices, this approach must be avoided.
To achieve this, we introduce a device flag that controls whether a
bounce buffer allocation is required for the device. Additionally, this flag is
used to manage the top IPA bit assignment for setting up
protected/unprotected IPA aliases.
[1] commit fbf979a01375 ("arm64: Enforce bounce buffers for realm DMA")
based on changes from Alexey Kardashevskiy <aik@amd.com>
Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
---
arch/arm64/include/asm/mem_encrypt.h | 6 +-----
arch/arm64/mm/mem_encrypt.c | 10 ++++++++++
drivers/pci/tsm.c | 6 ++++++
include/linux/device.h | 1 +
include/linux/swiotlb.h | 4 ++++
5 files changed, 22 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/include/asm/mem_encrypt.h b/arch/arm64/include/asm/mem_encrypt.h
index 314b2b52025f..d77c10cd5b79 100644
--- a/arch/arm64/include/asm/mem_encrypt.h
+++ b/arch/arm64/include/asm/mem_encrypt.h
@@ -15,14 +15,10 @@ int arm64_mem_crypt_ops_register(const struct arm64_mem_crypt_ops *ops);
int set_memory_encrypted(unsigned long addr, int numpages);
int set_memory_decrypted(unsigned long addr, int numpages);
+bool force_dma_unencrypted(struct device *dev);
int realm_register_memory_enc_ops(void);
-static inline bool force_dma_unencrypted(struct device *dev)
-{
- return is_realm_world();
-}
-
/*
* For Arm CCA guests, canonical addresses are "encrypted", so no changes
* required for dma_addr_encrypted().
diff --git a/arch/arm64/mm/mem_encrypt.c b/arch/arm64/mm/mem_encrypt.c
index ee3c0ab04384..279696a8af3f 100644
--- a/arch/arm64/mm/mem_encrypt.c
+++ b/arch/arm64/mm/mem_encrypt.c
@@ -17,6 +17,7 @@
#include <linux/compiler.h>
#include <linux/err.h>
#include <linux/mm.h>
+#include <linux/device.h>
#include <asm/mem_encrypt.h>
@@ -48,3 +49,12 @@ int set_memory_decrypted(unsigned long addr, int numpages)
return crypt_ops->decrypt(addr, numpages);
}
EXPORT_SYMBOL_GPL(set_memory_decrypted);
+
+bool force_dma_unencrypted(struct device *dev)
+{
+ if (dev->tdi_enabled)
+ return false;
+
+ return is_realm_world();
+}
+EXPORT_SYMBOL_GPL(force_dma_unencrypted);
diff --git a/drivers/pci/tsm.c b/drivers/pci/tsm.c
index e4a3b5b37939..60f50d57a725 100644
--- a/drivers/pci/tsm.c
+++ b/drivers/pci/tsm.c
@@ -120,6 +120,7 @@ static int pci_tsm_disconnect(struct pci_dev *pdev)
tsm_ops->disconnect(pdev);
tsm->state = PCI_TSM_INIT;
+ pdev->dev.tdi_enabled = false;
return 0;
}
@@ -199,6 +200,8 @@ static int pci_tsm_accept(struct pci_dev *pdev)
if (rc)
return rc;
tsm->state = PCI_TSM_ACCEPT;
+ pdev->dev.tdi_enabled = true;
+
return 0;
}
@@ -557,6 +560,9 @@ static void __pci_tsm_init(struct pci_dev *pdev)
default:
break;
}
+
+ /* FIXME!! should this be default true and switch to false for TEE capable device */
+ pdev->dev.tdi_enabled = false;
}
void pci_tsm_init(struct pci_dev *pdev)
diff --git a/include/linux/device.h b/include/linux/device.h
index 4940db137fff..d62e0dd9d8ee 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -688,6 +688,7 @@ struct device {
#ifdef CONFIG_IOMMU_DMA
bool dma_iommu:1;
#endif
+ bool tdi_enabled:1;
};
/**
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 3dae0f592063..61e7cff7768b 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -173,6 +173,10 @@ static inline bool is_swiotlb_force_bounce(struct device *dev)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ if (dev->tdi_enabled) {
+ dev_warn_once(dev, "(TIO) Disable SWIOTLB");
+ return false;
+ }
return mem && mem->force_bounce;
}
--
2.43.0
Hi Aneesh, On Mon, Jul 28, 2025 at 07:21:41PM +0530, Aneesh Kumar K.V (Arm) wrote: > Currently, we enforce the use of bounce buffers to ensure that memory > accessed by non-secure devices is explicitly shared with the host [1]. > However, for secure devices, this approach must be avoided. Sorry this might be a basic question, I just started looking into this. I see that “force_dma_unencrypted” and “is_swiotlb_force_bounce” are only used from DMA-direct, but it seems in your case it involves an IOMMU. How does it influence bouncing in that case? Thanks, Mostafa > > To achieve this, we introduce a device flag that controls whether a > bounce buffer allocation is required for the device. Additionally, this flag is > used to manage the top IPA bit assignment for setting up > protected/unprotected IPA aliases. > > [1] commit fbf979a01375 ("arm64: Enforce bounce buffers for realm DMA") > > based on changes from Alexey Kardashevskiy <aik@amd.com> > Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> > --- > arch/arm64/include/asm/mem_encrypt.h | 6 +----- > arch/arm64/mm/mem_encrypt.c | 10 ++++++++++ > drivers/pci/tsm.c | 6 ++++++ > include/linux/device.h | 1 + > include/linux/swiotlb.h | 4 ++++ > 5 files changed, 22 insertions(+), 5 deletions(-) > > diff --git a/arch/arm64/include/asm/mem_encrypt.h b/arch/arm64/include/asm/mem_encrypt.h > index 314b2b52025f..d77c10cd5b79 100644 > --- a/arch/arm64/include/asm/mem_encrypt.h > +++ b/arch/arm64/include/asm/mem_encrypt.h > @@ -15,14 +15,10 @@ int arm64_mem_crypt_ops_register(const struct arm64_mem_crypt_ops *ops); > > int set_memory_encrypted(unsigned long addr, int numpages); > int set_memory_decrypted(unsigned long addr, int numpages); > +bool force_dma_unencrypted(struct device *dev); > > int realm_register_memory_enc_ops(void); > > -static inline bool force_dma_unencrypted(struct device *dev) > -{ > - return is_realm_world(); > -} > - > /* > * For Arm CCA guests, canonical addresses are "encrypted", so no changes > * required for dma_addr_encrypted(). > diff --git a/arch/arm64/mm/mem_encrypt.c b/arch/arm64/mm/mem_encrypt.c > index ee3c0ab04384..279696a8af3f 100644 > --- a/arch/arm64/mm/mem_encrypt.c > +++ b/arch/arm64/mm/mem_encrypt.c > @@ -17,6 +17,7 @@ > #include <linux/compiler.h> > #include <linux/err.h> > #include <linux/mm.h> > +#include <linux/device.h> > > #include <asm/mem_encrypt.h> > > @@ -48,3 +49,12 @@ int set_memory_decrypted(unsigned long addr, int numpages) > return crypt_ops->decrypt(addr, numpages); > } > EXPORT_SYMBOL_GPL(set_memory_decrypted); > + > +bool force_dma_unencrypted(struct device *dev) > +{ > + if (dev->tdi_enabled) > + return false; > + > + return is_realm_world(); > +} > +EXPORT_SYMBOL_GPL(force_dma_unencrypted); > diff --git a/drivers/pci/tsm.c b/drivers/pci/tsm.c > index e4a3b5b37939..60f50d57a725 100644 > --- a/drivers/pci/tsm.c > +++ b/drivers/pci/tsm.c > @@ -120,6 +120,7 @@ static int pci_tsm_disconnect(struct pci_dev *pdev) > > tsm_ops->disconnect(pdev); > tsm->state = PCI_TSM_INIT; > + pdev->dev.tdi_enabled = false; > > return 0; > } > @@ -199,6 +200,8 @@ static int pci_tsm_accept(struct pci_dev *pdev) > if (rc) > return rc; > tsm->state = PCI_TSM_ACCEPT; > + pdev->dev.tdi_enabled = true; > + > return 0; > } > > @@ -557,6 +560,9 @@ static void __pci_tsm_init(struct pci_dev *pdev) > default: > break; > } > + > + /* FIXME!! should this be default true and switch to false for TEE capable device */ > + pdev->dev.tdi_enabled = false; > } > > void pci_tsm_init(struct pci_dev *pdev) > diff --git a/include/linux/device.h b/include/linux/device.h > index 4940db137fff..d62e0dd9d8ee 100644 > --- a/include/linux/device.h > +++ b/include/linux/device.h > @@ -688,6 +688,7 @@ struct device { > #ifdef CONFIG_IOMMU_DMA > bool dma_iommu:1; > #endif > + bool tdi_enabled:1; > }; > > /** > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h > index 3dae0f592063..61e7cff7768b 100644 > --- a/include/linux/swiotlb.h > +++ b/include/linux/swiotlb.h > @@ -173,6 +173,10 @@ static inline bool is_swiotlb_force_bounce(struct device *dev) > { > struct io_tlb_mem *mem = dev->dma_io_tlb_mem; > > + if (dev->tdi_enabled) { > + dev_warn_once(dev, "(TIO) Disable SWIOTLB"); > + return false; > + } > return mem && mem->force_bounce; > } > > -- > 2.43.0 >
Mostafa Saleh <smostafa@google.com> writes: > Hi Aneesh, > > On Mon, Jul 28, 2025 at 07:21:41PM +0530, Aneesh Kumar K.V (Arm) wrote: >> Currently, we enforce the use of bounce buffers to ensure that memory >> accessed by non-secure devices is explicitly shared with the host [1]. >> However, for secure devices, this approach must be avoided. > > > Sorry this might be a basic question, I just started looking into this. > I see that “force_dma_unencrypted” and “is_swiotlb_force_bounce” are only > used from DMA-direct, but it seems in your case it involves an IOMMU. > How does it influence bouncing in that case? > With the current patchset, the guest does not have an assigned IOMMU (no Stage1 SMMU), so guest DMA operations use DMA-direct. For non-secure devices: - Streaming DMA uses swiotlb, which is a shared pool with the hypervisor. - Non-streaming DMA uses DMA-direct, and the attributes of the allocated memory are updated with dma_set_decrypted(). For secure devices, neither of these mechanisms is needed. -aneesh
On Tue, Sep 16, 2025 at 09:45:18AM +0530, Aneesh Kumar K.V wrote: > Mostafa Saleh <smostafa@google.com> writes: > > > Hi Aneesh, > > > > On Mon, Jul 28, 2025 at 07:21:41PM +0530, Aneesh Kumar K.V (Arm) wrote: > >> Currently, we enforce the use of bounce buffers to ensure that memory > >> accessed by non-secure devices is explicitly shared with the host [1]. > >> However, for secure devices, this approach must be avoided. > > > > > > Sorry this might be a basic question, I just started looking into this. > > I see that “force_dma_unencrypted” and “is_swiotlb_force_bounce” are only > > used from DMA-direct, but it seems in your case it involves an IOMMU. > > How does it influence bouncing in that case? > > > > With the current patchset, the guest does not have an assigned IOMMU (no > Stage1 SMMU), so guest DMA operations use DMA-direct. > > For non-secure devices: > - Streaming DMA uses swiotlb, which is a shared pool with the hypervisor. > - Non-streaming DMA uses DMA-direct, and the attributes of the allocated > memory are updated with dma_set_decrypted(). > > For secure devices, neither of these mechanisms is needed. I see, thanks for the explanation! Thanks, Mostafa > > -aneesh
On Mon, Jul 28, 2025 at 07:21:41PM +0530, Aneesh Kumar K.V (Arm) wrote: > Currently, we enforce the use of bounce buffers to ensure that memory > accessed by non-secure devices is explicitly shared with the host [1]. > However, for secure devices, this approach must be avoided. > @@ -557,6 +560,9 @@ static void __pci_tsm_init(struct pci_dev *pdev) > default: > break; > } > + > + /* FIXME!! should this be default true and switch to false for TEE capable device */ Fix whatever this is, or make it a real sentence and wrap to fit in 80 columns. > + pdev->dev.tdi_enabled = false; > }
On Mon, Jul 28, 2025 at 07:21:41PM +0530, Aneesh Kumar K.V (Arm) wrote: > @@ -48,3 +49,12 @@ int set_memory_decrypted(unsigned long addr, int numpages) > return crypt_ops->decrypt(addr, numpages); > } > EXPORT_SYMBOL_GPL(set_memory_decrypted); > + > +bool force_dma_unencrypted(struct device *dev) > +{ > + if (dev->tdi_enabled) > + return false; Is this OK? I see code like this: static inline dma_addr_t phys_to_dma_direct(struct device *dev, phys_addr_t phys) { if (force_dma_unencrypted(dev)) return phys_to_dma_unencrypted(dev, phys); return phys_to_dma(dev, phys); What are the ARM rules for generating dma addreses? 1) Device is T=0, memory is unencrypted, call dma_addr_unencrypted() and do "top bit IBA set" 2) Device is T=1, memory is encrypted, use the phys_to_dma() normally 3) Device it T=1, memory is uncrypted, use the phys_to_dma() normally??? Seems odd, I would have guessed the DMA address sould be the same as case #1? Can you document this in a comment? > diff --git a/include/linux/device.h b/include/linux/device.h > index 4940db137fff..d62e0dd9d8ee 100644 > --- a/include/linux/device.h > +++ b/include/linux/device.h > @@ -688,6 +688,7 @@ struct device { > #ifdef CONFIG_IOMMU_DMA > bool dma_iommu:1; > #endif > + bool tdi_enabled:1; > }; I would give the dev->tdi_enabled a clearer name, maybe dev->encrypted_dma_supported ? Also need to think carefully of a bitfield is OK here, we can't locklessly change a bitfield so need to audit that all members are set under, probably, the device lock or some other single threaded hand waving. It seems believable it is like that but should be checked out, and add a lockdep if it relies on the device lock. Jason
On 28/7/25 20:03, Jason Gunthorpe wrote: > On Mon, Jul 28, 2025 at 07:21:41PM +0530, Aneesh Kumar K.V (Arm) wrote: >> @@ -48,3 +49,12 @@ int set_memory_decrypted(unsigned long addr, int numpages) >> return crypt_ops->decrypt(addr, numpages); >> } >> EXPORT_SYMBOL_GPL(set_memory_decrypted); >> + >> +bool force_dma_unencrypted(struct device *dev) >> +{ >> + if (dev->tdi_enabled) >> + return false; > > Is this OK? I see code like this: > > static inline dma_addr_t phys_to_dma_direct(struct device *dev, > phys_addr_t phys) > { > if (force_dma_unencrypted(dev)) > return phys_to_dma_unencrypted(dev, phys); > return phys_to_dma(dev, phys); I did write this in the first place so I'll comment :) > > What are the ARM rules for generating dma addreses? > > 1) Device is T=0, memory is unencrypted, call dma_addr_unencrypted() > and do "top bit IBA set" > > 2) Device is T=1, memory is encrypted, use the phys_to_dma() normally > > 3) Device it T=1, memory is uncrypted, use the phys_to_dma() > normally??? Seems odd, I would have guessed the DMA address sould > be the same as case #1? > > Can you document this in a comment? On AMD, T=1 only encrypts the PCIe trafic, when a DMA request hits the IOMMU, the IOMMU decrypts it and then decides whether to encrypt it with a memory key: if there is secure vIOMMU - it will do what Cbit says in the guest IOMMU table (this is in the works) oooor just always set Cbit without guest vIOMMU (which is a big knob per a device and this is what my patches do now). And with vIOMMU, I'd expect phys_to_dma_direct() not to be called as this one is in a direct map path. > >> diff --git a/include/linux/device.h b/include/linux/device.h >> index 4940db137fff..d62e0dd9d8ee 100644 >> --- a/include/linux/device.h >> +++ b/include/linux/device.h >> @@ -688,6 +688,7 @@ struct device { >> #ifdef CONFIG_IOMMU_DMA >> bool dma_iommu:1; >> #endif >> + bool tdi_enabled:1; >> }; > > I would give the dev->tdi_enabled a clearer name, maybe > dev->encrypted_dma_supported ? May be but "_enabled", not "_supported". And, ideally, with vIOMMU, at least AMD won't be needing it. > > Also need to think carefully of a bitfield is OK here, we can't > locklessly change a bitfield so need to audit that all members are set > under, probably, the device lock or some other single threaded hand > waving. It seems believable it is like that but should be checked out, > and add a lockdep if it relies on the device lock. True. > > Jason > -- Alexey
On Tue, Aug 05, 2025 at 08:22:10PM +1000, Alexey Kardashevskiy wrote: >> static inline dma_addr_t phys_to_dma_direct(struct device *dev, >> phys_addr_t phys) >> { >> if (force_dma_unencrypted(dev)) >> return phys_to_dma_unencrypted(dev, phys); >> return phys_to_dma(dev, phys); On AMD what is the force_dma_unencrypted() for? I thought AMD had only one IOMMU and effectively one S2 mapping. Why does it need to change the phys depending on it being shared or private? > On AMD, T=1 only encrypts the PCIe trafic, when a DMA request hits > the IOMMU, the IOMMU decrypts it and then decides whether to encrypt > it with a memory key: if there is secure vIOMMU - it will do what > Cbit says in the guest IOMMU table (this is in the works) oooor just > always set Cbit without guest vIOMMU (which is a big knob per a > device and this is what my patches do now). AMD doesn't have the split IOMMU design that something like ARM has, so it is bit different.. On ARM the T=1 IOMMU should map the entire CPU address space, so any IOVA with any address should just work. So I'd expect AMD and ARM to be the same here. For the T=0 iommu ARM (I think) will only map the shared pages to the shared IPA alias, so the guest VM has to ensure the shared physical alias is used. Then it sounds like the CPU will sometimes accept the private physical alias, and linus will sometimes prefer the physical alias, for the shared memory too so Linux gets things muddled. IMHO ARM probably should fix this much higher up the stack when it has more information to tell if the phys_addr is actualy the private alias a shared page. > > > + bool tdi_enabled:1; > > > }; > > > > I would give the dev->tdi_enabled a clearer name, maybe > > dev->encrypted_dma_supported ? > > > May be but "_enabled", not "_supported". And, ideally, with vIOMMU, at least AMD won't be needing it. Yes Jason
Jason Gunthorpe <jgg@ziepe.ca> writes: > On Mon, Jul 28, 2025 at 07:21:41PM +0530, Aneesh Kumar K.V (Arm) wrote: >> @@ -48,3 +49,12 @@ int set_memory_decrypted(unsigned long addr, int numpages) >> return crypt_ops->decrypt(addr, numpages); >> } >> EXPORT_SYMBOL_GPL(set_memory_decrypted); >> + >> +bool force_dma_unencrypted(struct device *dev) >> +{ >> + if (dev->tdi_enabled) >> + return false; > > Is this OK? I see code like this: > > static inline dma_addr_t phys_to_dma_direct(struct device *dev, > phys_addr_t phys) > { > if (force_dma_unencrypted(dev)) > return phys_to_dma_unencrypted(dev, phys); > return phys_to_dma(dev, phys); > > What are the ARM rules for generating dma addreses? > > 1) Device is T=0, memory is unencrypted, call dma_addr_unencrypted() > and do "top bit IBA set" > > 2) Device is T=1, memory is encrypted, use the phys_to_dma() normally > > 3) Device it T=1, memory is uncrypted, use the phys_to_dma() > normally??? Seems odd, I would have guessed the DMA address sould > be the same as case #1? > > Can you document this in a comment? > If a device is operating in secure mode (T=1), it is currently assumed that only access to private (encrypted) memory is supported. It is unclear whether devices would need to perform DMA to shared (unencrypted) memory while operating in this mode, as TLPs with T=1 are generally expected to target private memory. Based on this assumption, T=1 devices will always access private/encrypted memory, while T=0 devices will be restricted to shared/unencrypted memory. > >> diff --git a/include/linux/device.h b/include/linux/device.h >> index 4940db137fff..d62e0dd9d8ee 100644 >> --- a/include/linux/device.h >> +++ b/include/linux/device.h >> @@ -688,6 +688,7 @@ struct device { >> #ifdef CONFIG_IOMMU_DMA >> bool dma_iommu:1; >> #endif >> + bool tdi_enabled:1; >> }; > > I would give the dev->tdi_enabled a clearer name, maybe > dev->encrypted_dma_supported ? > > Also need to think carefully of a bitfield is OK here, we can't > locklessly change a bitfield so need to audit that all members are set > under, probably, the device lock or some other single threaded hand > waving. It seems believable it is like that but should be checked out, > and add a lockdep if it relies on the device lock. > Will check and update the patch. -aneesh
On Tue, Jul 29, 2025 at 01:53:10PM +0530, Aneesh Kumar K.V wrote: > Jason Gunthorpe <jgg@ziepe.ca> writes: > > > On Mon, Jul 28, 2025 at 07:21:41PM +0530, Aneesh Kumar K.V (Arm) wrote: > >> @@ -48,3 +49,12 @@ int set_memory_decrypted(unsigned long addr, int numpages) > >> return crypt_ops->decrypt(addr, numpages); > >> } > >> EXPORT_SYMBOL_GPL(set_memory_decrypted); > >> + > >> +bool force_dma_unencrypted(struct device *dev) > >> +{ > >> + if (dev->tdi_enabled) > >> + return false; > > > > Is this OK? I see code like this: > > > > static inline dma_addr_t phys_to_dma_direct(struct device *dev, > > phys_addr_t phys) > > { > > if (force_dma_unencrypted(dev)) > > return phys_to_dma_unencrypted(dev, phys); > > return phys_to_dma(dev, phys); > > > > What are the ARM rules for generating dma addreses? > > > > 1) Device is T=0, memory is unencrypted, call dma_addr_unencrypted() > > and do "top bit IBA set" > > > > 2) Device is T=1, memory is encrypted, use the phys_to_dma() normally > > > > 3) Device it T=1, memory is uncrypted, use the phys_to_dma() > > normally??? Seems odd, I would have guessed the DMA address sould > > be the same as case #1? > > > > Can you document this in a comment? > > > > If a device is operating in secure mode (T=1), it is currently assumed > that only access to private (encrypted) memory is supported. No, this is no how the PCI specs were written as far as I understand. The XT bit thing is supposed to add more fine grained device side control over what memory the DMA can target. T alone does not do that. > It is unclear whether devices would need to perform DMA to shared > (unencrypted) memory while operating in this mode, as TLPs with T=1 > are generally expected to target private memory. PCI SIG supports it, kernel should support it. Jason
Jason Gunthorpe <jgg@ziepe.ca> writes: > On Tue, Jul 29, 2025 at 01:53:10PM +0530, Aneesh Kumar K.V wrote: >> Jason Gunthorpe <jgg@ziepe.ca> writes: >> >> > On Mon, Jul 28, 2025 at 07:21:41PM +0530, Aneesh Kumar K.V (Arm) wrote: >> >> @@ -48,3 +49,12 @@ int set_memory_decrypted(unsigned long addr, int numpages) >> >> return crypt_ops->decrypt(addr, numpages); >> >> } >> >> EXPORT_SYMBOL_GPL(set_memory_decrypted); >> >> + >> >> +bool force_dma_unencrypted(struct device *dev) >> >> +{ >> >> + if (dev->tdi_enabled) >> >> + return false; >> > >> > Is this OK? I see code like this: >> > >> > static inline dma_addr_t phys_to_dma_direct(struct device *dev, >> > phys_addr_t phys) >> > { >> > if (force_dma_unencrypted(dev)) >> > return phys_to_dma_unencrypted(dev, phys); >> > return phys_to_dma(dev, phys); >> > >> > What are the ARM rules for generating dma addreses? >> > >> > 1) Device is T=0, memory is unencrypted, call dma_addr_unencrypted() >> > and do "top bit IBA set" >> > >> > 2) Device is T=1, memory is encrypted, use the phys_to_dma() normally >> > >> > 3) Device it T=1, memory is uncrypted, use the phys_to_dma() >> > normally??? Seems odd, I would have guessed the DMA address sould >> > be the same as case #1? >> > >> > Can you document this in a comment? >> > >> >> If a device is operating in secure mode (T=1), it is currently assumed >> that only access to private (encrypted) memory is supported. > > No, this is no how the PCI specs were written as far as I > understand. The XT bit thing is supposed to add more fine grained > device side control over what memory the DMA can target. T alone does > not do that. > >> It is unclear whether devices would need to perform DMA to shared >> (unencrypted) memory while operating in this mode, as TLPs with T=1 >> are generally expected to target private memory. > > PCI SIG supports it, kernel should support it. > Would we also need a separate DMA allocation API for allocating addresses intended to be shared with the non-secure hypervisor? Are there any existing drivers in the kernel that already require such shared allocations, which I could use as a reference? -aneesh
On Sat, Aug 02, 2025 at 02:14:20PM +0530, Aneesh Kumar K.V wrote: > Jason Gunthorpe <jgg@ziepe.ca> writes: > > > On Tue, Jul 29, 2025 at 01:53:10PM +0530, Aneesh Kumar K.V wrote: > >> Jason Gunthorpe <jgg@ziepe.ca> writes: > >> > >> > On Mon, Jul 28, 2025 at 07:21:41PM +0530, Aneesh Kumar K.V (Arm) wrote: > >> >> @@ -48,3 +49,12 @@ int set_memory_decrypted(unsigned long addr, int numpages) > >> >> return crypt_ops->decrypt(addr, numpages); > >> >> } > >> >> EXPORT_SYMBOL_GPL(set_memory_decrypted); > >> >> + > >> >> +bool force_dma_unencrypted(struct device *dev) > >> >> +{ > >> >> + if (dev->tdi_enabled) > >> >> + return false; > >> > > >> > Is this OK? I see code like this: > >> > > >> > static inline dma_addr_t phys_to_dma_direct(struct device *dev, > >> > phys_addr_t phys) > >> > { > >> > if (force_dma_unencrypted(dev)) > >> > return phys_to_dma_unencrypted(dev, phys); > >> > return phys_to_dma(dev, phys); > >> > > >> > What are the ARM rules for generating dma addreses? > >> > > >> > 1) Device is T=0, memory is unencrypted, call dma_addr_unencrypted() > >> > and do "top bit IBA set" > >> > > >> > 2) Device is T=1, memory is encrypted, use the phys_to_dma() normally > >> > > >> > 3) Device it T=1, memory is uncrypted, use the phys_to_dma() > >> > normally??? Seems odd, I would have guessed the DMA address sould > >> > be the same as case #1? > >> > > >> > Can you document this in a comment? > >> > > >> > >> If a device is operating in secure mode (T=1), it is currently assumed > >> that only access to private (encrypted) memory is supported. > > > > No, this is no how the PCI specs were written as far as I > > understand. The XT bit thing is supposed to add more fine grained > > device side control over what memory the DMA can target. T alone does > > not do that. > > > >> It is unclear whether devices would need to perform DMA to shared > >> (unencrypted) memory while operating in this mode, as TLPs with T=1 > >> are generally expected to target private memory. > > > > PCI SIG supports it, kernel should support it. > > > > Would we also need a separate DMA allocation API for allocating > addresses intended to be shared with the non-secure hypervisor? > > Are there any existing drivers in the kernel that already require such > shared allocations, which I could use as a reference? The most likely case in the near term is PCI P2P to shared MMIO. I don't know any way to allocate shared memory in a driver?? At the bare minimum this patch should be documenting the correct architecture and outlining any gaps in the current implementation. I also don't really understand what the above code is even doing.. Isn't the design on ARM that the IPA always encodes the shared/private in the top bit? How do we get a shared page that does not already have a phys_addr_t in the shared IPA? Shouldn't the kernel have switched to the shared IPA alias when it returned the swiotlb buffer? eg why do we need to do: #define dma_addr_unencrypted(x) ((x) | PROT_NS_SHARED) At all? Suzuki ? Jason
Jason Gunthorpe <jgg@ziepe.ca> writes: > On Sat, Aug 02, 2025 at 02:14:20PM +0530, Aneesh Kumar K.V wrote: >> Jason Gunthorpe <jgg@ziepe.ca> writes: >> >> > On Tue, Jul 29, 2025 at 01:53:10PM +0530, Aneesh Kumar K.V wrote: >> >> Jason Gunthorpe <jgg@ziepe.ca> writes: >> >> >> >> > On Mon, Jul 28, 2025 at 07:21:41PM +0530, Aneesh Kumar K.V (Arm) wrote: >> >> >> @@ -48,3 +49,12 @@ int set_memory_decrypted(unsigned long addr, int numpages) >> >> >> return crypt_ops->decrypt(addr, numpages); >> >> >> } >> >> >> EXPORT_SYMBOL_GPL(set_memory_decrypted); >> >> >> + >> >> >> +bool force_dma_unencrypted(struct device *dev) >> >> >> +{ >> >> >> + if (dev->tdi_enabled) >> >> >> + return false; >> >> > >> >> > Is this OK? I see code like this: >> >> > >> >> > static inline dma_addr_t phys_to_dma_direct(struct device *dev, >> >> > phys_addr_t phys) >> >> > { >> >> > if (force_dma_unencrypted(dev)) >> >> > return phys_to_dma_unencrypted(dev, phys); >> >> > return phys_to_dma(dev, phys); >> >> > >> >> > What are the ARM rules for generating dma addreses? >> >> > >> >> > 1) Device is T=0, memory is unencrypted, call dma_addr_unencrypted() >> >> > and do "top bit IBA set" >> >> > >> >> > 2) Device is T=1, memory is encrypted, use the phys_to_dma() normally >> >> > >> >> > 3) Device it T=1, memory is uncrypted, use the phys_to_dma() >> >> > normally??? Seems odd, I would have guessed the DMA address sould >> >> > be the same as case #1? >> >> > >> >> > Can you document this in a comment? >> >> > >> >> >> >> If a device is operating in secure mode (T=1), it is currently assumed >> >> that only access to private (encrypted) memory is supported. >> > >> > No, this is no how the PCI specs were written as far as I >> > understand. The XT bit thing is supposed to add more fine grained >> > device side control over what memory the DMA can target. T alone does >> > not do that. >> > >> >> It is unclear whether devices would need to perform DMA to shared >> >> (unencrypted) memory while operating in this mode, as TLPs with T=1 >> >> are generally expected to target private memory. >> > >> > PCI SIG supports it, kernel should support it. >> > >> >> Would we also need a separate DMA allocation API for allocating >> addresses intended to be shared with the non-secure hypervisor? >> >> Are there any existing drivers in the kernel that already require such >> shared allocations, which I could use as a reference? > > The most likely case in the near term is PCI P2P to shared MMIO. > > I don't know any way to allocate shared memory in a driver?? > > At the bare minimum this patch should be documenting the correct > architecture and outlining any gaps in the current implementation. > > I also don't really understand what the above code is even > doing.. Isn't the design on ARM that the IPA always encodes the > shared/private in the top bit? > > How do we get a shared page that does not already have a phys_addr_t > in the shared IPA? Shouldn't the kernel have switched to the shared > IPA alias when it returned the swiotlb buffer? eg why do we need to do: > > #define dma_addr_unencrypted(x) ((x) | PROT_NS_SHARED) > swiotlb virt addr is updated in the direct map page table such that we have the correct attribute set. For ex: swiotlb_update_mem_attributes uses set_memory_decrypted() to mark the memory as shared. set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT); However, when mapping swiotlb regions to obtain a `dma_addr_t`, we still need to explicitly convert the physical address: swiotlb_map() swiotlb_addr = swiotlb_tbl_map_single(dev, paddr, size, 0, dir, attrs); ... /* Ensure that the address returned is DMA'ble */ dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr); Note that we don’t update the phys_addr_t to set the top bit. For reference: tlb_addr = slot_addr(pool->start, index) + offset; -aneesh
On Mon, Aug 04, 2025 at 12:28:33PM +0530, Aneesh Kumar K.V wrote: > Note that we don’t update the phys_addr_t to set the top > bit. For reference: > > tlb_addr = slot_addr(pool->start, index) + offset; This seems unfortunate. So you end up with the private IPA space having shared pages in it, so *sometimes* you have to force the unencrypted bit? Seems to me we should insist the phys_addr is cannonised before reaching the dma API. Ie that the swiotlb/etc code will set the right IPA bit. Jason
On 29/07/2025 15:33, Jason Gunthorpe wrote: > On Tue, Jul 29, 2025 at 01:53:10PM +0530, Aneesh Kumar K.V wrote: >> Jason Gunthorpe <jgg@ziepe.ca> writes: >> >>> On Mon, Jul 28, 2025 at 07:21:41PM +0530, Aneesh Kumar K.V (Arm) wrote: >>>> @@ -48,3 +49,12 @@ int set_memory_decrypted(unsigned long addr, int numpages) >>>> return crypt_ops->decrypt(addr, numpages); >>>> } >>>> EXPORT_SYMBOL_GPL(set_memory_decrypted); >>>> + >>>> +bool force_dma_unencrypted(struct device *dev) >>>> +{ >>>> + if (dev->tdi_enabled) >>>> + return false; >>> >>> Is this OK? I see code like this: >>> >>> static inline dma_addr_t phys_to_dma_direct(struct device *dev, >>> phys_addr_t phys) >>> { >>> if (force_dma_unencrypted(dev)) >>> return phys_to_dma_unencrypted(dev, phys); >>> return phys_to_dma(dev, phys); >>> >>> What are the ARM rules for generating dma addreses? >>> >>> 1) Device is T=0, memory is unencrypted, call dma_addr_unencrypted() >>> and do "top bit IBA set" >>> >>> 2) Device is T=1, memory is encrypted, use the phys_to_dma() normally >>> >>> 3) Device it T=1, memory is uncrypted, use the phys_to_dma() >>> normally??? Seems odd, I would have guessed the DMA address sould >>> be the same as case #1? >>> >>> Can you document this in a comment? >>> >> >> If a device is operating in secure mode (T=1), it is currently assumed >> that only access to private (encrypted) memory is supported. > > No, this is no how the PCI specs were written as far as I > understand. The XT bit thing is supposed to add more fine grained > device side control over what memory the DMA can target. T alone does > not do that. > >> It is unclear whether devices would need to perform DMA to shared >> (unencrypted) memory while operating in this mode, as TLPs with T=1 >> are generally expected to target private memory. > > PCI SIG supports it, kernel should support it. ACK. On Arm CCA, the device can access shared IPA, with T=1 transaction as long as the mapping is active in the Stage2 managed by RMM. Rather than mapping the entire memory from the host, it would be ideal if the Coco vms have some sort of a callback to "make sure the DMA wouldn't fault for a device". e.g, it could be as simple as touching the page in Arm CCA (GFP_ZERO could do the trick, well one byte per Granule is good). or an ACCEPT a given page. Is this a problem for AMDE SNP / Intel TDX ? Suzuki > > Jason
On Wed, Jul 30, 2025 at 11:09:35AM +0100, Suzuki K Poulose wrote: > > > It is unclear whether devices would need to perform DMA to shared > > > (unencrypted) memory while operating in this mode, as TLPs with T=1 > > > are generally expected to target private memory. > > > > PCI SIG supports it, kernel should support it. > > ACK. On Arm CCA, the device can access shared IPA, with T=1 transaction > as long as the mapping is active in the Stage2 managed by RMM. Right, I expect that the T=0 SMMU S2 translation is a perfect subset of the T=1 S2 rmm translation. At most pages that are not available to T=0 should be removed when making the subset. I'm not sure what the plan is here on ARM though, do you expect to pre-load the entire T=0 SMMU S2 with the shared IPA aliases and rely on the GPT for protection or will the hypervisor dynamically change the T=0 SMMU S2 after each shared/private change? Same question for the RMM S2? The first option sounds fairly appealing, IMHO > Rather than mapping the entire memory from the host, it would be ideal > if the Coco vms have some sort of a callback to "make sure the DMA > wouldn't fault for a device". Isn't that a different topic? For right now we expect that all pages are pinned and loaded into both S2s. Upon any private/shared conversion the pages should be reloaded into the appropriate S2s if required. The VM never needs to tell the hypervisor that it wants to do DMA. There are all sorts of options here to relax this but exploring them it an entirely different project that CCA, IMHO. Jason
On 31/07/2025 13:17, Jason Gunthorpe wrote: > On Wed, Jul 30, 2025 at 11:09:35AM +0100, Suzuki K Poulose wrote: >>>> It is unclear whether devices would need to perform DMA to shared >>>> (unencrypted) memory while operating in this mode, as TLPs with T=1 >>>> are generally expected to target private memory. >>> >>> PCI SIG supports it, kernel should support it. >> >> ACK. On Arm CCA, the device can access shared IPA, with T=1 transaction >> as long as the mapping is active in the Stage2 managed by RMM. > > Right, I expect that the T=0 SMMU S2 translation is a perfect subset of > the T=1 S2 rmm translation. At most pages that are not available to > T=0 should be removed when making the subset. Yes, this is what the VMM is supposed to do today, see [0] & [1]. > > I'm not sure what the plan is here on ARM though, do you expect to > pre-load the entire T=0 SMMU S2 with the shared IPA aliases and rely > on the GPT for protection or will the hypervisor dynamically change > the T=0 SMMU S2 after each shared/private change? Same question for Yes, share/private transitions do go all the way back to VMM and it is supposed to make the necessary changes to the SMMU S2 (as in [1]). > the RMM S2? > As for the RMM S2, the current plan is to re-use the CPU S2 managed by RMM. > The first option sounds fairly appealing, IMHO > >> Rather than mapping the entire memory from the host, it would be ideal >> if the Coco vms have some sort of a callback to "make sure the DMA >> wouldn't fault for a device". > > Isn't that a different topic? For right now we expect that all pages > are pinned and loaded into both S2s. Upon any private/shared Actually it is. But might solve the problem for confidential VMs, where the S2 mapping is kind of pinned. Population of S2 is a bit tricky for CVMs, as there are restrictions due to : 1) Pre-boot measurements 2) Restrictions on modifying the S2 (at least on CCA). Thus, "the preload S2 and pin" must be done, after the "Initial images are loaded". And that becomes tricky from the Hypervisor (thinking of this, the VMM may be able to do this properly, as long as it remembers which areas where loaded). Filling in the S2, with already populated S2 is complicated for CCA (costly, but not impossible). But the easier way is for the Realm to fault in the pages before they are used for DMA (and S2 mappings can be pinned by the hyp as default). Hence that suggestion. Suzuki [0] https://gitlab.arm.com/linux-arm/kvmtool-cca/-/commit/7c34972ddc [1] https://gitlab.arm.com/linux-arm/kvmtool-cca/-/commit/ab4e654c4 > conversion the pages should be reloaded into the appropriate S2s if > required. The VM never needs to tell the hypervisor that it wants to > do DMA. > > There are all sorts of options here to relax this but exploring them > it an entirely different project that CCA, IMHO. > > Jason
On Thu, Jul 31, 2025 at 02:48:23PM +0100, Suzuki K Poulose wrote: > On 31/07/2025 13:17, Jason Gunthorpe wrote: > > On Wed, Jul 30, 2025 at 11:09:35AM +0100, Suzuki K Poulose wrote: > > > > > It is unclear whether devices would need to perform DMA to shared > > > > > (unencrypted) memory while operating in this mode, as TLPs with T=1 > > > > > are generally expected to target private memory. > > > > > > > > PCI SIG supports it, kernel should support it. > > > > > > ACK. On Arm CCA, the device can access shared IPA, with T=1 transaction > > > as long as the mapping is active in the Stage2 managed by RMM. > > > > Right, I expect that the T=0 SMMU S2 translation is a perfect subset of > > the T=1 S2 rmm translation. At most pages that are not available to > > T=0 should be removed when making the subset. > > Yes, this is what the VMM is supposed to do today, see [0] & [1]. Okay great! > > I'm not sure what the plan is here on ARM though, do you expect to > > pre-load the entire T=0 SMMU S2 with the shared IPA aliases and rely > > on the GPT for protection or will the hypervisor dynamically change > > the T=0 SMMU S2 after each shared/private change? Same question for > > Yes, share/private transitions do go all the way back to VMM and it > is supposed to make the necessary changes to the SMMU S2 (as in [1]). Okay, it works, but also why? From a hypervisor perspective when using VFIO I'd like the guestmemfd to fix all the physical memory immediately, so the entire physical map is fixed and known. Backed by 1G huge pages most likely. Is there a reason not to just dump that into the T=0 SMMU using 1G huge pages and never touch it again? The GPT provides protection? Sure sounds appealing.. > As for the RMM S2, the current plan is to re-use the CPU S2 managed > by RMM. Yes, but my question is if the CPU will be prepopulated > Actually it is. But might solve the problem for confidential VMs, > where the S2 mapping is kind of pinned. Not kind of pinned, it is pinned in the hypervisor.. > Population of S2 is a bit tricky for CVMs, as there are restrictions > due to : > 1) Pre-boot measurements > 2) Restrictions on modifying the S2 (at least on CCA). I haven't dug into any of this, but I'd challenge you to try to make it run fast if the guestmemfd has a full fixed address map in 1G pages and could just dump them into the RMM efficiently once during boot. Perhaps there are ways to optimize the measurements for huge amounts of zero'd memory. > Filling in the S2, with already populated S2 is complicated for CCA > (costly, but not impossible). But the easier way is for the Realm to > fault in the pages before they are used for DMA (and S2 mappings can be > pinned by the hyp as default). Hence that suggestion. I guess, but it's weird, kinda slow, and the RMM can never unfault them.. How will you reconstruct the 1G huge pages in the S2 if you are only populating on faults? Can you really fault the entire 1G page? If so why can't it be prepopulated? Jason
On 31/07/2025 17:44, Jason Gunthorpe wrote: > On Thu, Jul 31, 2025 at 02:48:23PM +0100, Suzuki K Poulose wrote: >> On 31/07/2025 13:17, Jason Gunthorpe wrote: >>> On Wed, Jul 30, 2025 at 11:09:35AM +0100, Suzuki K Poulose wrote: >>>>>> It is unclear whether devices would need to perform DMA to shared >>>>>> (unencrypted) memory while operating in this mode, as TLPs with T=1 >>>>>> are generally expected to target private memory. >>>>> >>>>> PCI SIG supports it, kernel should support it. >>>> >>>> ACK. On Arm CCA, the device can access shared IPA, with T=1 transaction >>>> as long as the mapping is active in the Stage2 managed by RMM. >>> >>> Right, I expect that the T=0 SMMU S2 translation is a perfect subset of >>> the T=1 S2 rmm translation. At most pages that are not available to >>> T=0 should be removed when making the subset. >> >> Yes, this is what the VMM is supposed to do today, see [0] & [1]. > > Okay great! > >>> I'm not sure what the plan is here on ARM though, do you expect to >>> pre-load the entire T=0 SMMU S2 with the shared IPA aliases and rely >>> on the GPT for protection or will the hypervisor dynamically change >>> the T=0 SMMU S2 after each shared/private change? Same question for >> >> Yes, share/private transitions do go all the way back to VMM and it >> is supposed to make the necessary changes to the SMMU S2 (as in [1]). > > Okay, it works, but also why? > > From a hypervisor perspective when using VFIO I'd like the guestmemfd > to fix all the physical memory immediately, so the entire physical map > is fixed and known. Backed by 1G huge pages most likely. > > Is there a reason not to just dump that into the T=0 SMMU using 1G > huge pages and never touch it again? The GPT provides protection? That is possible, once we get guest_memfd mmap support merged upstream. GPT does provide protection. The only caveat is, does the guest_memfd support this at all ? i.e., shared->private transitions with a shared mapping in place (Though this is in SMMU only, not the Host CPU pagetables) > > Sure sounds appealing.. > >> As for the RMM S2, the current plan is to re-use the CPU S2 managed >> by RMM. > > Yes, but my question is if the CPU will be prepopulated > >> Actually it is. But might solve the problem for confidential VMs, >> where the S2 mapping is kind of pinned. > > Not kind of pinned, it is pinned in the hypervisor.. > >> Population of S2 is a bit tricky for CVMs, as there are restrictions >> due to : >> 1) Pre-boot measurements >> 2) Restrictions on modifying the S2 (at least on CCA). > > I haven't dug into any of this, but I'd challenge you to try to make > it run fast if the guestmemfd has a full fixed address map in 1G pages > and could just dump them into the RMM efficiently once during boot. > > Perhaps there are ways to optimize the measurements for huge amounts > of zero'd memory. There is. We (VMM) can choose not to "measure" the zero'd pages. >> Filling in the S2, with already populated S2 is complicated for CCA >> (costly, but not impossible). But the easier way is for the Realm to >> fault in the pages before they are used for DMA (and S2 mappings can be >> pinned by the hyp as default). Hence that suggestion. > > I guess, but it's weird, kinda slow, and the RMM can never unfault them.. > > How will you reconstruct the 1G huge pages in the S2 if you are only > populating on faults? Can you really fault the entire 1G page? If so > why can't it be prepopulated? It is tricky to prepopulate the 1G page, as parts of the pages may be "populated" with contents. We can recreate the 1G block mapping by "FOLD" ing the leaf level tables, all the way upto 1G, after the mappings are created. We have to do that anyway for CCA. I think we can go ahead with VMM pre-populating the entire DRAM and keeping it pinned for DA. Rather than doing this from the vfio kernel, it could be done by the VMM as it has better knowledge of the populated contents and map the rest as "unmeasured" 0s. Suzuki
On Fri, Aug 01, 2025 at 10:30:35AM +0100, Suzuki K Poulose wrote: > > Is there a reason not to just dump that into the T=0 SMMU using 1G > > huge pages and never touch it again? The GPT provides protection? > > That is possible, once we get guest_memfd mmap support merged upstream. > GPT does provide protection. The only caveat is, does the guest_memfd > support this at all ? i.e., shared->private transitions with a shared > mapping in place (Though this is in SMMU only, not the Host CPU > pagetables) I don't know, we haven't got to the guestmemfd/IOMMU integration yet, which is why I ask the questions. I think AMD and ARM would both be interested in guestmemfd <-> iommu working this way, at least. > I think we can go ahead with VMM pre-populating the entire DRAM > and keeping it pinned for DA. Rather than doing this from the > vfio kernel, it could be done by the VMM as it has better knowledge > of the populated contents and map the rest as "unmeasured" 0s. Yes, if done it should be done by the VMM and run through guestmemfd/kvm however that is agreed to. Jason
© 2016 - 2025 Red Hat, Inc.