docs/man/xl.cfg.5.pod.in | 13 + docs/misc/xen-command-line.pandoc | 7 + tools/golang/xenlight/helpers.gen.go | 2 + tools/golang/xenlight/types.gen.go | 1 + tools/include/libxl.h | 5 + tools/libs/light/libxl_arm.c | 123 +++- tools/libs/light/libxl_types.idl | 6 + tools/xl/xl_parse.c | 10 + xen/arch/arm/dom0less-build.c | 72 ++ xen/arch/arm/domain.c | 26 + xen/arch/arm/domain_build.c | 103 ++- xen/arch/arm/include/asm/domain.h | 4 + xen/arch/arm/include/asm/viommu.h | 102 +++ xen/common/device-tree/dom0less-build.c | 31 +- xen/drivers/passthrough/Kconfig | 14 + xen/drivers/passthrough/arm/Makefile | 2 + xen/drivers/passthrough/arm/smmu-v3.c | 369 +++++++++- xen/drivers/passthrough/arm/smmu-v3.h | 49 +- xen/drivers/passthrough/arm/viommu.c | 87 +++ xen/drivers/passthrough/arm/vsmmu-v3.c | 895 ++++++++++++++++++++++++ xen/drivers/passthrough/arm/vsmmu-v3.h | 32 + xen/include/public/arch-arm.h | 14 +- xen/include/public/device_tree_defs.h | 1 + xen/include/xen/iommu.h | 14 + 24 files changed, 1935 insertions(+), 47 deletions(-) create mode 100644 xen/arch/arm/include/asm/viommu.h create mode 100644 xen/drivers/passthrough/arm/viommu.c create mode 100644 xen/drivers/passthrough/arm/vsmmu-v3.c create mode 100644 xen/drivers/passthrough/arm/vsmmu-v3.h
This patch series represents a rebase of an older patch series implemented and sumbitted by Rahul Singh as an RFC: https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/. Original patch series content is aligned with the latest xen structure in terms of common/arch-specific code structuring. Some minor bugfixes are also applied: - Sanity checks / error handling - Non-pci devices support for emulated iommu Overall description of stage-1 support is available in the original patch series cover letter. Original commits structure with detailed explanation for each commit functionality is maintained. Patch series testing is performed in qemu arm environment. Additionally, stage-1 translation for non-pci devices is verified on a Renesas platform. Jean-Philippe Brucker (1): xen/arm: smmuv3: Maintain a SID->device structure Rahul Singh (19): xen/arm: smmuv3: Add support for stage-1 and nested stage translation xen/arm: smmuv3: Alloc io_domain for each device xen/arm: vIOMMU: add generic vIOMMU framework xen/arm: vsmmuv3: Add dummy support for virtual SMMUv3 for guests xen/domctl: Add XEN_DOMCTL_CONFIG_VIOMMU_* and viommu config param xen/arm: vIOMMU: Add cmdline boot option "viommu = <boolean>" xen/arm: vsmmuv3: Add support for registers emulation xen/arm: vsmmuv3: Add support for cmdqueue handling xen/arm: vsmmuv3: Add support for command CMD_CFGI_STE xen/arm: vsmmuv3: Attach Stage-1 configuration to SMMUv3 hardware xen/arm: vsmmuv3: Add support for event queue and global error xen/arm: vsmmuv3: Add "iommus" property node for dom0 devices xen/arm: vIOMMU: IOMMU device tree node for dom0 xen/arm: vsmmuv3: Emulated SMMUv3 device tree node for dom0less arm/libxl: vsmmuv3: Emulated SMMUv3 device tree node in libxl xen/arm: vsmmuv3: Alloc virq for virtual SMMUv3 xen/arm: vsmmuv3: Add support to send stage-1 event to guest libxl/arm: vIOMMU: Modify the partial device tree for iommus xen/arm: vIOMMU: Modify the partial device tree for dom0less docs/man/xl.cfg.5.pod.in | 13 + docs/misc/xen-command-line.pandoc | 7 + tools/golang/xenlight/helpers.gen.go | 2 + tools/golang/xenlight/types.gen.go | 1 + tools/include/libxl.h | 5 + tools/libs/light/libxl_arm.c | 123 +++- tools/libs/light/libxl_types.idl | 6 + tools/xl/xl_parse.c | 10 + xen/arch/arm/dom0less-build.c | 72 ++ xen/arch/arm/domain.c | 26 + xen/arch/arm/domain_build.c | 103 ++- xen/arch/arm/include/asm/domain.h | 4 + xen/arch/arm/include/asm/viommu.h | 102 +++ xen/common/device-tree/dom0less-build.c | 31 +- xen/drivers/passthrough/Kconfig | 14 + xen/drivers/passthrough/arm/Makefile | 2 + xen/drivers/passthrough/arm/smmu-v3.c | 369 +++++++++- xen/drivers/passthrough/arm/smmu-v3.h | 49 +- xen/drivers/passthrough/arm/viommu.c | 87 +++ xen/drivers/passthrough/arm/vsmmu-v3.c | 895 ++++++++++++++++++++++++ xen/drivers/passthrough/arm/vsmmu-v3.h | 32 + xen/include/public/arch-arm.h | 14 +- xen/include/public/device_tree_defs.h | 1 + xen/include/xen/iommu.h | 14 + 24 files changed, 1935 insertions(+), 47 deletions(-) create mode 100644 xen/arch/arm/include/asm/viommu.h create mode 100644 xen/drivers/passthrough/arm/viommu.c create mode 100644 xen/drivers/passthrough/arm/vsmmu-v3.c create mode 100644 xen/drivers/passthrough/arm/vsmmu-v3.h -- 2.43.0
Hi Milan, On Thu, 7 Aug 2025 at 17:55, Milan Djokic <milan_djokic@epam.com> wrote: > This patch series represents a rebase of an older patch series implemented > and > sumbitted by Rahul Singh as an RFC: > https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/ > . > Original patch series content is aligned with the latest xen structure in > terms of common/arch-specific code structuring. > Some minor bugfixes are also applied: > - Sanity checks / error handling > - Non-pci devices support for emulated iommu > > Overall description of stage-1 support is available in the original > patch series cover letter. Original commits structure with detailed > explanation for each commit > functionality is maintained. I am a bit surprised not much has changed. Last time we asked a document to explain the overall design of the vSMMU including some details on the security posture. I can’t remember if this was ever posted. If not, then you need to start with that. Otherwise, if is going to be pretty difficult to review this series. Cheers, > > Patch series testing is performed in qemu arm environment. Additionally, > stage-1 translation for non-pci devices is verified on a Renesas platform. > > Jean-Philippe Brucker (1): > xen/arm: smmuv3: Maintain a SID->device structure > > Rahul Singh (19): > xen/arm: smmuv3: Add support for stage-1 and nested stage translation > xen/arm: smmuv3: Alloc io_domain for each device > xen/arm: vIOMMU: add generic vIOMMU framework > xen/arm: vsmmuv3: Add dummy support for virtual SMMUv3 for guests > xen/domctl: Add XEN_DOMCTL_CONFIG_VIOMMU_* and viommu config param > xen/arm: vIOMMU: Add cmdline boot option "viommu = <boolean>" > xen/arm: vsmmuv3: Add support for registers emulation > xen/arm: vsmmuv3: Add support for cmdqueue handling > xen/arm: vsmmuv3: Add support for command CMD_CFGI_STE > xen/arm: vsmmuv3: Attach Stage-1 configuration to SMMUv3 hardware > xen/arm: vsmmuv3: Add support for event queue and global error > xen/arm: vsmmuv3: Add "iommus" property node for dom0 devices > xen/arm: vIOMMU: IOMMU device tree node for dom0 > xen/arm: vsmmuv3: Emulated SMMUv3 device tree node for dom0less > arm/libxl: vsmmuv3: Emulated SMMUv3 device tree node in libxl > xen/arm: vsmmuv3: Alloc virq for virtual SMMUv3 > xen/arm: vsmmuv3: Add support to send stage-1 event to guest > libxl/arm: vIOMMU: Modify the partial device tree for iommus > xen/arm: vIOMMU: Modify the partial device tree for dom0less > > docs/man/xl.cfg.5.pod.in | 13 + > docs/misc/xen-command-line.pandoc | 7 + > tools/golang/xenlight/helpers.gen.go | 2 + > tools/golang/xenlight/types.gen.go | 1 + > tools/include/libxl.h | 5 + > tools/libs/light/libxl_arm.c | 123 +++- > tools/libs/light/libxl_types.idl | 6 + > tools/xl/xl_parse.c | 10 + > xen/arch/arm/dom0less-build.c | 72 ++ > xen/arch/arm/domain.c | 26 + > xen/arch/arm/domain_build.c | 103 ++- > xen/arch/arm/include/asm/domain.h | 4 + > xen/arch/arm/include/asm/viommu.h | 102 +++ > xen/common/device-tree/dom0less-build.c | 31 +- > xen/drivers/passthrough/Kconfig | 14 + > xen/drivers/passthrough/arm/Makefile | 2 + > xen/drivers/passthrough/arm/smmu-v3.c | 369 +++++++++- > xen/drivers/passthrough/arm/smmu-v3.h | 49 +- > xen/drivers/passthrough/arm/viommu.c | 87 +++ > xen/drivers/passthrough/arm/vsmmu-v3.c | 895 ++++++++++++++++++++++++ > xen/drivers/passthrough/arm/vsmmu-v3.h | 32 + > xen/include/public/arch-arm.h | 14 +- > xen/include/public/device_tree_defs.h | 1 + > xen/include/xen/iommu.h | 14 + > 24 files changed, 1935 insertions(+), 47 deletions(-) > create mode 100644 xen/arch/arm/include/asm/viommu.h > create mode 100644 xen/drivers/passthrough/arm/viommu.c > create mode 100644 xen/drivers/passthrough/arm/vsmmu-v3.c > create mode 100644 xen/drivers/passthrough/arm/vsmmu-v3.h > > -- > 2.43.0 >
On 8/7/25 19:58, Julien Grall wrote: > Hi Milan, > > On Thu, 7 Aug 2025 at 17:55, Milan Djokic <milan_djokic@epam.com > <mailto:milan_djokic@epam.com>> wrote: > > This patch series represents a rebase of an older patch series > implemented and > sumbitted by Rahul Singh as an RFC: https://patchwork.kernel.org/ > project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/ > <https://eur01.safelinks.protection.outlook.com/? > url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fxen- > devel%2Fcover%2Fcover.1669888522.git.rahul.singh%40arm.com%2F&data=05%7C02%7Cmilan_djokic%40epam.com%7C03265dfcc1a94a11e83f08ddd5dc0edc%7Cb41b72d04e9f4c268a69f949f367c91d%7C1%7C0%7C638901863296475715%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=bdsPyXoIqvzwWIWk0Ot3BDOu8yAaF%2Bq3Vrs4wsmZJEA%3D&reserved=0>. > Original patch series content is aligned with the latest xen > structure in terms of common/arch-specific code structuring. > Some minor bugfixes are also applied: > - Sanity checks / error handling > - Non-pci devices support for emulated iommu > > > > Overall description of stage-1 support is available in the original > patch series cover letter. Original commits structure with detailed > explanation for each commit > functionality is maintained. > > > I am a bit surprised not much has changed. Last time we asked a document > to explain the overall design of the vSMMU including some details on the > security posture. I can’t remember if this was ever posted. > > If not, then you need to start with that. Otherwise, if is going to be > pretty difficult to review this series. > > Cheers, Hello Julien, We have prepared a design document and it will be part of the updated patch series (added in docs/design). I'll also extend cover letter with details on implementation structure to make review easier. Following is the design document content which will be provided in updated patch series: Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests ========================================================== Author: Milan Djokic <milan_djokic@epam.com> Date: 2025-08-07 Status: Draft Introduction ------------ The SMMUv3 supports two stages of translation. Each stage of translation can be independently enabled. An incoming address is logically translated from VA to IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to the output PA. Stage 1 translation support is required to provide isolation between different devices within the OS. Xen already supports Stage 2 translation but there is no support for Stage 1 translation. This design proposal outlines the introduction of Stage-1 SMMUv3 support in Xen for ARM guests. Motivation ---------- ARM systems utilizing SMMUv3 require Stage-1 address translation to ensure correct and secure DMA behavior inside guests. This feature enables: - Stage-1 translation in guest domain - Safe device passthrough under secure memory translation Design Overview --------------- These changes provide emulated SMMUv3 support: - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in SMMUv3 driver - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling - Register/Command Emulation: SMMUv3 register emulation and command queue handling - Device Tree Extensions: adds iommus and virtual SMMUv3 nodes to device trees for dom0 and dom0less scenarios - Runtime Configuration: introduces a 'viommu' boot parameter for dynamic enablement Security Considerations ------------------------ viommu security benefits: - Stage-1 translation ensures guest devices cannot perform unauthorized DMA - Emulated SMMUv3 for domains removes dependency on host hardware while maintaining isolation Observations and Potential Risks -------------------------------- 1. Observation: Support for Stage-1 translation introduces new data structures (s1_cfg and s2_cfg) and logic to write both Stage-1 and Stage-2 entries in the Stream Table Entry (STE), including an abort field for partial config states. Risk: A partially applied Stage-1 configuration might leave guest DMA mappings in an inconsistent state, enabling unauthorized access or cross-domain interference. Mitigation (Handled by design): Both s1_cfg and s2_cfg are written atomically. The abort field ensures Stage-1 config is only used when fully applied. Incomplete configs are ignored by the hypervisor. 2. Observation: Guests can now issue Stage-1 cache invalidations. Risk: Failure to propagate invalidations could leave stale mappings, enabling data leakage or misrouting. Mitigation (Handled by design): Guest invalidations are forwarded to the hardware to ensure IOMMU coherency. 3. Observation: The feature introduces large functional changes including the vIOMMU framework, vsmmuv3 devices, command queues, event queues, domain handling, and Device Tree modifications. Risk: Increased attack surface with risk of race conditions, malformed commands, or misconfiguration via the device tree. Mitigation: - Improved sanity checks and error handling - Feature is marked as Tech Preview and self-contained to reduce risk to unrelated code 4. Observation: The implementation supports nested and standard translation modes, using guest command queues (e.g. CMD_CFGI_STE) and events. Risk: Malicious commands could bypass validation and corrupt SMMUv3 state or destabilize dom0. Mitigation (Handled by design): Command queues are validated, and only permitted configuration changes are accepted. Handled in vsmmuv3 and cmdqueue logic. 5. Observation: Device Tree changes inject iommus and vsmmuv3 nodes via libxl. Risk: Malicious or incorrect DT fragments could result in wrong device assignments or hardware access. Mitigation: Only vetted and sanitized DT fragments are allowed. libxl limits what guests can inject. 6. Observation: The feature is enabled per-guest via viommu setting. Risk: Guests without viommu may behave differently, potentially causing confusion, privilege drift, or accidental exposure. Mitigation: Ensure downgrade paths are safe. Perform isolation audits in multi-guest environments to ensure correct behavior. Performance Impact ------------------ Hardware-managed translations are expected to have minimal overhead. Emulated vIOMMU may introduce some latency during initialization or event processing. Testing ------- - QEMU-based testing for Stage-1 and nested translation - Hardware testing on Renesas SMMUv3-enabled ARM systems - Unit tests for translation accuracy (not yet implemented) Migration and Compatibility --------------------------- This feature is optional and disabled by default (viommu="") to ensure backward compatibility. References ---------- - Original implementation by Rahul Singh: https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/ - ARM SMMUv3 architecture documentation - Existing vIOMMU code in Xen BR, Milan
On 13/08/2025 11:04, Milan Djokic wrote: > Hello Julien, Hi Milan, > > We have prepared a design document and it will be part of the updated > patch series (added in docs/design). I'll also extend cover letter with > details on implementation structure to make review easier. I would suggest to just iterate on the design document for now. > Following is the design document content which will be provided in > updated patch series: > > Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests > ========================================================== > > Author: Milan Djokic <milan_djokic@epam.com> > Date: 2025-08-07 > Status: Draft > > Introduction > ------------ > > The SMMUv3 supports two stages of translation. Each stage of translation > can be independently enabled. An incoming address is logically > translated from VA to IPA in stage 1, then the IPA is input to stage 2 > which translates the IPA to the output PA. Stage 1 translation support > is required to provide isolation between different devices within the OS. > > Xen already supports Stage 2 translation but there is no support for > Stage 1 translation. This design proposal outlines the introduction of > Stage-1 SMMUv3 support in Xen for ARM guests. > > Motivation > ---------- > > ARM systems utilizing SMMUv3 require Stage-1 address translation to > ensure correct and secure DMA behavior inside guests. Can you clarify what you mean by "correct"? DMA would still work without stage-1. > > This feature enables: > - Stage-1 translation in guest domain > - Safe device passthrough under secure memory translation > > Design Overview > --------------- > > These changes provide emulated SMMUv3 support: > > - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in > SMMUv3 driver > - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling So what are you planning to expose to a guest? Is it one vIOMMU per pIOMMU? Or a single one? Have you considered the pros/cons for both? > - Register/Command Emulation: SMMUv3 register emulation and command > queue handling For each pSMMU, we have a single command queue that will receive command from all the guests. How do you plan to prevent a guest hogging the command queue? In addition to that, AFAIU, the size of the virtual command queue is fixed by the guest rather than Xen. If a guest is filling up the queue with commands before notifying Xen, how do you plan to ensure we don't spend too much time in Xen (which is not preemptible)? Lastly, what do you plan to expose? Is it a full vIOMMU (including event forwarding)? > - Device Tree Extensions: adds iommus and virtual SMMUv3 nodes to > device trees for dom0 and dom0less scenarios > - Runtime Configuration: introduces a 'viommu' boot parameter for > dynamic enablement > > Security Considerations > ------------------------ > > viommu security benefits: > - Stage-1 translation ensures guest devices cannot perform unauthorized > DMA > - Emulated SMMUv3 for domains removes dependency on host hardware while > maintaining isolation I don't understand this sentence. > > Observations and Potential Risks > -------------------------------- > > 1. Observation: > Support for Stage-1 translation introduces new data structures > (s1_cfg and s2_cfg) and logic to write both Stage-1 and Stage-2 entries > in the Stream Table Entry (STE), including an abort field for partial > config states. > > Risk: > A partially applied Stage-1 configuration might leave guest DMA > mappings in an inconsistent state, enabling unauthorized access or > cross-domain interference. I don't understand how a misconfigured stage-1 could lead to cross-domain interference. Can you clarify? > > Mitigation (Handled by design): > Both s1_cfg and s2_cfg are written atomically. The abort field ensures > Stage-1 config is only used when fully applied. Incomplete configs are > ignored by the hypervisor. > > 2. Observation: > Guests can now issue Stage-1 cache invalidations. > > Risk: > Failure to propagate invalidations could leave stale mappings, enabling > data leakage or misrouting. This is a risk from the guest PoV, right? IOW, this would not open up a security hole in Xen. > > Mitigation (Handled by design): > Guest invalidations are forwarded to the hardware to ensure IOMMU > coherency. > > 3. Observation: > The feature introduces large functional changes including the vIOMMU > framework, vsmmuv3 devices, command queues, event queues, domain > handling, and Device Tree modifications. > > Risk: > Increased attack surface with risk of race conditions, malformed > commands, or misconfiguration via the device tree. > > Mitigation: > - Improved sanity checks and error handling > - Feature is marked as Tech Preview and self-contained to reduce risk > to unrelated code Surely, you will want to use the code in production... No? > > 4. Observation: > The implementation supports nested and standard translation modes, > using guest command queues (e.g. CMD_CFGI_STE) and events. > > Risk: > Malicious commands could bypass validation and corrupt SMMUv3 state or > destabilize dom0. > > Mitigation (Handled by design): > Command queues are validated, and only permitted configuration changes > are accepted. Handled in vsmmuv3 and cmdqueue logic. I didn't mention anything in obversation 1 but now I have to say it... The observations you wrote are what I would expect to be handled in any submission to our code base. This is the bare minimum to have the code secure. But you don't seem to address the more subttle ones which are more related to scheduling issue (see some above). They require some design and discussion. > > 5. Observation: > Device Tree changes inject iommus and vsmmuv3 nodes via libxl. > > Risk: > Malicious or incorrect DT fragments could result in wrong device > assignments or hardware access. > > Mitigation: > Only vetted and sanitized DT fragments are allowed. libxl limits what > guests can inject. Today, libxl doesn't do any sanitisation on the DT. In fact, this is pretty much impossible because libfdt expects trusted DT. Is this something you are planning to change? > > 6. Observation: > The feature is enabled per-guest via viommu setting. > > Risk: > Guests without viommu may behave differently, potentially causing > confusion, privilege drift, or accidental exposure. > > Mitigation: > Ensure downgrade paths are safe. Perform isolation audits in > multi-guest environments to ensure correct behavior. > > Performance Impact > ------------------ > > Hardware-managed translations are expected to have minimal overhead. > Emulated vIOMMU may introduce some latency during initialization or > event processing. Latency to who? We still expect isolation between guests and a guest will not go over its time slice. For the guest itself, the main performance impact will be TLB flushes because they are commands that will end up to be emulated by Xen. Depending on your Linux configuration (I haven't checked other), this will either happen every unmap operation or they will be batch. The performance of the latter will be the worse one. Have you done any benchmark to confirm the impact? Just to note, I would not gate the work for virtual SMMUv3 based on the performance. I think it is ok to offer the support if the user want extra security and doesn't care about performance. But it would be good to outline them as I expect them to be pretty bad... Cheers, -- Julien Grall
Hello Julien, On 8/13/25 14:11, Julien Grall wrote: > On 13/08/2025 11:04, Milan Djokic wrote: >> Hello Julien, > > Hi Milan, > >> >> We have prepared a design document and it will be part of the updated >> patch series (added in docs/design). I'll also extend cover letter with >> details on implementation structure to make review easier. > > I would suggest to just iterate on the design document for now. > >> Following is the design document content which will be provided in >> updated patch series: >> >> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests >> ========================================================== >> >> Author: Milan Djokic <milan_djokic@epam.com> >> Date: 2025-08-07 >> Status: Draft >> >> Introduction >> ------------ >> >> The SMMUv3 supports two stages of translation. Each stage of translation >> can be independently enabled. An incoming address is logically >> translated from VA to IPA in stage 1, then the IPA is input to stage 2 >> which translates the IPA to the output PA. Stage 1 translation support >> is required to provide isolation between different devices within the OS. >> >> Xen already supports Stage 2 translation but there is no support for >> Stage 1 translation. This design proposal outlines the introduction of >> Stage-1 SMMUv3 support in Xen for ARM guests. >> >> Motivation >> ---------- >> >> ARM systems utilizing SMMUv3 require Stage-1 address translation to >> ensure correct and secure DMA behavior inside guests. > > Can you clarify what you mean by "correct"? DMA would still work without > stage-1. Correct in terms of working with guest managed I/O space. I'll rephrase this statement, it seems ambiguous. >> >> This feature enables: >> - Stage-1 translation in guest domain >> - Safe device passthrough under secure memory translation >> >> Design Overview >> --------------- >> >> These changes provide emulated SMMUv3 support: >> >> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in >> SMMUv3 driver >> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling > > So what are you planning to expose to a guest? Is it one vIOMMU per > pIOMMU? Or a single one? Single vIOMMU model is used in this design. > > Have you considered the pros/cons for both? >> - Register/Command Emulation: SMMUv3 register emulation and command >> queue handling > That's a point for consideration. single vIOMMU prevails in terms of less complex implementation and a simple guest iommmu model - single vIOMMU node, one interrupt path, event queue, single set of trap handlers for emulation, etc. Cons for a single vIOMMU model could be less accurate hw representation and a potential bottleneck with one emulated queue and interrupt path. On the other hand, vIOMMU per pIOMMU provides more accurate hw modeling and offers better scalability in case of many IOMMUs in the system, but this comes with more complex emulation logic and device tree, also handling multiple vIOMMUs on guest side. IMO, single vIOMMU model seems like a better option mostly because it's less complex, easier to maintain and debug. Of course, this decision can and should be discussed. > For each pSMMU, we have a single command queue that will receive command > from all the guests. How do you plan to prevent a guest hogging the > command queue? > > In addition to that, AFAIU, the size of the virtual command queue is > fixed by the guest rather than Xen. If a guest is filling up the queue > with commands before notifying Xen, how do you plan to ensure we don't > spend too much time in Xen (which is not preemptible)? > We'll have to do a detailed analysis on these scenarios, they are not covered by the design (as well as some others which is clear after your comments). I'll come back with an updated design. > Lastly, what do you plan to expose? Is it a full vIOMMU (including event > forwarding)? > Yes, implementation provides full vIOMMU functionality, with stage-1 event forwarding to guest. >> - Device Tree Extensions: adds iommus and virtual SMMUv3 nodes to >> device trees for dom0 and dom0less scenarios >> - Runtime Configuration: introduces a 'viommu' boot parameter for >> dynamic enablement >> >> Security Considerations >> ------------------------ >> >> viommu security benefits: >> - Stage-1 translation ensures guest devices cannot perform unauthorized >> DMA >> - Emulated SMMUv3 for domains removes dependency on host hardware while >> maintaining isolation > > I don't understand this sentence. > Current implementation emulates IOMMU with predefined capabilities, exposed as a single vIOMMU to guest. That's where "removes dependency on host hardware" came from. I'll rephrase this part to be more clear. >> >> Observations and Potential Risks >> -------------------------------- >> >> 1. Observation: >> Support for Stage-1 translation introduces new data structures >> (s1_cfg and s2_cfg) and logic to write both Stage-1 and Stage-2 entries >> in the Stream Table Entry (STE), including an abort field for partial >> config states. >> >> Risk: >> A partially applied Stage-1 configuration might leave guest DMA >> mappings in an inconsistent state, enabling unauthorized access or >> cross-domain interference. > > I don't understand how a misconfigured stage-1 could lead to > cross-domain interference. Can you clarify? > For stage-1 support, SID-to-device mapping and per device io_domain allocation is introduced in Xen smmu driver, and we have to take care that these mappings are valid all the time. If these are not properly managed, structures and SIDs could be mapped to wrong device (and consequentially wrong guest) in some extreme cases. This is covered by the design, but listed as a risc anyway for eventual future updates in this area. >> >> Mitigation (Handled by design): >> Both s1_cfg and s2_cfg are written atomically. The abort field ensures >> Stage-1 config is only used when fully applied. Incomplete configs are >> ignored by the hypervisor. >> >> 2. Observation: >> Guests can now issue Stage-1 cache invalidations. >> >> Risk: >> Failure to propagate invalidations could leave stale mappings, enabling >> data leakage or misrouting. > > This is a risk from the guest PoV, right? IOW, this would not open up a > security hole in Xen. > Yes, this is guest PoV, although still related to vIOMMU. >> >> Mitigation (Handled by design): >> Guest invalidations are forwarded to the hardware to ensure IOMMU >> coherency. >> >> 3. Observation: >> The feature introduces large functional changes including the vIOMMU >> framework, vsmmuv3 devices, command queues, event queues, domain >> handling, and Device Tree modifications. >> >> Risk: >> Increased attack surface with risk of race conditions, malformed >> commands, or misconfiguration via the device tree. >> >> Mitigation: >> - Improved sanity checks and error handling >> - Feature is marked as Tech Preview and self-contained to reduce risk >> to unrelated code > > Surely, you will want to use the code in production... No? > Yes, it is planned for production usage. At the moment, it is optionally enabled (grouped under unsupported features), needs community feedback, complete security analysis and performance benchmarking/optimization. That's the reason it's marked as a Tech Preview at this point. >> >> 4. Observation: >> The implementation supports nested and standard translation modes, >> using guest command queues (e.g. CMD_CFGI_STE) and events. >> >> Risk: >> Malicious commands could bypass validation and corrupt SMMUv3 state or >> destabilize dom0. >> >> Mitigation (Handled by design): >> Command queues are validated, and only permitted configuration changes >> are accepted. Handled in vsmmuv3 and cmdqueue logic. > > I didn't mention anything in obversation 1 but now I have to say it... > The observations you wrote are what I would expect to be handled in any > submission to our code base. This is the bare minimum to have the code > secure. But you don't seem to address the more subttle ones which are > more related to scheduling issue (see some above). They require some > design and discussion. > Yes, it's clear to me after your comments that some important observations are missing. We'll do additional analysis and come back with a more complete design. >> >> 5. Observation: >> Device Tree changes inject iommus and vsmmuv3 nodes via libxl. >> >> Risk: >> Malicious or incorrect DT fragments could result in wrong device >> assignments or hardware access. >> >> Mitigation: >> Only vetted and sanitized DT fragments are allowed. libxl limits what >> guests can inject. > > Today, libxl doesn't do any sanitisation on the DT. In fact, this is > pretty much impossible because libfdt expects trusted DT. Is this > something you are planning to change? I've referred to libxl parsing only supported fragments/nodes from DT, but yes, that's not actual sanitization. I'll update these statements. >> >> 6. Observation: >> The feature is enabled per-guest via viommu setting. >> >> Risk: >> Guests without viommu may behave differently, potentially causing >> confusion, privilege drift, or accidental exposure. >> >> Mitigation: >> Ensure downgrade paths are safe. Perform isolation audits in >> multi-guest environments to ensure correct behavior. >> >> Performance Impact >> ------------------ >> >> Hardware-managed translations are expected to have minimal overhead. >> Emulated vIOMMU may introduce some latency during initialization or >> event processing. > > Latency to who? We still expect isolation between guests and a guest > will not go over its time slice. > This is more related to comparison of emulated vs hw translation, and overall overhead introduced with these mechanisms. I'll rephrase this part to be more clear. > For the guest itself, the main performance impact will be TLB flushes > because they are commands that will end up to be emulated by Xen. > Depending on your Linux configuration (I haven't checked other), this > will either happen every unmap operation or they will be batch. The > performance of the latter will be the worse one. > > Have you done any benchmark to confirm the impact? Just to note, I would > not gate the work for virtual SMMUv3 based on the performance. I think > it is ok to offer the support if the user want extra security and > doesn't care about performance. But it would be good to outline them as > I expect them to be pretty bad... > We haven't performed detailed benchmarking, just a measurement of boot time and our domU application execution rate with and without viommu. We could perform some measurements for viommu operations and add results in this section. Thank you for your feedback, I'll come back with an updated design document for further review. BR, Milan
Hi Milan, Milan Djokic <milan_djokic@epam.com> writes: > Hello Julien, > > On 8/13/25 14:11, Julien Grall wrote: >> On 13/08/2025 11:04, Milan Djokic wrote: >>> Hello Julien, >> Hi Milan, >> >>> >>> We have prepared a design document and it will be part of the updated >>> patch series (added in docs/design). I'll also extend cover letter with >>> details on implementation structure to make review easier. >> I would suggest to just iterate on the design document for now. >> >>> Following is the design document content which will be provided in >>> updated patch series: >>> >>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests >>> ========================================================== >>> >>> Author: Milan Djokic <milan_djokic@epam.com> >>> Date: 2025-08-07 >>> Status: Draft >>> >>> Introduction >>> ------------ >>> >>> The SMMUv3 supports two stages of translation. Each stage of translation >>> can be independently enabled. An incoming address is logically >>> translated from VA to IPA in stage 1, then the IPA is input to stage 2 >>> which translates the IPA to the output PA. Stage 1 translation support >>> is required to provide isolation between different devices within the OS. >>> >>> Xen already supports Stage 2 translation but there is no support for >>> Stage 1 translation. This design proposal outlines the introduction of >>> Stage-1 SMMUv3 support in Xen for ARM guests. >>> >>> Motivation >>> ---------- >>> >>> ARM systems utilizing SMMUv3 require Stage-1 address translation to >>> ensure correct and secure DMA behavior inside guests. >> Can you clarify what you mean by "correct"? DMA would still work >> without >> stage-1. > > Correct in terms of working with guest managed I/O space. I'll > rephrase this statement, it seems ambiguous. > >>> >>> This feature enables: >>> - Stage-1 translation in guest domain >>> - Safe device passthrough under secure memory translation >>> >>> Design Overview >>> --------------- >>> >>> These changes provide emulated SMMUv3 support: >>> >>> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in >>> SMMUv3 driver >>> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling >> So what are you planning to expose to a guest? Is it one vIOMMU per >> pIOMMU? Or a single one? > > Single vIOMMU model is used in this design. > >> Have you considered the pros/cons for both? >>> - Register/Command Emulation: SMMUv3 register emulation and command >>> queue handling >> > > That's a point for consideration. > single vIOMMU prevails in terms of less complex implementation and a > simple guest iommmu model - single vIOMMU node, one interrupt path, > event queue, single set of trap handlers for emulation, etc. > Cons for a single vIOMMU model could be less accurate hw > representation and a potential bottleneck with one emulated queue and > interrupt path. > On the other hand, vIOMMU per pIOMMU provides more accurate hw > modeling and offers better scalability in case of many IOMMUs in the > system, but this comes with more complex emulation logic and device > tree, also handling multiple vIOMMUs on guest side. > IMO, single vIOMMU model seems like a better option mostly because > it's less complex, easier to maintain and debug. Of course, this > decision can and should be discussed. > Well, I am not sure that this is possible, because of StreamID allocation. The biggest offender is of course PCI, as each Root PCI bridge will require own SMMU instance with own StreamID space. But even without PCI you'll need some mechanism to map vStremID to <pSMMU, pStreamID>, because there will be overlaps in SID space. Actually, PCI/vPCI with vSMMU is its own can of worms... >> For each pSMMU, we have a single command queue that will receive command >> from all the guests. How do you plan to prevent a guest hogging the >> command queue? >> In addition to that, AFAIU, the size of the virtual command queue is >> fixed by the guest rather than Xen. If a guest is filling up the queue >> with commands before notifying Xen, how do you plan to ensure we don't >> spend too much time in Xen (which is not preemptible)? >> > > We'll have to do a detailed analysis on these scenarios, they are not > covered by the design (as well as some others which is clear after > your comments). I'll come back with an updated design. I think that can be handled akin to hypercall continuation, which is used in similar places, like P2M code [...] -- WBR, Volodymyr
Hello Julien, Volodymyr On 8/27/25 01:28, Volodymyr Babchuk wrote: > > Hi Milan, > > Milan Djokic <milan_djokic@epam.com> writes: > >> Hello Julien, >> >> On 8/13/25 14:11, Julien Grall wrote: >>> On 13/08/2025 11:04, Milan Djokic wrote: >>>> Hello Julien, >>> Hi Milan, >>> >>>> >>>> We have prepared a design document and it will be part of the updated >>>> patch series (added in docs/design). I'll also extend cover letter with >>>> details on implementation structure to make review easier. >>> I would suggest to just iterate on the design document for now. >>> >>>> Following is the design document content which will be provided in >>>> updated patch series: >>>> >>>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests >>>> ========================================================== >>>> >>>> Author: Milan Djokic <milan_djokic@epam.com> >>>> Date: 2025-08-07 >>>> Status: Draft >>>> >>>> Introduction >>>> ------------ >>>> >>>> The SMMUv3 supports two stages of translation. Each stage of translation >>>> can be independently enabled. An incoming address is logically >>>> translated from VA to IPA in stage 1, then the IPA is input to stage 2 >>>> which translates the IPA to the output PA. Stage 1 translation support >>>> is required to provide isolation between different devices within the OS. >>>> >>>> Xen already supports Stage 2 translation but there is no support for >>>> Stage 1 translation. This design proposal outlines the introduction of >>>> Stage-1 SMMUv3 support in Xen for ARM guests. >>>> >>>> Motivation >>>> ---------- >>>> >>>> ARM systems utilizing SMMUv3 require Stage-1 address translation to >>>> ensure correct and secure DMA behavior inside guests. >>> Can you clarify what you mean by "correct"? DMA would still work >>> without >>> stage-1. >> >> Correct in terms of working with guest managed I/O space. I'll >> rephrase this statement, it seems ambiguous. >> >>>> >>>> This feature enables: >>>> - Stage-1 translation in guest domain >>>> - Safe device passthrough under secure memory translation >>>> >>>> Design Overview >>>> --------------- >>>> >>>> These changes provide emulated SMMUv3 support: >>>> >>>> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in >>>> SMMUv3 driver >>>> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling >>> So what are you planning to expose to a guest? Is it one vIOMMU per >>> pIOMMU? Or a single one? >> >> Single vIOMMU model is used in this design. >> >>> Have you considered the pros/cons for both? >>>> - Register/Command Emulation: SMMUv3 register emulation and command >>>> queue handling >>> >> >> That's a point for consideration. >> single vIOMMU prevails in terms of less complex implementation and a >> simple guest iommmu model - single vIOMMU node, one interrupt path, >> event queue, single set of trap handlers for emulation, etc. >> Cons for a single vIOMMU model could be less accurate hw >> representation and a potential bottleneck with one emulated queue and >> interrupt path. >> On the other hand, vIOMMU per pIOMMU provides more accurate hw >> modeling and offers better scalability in case of many IOMMUs in the >> system, but this comes with more complex emulation logic and device >> tree, also handling multiple vIOMMUs on guest side. >> IMO, single vIOMMU model seems like a better option mostly because >> it's less complex, easier to maintain and debug. Of course, this >> decision can and should be discussed. >> > > Well, I am not sure that this is possible, because of StreamID > allocation. The biggest offender is of course PCI, as each Root PCI > bridge will require own SMMU instance with own StreamID space. But even > without PCI you'll need some mechanism to map vStremID to > <pSMMU, pStreamID>, because there will be overlaps in SID space. > > > Actually, PCI/vPCI with vSMMU is its own can of worms... > >>> For each pSMMU, we have a single command queue that will receive command >>> from all the guests. How do you plan to prevent a guest hogging the >>> command queue? >>> In addition to that, AFAIU, the size of the virtual command queue is >>> fixed by the guest rather than Xen. If a guest is filling up the queue >>> with commands before notifying Xen, how do you plan to ensure we don't >>> spend too much time in Xen (which is not preemptible)? >>> >> >> We'll have to do a detailed analysis on these scenarios, they are not >> covered by the design (as well as some others which is clear after >> your comments). I'll come back with an updated design. > > I think that can be handled akin to hypercall continuation, which is > used in similar places, like P2M code > > [...] > I have updated vIOMMU design document with additional security topics covered and performance impact results. Also added some additional explanations for vIOMMU components following your comments. Updated document content: =============================================== Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests =============================================== :Author: Milan Djokic <milan_djokic@epam.com> :Date: 2025-08-07 :Status: Draft Introduction ======== The SMMUv3 supports two stages of translation. Each stage of translation can be independently enabled. An incoming address is logically translated from VA to IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to the output PA. Stage 1 translation support is required to provide isolation between different devices within OS. XEN already supports Stage 2 translation but there is no support for Stage 1 translation. This design proposal outlines the introduction of Stage-1 SMMUv3 support in Xen for ARM guests. Motivation ========== ARM systems utilizing SMMUv3 require stage-1 address translation to ensure secure DMA and guest managed I/O memory mappings. This feature enables: - Stage-1 translation in guest domain - Safe device passthrough under secure memory translation Design Overview =============== These changes provide emulated SMMUv3 support: - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support in SMMUv3 driver. - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1 handling. - **Register/Command Emulation**: SMMUv3 register emulation and command queue handling. - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to device trees for dom0 and dom0less scenarios. - **Runtime Configuration**: Introduces a `viommu` boot parameter for dynamic enablement. vIOMMU is exposed to guest as a single device with predefined capabilities and commands supported. Single vIOMMU model abstracts the details of an actual IOMMU hardware, simplifying usage from the guest point of view. Guest OS handles only a single IOMMU, even if multiple IOMMU units are available on the host system. Security Considerations ======================= **viommu security benefits:** - Stage-1 translation ensures guest devices cannot perform unauthorized DMA. - Emulated IOMMU removes guest dependency on IOMMU hardware while maintaining domains isolation. 1. Observation: --------------- Support for Stage-1 translation in SMMUv3 introduces new data structures (`s1_cfg` alongside `s2_cfg`) and logic to write both Stage-1 and Stage-2 entries in the Stream Table Entry (STE), including an `abort` field to handle partial configuration states. **Risk:** Without proper handling, a partially applied Stage-1 configuration might leave guest DMA mappings in an inconsistent state, potentially enabling unauthorized access or causing cross-domain interference. **Mitigation:** *(Handled by design)* This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to STE and manages the `abort` field-only considering Stage-1 configuration if fully attached. This ensures incomplete or invalid guest configurations are safely ignored by the hypervisor. 2. Observation: --------------- Guests can now invalidate Stage-1 caches; invalidation needs forwarding to SMMUv3 hardware to maintain coherence. **Risk:** Failing to propagate cache invalidation could allow stale mappings, enabling access to old mappings and possibly data leakage or misrouting. **Mitigation:** *(Handled by design)* This feature ensures that guest-initiated invalidations are correctly forwarded to the hardware, preserving IOMMU coherency. 3. Observation: --------------- This design introduces substantial new functionality, including the `vIOMMU` framework, virtual SMMUv3 devices (`vsmmuv3`), command queues, event queues, domain management, and Device Tree modifications (e.g., `iommus` nodes and `libxl` integration). **Risk:** Large feature expansions increase the attack surface—potential for race conditions, unchecked command inputs, or Device Tree-based misconfigurations. **Mitigation:** - Sanity checks and error-handling improvements have been introduced in this feature. - Further audits have to be performed for this feature and its dependencies in this area. Currently, feature is marked as *Tech Preview* and is self-contained, reducing the risk to unrelated components. 4. Observation: --------------- The code includes transformations to handle nested translation versus standard modes and uses guest-configured command queues (e.g., `CMD_CFGI_STE`) and event notifications. **Risk:** Malicious or malformed queue commands from guests could bypass validation, manipulate SMMUv3 state, or cause Dom0 instability. **Mitigation:** *(Handled by design)* Built-in validation of command queue entries and sanitization mechanisms ensure only permitted configurations are applied. This is supported via additions in `vsmmuv3` and `cmdqueue` handling code. 5. Observation: --------------- Device Tree modifications enable device assignment and configuration—guest DT fragments (e.g., `iommus`) are added via `libxl`. **Risk:** Erroneous or malicious Device Tree injection could result in device misbinding or guest access to unauthorized hardware. **Mitigation:** - `libxl` perform checks of guest configuration and parse only predefined dt fragments and nodes, reducing risc. - The system integrator must ensure correct resource mapping in the guest Device Tree (DT) fragments. 6. Observation: --------------- Introducing optional per-guest enabled features (`viommu` argument in xl guest config) means some guests may opt-out. **Risk:** Differences between guests with and without `viommu` may cause unexpected behavior or privilege drift. **Mitigation:** Verify that downgrade paths are safe and well-isolated; ensure missing support doesn't cause security issues. Additional audits on emulation paths and domains interference need to be performed in a multi-guest environment. 7. Observation: --------------- Guests have the ability to issue Stage-1 IOMMU commands like cache invalidation, stream table entries configuration, etc. An adversarial guest may issue a high volume of commands in rapid succession. **Risk** Excessive commands requests can cause high hypervisor CPU consumption and disrupt scheduling, leading to degraded system responsiveness and potential denial-of-service scenarios. **Mitigation** - Xen credit scheduler limits guest vCPU execution time, securing basic guest rate-limiting. - Batch multiple commands of same type to reduce overhead on the virtual SMMUv3 hardware emulation. - Implement vIOMMU commands execution restart and continuation support 8. Observation: --------------- Some guest commands issued towards vIOMMU are propagated to pIOMMU command queue (e.g. TLB invalidate). For each pIOMMU, only one command queue is available for all domains. **Risk** Excessive commands requests from abusive guest can cause flooding of physical IOMMU command queue, leading to degraded pIOMMU responsivness on commands issued from other guests. **Mitigation** - Xen credit scheduler limits guest vCPU execution time, securing basic guest rate-limiting. - Batch commands which should be propagated towards pIOMMU cmd queue and enable support for batch execution pause/continuation - If possible, implement domain penalization by adding a per-domain cost counter for vIOMMU/pIOMMU usage. 9. Observation: --------------- vIOMMU feature includes event queue used for forwarding IOMMU events to guest (e.g. translation faults, invalid stream IDs, permission errors). A malicious guest can misconfigure its SMMU state or intentionally trigger faults with high frequency. **Risk** Occurance of IOMMU events with high frequency can cause Xen to flood the event queue and disrupt scheduling with high hypervisor CPU load for events handling. **Mitigation** - Implement fail-safe state by disabling events forwarding when faults are occured with high frequency and not processed by guest. - Batch multiple events of same type to reduce overhead on the virtual SMMUv3 hardware emulation. - Consider disabling event queue for untrusted guests Performance Impact ================== With iommu stage-1 and nested translation inclusion, performance overhead is introduced comparing to existing, stage-2 only usage in Xen. Once mappings are established, translations should not introduce significant overhead. Emulated paths may introduce moderate overhead, primarily affecting device initialization and event handling. Performance impact highly depends on target CPU capabilities. Testing is performed on cortex-a53 based platform. Performance is mostly impacted by emulated vIOMMU operations, results shown in the following table. +-------------------------------+---------------------------------+ | vIOMMU Operation | Execution time in guest | +===============================+=================================+ | Reg read | median: 30μs, worst-case: 250μs | +-------------------------------+---------------------------------+ | Reg write | median: 35μs, worst-case: 280μs | +-------------------------------+---------------------------------+ | Invalidate TLB | median: 90μs, worst-case: 1ms+ | +-------------------------------+---------------------------------+ | Invalidate STE | median: 450μs worst_case: 7ms+ | +-------------------------------+---------------------------------+ With vIOMMU exposed to guest, guest OS has to initialize IOMMU device and configure stage-1 mappings for devices attached to it. Following table shows initialization stages which impact stage-1 enabled guest boot time and compares it with stage-1 disabled guest. "NOTE: Device probe execution time varies significantly depending on device complexity. virtio-gpu was selected as a test case due to its extensive use of dynamic DMA allocations and IOMMU mappings, making it a suitable candidate for benchmarking stage-1 vIOMMU behavior." +---------------------+-----------------------+------------------------+ | Stage | Stage-1 Enabled Guest | Stage-1 Disabled Guest | +=====================+=======================+========================+ | IOMMU Init | ~25ms | / | +---------------------+-----------------------+------------------------+ | Dev Attach / Mapping| ~220ms | ~200ms | +---------------------+-----------------------+------------------------+ For devices configured with dynamic DMA mappings, DMA allocate/map/unmap operations performance is also impacted on stage-1 enabled guests. Dynamic DMA mapping operation issues emulated IOMMU functions like mmio write/read and TLB invalidations. As a reference, following table shows performance results for runtime dma operations for virtio-gpu device. +---------------+-------------------------+----------------------------+ | DMA Op | Stage-1 Enabled Guest | Stage-1 Disabled Guest | +===============+=========================+============================+ | dma_alloc | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs| +---------------+-------------------------+----------------------------+ | dma_free | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs | +---------------+-------------------------+----------------------------+ | dma_map | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs| +---------------+-------------------------+----------------------------+ | dma_unmap | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs | +---------------+-------------------------+----------------------------+ Testing ============ - QEMU-based ARM system tests for Stage-1 translation and nested virtualization. - Actual hardware validation on platforms such as Renesas to ensure compatibility with real SMMUv3 implementations. - Unit/Functional tests validating correct translations (not implemented). Migration and Compatibility =========================== This optional feature defaults to disabled (`viommu=""`) for backward compatibility. References ========== - Original feature implemented by Rahul Singh: https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/ - SMMUv3 architecture documentation - Existing vIOMMU code patterns
Hi Milan, Thanks, "Security Considerations" sections looks really good. But I have more questions. Milan Djokic <milan_djokic@epam.com> writes: > Hello Julien, Volodymyr > > On 8/27/25 01:28, Volodymyr Babchuk wrote: >> Hi Milan, >> Milan Djokic <milan_djokic@epam.com> writes: >> >>> Hello Julien, >>> >>> On 8/13/25 14:11, Julien Grall wrote: >>>> On 13/08/2025 11:04, Milan Djokic wrote: >>>>> Hello Julien, >>>> Hi Milan, >>>> >>>>> >>>>> We have prepared a design document and it will be part of the updated >>>>> patch series (added in docs/design). I'll also extend cover letter with >>>>> details on implementation structure to make review easier. >>>> I would suggest to just iterate on the design document for now. >>>> >>>>> Following is the design document content which will be provided in >>>>> updated patch series: >>>>> >>>>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests >>>>> ========================================================== >>>>> >>>>> Author: Milan Djokic <milan_djokic@epam.com> >>>>> Date: 2025-08-07 >>>>> Status: Draft >>>>> >>>>> Introduction >>>>> ------------ >>>>> >>>>> The SMMUv3 supports two stages of translation. Each stage of translation >>>>> can be independently enabled. An incoming address is logically >>>>> translated from VA to IPA in stage 1, then the IPA is input to stage 2 >>>>> which translates the IPA to the output PA. Stage 1 translation support >>>>> is required to provide isolation between different devices within the OS. >>>>> >>>>> Xen already supports Stage 2 translation but there is no support for >>>>> Stage 1 translation. This design proposal outlines the introduction of >>>>> Stage-1 SMMUv3 support in Xen for ARM guests. >>>>> >>>>> Motivation >>>>> ---------- >>>>> >>>>> ARM systems utilizing SMMUv3 require Stage-1 address translation to >>>>> ensure correct and secure DMA behavior inside guests. >>>> Can you clarify what you mean by "correct"? DMA would still work >>>> without >>>> stage-1. >>> >>> Correct in terms of working with guest managed I/O space. I'll >>> rephrase this statement, it seems ambiguous. >>> >>>>> >>>>> This feature enables: >>>>> - Stage-1 translation in guest domain >>>>> - Safe device passthrough under secure memory translation >>>>> >>>>> Design Overview >>>>> --------------- >>>>> >>>>> These changes provide emulated SMMUv3 support: >>>>> >>>>> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in >>>>> SMMUv3 driver >>>>> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling >>>> So what are you planning to expose to a guest? Is it one vIOMMU per >>>> pIOMMU? Or a single one? >>> >>> Single vIOMMU model is used in this design. >>> >>>> Have you considered the pros/cons for both? >>>>> - Register/Command Emulation: SMMUv3 register emulation and command >>>>> queue handling >>>> >>> >>> That's a point for consideration. >>> single vIOMMU prevails in terms of less complex implementation and a >>> simple guest iommmu model - single vIOMMU node, one interrupt path, >>> event queue, single set of trap handlers for emulation, etc. >>> Cons for a single vIOMMU model could be less accurate hw >>> representation and a potential bottleneck with one emulated queue and >>> interrupt path. >>> On the other hand, vIOMMU per pIOMMU provides more accurate hw >>> modeling and offers better scalability in case of many IOMMUs in the >>> system, but this comes with more complex emulation logic and device >>> tree, also handling multiple vIOMMUs on guest side. >>> IMO, single vIOMMU model seems like a better option mostly because >>> it's less complex, easier to maintain and debug. Of course, this >>> decision can and should be discussed. >>> >> Well, I am not sure that this is possible, because of StreamID >> allocation. The biggest offender is of course PCI, as each Root PCI >> bridge will require own SMMU instance with own StreamID space. But even >> without PCI you'll need some mechanism to map vStremID to >> <pSMMU, pStreamID>, because there will be overlaps in SID space. >> Actually, PCI/vPCI with vSMMU is its own can of worms... >> >>>> For each pSMMU, we have a single command queue that will receive command >>>> from all the guests. How do you plan to prevent a guest hogging the >>>> command queue? >>>> In addition to that, AFAIU, the size of the virtual command queue is >>>> fixed by the guest rather than Xen. If a guest is filling up the queue >>>> with commands before notifying Xen, how do you plan to ensure we don't >>>> spend too much time in Xen (which is not preemptible)? >>>> >>> >>> We'll have to do a detailed analysis on these scenarios, they are not >>> covered by the design (as well as some others which is clear after >>> your comments). I'll come back with an updated design. >> I think that can be handled akin to hypercall continuation, which is >> used in similar places, like P2M code >> [...] >> > > I have updated vIOMMU design document with additional security topics > covered and performance impact results. Also added some additional > explanations for vIOMMU components following your comments. > Updated document content: > > =============================================== > Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests > =============================================== > > :Author: Milan Djokic <milan_djokic@epam.com> > :Date: 2025-08-07 > :Status: Draft > > Introduction > ======== > > The SMMUv3 supports two stages of translation. Each stage of > translation can be > independently enabled. An incoming address is logically translated > from VA to > IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to > the output PA. Stage 1 translation support is required to provide > isolation between different > devices within OS. XEN already supports Stage 2 translation but there is no > support for Stage 1 translation. > This design proposal outlines the introduction of Stage-1 SMMUv3 > support in Xen for ARM guests. > > Motivation > ========== > > ARM systems utilizing SMMUv3 require stage-1 address translation to > ensure secure DMA and guest managed I/O memory mappings. It is unclear for my what you mean by "guest manged IO memory mappings", could you please provide an example? > This feature enables: > > - Stage-1 translation in guest domain > - Safe device passthrough under secure memory translation > As I see it, ARM specs use "secure" mostly when referring to Secure mode (S-EL1, S-EL2, EL3) and associated secure counterparts of architectural devices, like secure GIC, secure Timer, etc. So I'd probably don't use this word here to reduce confusion > Design Overview > =============== > > These changes provide emulated SMMUv3 support: > > - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation > support in SMMUv3 driver. "Nested translation" as in "nested virtualization"? Or is this something else? > - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1 > handling. I think, this is the big topic. You see, apart from SMMU, there is at least Renesas IP-MMU, which uses completely different API. And probably there are other IO-MMU implementations possible. Right now vIOMMU framework handles only SMMU, which is okay, but probably we should design it in a such way, that other IO-MMUs will be supported as well. Maybe even IO-MMUs for other architectures (RISC V maybe?). > - **Register/Command Emulation**: SMMUv3 register emulation and > command queue handling. Continuing previous paragraph: what about other IO-MMUs? For example, if platform provides only Renesas IO-MMU, will vIOMMU framework still emulate SMMUv3 registers and queue handling? > - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes > to device trees for dom0 and dom0less scenarios. > - **Runtime Configuration**: Introduces a `viommu` boot parameter for > dynamic enablement. > > vIOMMU is exposed to guest as a single device with predefined > capabilities and commands supported. Single vIOMMU model abstracts the > details of an actual IOMMU hardware, simplifying usage from the guest > point of view. Guest OS handles only a single IOMMU, even if multiple > IOMMU units are available on the host system. In the previous email I asked how are you planning to handle potential SID overlaps, especially in PCI use case. I want to return to this topic. I am not saying that this is impossible, but I'd like to see this covered in the design document. > > Security Considerations > ======================= > > **viommu security benefits:** > > - Stage-1 translation ensures guest devices cannot perform unauthorized DMA. > - Emulated IOMMU removes guest dependency on IOMMU hardware while > maintaining domains isolation. I am not sure that I got this paragraph. > > > 1. Observation: > --------------- > Support for Stage-1 translation in SMMUv3 introduces new data > structures (`s1_cfg` alongside `s2_cfg`) and logic to write both > Stage-1 and Stage-2 entries in the Stream Table Entry (STE), including > an `abort` field to handle partial configuration states. > > **Risk:** > Without proper handling, a partially applied Stage-1 configuration > might leave guest DMA mappings in an inconsistent state, potentially > enabling unauthorized access or causing cross-domain interference. > > **Mitigation:** *(Handled by design)* > This feature introduces logic that writes both `s1_cfg` and `s2_cfg` > to STE and manages the `abort` field-only considering Stage-1 > configuration if fully attached. This ensures incomplete or invalid > guest configurations are safely ignored by the hypervisor. > > 2. Observation: > --------------- > Guests can now invalidate Stage-1 caches; invalidation needs > forwarding to SMMUv3 hardware to maintain coherence. > > **Risk:** > Failing to propagate cache invalidation could allow stale mappings, > enabling access to old mappings and possibly data leakage or > misrouting. > > **Mitigation:** *(Handled by design)* > This feature ensures that guest-initiated invalidations are correctly > forwarded to the hardware, preserving IOMMU coherency. > > 3. Observation: > --------------- > This design introduces substantial new functionality, including the > `vIOMMU` framework, virtual SMMUv3 devices (`vsmmuv3`), command > queues, event queues, domain management, and Device Tree modifications > (e.g., `iommus` nodes and `libxl` integration). > > **Risk:** > Large feature expansions increase the attack surface—potential for > race conditions, unchecked command inputs, or Device Tree-based > misconfigurations. > > **Mitigation:** > > - Sanity checks and error-handling improvements have been introduced > in this feature. > - Further audits have to be performed for this feature and its > dependencies in this area. Currently, feature is marked as *Tech > Preview* and is self-contained, reducing the risk to unrelated > components. > > 4. Observation: > --------------- > The code includes transformations to handle nested translation versus > standard modes and uses guest-configured command queues (e.g., > `CMD_CFGI_STE`) and event notifications. > > **Risk:** > Malicious or malformed queue commands from guests could bypass > validation, manipulate SMMUv3 state, or cause Dom0 instability. Only Dom0? > > **Mitigation:** *(Handled by design)* > Built-in validation of command queue entries and sanitization > mechanisms ensure only permitted configurations are applied. This is > supported via additions in `vsmmuv3` and `cmdqueue` handling code. > > 5. Observation: > --------------- > Device Tree modifications enable device assignment and > configuration—guest DT fragments (e.g., `iommus`) are added via > `libxl`. > > **Risk:** > Erroneous or malicious Device Tree injection could result in device > misbinding or guest access to unauthorized hardware. > > **Mitigation:** > > - `libxl` perform checks of guest configuration and parse only > predefined dt fragments and nodes, reducing risc. > - The system integrator must ensure correct resource mapping in the > guest Device Tree (DT) fragments. > > 6. Observation: > --------------- > Introducing optional per-guest enabled features (`viommu` argument in > xl guest config) means some guests may opt-out. > > **Risk:** > Differences between guests with and without `viommu` may cause > unexpected behavior or privilege drift. > > **Mitigation:** > Verify that downgrade paths are safe and well-isolated; ensure missing > support doesn't cause security issues. Additional audits on emulation > paths and domains interference need to be performed in a multi-guest > environment. > > 7. Observation: > --------------- > Guests have the ability to issue Stage-1 IOMMU commands like cache > invalidation, stream table entries configuration, etc. An adversarial > guest may issue a high volume of commands in rapid succession. > > **Risk** > Excessive commands requests can cause high hypervisor CPU consumption > and disrupt scheduling, leading to degraded system responsiveness and > potential denial-of-service scenarios. > > **Mitigation** > > - Xen credit scheduler limits guest vCPU execution time, securing > basic guest rate-limiting. I don't thing that this feature available only in credit schedulers, AFAIK, all schedulers except null scheduler will limit vCPU execution time. > - Batch multiple commands of same type to reduce overhead on the > virtual SMMUv3 hardware emulation. > - Implement vIOMMU commands execution restart and continuation support So, something like "hypercall continuation"? > > 8. Observation: > --------------- > Some guest commands issued towards vIOMMU are propagated to pIOMMU > command queue (e.g. TLB invalidate). For each pIOMMU, only one command > queue is > available for all domains. > > **Risk** > Excessive commands requests from abusive guest can cause flooding of > physical IOMMU command queue, leading to degraded pIOMMU responsivness > on commands issued from other guests. > > **Mitigation** > > - Xen credit scheduler limits guest vCPU execution time, securing > basic guest rate-limiting. > - Batch commands which should be propagated towards pIOMMU cmd queue > and enable support for batch execution pause/continuation > - If possible, implement domain penalization by adding a per-domain > cost counter for vIOMMU/pIOMMU usage. > > 9. Observation: > --------------- > vIOMMU feature includes event queue used for forwarding IOMMU events > to guest (e.g. translation faults, invalid stream IDs, permission > errors). A malicious guest can misconfigure its SMMU state or > intentionally trigger faults with high frequency. > > **Risk** > Occurance of IOMMU events with high frequency can cause Xen to flood > the event queue and disrupt scheduling with high hypervisor CPU load > for events handling. > > **Mitigation** > > - Implement fail-safe state by disabling events forwarding when faults > are occured with high frequency and not processed by guest. > - Batch multiple events of same type to reduce overhead on the virtual > SMMUv3 hardware emulation. > - Consider disabling event queue for untrusted guests > > Performance Impact > ================== > > With iommu stage-1 and nested translation inclusion, performance > overhead is introduced comparing to existing, stage-2 only usage in > Xen. > Once mappings are established, translations should not introduce > significant overhead. > Emulated paths may introduce moderate overhead, primarily affecting > device initialization and event handling. > Performance impact highly depends on target CPU capabilities. Testing > is performed on cortex-a53 based platform. Which platform exactly? While QEMU emulates SMMU to some extent, we are observing somewhat different SMMU behavior on real HW platforms (mostly due to cache coherence problems). Also, according to MMU-600 errata, it can have lower than expected performance in some use-cases. > Performance is mostly impacted by emulated vIOMMU operations, results > shown in the following table. > > +-------------------------------+---------------------------------+ > | vIOMMU Operation | Execution time in guest | > +===============================+=================================+ > | Reg read | median: 30μs, worst-case: 250μs | > +-------------------------------+---------------------------------+ > | Reg write | median: 35μs, worst-case: 280μs | > +-------------------------------+---------------------------------+ > | Invalidate TLB | median: 90μs, worst-case: 1ms+ | > +-------------------------------+---------------------------------+ > | Invalidate STE | median: 450μs worst_case: 7ms+ | > +-------------------------------+---------------------------------+ > > With vIOMMU exposed to guest, guest OS has to initialize IOMMU device > and configure stage-1 mappings for devices attached to it. > Following table shows initialization stages which impact stage-1 > enabled guest boot time and compares it with stage-1 disabled guest. > > "NOTE: Device probe execution time varies significantly depending on > device complexity. virtio-gpu was selected as a test case due to its > extensive use of dynamic DMA allocations and IOMMU mappings, making it > a suitable candidate for benchmarking stage-1 vIOMMU behavior." > > +---------------------+-----------------------+------------------------+ > | Stage | Stage-1 Enabled Guest | Stage-1 Disabled Guest | > +=====================+=======================+========================+ > | IOMMU Init | ~25ms | / | > +---------------------+-----------------------+------------------------+ > | Dev Attach / Mapping| ~220ms | ~200ms | > +---------------------+-----------------------+------------------------+ > > For devices configured with dynamic DMA mappings, DMA > allocate/map/unmap operations performance is also impacted on stage-1 > enabled guests. > Dynamic DMA mapping operation issues emulated IOMMU functions like > mmio write/read and TLB invalidations. > As a reference, following table shows performance results for runtime > dma operations for virtio-gpu device. > > +---------------+-------------------------+----------------------------+ > | DMA Op | Stage-1 Enabled Guest | Stage-1 Disabled Guest | > +===============+=========================+============================+ > | dma_alloc | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs| > +---------------+-------------------------+----------------------------+ > | dma_free | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs | > +---------------+-------------------------+----------------------------+ > | dma_map | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs| > +---------------+-------------------------+----------------------------+ > | dma_unmap | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs | > +---------------+-------------------------+----------------------------+ > > Testing > ============ > > - QEMU-based ARM system tests for Stage-1 translation and nested > virtualization. > - Actual hardware validation on platforms such as Renesas to ensure > compatibility with real SMMUv3 implementations. > - Unit/Functional tests validating correct translations (not implemented). > > Migration and Compatibility > =========================== > > This optional feature defaults to disabled (`viommu=""`) for backward > compatibility. > -- WBR, Volodymyr
Hi Volodymyr, On 8/29/25 18:27, Volodymyr Babchuk wrote: > Hi Milan, > > Thanks, "Security Considerations" sections looks really good. But I have > more questions. > > Milan Djokic <milan_djokic@epam.com> writes: > >> Hello Julien, Volodymyr >> >> On 8/27/25 01:28, Volodymyr Babchuk wrote: >>> Hi Milan, >>> Milan Djokic <milan_djokic@epam.com> writes: >>> >>>> Hello Julien, >>>> >>>> On 8/13/25 14:11, Julien Grall wrote: >>>>> On 13/08/2025 11:04, Milan Djokic wrote: >>>>>> Hello Julien, >>>>> Hi Milan, >>>>> >>>>>> >>>>>> We have prepared a design document and it will be part of the updated >>>>>> patch series (added in docs/design). I'll also extend cover letter with >>>>>> details on implementation structure to make review easier. >>>>> I would suggest to just iterate on the design document for now. >>>>> >>>>>> Following is the design document content which will be provided in >>>>>> updated patch series: >>>>>> >>>>>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests >>>>>> ========================================================== >>>>>> >>>>>> Author: Milan Djokic <milan_djokic@epam.com> >>>>>> Date: 2025-08-07 >>>>>> Status: Draft >>>>>> >>>>>> Introduction >>>>>> ------------ >>>>>> >>>>>> The SMMUv3 supports two stages of translation. Each stage of translation >>>>>> can be independently enabled. An incoming address is logically >>>>>> translated from VA to IPA in stage 1, then the IPA is input to stage 2 >>>>>> which translates the IPA to the output PA. Stage 1 translation support >>>>>> is required to provide isolation between different devices within the OS. >>>>>> >>>>>> Xen already supports Stage 2 translation but there is no support for >>>>>> Stage 1 translation. This design proposal outlines the introduction of >>>>>> Stage-1 SMMUv3 support in Xen for ARM guests. >>>>>> >>>>>> Motivation >>>>>> ---------- >>>>>> >>>>>> ARM systems utilizing SMMUv3 require Stage-1 address translation to >>>>>> ensure correct and secure DMA behavior inside guests. >>>>> Can you clarify what you mean by "correct"? DMA would still work >>>>> without >>>>> stage-1. >>>> >>>> Correct in terms of working with guest managed I/O space. I'll >>>> rephrase this statement, it seems ambiguous. >>>> >>>>>> >>>>>> This feature enables: >>>>>> - Stage-1 translation in guest domain >>>>>> - Safe device passthrough under secure memory translation >>>>>> >>>>>> Design Overview >>>>>> --------------- >>>>>> >>>>>> These changes provide emulated SMMUv3 support: >>>>>> >>>>>> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in >>>>>> SMMUv3 driver >>>>>> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling >>>>> So what are you planning to expose to a guest? Is it one vIOMMU per >>>>> pIOMMU? Or a single one? >>>> >>>> Single vIOMMU model is used in this design. >>>> >>>>> Have you considered the pros/cons for both? >>>>>> - Register/Command Emulation: SMMUv3 register emulation and command >>>>>> queue handling >>>>> >>>> >>>> That's a point for consideration. >>>> single vIOMMU prevails in terms of less complex implementation and a >>>> simple guest iommmu model - single vIOMMU node, one interrupt path, >>>> event queue, single set of trap handlers for emulation, etc. >>>> Cons for a single vIOMMU model could be less accurate hw >>>> representation and a potential bottleneck with one emulated queue and >>>> interrupt path. >>>> On the other hand, vIOMMU per pIOMMU provides more accurate hw >>>> modeling and offers better scalability in case of many IOMMUs in the >>>> system, but this comes with more complex emulation logic and device >>>> tree, also handling multiple vIOMMUs on guest side. >>>> IMO, single vIOMMU model seems like a better option mostly because >>>> it's less complex, easier to maintain and debug. Of course, this >>>> decision can and should be discussed. >>>> >>> Well, I am not sure that this is possible, because of StreamID >>> allocation. The biggest offender is of course PCI, as each Root PCI >>> bridge will require own SMMU instance with own StreamID space. But even >>> without PCI you'll need some mechanism to map vStremID to >>> <pSMMU, pStreamID>, because there will be overlaps in SID space. >>> Actually, PCI/vPCI with vSMMU is its own can of worms... >>> >>>>> For each pSMMU, we have a single command queue that will receive command >>>>> from all the guests. How do you plan to prevent a guest hogging the >>>>> command queue? >>>>> In addition to that, AFAIU, the size of the virtual command queue is >>>>> fixed by the guest rather than Xen. If a guest is filling up the queue >>>>> with commands before notifying Xen, how do you plan to ensure we don't >>>>> spend too much time in Xen (which is not preemptible)? >>>>> >>>> >>>> We'll have to do a detailed analysis on these scenarios, they are not >>>> covered by the design (as well as some others which is clear after >>>> your comments). I'll come back with an updated design. >>> I think that can be handled akin to hypercall continuation, which is >>> used in similar places, like P2M code >>> [...] >>> >> >> I have updated vIOMMU design document with additional security topics >> covered and performance impact results. Also added some additional >> explanations for vIOMMU components following your comments. >> Updated document content: >> >> =============================================== >> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests >> =============================================== >> >> :Author: Milan Djokic <milan_djokic@epam.com> >> :Date: 2025-08-07 >> :Status: Draft >> >> Introduction >> ======== >> >> The SMMUv3 supports two stages of translation. Each stage of >> translation can be >> independently enabled. An incoming address is logically translated >> from VA to >> IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to >> the output PA. Stage 1 translation support is required to provide >> isolation between different >> devices within OS. XEN already supports Stage 2 translation but there is no >> support for Stage 1 translation. >> This design proposal outlines the introduction of Stage-1 SMMUv3 >> support in Xen for ARM guests. >> >> Motivation >> ========== >> >> ARM systems utilizing SMMUv3 require stage-1 address translation to >> ensure secure DMA and guest managed I/O memory mappings. > > It is unclear for my what you mean by "guest manged IO memory mappings", > could you please provide an example? > Basically enabling stage-1 translation means that the guest is responsible for managing IOVA to IPA mappings through its own IOMMU driver. Guest manages its own stage-1 page tables and TLB. For example, when a guest driver wants to perform DMA mapping (e.g. with dma_map_single()), it will request mapping of its buffer physical address to IOVA through guest IOMMU driver. Guest IOMMU driver will further issue mapping commands emulated by Xen which translate it into stage-2 mappings. >> This feature enables: >> >> - Stage-1 translation in guest domain >> - Safe device passthrough under secure memory translation >> > > As I see it, ARM specs use "secure" mostly when referring to Secure mode > (S-EL1, S-EL2, EL3) and associated secure counterparts of architectural > devices, like secure GIC, secure Timer, etc. So I'd probably don't use > this word here to reduce confusion > Sure, secure in terms of isolation is the topic here. I'll rephrase this >> Design Overview >> =============== >> >> These changes provide emulated SMMUv3 support: >> >> - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation >> support in SMMUv3 driver. > > "Nested translation" as in "nested virtualization"? Or is this something else? > No, this refers to 2-stage translation IOVA->IPA->PA as a nested translation. Although with this feature, nested virtualization is also enabled since guest can emulate its own IOMMU e.g. when kvm is run in guest. >> - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1 >> handling. > > I think, this is the big topic. You see, apart from SMMU, there is > at least Renesas IP-MMU, which uses completely different API. And > probably there are other IO-MMU implementations possible. Right now > vIOMMU framework handles only SMMU, which is okay, but probably we > should design it in a such way, that other IO-MMUs will be supported as > well. Maybe even IO-MMUs for other architectures (RISC V maybe?). > I think that it is already designed in such manner. We have a generic vIOMMU framework and a backend implementation for target IOMMU as separate components. And the backend implements supported commands/mechanisms which are specific for target IOMMU type. At this point, only SMMUv3 is supported, but it is possible to implement other IOMMU types support under the same generic framework. AFAIK, RISC-V IOMMU stage-2 is still in early development stage, but I do believe that it will be also compatible with vIOMMU framework. >> - **Register/Command Emulation**: SMMUv3 register emulation and >> command queue handling. > > Continuing previous paragraph: what about other IO-MMUs? For example, if > platform provides only Renesas IO-MMU, will vIOMMU framework still > emulate SMMUv3 registers and queue handling? > Yes, this is not supported in current implementation. To support other IOMMU than SMMUv3, stage-1 emulation backend needs to be implemented for target IOMMU and probably Xen driver for target IOMMU has to be updated to handle stage-1 configuration. I will elaborate this part in the design, to make clear that we have a generic vIOMMU framework, but only SMMUv3 backend exists atm. >> - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes >> to device trees for dom0 and dom0less scenarios. >> - **Runtime Configuration**: Introduces a `viommu` boot parameter for >> dynamic enablement. >> >> vIOMMU is exposed to guest as a single device with predefined >> capabilities and commands supported. Single vIOMMU model abstracts the >> details of an actual IOMMU hardware, simplifying usage from the guest >> point of view. Guest OS handles only a single IOMMU, even if multiple >> IOMMU units are available on the host system. > > In the previous email I asked how are you planning to handle potential > SID overlaps, especially in PCI use case. I want to return to this > topic. I am not saying that this is impossible, but I'd like to see this > covered in the design document. > Sorry, I've missed this part in the previous mail. This is a valid point, SID overlapping would be an issue for a single vIOMMU model. To prevent it, design will have to be extended with SID namespace virtualization, introducing a remapping layer which will make sure that guest virtual SIDs are unique and maintain proper mappings of vSIDs to pSIDs. For PCI case, we need to have an extended remapping logic where iommu-map property will be also patched in the guest device tree since we need a range of unique vSIDs for every RC assigned to guest. Alternative approach would be to switch to vIOMMU per pIOMMU model. Since both approaches require major updates, I'll have to do a detailed analysis and come back with an updated design which would address this issue. >> >> Security Considerations >> ======================= >> >> **viommu security benefits:** >> >> - Stage-1 translation ensures guest devices cannot perform unauthorized DMA. >> - Emulated IOMMU removes guest dependency on IOMMU hardware while >> maintaining domains isolation. > > I am not sure that I got this paragraph. > First one refers to guest controlled DMA access. Only IOVA->IPA mappings created by guest are usable by the device when stage-1 is enabled. On the other hand, with stage-2 only enabled, device could access to complete IOVA->PA mapping created by Xen for guest. Since the guest has no control over device IOVA accesses, a malicious guest kernel could potentially access memory regions it shouldn't be allowed to, e.g. if stage-2 mappings are stale. With stage-1 enabled, guest device driver has to explicitly map IOVAs and this request is propagated through emulated IOMMU, making sure that IOVA mappings are valid all the time. Second claim means that with emulated IOMMU, guests don’t need direct access to physical IOMMU hardware. The hypervisor emulates IOMMU behavior for the guest, while still ensuring that memory access by devices remains properly isolated between guests, just like it would with real IOMMU hardware. >> >> >> 1. Observation: >> --------------- >> Support for Stage-1 translation in SMMUv3 introduces new data >> structures (`s1_cfg` alongside `s2_cfg`) and logic to write both >> Stage-1 and Stage-2 entries in the Stream Table Entry (STE), including >> an `abort` field to handle partial configuration states. >> >> **Risk:** >> Without proper handling, a partially applied Stage-1 configuration >> might leave guest DMA mappings in an inconsistent state, potentially >> enabling unauthorized access or causing cross-domain interference. >> >> **Mitigation:** *(Handled by design)* >> This feature introduces logic that writes both `s1_cfg` and `s2_cfg` >> to STE and manages the `abort` field-only considering Stage-1 >> configuration if fully attached. This ensures incomplete or invalid >> guest configurations are safely ignored by the hypervisor. >> >> 2. Observation: >> --------------- >> Guests can now invalidate Stage-1 caches; invalidation needs >> forwarding to SMMUv3 hardware to maintain coherence. >> >> **Risk:** >> Failing to propagate cache invalidation could allow stale mappings, >> enabling access to old mappings and possibly data leakage or >> misrouting. >> >> **Mitigation:** *(Handled by design)* >> This feature ensures that guest-initiated invalidations are correctly >> forwarded to the hardware, preserving IOMMU coherency. >> >> 3. Observation: >> --------------- >> This design introduces substantial new functionality, including the >> `vIOMMU` framework, virtual SMMUv3 devices (`vsmmuv3`), command >> queues, event queues, domain management, and Device Tree modifications >> (e.g., `iommus` nodes and `libxl` integration). >> >> **Risk:** >> Large feature expansions increase the attack surface—potential for >> race conditions, unchecked command inputs, or Device Tree-based >> misconfigurations. >> >> **Mitigation:** >> >> - Sanity checks and error-handling improvements have been introduced >> in this feature. >> - Further audits have to be performed for this feature and its >> dependencies in this area. Currently, feature is marked as *Tech >> Preview* and is self-contained, reducing the risk to unrelated >> components. >> >> 4. Observation: >> --------------- >> The code includes transformations to handle nested translation versus >> standard modes and uses guest-configured command queues (e.g., >> `CMD_CFGI_STE`) and event notifications. >> >> **Risk:** >> Malicious or malformed queue commands from guests could bypass >> validation, manipulate SMMUv3 state, or cause Dom0 instability. > > Only Dom0? > This is a mistake, the whole system could be affected. I'll fix this. >> >> **Mitigation:** *(Handled by design)* >> Built-in validation of command queue entries and sanitization >> mechanisms ensure only permitted configurations are applied. This is >> supported via additions in `vsmmuv3` and `cmdqueue` handling code. >> >> 5. Observation: >> --------------- >> Device Tree modifications enable device assignment and >> configuration—guest DT fragments (e.g., `iommus`) are added via >> `libxl`. >> >> **Risk:** >> Erroneous or malicious Device Tree injection could result in device >> misbinding or guest access to unauthorized hardware. >> >> **Mitigation:** >> >> - `libxl` perform checks of guest configuration and parse only >> predefined dt fragments and nodes, reducing risc. >> - The system integrator must ensure correct resource mapping in the >> guest Device Tree (DT) fragments. >> >> 6. Observation: >> --------------- >> Introducing optional per-guest enabled features (`viommu` argument in >> xl guest config) means some guests may opt-out. >> >> **Risk:** >> Differences between guests with and without `viommu` may cause >> unexpected behavior or privilege drift. >> >> **Mitigation:** >> Verify that downgrade paths are safe and well-isolated; ensure missing >> support doesn't cause security issues. Additional audits on emulation >> paths and domains interference need to be performed in a multi-guest >> environment. >> >> 7. Observation: >> --------------- >> Guests have the ability to issue Stage-1 IOMMU commands like cache >> invalidation, stream table entries configuration, etc. An adversarial >> guest may issue a high volume of commands in rapid succession. >> >> **Risk** >> Excessive commands requests can cause high hypervisor CPU consumption >> and disrupt scheduling, leading to degraded system responsiveness and >> potential denial-of-service scenarios. >> >> **Mitigation** >> >> - Xen credit scheduler limits guest vCPU execution time, securing >> basic guest rate-limiting. > > I don't thing that this feature available only in credit schedulers, > AFAIK, all schedulers except null scheduler will limit vCPU execution time. > I was not aware of that. I'll rephrase this part. >> - Batch multiple commands of same type to reduce overhead on the >> virtual SMMUv3 hardware emulation. >> - Implement vIOMMU commands execution restart and continuation support > > So, something like "hypercall continuation"? > Yes >> >> 8. Observation: >> --------------- >> Some guest commands issued towards vIOMMU are propagated to pIOMMU >> command queue (e.g. TLB invalidate). For each pIOMMU, only one command >> queue is >> available for all domains. >> >> **Risk** >> Excessive commands requests from abusive guest can cause flooding of >> physical IOMMU command queue, leading to degraded pIOMMU responsivness >> on commands issued from other guests. >> >> **Mitigation** >> >> - Xen credit scheduler limits guest vCPU execution time, securing >> basic guest rate-limiting. >> - Batch commands which should be propagated towards pIOMMU cmd queue >> and enable support for batch execution pause/continuation >> - If possible, implement domain penalization by adding a per-domain >> cost counter for vIOMMU/pIOMMU usage. >> >> 9. Observation: >> --------------- >> vIOMMU feature includes event queue used for forwarding IOMMU events >> to guest (e.g. translation faults, invalid stream IDs, permission >> errors). A malicious guest can misconfigure its SMMU state or >> intentionally trigger faults with high frequency. >> >> **Risk** >> Occurance of IOMMU events with high frequency can cause Xen to flood >> the event queue and disrupt scheduling with high hypervisor CPU load >> for events handling. >> >> **Mitigation** >> >> - Implement fail-safe state by disabling events forwarding when faults >> are occured with high frequency and not processed by guest. >> - Batch multiple events of same type to reduce overhead on the virtual >> SMMUv3 hardware emulation. >> - Consider disabling event queue for untrusted guests >> >> Performance Impact >> ================== >> >> With iommu stage-1 and nested translation inclusion, performance >> overhead is introduced comparing to existing, stage-2 only usage in >> Xen. >> Once mappings are established, translations should not introduce >> significant overhead. >> Emulated paths may introduce moderate overhead, primarily affecting >> device initialization and event handling. >> Performance impact highly depends on target CPU capabilities. Testing >> is performed on cortex-a53 based platform. > > Which platform exactly? While QEMU emulates SMMU to some extent, we are > observing somewhat different SMMU behavior on real HW platforms (mostly > due to cache coherence problems). Also, according to MMU-600 errata, it > can have lower than expected performance in some use-cases. > Performance measurement are done on QEMU emulated Renesas platform. I'll add some details for this. >> Performance is mostly impacted by emulated vIOMMU operations, results >> shown in the following table. >> >> +-------------------------------+---------------------------------+ >> | vIOMMU Operation | Execution time in guest | >> +===============================+=================================+ >> | Reg read | median: 30μs, worst-case: 250μs | >> +-------------------------------+---------------------------------+ >> | Reg write | median: 35μs, worst-case: 280μs | >> +-------------------------------+---------------------------------+ >> | Invalidate TLB | median: 90μs, worst-case: 1ms+ | >> +-------------------------------+---------------------------------+ >> | Invalidate STE | median: 450μs worst_case: 7ms+ | >> +-------------------------------+---------------------------------+ >> >> With vIOMMU exposed to guest, guest OS has to initialize IOMMU device >> and configure stage-1 mappings for devices attached to it. >> Following table shows initialization stages which impact stage-1 >> enabled guest boot time and compares it with stage-1 disabled guest. >> >> "NOTE: Device probe execution time varies significantly depending on >> device complexity. virtio-gpu was selected as a test case due to its >> extensive use of dynamic DMA allocations and IOMMU mappings, making it >> a suitable candidate for benchmarking stage-1 vIOMMU behavior." >> >> +---------------------+-----------------------+------------------------+ >> | Stage | Stage-1 Enabled Guest | Stage-1 Disabled Guest | >> +=====================+=======================+========================+ >> | IOMMU Init | ~25ms | / | >> +---------------------+-----------------------+------------------------+ >> | Dev Attach / Mapping| ~220ms | ~200ms | >> +---------------------+-----------------------+------------------------+ >> >> For devices configured with dynamic DMA mappings, DMA >> allocate/map/unmap operations performance is also impacted on stage-1 >> enabled guests. >> Dynamic DMA mapping operation issues emulated IOMMU functions like >> mmio write/read and TLB invalidations. >> As a reference, following table shows performance results for runtime >> dma operations for virtio-gpu device. >> >> +---------------+-------------------------+----------------------------+ >> | DMA Op | Stage-1 Enabled Guest | Stage-1 Disabled Guest | >> +===============+=========================+============================+ >> | dma_alloc | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs| >> +---------------+-------------------------+----------------------------+ >> | dma_free | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs | >> +---------------+-------------------------+----------------------------+ >> | dma_map | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs| >> +---------------+-------------------------+----------------------------+ >> | dma_unmap | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs | >> +---------------+-------------------------+----------------------------+ >> >> Testing >> ============ >> >> - QEMU-based ARM system tests for Stage-1 translation and nested >> virtualization. >> - Actual hardware validation on platforms such as Renesas to ensure >> compatibility with real SMMUv3 implementations. >> - Unit/Functional tests validating correct translations (not implemented). >> >> Migration and Compatibility >> =========================== >> >> This optional feature defaults to disabled (`viommu=""`) for backward >> compatibility. >> > BR, Milan
On 9/1/25 13:06, Milan Djokic wrote: > Hi Volodymyr, > > On 8/29/25 18:27, Volodymyr Babchuk wrote: >> Hi Milan, >> >> Thanks, "Security Considerations" sections looks really good. But I have >> more questions. >> >> Milan Djokic <milan_djokic@epam.com> writes: >> >>> Hello Julien, Volodymyr >>> >>> On 8/27/25 01:28, Volodymyr Babchuk wrote: >>>> Hi Milan, >>>> Milan Djokic <milan_djokic@epam.com> writes: >>>> >>>>> Hello Julien, >>>>> >>>>> On 8/13/25 14:11, Julien Grall wrote: >>>>>> On 13/08/2025 11:04, Milan Djokic wrote: >>>>>>> Hello Julien, >>>>>> Hi Milan, >>>>>> >>>>>>> >>>>>>> We have prepared a design document and it will be part of the updated >>>>>>> patch series (added in docs/design). I'll also extend cover letter with >>>>>>> details on implementation structure to make review easier. >>>>>> I would suggest to just iterate on the design document for now. >>>>>> >>>>>>> Following is the design document content which will be provided in >>>>>>> updated patch series: >>>>>>> >>>>>>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests >>>>>>> ========================================================== >>>>>>> >>>>>>> Author: Milan Djokic <milan_djokic@epam.com> >>>>>>> Date: 2025-08-07 >>>>>>> Status: Draft >>>>>>> >>>>>>> Introduction >>>>>>> ------------ >>>>>>> >>>>>>> The SMMUv3 supports two stages of translation. Each stage of translation >>>>>>> can be independently enabled. An incoming address is logically >>>>>>> translated from VA to IPA in stage 1, then the IPA is input to stage 2 >>>>>>> which translates the IPA to the output PA. Stage 1 translation support >>>>>>> is required to provide isolation between different devices within the OS. >>>>>>> >>>>>>> Xen already supports Stage 2 translation but there is no support for >>>>>>> Stage 1 translation. This design proposal outlines the introduction of >>>>>>> Stage-1 SMMUv3 support in Xen for ARM guests. >>>>>>> >>>>>>> Motivation >>>>>>> ---------- >>>>>>> >>>>>>> ARM systems utilizing SMMUv3 require Stage-1 address translation to >>>>>>> ensure correct and secure DMA behavior inside guests. >>>>>> Can you clarify what you mean by "correct"? DMA would still work >>>>>> without >>>>>> stage-1. >>>>> >>>>> Correct in terms of working with guest managed I/O space. I'll >>>>> rephrase this statement, it seems ambiguous. >>>>> >>>>>>> >>>>>>> This feature enables: >>>>>>> - Stage-1 translation in guest domain >>>>>>> - Safe device passthrough under secure memory translation >>>>>>> >>>>>>> Design Overview >>>>>>> --------------- >>>>>>> >>>>>>> These changes provide emulated SMMUv3 support: >>>>>>> >>>>>>> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in >>>>>>> SMMUv3 driver >>>>>>> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling >>>>>> So what are you planning to expose to a guest? Is it one vIOMMU per >>>>>> pIOMMU? Or a single one? >>>>> >>>>> Single vIOMMU model is used in this design. >>>>> >>>>>> Have you considered the pros/cons for both? >>>>>>> - Register/Command Emulation: SMMUv3 register emulation and command >>>>>>> queue handling >>>>>> >>>>> >>>>> That's a point for consideration. >>>>> single vIOMMU prevails in terms of less complex implementation and a >>>>> simple guest iommmu model - single vIOMMU node, one interrupt path, >>>>> event queue, single set of trap handlers for emulation, etc. >>>>> Cons for a single vIOMMU model could be less accurate hw >>>>> representation and a potential bottleneck with one emulated queue and >>>>> interrupt path. >>>>> On the other hand, vIOMMU per pIOMMU provides more accurate hw >>>>> modeling and offers better scalability in case of many IOMMUs in the >>>>> system, but this comes with more complex emulation logic and device >>>>> tree, also handling multiple vIOMMUs on guest side. >>>>> IMO, single vIOMMU model seems like a better option mostly because >>>>> it's less complex, easier to maintain and debug. Of course, this >>>>> decision can and should be discussed. >>>>> >>>> Well, I am not sure that this is possible, because of StreamID >>>> allocation. The biggest offender is of course PCI, as each Root PCI >>>> bridge will require own SMMU instance with own StreamID space. But even >>>> without PCI you'll need some mechanism to map vStremID to >>>> <pSMMU, pStreamID>, because there will be overlaps in SID space. >>>> Actually, PCI/vPCI with vSMMU is its own can of worms... >>>> >>>>>> For each pSMMU, we have a single command queue that will receive command >>>>>> from all the guests. How do you plan to prevent a guest hogging the >>>>>> command queue? >>>>>> In addition to that, AFAIU, the size of the virtual command queue is >>>>>> fixed by the guest rather than Xen. If a guest is filling up the queue >>>>>> with commands before notifying Xen, how do you plan to ensure we don't >>>>>> spend too much time in Xen (which is not preemptible)? >>>>>> >>>>> >>>>> We'll have to do a detailed analysis on these scenarios, they are not >>>>> covered by the design (as well as some others which is clear after >>>>> your comments). I'll come back with an updated design. >>>> I think that can be handled akin to hypercall continuation, which is >>>> used in similar places, like P2M code >>>> [...] >>>> >>> >>> I have updated vIOMMU design document with additional security topics >>> covered and performance impact results. Also added some additional >>> explanations for vIOMMU components following your comments. >>> Updated document content: >>> >>> =============================================== >>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests >>> =============================================== >>> >>> :Author: Milan Djokic <milan_djokic@epam.com> >>> :Date: 2025-08-07 >>> :Status: Draft >>> >>> Introduction >>> ======== >>> >>> The SMMUv3 supports two stages of translation. Each stage of >>> translation can be >>> independently enabled. An incoming address is logically translated >>> from VA to >>> IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to >>> the output PA. Stage 1 translation support is required to provide >>> isolation between different >>> devices within OS. XEN already supports Stage 2 translation but there is no >>> support for Stage 1 translation. >>> This design proposal outlines the introduction of Stage-1 SMMUv3 >>> support in Xen for ARM guests. >>> >>> Motivation >>> ========== >>> >>> ARM systems utilizing SMMUv3 require stage-1 address translation to >>> ensure secure DMA and guest managed I/O memory mappings. >> >> It is unclear for my what you mean by "guest manged IO memory mappings", >> could you please provide an example? >> > > Basically enabling stage-1 translation means that the guest is > responsible for managing IOVA to IPA mappings through its own IOMMU > driver. Guest manages its own stage-1 page tables and TLB. > For example, when a guest driver wants to perform DMA mapping (e.g. with > dma_map_single()), it will request mapping of its buffer physical > address to IOVA through guest IOMMU driver. Guest IOMMU driver will > further issue mapping commands emulated by Xen which translate it into > stage-2 mappings. > >>> This feature enables: >>> >>> - Stage-1 translation in guest domain >>> - Safe device passthrough under secure memory translation >>> >> >> As I see it, ARM specs use "secure" mostly when referring to Secure mode >> (S-EL1, S-EL2, EL3) and associated secure counterparts of architectural >> devices, like secure GIC, secure Timer, etc. So I'd probably don't use >> this word here to reduce confusion >> > > Sure, secure in terms of isolation is the topic here. I'll rephrase this > >>> Design Overview >>> =============== >>> >>> These changes provide emulated SMMUv3 support: >>> >>> - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation >>> support in SMMUv3 driver. >> >> "Nested translation" as in "nested virtualization"? Or is this something else? >> > > No, this refers to 2-stage translation IOVA->IPA->PA as a nested > translation. Although with this feature, nested virtualization is also > enabled since guest can emulate its own IOMMU e.g. when kvm is run in guest. > > >>> - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1 >>> handling. >> >> I think, this is the big topic. You see, apart from SMMU, there is >> at least Renesas IP-MMU, which uses completely different API. And >> probably there are other IO-MMU implementations possible. Right now >> vIOMMU framework handles only SMMU, which is okay, but probably we >> should design it in a such way, that other IO-MMUs will be supported as >> well. Maybe even IO-MMUs for other architectures (RISC V maybe?). >> > > I think that it is already designed in such manner. We have a generic > vIOMMU framework and a backend implementation for target IOMMU as > separate components. And the backend implements supported > commands/mechanisms which are specific for target IOMMU type. At this > point, only SMMUv3 is supported, but it is possible to implement other > IOMMU types support under the same generic framework. AFAIK, RISC-V > IOMMU stage-2 is still in early development stage, but I do believe that > it will be also compatible with vIOMMU framework. > >>> - **Register/Command Emulation**: SMMUv3 register emulation and >>> command queue handling. >> >> Continuing previous paragraph: what about other IO-MMUs? For example, if >> platform provides only Renesas IO-MMU, will vIOMMU framework still >> emulate SMMUv3 registers and queue handling? >> > > Yes, this is not supported in current implementation. To support other > IOMMU than SMMUv3, stage-1 emulation backend needs to be implemented for > target IOMMU and probably Xen driver for target IOMMU has to be updated > to handle stage-1 configuration. I will elaborate this part in the > design, to make clear that we have a generic vIOMMU framework, but only > SMMUv3 backend exists atm. > >>> - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes >>> to device trees for dom0 and dom0less scenarios. >>> - **Runtime Configuration**: Introduces a `viommu` boot parameter for >>> dynamic enablement. >>> >>> vIOMMU is exposed to guest as a single device with predefined >>> capabilities and commands supported. Single vIOMMU model abstracts the >>> details of an actual IOMMU hardware, simplifying usage from the guest >>> point of view. Guest OS handles only a single IOMMU, even if multiple >>> IOMMU units are available on the host system. >> >> In the previous email I asked how are you planning to handle potential >> SID overlaps, especially in PCI use case. I want to return to this >> topic. I am not saying that this is impossible, but I'd like to see this >> covered in the design document. >> > > Sorry, I've missed this part in the previous mail. This is a valid point, > SID overlapping would be an issue for a single vIOMMU model. To prevent > it, design will have to be extended with SID namespace virtualization, > introducing a remapping layer which will make sure that guest virtual > SIDs are unique and maintain proper mappings of vSIDs to pSIDs. > For PCI case, we need to have an extended remapping logic where > iommu-map property will be also patched in the guest device tree since > we need a range of unique vSIDs for every RC assigned to guest. > Alternative approach would be to switch to vIOMMU per pIOMMU model. > Since both approaches require major updates, I'll have to do a detailed > analysis and come back with an updated design which would address this > issue. > > >>> >>> Security Considerations >>> ======================= >>> >>> **viommu security benefits:** >>> >>> - Stage-1 translation ensures guest devices cannot perform unauthorized DMA. >>> - Emulated IOMMU removes guest dependency on IOMMU hardware while >>> maintaining domains isolation. >> >> I am not sure that I got this paragraph. >> > > First one refers to guest controlled DMA access. Only IOVA->IPA mappings > created by guest are usable by the device when stage-1 is enabled. On > the other hand, with stage-2 only enabled, device could access to > complete IOVA->PA mapping created by Xen for guest. Since the guest has > no control over device IOVA accesses, a malicious guest kernel could > potentially access memory regions it shouldn't be allowed to, e.g. if > stage-2 mappings are stale. With stage-1 enabled, guest device driver > has to explicitly map IOVAs and this request is propagated through > emulated IOMMU, making sure that IOVA mappings are valid all the time. > > Second claim means that with emulated IOMMU, guests don’t need direct > access to physical IOMMU hardware. The hypervisor emulates IOMMU > behavior for the guest, while still ensuring that memory access by > devices remains properly isolated between guests, just like it would > with real IOMMU hardware. > >>> >>> >>> 1. Observation: >>> --------------- >>> Support for Stage-1 translation in SMMUv3 introduces new data >>> structures (`s1_cfg` alongside `s2_cfg`) and logic to write both >>> Stage-1 and Stage-2 entries in the Stream Table Entry (STE), including >>> an `abort` field to handle partial configuration states. >>> >>> **Risk:** >>> Without proper handling, a partially applied Stage-1 configuration >>> might leave guest DMA mappings in an inconsistent state, potentially >>> enabling unauthorized access or causing cross-domain interference. >>> >>> **Mitigation:** *(Handled by design)* >>> This feature introduces logic that writes both `s1_cfg` and `s2_cfg` >>> to STE and manages the `abort` field-only considering Stage-1 >>> configuration if fully attached. This ensures incomplete or invalid >>> guest configurations are safely ignored by the hypervisor. >>> >>> 2. Observation: >>> --------------- >>> Guests can now invalidate Stage-1 caches; invalidation needs >>> forwarding to SMMUv3 hardware to maintain coherence. >>> >>> **Risk:** >>> Failing to propagate cache invalidation could allow stale mappings, >>> enabling access to old mappings and possibly data leakage or >>> misrouting. >>> >>> **Mitigation:** *(Handled by design)* >>> This feature ensures that guest-initiated invalidations are correctly >>> forwarded to the hardware, preserving IOMMU coherency. >>> >>> 3. Observation: >>> --------------- >>> This design introduces substantial new functionality, including the >>> `vIOMMU` framework, virtual SMMUv3 devices (`vsmmuv3`), command >>> queues, event queues, domain management, and Device Tree modifications >>> (e.g., `iommus` nodes and `libxl` integration). >>> >>> **Risk:** >>> Large feature expansions increase the attack surface—potential for >>> race conditions, unchecked command inputs, or Device Tree-based >>> misconfigurations. >>> >>> **Mitigation:** >>> >>> - Sanity checks and error-handling improvements have been introduced >>> in this feature. >>> - Further audits have to be performed for this feature and its >>> dependencies in this area. Currently, feature is marked as *Tech >>> Preview* and is self-contained, reducing the risk to unrelated >>> components. >>> >>> 4. Observation: >>> --------------- >>> The code includes transformations to handle nested translation versus >>> standard modes and uses guest-configured command queues (e.g., >>> `CMD_CFGI_STE`) and event notifications. >>> >>> **Risk:** >>> Malicious or malformed queue commands from guests could bypass >>> validation, manipulate SMMUv3 state, or cause Dom0 instability. >> >> Only Dom0? >> > > This is a mistake, the whole system could be affected. I'll fix this. > >>> >>> **Mitigation:** *(Handled by design)* >>> Built-in validation of command queue entries and sanitization >>> mechanisms ensure only permitted configurations are applied. This is >>> supported via additions in `vsmmuv3` and `cmdqueue` handling code. >>> >>> 5. Observation: >>> --------------- >>> Device Tree modifications enable device assignment and >>> configuration—guest DT fragments (e.g., `iommus`) are added via >>> `libxl`. >>> >>> **Risk:** >>> Erroneous or malicious Device Tree injection could result in device >>> misbinding or guest access to unauthorized hardware. >>> >>> **Mitigation:** >>> >>> - `libxl` perform checks of guest configuration and parse only >>> predefined dt fragments and nodes, reducing risc. >>> - The system integrator must ensure correct resource mapping in the >>> guest Device Tree (DT) fragments. >>> >>> 6. Observation: >>> --------------- >>> Introducing optional per-guest enabled features (`viommu` argument in >>> xl guest config) means some guests may opt-out. >>> >>> **Risk:** >>> Differences between guests with and without `viommu` may cause >>> unexpected behavior or privilege drift. >>> >>> **Mitigation:** >>> Verify that downgrade paths are safe and well-isolated; ensure missing >>> support doesn't cause security issues. Additional audits on emulation >>> paths and domains interference need to be performed in a multi-guest >>> environment. >>> >>> 7. Observation: >>> --------------- >>> Guests have the ability to issue Stage-1 IOMMU commands like cache >>> invalidation, stream table entries configuration, etc. An adversarial >>> guest may issue a high volume of commands in rapid succession. >>> >>> **Risk** >>> Excessive commands requests can cause high hypervisor CPU consumption >>> and disrupt scheduling, leading to degraded system responsiveness and >>> potential denial-of-service scenarios. >>> >>> **Mitigation** >>> >>> - Xen credit scheduler limits guest vCPU execution time, securing >>> basic guest rate-limiting. >> >> I don't thing that this feature available only in credit schedulers, >> AFAIK, all schedulers except null scheduler will limit vCPU execution time. >> > > I was not aware of that. I'll rephrase this part. > >>> - Batch multiple commands of same type to reduce overhead on the >>> virtual SMMUv3 hardware emulation. >>> - Implement vIOMMU commands execution restart and continuation support >> >> So, something like "hypercall continuation"? >> > > Yes > >>> >>> 8. Observation: >>> --------------- >>> Some guest commands issued towards vIOMMU are propagated to pIOMMU >>> command queue (e.g. TLB invalidate). For each pIOMMU, only one command >>> queue is >>> available for all domains. >>> >>> **Risk** >>> Excessive commands requests from abusive guest can cause flooding of >>> physical IOMMU command queue, leading to degraded pIOMMU responsivness >>> on commands issued from other guests. >>> >>> **Mitigation** >>> >>> - Xen credit scheduler limits guest vCPU execution time, securing >>> basic guest rate-limiting. >>> - Batch commands which should be propagated towards pIOMMU cmd queue >>> and enable support for batch execution pause/continuation >>> - If possible, implement domain penalization by adding a per-domain >>> cost counter for vIOMMU/pIOMMU usage. >>> >>> 9. Observation: >>> --------------- >>> vIOMMU feature includes event queue used for forwarding IOMMU events >>> to guest (e.g. translation faults, invalid stream IDs, permission >>> errors). A malicious guest can misconfigure its SMMU state or >>> intentionally trigger faults with high frequency. >>> >>> **Risk** >>> Occurance of IOMMU events with high frequency can cause Xen to flood >>> the event queue and disrupt scheduling with high hypervisor CPU load >>> for events handling. >>> >>> **Mitigation** >>> >>> - Implement fail-safe state by disabling events forwarding when faults >>> are occured with high frequency and not processed by guest. >>> - Batch multiple events of same type to reduce overhead on the virtual >>> SMMUv3 hardware emulation. >>> - Consider disabling event queue for untrusted guests >>> >>> Performance Impact >>> ================== >>> >>> With iommu stage-1 and nested translation inclusion, performance >>> overhead is introduced comparing to existing, stage-2 only usage in >>> Xen. >>> Once mappings are established, translations should not introduce >>> significant overhead. >>> Emulated paths may introduce moderate overhead, primarily affecting >>> device initialization and event handling. >>> Performance impact highly depends on target CPU capabilities. Testing >>> is performed on cortex-a53 based platform. >> >> Which platform exactly? While QEMU emulates SMMU to some extent, we are >> observing somewhat different SMMU behavior on real HW platforms (mostly >> due to cache coherence problems). Also, according to MMU-600 errata, it >> can have lower than expected performance in some use-cases. >> > > Performance measurement are done on QEMU emulated Renesas platform. I'll > add some details for this. > >>> Performance is mostly impacted by emulated vIOMMU operations, results >>> shown in the following table. >>> >>> +-------------------------------+---------------------------------+ >>> | vIOMMU Operation | Execution time in guest | >>> +===============================+=================================+ >>> | Reg read | median: 30μs, worst-case: 250μs | >>> +-------------------------------+---------------------------------+ >>> | Reg write | median: 35μs, worst-case: 280μs | >>> +-------------------------------+---------------------------------+ >>> | Invalidate TLB | median: 90μs, worst-case: 1ms+ | >>> +-------------------------------+---------------------------------+ >>> | Invalidate STE | median: 450μs worst_case: 7ms+ | >>> +-------------------------------+---------------------------------+ >>> >>> With vIOMMU exposed to guest, guest OS has to initialize IOMMU device >>> and configure stage-1 mappings for devices attached to it. >>> Following table shows initialization stages which impact stage-1 >>> enabled guest boot time and compares it with stage-1 disabled guest. >>> >>> "NOTE: Device probe execution time varies significantly depending on >>> device complexity. virtio-gpu was selected as a test case due to its >>> extensive use of dynamic DMA allocations and IOMMU mappings, making it >>> a suitable candidate for benchmarking stage-1 vIOMMU behavior." >>> >>> +---------------------+-----------------------+------------------------+ >>> | Stage | Stage-1 Enabled Guest | Stage-1 Disabled Guest | >>> +=====================+=======================+========================+ >>> | IOMMU Init | ~25ms | / | >>> +---------------------+-----------------------+------------------------+ >>> | Dev Attach / Mapping| ~220ms | ~200ms | >>> +---------------------+-----------------------+------------------------+ >>> >>> For devices configured with dynamic DMA mappings, DMA >>> allocate/map/unmap operations performance is also impacted on stage-1 >>> enabled guests. >>> Dynamic DMA mapping operation issues emulated IOMMU functions like >>> mmio write/read and TLB invalidations. >>> As a reference, following table shows performance results for runtime >>> dma operations for virtio-gpu device. >>> >>> +---------------+-------------------------+----------------------------+ >>> | DMA Op | Stage-1 Enabled Guest | Stage-1 Disabled Guest | >>> +===============+=========================+============================+ >>> | dma_alloc | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs| >>> +---------------+-------------------------+----------------------------+ >>> | dma_free | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs | >>> +---------------+-------------------------+----------------------------+ >>> | dma_map | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs| >>> +---------------+-------------------------+----------------------------+ >>> | dma_unmap | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs | >>> +---------------+-------------------------+----------------------------+ >>> >>> Testing >>> ============ >>> >>> - QEMU-based ARM system tests for Stage-1 translation and nested >>> virtualization. >>> - Actual hardware validation on platforms such as Renesas to ensure >>> compatibility with real SMMUv3 implementations. >>> - Unit/Functional tests validating correct translations (not implemented). >>> >>> Migration and Compatibility >>> =========================== >>> >>> This optional feature defaults to disabled (`viommu=""`) for backward >>> compatibility. >>> >> > > BR, > Milan > Hello Volodymyr, Julien Sorry for the delayed follow-up on this topic. We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and pIOMMU. Considering single vIOMMU model limitation pointed out by Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the only proper solution. Following is the updated design document. I have added additional details to the design and performance impact sections, and also indicated future improvements. Security considerations section is unchanged apart from some minor details according to review comments. Let me know what do you think about updated design. Once approved, I will send the updated vIOMMU patch series. ========================================================== Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests ========================================================== :Author: Milan Djokic <milan_djokic@epam.com> :Date: 2025-11-03 :Status: Draft Introduction ============ The SMMUv3 supports two stages of translation. Each stage of translation can be independently enabled. An incoming address is logically translated from VA to IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to the output PA. Stage 1 translation support is required to provide isolation between different devices within OS. XEN already supports Stage 2 translation but there is no support for Stage 1 translation. This design proposal outlines the introduction of Stage-1 SMMUv3 support in Xen for ARM guests. Motivation ========== ARM systems utilizing SMMUv3 require stage-1 address translation to ensure secure DMA and guest managed I/O memory mappings. With stage-1 enabed, guest manages IOVA to IPA mappings through its own IOMMU driver. This feature enables: - Stage-1 translation in guest domain - Safe device passthrough with per-device address translation table Design Overview =============== These changes provide emulated SMMUv3 support: - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support in SMMUv3 driver. - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1 handling. - **Register/Command Emulation**: SMMUv3 register emulation and command queue handling. - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to device trees for dom0 and dom0less scenarios. - **Runtime Configuration**: Introduces a `viommu` boot parameter for dynamic enablement. Separate vIOMMU device is exposed to guest for every physical IOMMU in the system. vIOMMU feature is designed in a way to provide a generic vIOMMU framework and a backend implementation for target IOMMU as separate components. Backend implementation contains specific IOMMU structure and commands handling (only SMMUv3 currently supported). This structure allows potential reuse of stage-1 feature for other IOMMU types. Security Considerations ======================= **viommu security benefits:** - Stage-1 translation ensures guest devices cannot perform unauthorized DMA (device I/O address mapping managed by guest). - Emulated IOMMU removes guest direct dependency on IOMMU hardware, while maintaining domains isolation. 1. Observation: --------------- Support for Stage-1 translation in SMMUv3 introduces new data structures (`s1_cfg` alongside `s2_cfg`) and logic to write both Stage-1 and Stage-2 entries in the Stream Table Entry (STE), including an `abort` field to handle partial configuration states. **Risk:** Without proper handling, a partially applied Stage-1 configuration might leave guest DMA mappings in an inconsistent state, potentially enabling unauthorized access or causing cross-domain interference. **Mitigation:** *(Handled by design)* This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to STE and manages the `abort` field-only considering Stage-1 configuration if fully attached. This ensures incomplete or invalid guest configurations are safely ignored by the hypervisor. 2. Observation: --------------- Guests can now invalidate Stage-1 caches; invalidation needs forwarding to SMMUv3 hardware to maintain coherence. **Risk:** Failing to propagate cache invalidation could allow stale mappings, enabling access to old mappings and possibly data leakage or misrouting. **Mitigation:** *(Handled by design)* This feature ensures that guest-initiated invalidations are correctly forwarded to the hardware, preserving IOMMU coherency. 3. Observation: --------------- This design introduces substantial new functionality, including the `vIOMMU` framework, virtual SMMUv3 devices (`vsmmuv3`), command queues, event queues, domain management, and Device Tree modifications (e.g., `iommus` nodes and `libxl` integration). **Risk:** Large feature expansions increase the attack surface potential for race conditions, unchecked command inputs, or Device Tree-based misconfigurations. **Mitigation:** - Sanity checks and error-handling improvements have been introduced in this feature. - Further audits have to be performed for this feature and its dependencies in this area. 4. Observation: --------------- The code includes transformations to handle nested translation versus standard modes and uses guest-configured command queues (e.g., `CMD_CFGI_STE`) and event notifications. **Risk:** Malicious or malformed queue commands from guests could bypass validation, manipulate SMMUv3 state, or cause system instability. **Mitigation:** *(Handled by design)* Built-in validation of command queue entries and sanitization mechanisms ensure only permitted configurations are applied. This is supported via additions in `vsmmuv3` and `cmdqueue` handling code. 5. Observation: --------------- Device Tree modifications enable device assignment and configuration through guest DT fragments (e.g., `iommus`) are added via `libxl`. **Risk:** Erroneous or malicious Device Tree injection could result in device misbinding or guest access to unauthorized hardware. **Mitigation:** - `libxl` perform checks of guest configuration and parse only predefined dt fragments and nodes, reducing risk. - The system integrator must ensure correct resource mapping in the guest Device Tree (DT) fragments. 6. Observation: --------------- Introducing optional per-guest enabled features (`viommu` argument in xl guest config) means some guests may opt-out. **Risk:** Differences between guests with and without `viommu` may cause unexpected behavior or privilege drift. **Mitigation:** Verify that downgrade paths are safe and well-isolated; ensure missing support doesn't cause security issues. Additional audits on emulation paths and domains interference need to be performed in a multi-guest environment. 7. Observation: --------------- Guests have the ability to issue Stage-1 IOMMU commands like cache invalidation, stream table entries configuration, etc. An adversarial guest may issue a high volume of commands in rapid succession. **Risk:** Excessive commands requests can cause high hypervisor CPU consumption and disrupt scheduling, leading to degraded system responsiveness and potential denial-of-service scenarios. **Mitigation:** - Xen scheduler limits guest vCPU execution time, securing basic guest rate-limiting. - Batch multiple commands of same type to reduce overhead on the virtual SMMUv3 hardware emulation. - Implement vIOMMU commands execution restart and continuation support 8. Observation: --------------- Some guest commands issued towards vIOMMU are propagated to pIOMMU command queue (e.g. TLB invalidate). **Risk:** Excessive commands requests from abusive guest can cause flooding of physical IOMMU command queue, leading to degraded pIOMMU responsivness on commands issued from other guests. **Mitigation:** - Xen credit scheduler limits guest vCPU execution time, securing basic guest rate-limiting. - Batch commands which should be propagated towards pIOMMU cmd queue and enable support for batch execution pause/continuation - If possible, implement domain penalization by adding a per-domain cost counter for vIOMMU/pIOMMU usage. 9. Observation: --------------- vIOMMU feature includes event queue used for forwarding IOMMU events to guest (e.g. translation faults, invalid stream IDs, permission errors). A malicious guest can misconfigure its SMMU state or intentionally trigger faults with high frequency. **Risk:** Occurance of IOMMU events with high frequency can cause Xen to flood the event queue and disrupt scheduling with high hypervisor CPU load for events handling. **Mitigation:** - Implement fail-safe state by disabling events forwarding when faults are occured with high frequency and not processed by guest. - Batch multiple events of same type to reduce overhead on the virtual SMMUv3 hardware emulation. - Consider disabling event queue for untrusted guests Performance Impact ================== With iommu stage-1 and nested translation inclusion, performance overhead is introduced comparing to existing, stage-2 only usage in Xen. Once mappings are established, translations should not introduce significant overhead. Emulated paths may introduce moderate overhead, primarily affecting device initialization and event handling. Performance impact highly depends on target CPU capabilities. Testing is performed on QEMU virt and Renesas R-Car (QEMU emulated) platforms. Performance is mostly impacted by emulated vIOMMU operations, results shown in the following table. +-------------------------------+---------------------------------+ | vIOMMU Operation | Execution time in guest | +===============================+=================================+ | Reg read | median: 30μs, worst-case: 250μs | +-------------------------------+---------------------------------+ | Reg write | median: 35μs, worst-case: 280μs | +-------------------------------+---------------------------------+ | Invalidate TLB | median: 90μs, worst-case: 1ms+ | +-------------------------------+---------------------------------+ | Invalidate STE | median: 450μs worst_case: 7ms+ | +-------------------------------+---------------------------------+ With vIOMMU exposed to guest, guest OS has to initialize IOMMU device and configure stage-1 mappings for devices attached to it. Following table shows initialization stages which impact stage-1 enabled guest boot time and compares it with stage-1 disabled guest. "NOTE: Device probe execution time varies significantly depending on device complexity. virtio-gpu was selected as a test case due to its extensive use of dynamic DMA allocations and IOMMU mappings, making it a suitable candidate for benchmarking stage-1 vIOMMU behavior." +---------------------+-----------------------+------------------------+ | Stage | Stage-1 Enabled Guest | Stage-1 Disabled Guest | +=====================+=======================+========================+ | IOMMU Init | ~25ms | / | +---------------------+-----------------------+------------------------+ | Dev Attach / Mapping| ~220ms | ~200ms | +---------------------+-----------------------+------------------------+ For devices configured with dynamic DMA mappings, DMA allocate/map/unmap operations performance is also impacted on stage-1 enabled guests. Dynamic DMA mapping operation trigger emulated IOMMU functions like mmio write/read and TLB invalidations. As a reference, following table shows performance results for runtime dma operations for virtio-gpu device. +---------------+-------------------------+----------------------------+ | DMA Op | Stage-1 Enabled Guest | Stage-1 Disabled Guest | +===============+=========================+============================+ | dma_alloc | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs| +---------------+-------------------------+----------------------------+ | dma_free | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs | +---------------+-------------------------+----------------------------+ | dma_map | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs| +---------------+-------------------------+----------------------------+ | dma_unmap | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs | +---------------+-------------------------+----------------------------+ Testing ======= - QEMU-based ARM system tests for Stage-1 translation. - Actual hardware validation to ensure compatibility with real SMMUv3 implementations. - Unit/Functional tests validating correct translations (not implemented). Migration and Compatibility =========================== This optional feature defaults to disabled (`viommu=""`) for backward compatibility. Future improvements =================== - Implement the proposed mitigations to address security risks that are not covered by the current design (events batching, commands execution continuation) - Support for other IOMMU HW (Renesas, RISC-V, etc.) - Due to static definition of SPIs and MMIO regions for emulated devices, current implementation statically defines SPIs and MMIO regions for up to 16 vIOMMUs per guest. Future improvements would include configurable number of IOMMUs or automatic runtime resolution for target platform. References ========== - Original feature implemented by Rahul Singh: https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/ - SMMUv3 architecture documentation - Existing vIOMMU code patterns BR, Milan
On 03/11/2025 13:16, Milan Djokic wrote: > Hello Volodymyr, Julien Hi Milan, Thanks for the new update. For the future, can you trim your reply? > Sorry for the delayed follow-up on this topic. > We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and > pIOMMU. Considering single vIOMMU model limitation pointed out by > Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the > only proper solution. I am not sure to fully understand. My assumption with the single vIOMMU is you have a virtual SID that would be mapped to a (pIOMMU, physical SID). Does this means in your solution you will end up with multiple vPCI as well and then map pBDF == vBDF? (this because the SID have to be fixed at boot) > Following is the updated design document. > I have added additional details to the design and performance impact > sections, and also indicated future improvements. Security > considerations section is unchanged apart from some minor details > according to review comments. > Let me know what do you think about updated design. Once approved, I > will send the updated vIOMMU patch series. > > > ========================================================== > Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests > ========================================================== > > :Author: Milan Djokic <milan_djokic@epam.com> > :Date: 2025-11-03 > :Status: Draft > > Introduction > ============ > > The SMMUv3 supports two stages of translation. Each stage of translation > can be > independently enabled. An incoming address is logically translated from > VA to > IPA in stage 1, then the IPA is input to stage 2 which translates the > IPA to > the output PA. Stage 1 translation support is required to provide > isolation between different > devices within OS. XEN already supports Stage 2 translation but there is no > support for Stage 1 translation. > This design proposal outlines the introduction of Stage-1 SMMUv3 support > in Xen for ARM guests. > > Motivation > ========== > > ARM systems utilizing SMMUv3 require stage-1 address translation to > ensure secure DMA and > guest managed I/O memory mappings. > With stage-1 enabed, guest manages IOVA to IPA mappings through its own > IOMMU driver. > > This feature enables: > > - Stage-1 translation in guest domain > - Safe device passthrough with per-device address translation table I find this misleading. Even without this feature, device passthrough is still safe in the sense a device will be isolated (assuming all the DMA goes through the IOMMU) and will not be able to DMA outside of the guest memory. What the stage-1 is doing is providing an extra layer to control what each device can see. This is useful if you don't trust your devices or you want to assign a device to userspace (e.g. for DPDK). > > Design Overview > =============== > > These changes provide emulated SMMUv3 support: If my understanding is correct, there are all some implications in how we create the PCI topology. It would be good to spell them out. > > - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support > in SMMUv3 driver. > - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1 > handling. > - **Register/Command Emulation**: SMMUv3 register emulation and command > queue handling. > - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to > device trees for dom0 and dom0less scenarios. What about ACPI? > - **Runtime Configuration**: Introduces a `viommu` boot parameter for > dynamic enablement. > > Separate vIOMMU device is exposed to guest for every physical IOMMU in > the system. > vIOMMU feature is designed in a way to provide a generic vIOMMU > framework and a backend implementation > for target IOMMU as separate components. > Backend implementation contains specific IOMMU structure and commands > handling (only SMMUv3 currently supported). > This structure allows potential reuse of stage-1 feature for other IOMMU > types. > > Security Considerations > ======================= > > **viommu security benefits:** > > - Stage-1 translation ensures guest devices cannot perform unauthorized > DMA (device I/O address mapping managed by guest). > - Emulated IOMMU removes guest direct dependency on IOMMU hardware, > while maintaining domains isolation. Sorry, I don't follow this argument. Are you saying that it would be possible to emulate a SMMUv3 vIOMMU on top of the IPMMU? > 1. Observation: > --------------- > Support for Stage-1 translation in SMMUv3 introduces new data structures > (`s1_cfg` alongside `s2_cfg`) > and logic to write both Stage-1 and Stage-2 entries in the Stream Table > Entry (STE), including an `abort` > field to handle partial configuration states. > > **Risk:** > Without proper handling, a partially applied Stage-1 configuration might > leave guest DMA mappings in an > inconsistent state, potentially enabling unauthorized access or causing > cross-domain interference. How so? Even if you misconfigure the S1, the S2 would still be properly configured (you just mention partially applied stage-1). > > **Mitigation:** *(Handled by design)* > This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to > STE and manages the `abort` field-only > considering Stage-1 configuration if fully attached. This ensures > incomplete or invalid guest configurations > are safely ignored by the hypervisor. Can you clarify what you mean by invalid guest configurations? > > 2. Observation: > --------------- > Guests can now invalidate Stage-1 caches; invalidation needs forwarding > to SMMUv3 hardware to maintain coherence. > > **Risk:** > Failing to propagate cache invalidation could allow stale mappings, > enabling access to old mappings and possibly > data leakage or misrouting. You are referring to data leakage/misrouting between two devices own by the same guest, right? Xen would still be in charge of flush when the stage-2 is updated. > > **Mitigation:** *(Handled by design)* > This feature ensures that guest-initiated invalidations are correctly > forwarded to the hardware, > preserving IOMMU coherency. How is this a mitigation? You have to properly handle commands. If you don't properly handle them, then yes it will break. > > 4. Observation: > --------------- > The code includes transformations to handle nested translation versus > standard modes and uses guest-configured > command queues (e.g., `CMD_CFGI_STE`) and event notifications. > > **Risk:** > Malicious or malformed queue commands from guests could bypass > validation, manipulate SMMUv3 state, > or cause system instability. > > **Mitigation:** *(Handled by design)* > Built-in validation of command queue entries and sanitization mechanisms > ensure only permitted configurations > are applied. This is true as long as we didn't make an mistake in the configurations ;). > This is supported via additions in `vsmmuv3` and `cmdqueue` > handling code. > > 5. Observation: > --------------- > Device Tree modifications enable device assignment and configuration > through guest DT fragments (e.g., `iommus`) > are added via `libxl`. > > **Risk:** > Erroneous or malicious Device Tree injection could result in device > misbinding or guest access to unauthorized > hardware. The DT fragment are not security support and will never be at least until you have can a libfdt that is able to detect malformed Device-Tree (I haven't checked if this has changed recently). > > **Mitigation:** > > - `libxl` perform checks of guest configuration and parse only > predefined dt fragments and nodes, reducing risk. > - The system integrator must ensure correct resource mapping in the > guest Device Tree (DT) fragments. > > 6. Observation: > --------------- > Introducing optional per-guest enabled features (`viommu` argument in xl > guest config) means some guests > may opt-out. > > **Risk:** > Differences between guests with and without `viommu` may cause > unexpected behavior or privilege drift. I don't understand this risk. Can you clarify? > > **Mitigation:** > Verify that downgrade paths are safe and well-isolated; ensure missing > support doesn't cause security issues. > Additional audits on emulation paths and domains interference need to be > performed in a multi-guest environment. > > 7. Observation: > --------------- This observation with 7, 8 and 9 are the most important observations but it seems to be missing some details on how this will be implemented. I will try to provide some questions that should help filling the gaps. > Guests have the ability to issue Stage-1 IOMMU commands like cache > invalidation, stream table entries > configuration, etc. An adversarial guest may issue a high volume of > commands in rapid succession. > > **Risk:** > Excessive commands requests can cause high hypervisor CPU consumption > and disrupt scheduling, > leading to degraded system responsiveness and potential denial-of- > service scenarios. > > **Mitigation:** > > - Xen scheduler limits guest vCPU execution time, securing basic guest > rate-limiting. This really depends on your scheduler. Some scheduler (e.g. NULL) will not do any scheduling at all. Furthermore, the scheduler only preempt EL1/EL0. It doesn't preempt EL2, so any long running operation need manual preemption. Therefore, I wouldn't consider this as a mitigation. > - Batch multiple commands of same type to reduce overhead on the virtual > SMMUv3 hardware emulation. The guest can send commands in any order. So can you expand how this would work? Maybe with some example. > - Implement vIOMMU commands execution restart and continuation support This needs a bit more details. How will you decide whether to restart and what would be the action? (I guess it will be re-executing the instruction to write to the CWRITER). > > 8. Observation: > --------------- > Some guest commands issued towards vIOMMU are propagated to pIOMMU > command queue (e.g. TLB invalidate). > > **Risk:** > Excessive commands requests from abusive guest can cause flooding of > physical IOMMU command queue, > leading to degraded pIOMMU responsivness on commands issued from other > guests. > > **Mitigation:** > > - Xen credit scheduler limits guest vCPU execution time, securing basic > guest rate-limiting. Same as above. This mitigation cannot be used. > - Batch commands which should be propagated towards pIOMMU cmd queue and > enable support for batch > execution pause/continuation Can this be expanded? > - If possible, implement domain penalization by adding a per-domain cost > counter for vIOMMU/pIOMMU usage. Can this be expanded? > > 9. Observation: > --------------- > vIOMMU feature includes event queue used for forwarding IOMMU events to > guest > (e.g. translation faults, invalid stream IDs, permission errors). > A malicious guest can misconfigure its SMMU state or intentionally > trigger faults with high frequency. > > **Risk:** > Occurance of IOMMU events with high frequency can cause Xen to flood the s/occurance/occurrence/ > event queue and disrupt scheduling with > high hypervisor CPU load for events handling. > > **Mitigation:** > > - Implement fail-safe state by disabling events forwarding when faults > are occured with high frequency and > not processed by guest. I am not sure to understand how this would work. Can you expand? > - Batch multiple events of same type to reduce overhead on the virtual > SMMUv3 hardware emulation. Ditto. > - Consider disabling event queue for untrusted guests My understanding is there is only a single physical event queue. Xen would be responsible to handle the events in the queue and forward to the respective guests. If so, it is not clear what you mean by "disable event queue". > > Performance Impact > ================== > > With iommu stage-1 and nested translation inclusion, performance > overhead is introduced comparing to existing, > stage-2 only usage in Xen. Once mappings are established, translations > should not introduce significant overhead. > Emulated paths may introduce moderate overhead, primarily affecting > device initialization and event handling. > Performance impact highly depends on target CPU capabilities. > Testing is performed on QEMU virt and Renesas R-Car (QEMU emulated) > platforms. I am afraid QEMU is not a reliable platform to do performance testing. Don't you have a real HW with vIOMMU support? [...] > References > ========== > > - Original feature implemented by Rahul Singh: > > https://patchwork.kernel.org/project/xen-devel/cover/ > cover.1669888522.git.rahul.singh@arm.com/ > - SMMUv3 architecture documentation > - Existing vIOMMU code patterns I am not sure what this is referring to? Cheers, -- Julien Grall
Hi Julien, On 11/27/25 11:22, Julien Grall wrote: >> We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and >> pIOMMU. Considering single vIOMMU model limitation pointed out by >> Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the >> only proper solution. > > I am not sure to fully understand. My assumption with the single vIOMMU > is you have a virtual SID that would be mapped to a (pIOMMU, physical > SID). In the original single vIOMMU implementation, vSID was also equal to pSID, we didn't have SW mapping layer between them. Once SID overlap issue was discovered with this model, I have switched to vIOMMU-per-pIOMMU model. Alternative was to introduce a SW mapping layer and stick with a single vIOMMU model. Imo, vSID->pSID mapping layer would overcomplicate the design, especially for PCI RC streamIDs handling. On the other hand, if even a multi-vIOMMU model introduces problems that I am not aware of yet, adding a complex mapping layer would be the only viable solution. > Does this means in your solution you will end up with multiple > vPCI as well and then map pBDF == vBDF? (this because the SID have to be > fixed at boot) > The important thing which I haven't mentioned here is that our focus is on non-PCI devices for this feature atm. If I'm not mistaken, arm PCI passthrough is still work in progress, so our plan was to implement full vIOMMU PCI support in the future, once PCI passthrough support is complete for arm. Of course, we need to make sure that vIOMMU design provides a suitable infrastructure for PCI. To answer your question, yes we will have multiple vPCI nodes with this model, establishing 1-1 vSID-pSID mapping (same iommu-map range between pPCI-vPCI). For pBDF to vBDF 1-1 mapping, I'm not sure if this is necessary. My understanding is that vBDF->pBDF mapping does not affect vSID->pSID mapping. Am I wrong here? >> ========================================================== >> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests >> ========================================================== >> >> :Author: Milan Djokic <milan_djokic@epam.com> >> :Date: 2025-11-03 >> :Status: Draft >> >> Introduction >> ============ >> >> The SMMUv3 supports two stages of translation. Each stage of translation >> can be >> independently enabled. An incoming address is logically translated from >> VA to >> IPA in stage 1, then the IPA is input to stage 2 which translates the >> IPA to >> the output PA. Stage 1 translation support is required to provide >> isolation between different >> devices within OS. XEN already supports Stage 2 translation but there is no >> support for Stage 1 translation. >> This design proposal outlines the introduction of Stage-1 SMMUv3 support >> in Xen for ARM guests. >> >> Motivation >> ========== >> >> ARM systems utilizing SMMUv3 require stage-1 address translation to >> ensure secure DMA and >> guest managed I/O memory mappings. >> With stage-1 enabed, guest manages IOVA to IPA mappings through its own >> IOMMU driver. >> >> This feature enables: >> >> - Stage-1 translation in guest domain >> - Safe device passthrough with per-device address translation table > > I find this misleading. Even without this feature, device passthrough is > still safe in the sense a device will be isolated (assuming all the DMA > goes through the IOMMU) and will not be able to DMA outside of the guest > memory. What the stage-1 is doing is providing an extra layer to control > what each device can see. This is useful if you don't trust your devices > or you want to assign a device to userspace (e.g. for DPDK). > I'll rephrase this. >> >> Design Overview >> =============== >> >> These changes provide emulated SMMUv3 support: > > If my understanding is correct, there are all some implications in how > we create the PCI topology. It would be good to spell them out. > Sure, I will outline them. >> >> - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support >> in SMMUv3 driver. >> - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1 >> handling. >> - **Register/Command Emulation**: SMMUv3 register emulation and command >> queue handling. >> - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to >> device trees for dom0 and dom0less scenarios. > > What about ACPI? > ACPI support is not part of this feature atm. This will be a topic for future updates. >> - **Runtime Configuration**: Introduces a `viommu` boot parameter for >> dynamic enablement. >> >> Separate vIOMMU device is exposed to guest for every physical IOMMU in >> the system. >> vIOMMU feature is designed in a way to provide a generic vIOMMU >> framework and a backend implementation >> for target IOMMU as separate components. >> Backend implementation contains specific IOMMU structure and commands >> handling (only SMMUv3 currently supported). >> This structure allows potential reuse of stage-1 feature for other IOMMU >> types. >> >> Security Considerations >> ======================= >> >> **viommu security benefits:** >> >> - Stage-1 translation ensures guest devices cannot perform unauthorized >> DMA (device I/O address mapping managed by guest). >> - Emulated IOMMU removes guest direct dependency on IOMMU hardware, >> while maintaining domains isolation. > > Sorry, I don't follow this argument. Are you saying that it would be > possible to emulate a SMMUv3 vIOMMU on top of the IPMMU? > No, this would not work. Emulated IOMMU has to match with the pIOMMU type. The argument only points out that we are emulating IOMMU, so the guest does not need direct HW interface for IOMMU functions. >> 1. Observation: >> --------------- >> Support for Stage-1 translation in SMMUv3 introduces new data structures >> (`s1_cfg` alongside `s2_cfg`) >> and logic to write both Stage-1 and Stage-2 entries in the Stream Table >> Entry (STE), including an `abort` >> field to handle partial configuration states. >> >> **Risk:** >> Without proper handling, a partially applied Stage-1 configuration might >> leave guest DMA mappings in an >> inconsistent state, potentially enabling unauthorized access or causing >> cross-domain interference. > > How so? Even if you misconfigure the S1, the S2 would still be properly > configured (you just mention partially applied stage-1). > This could be the case when we have only stage-1. But yes, this is improbable case for xen, stage-2 should be mentioned also, will fix this. >> >> **Mitigation:** *(Handled by design)* >> This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to >> STE and manages the `abort` field-only >> considering Stage-1 configuration if fully attached. This ensures >> incomplete or invalid guest configurations >> are safely ignored by the hypervisor. > > Can you clarify what you mean by invalid guest configurations? > s1 and s2 config will be considered only if configured for the guest device. E.g. if only stage-2 is attached for the guest device, stage-1 configuration will be invalid, but safely ignored. I'll change this to "device configuration" instead of ambiguous "guest configuration". >> >> 2. Observation: >> --------------- >> Guests can now invalidate Stage-1 caches; invalidation needs forwarding >> to SMMUv3 hardware to maintain coherence. >> >> **Risk:** >> Failing to propagate cache invalidation could allow stale mappings, >> enabling access to old mappings and possibly >> data leakage or misrouting. > > You are referring to data leakage/misrouting between two devices own by > the same guest, right? Xen would still be in charge of flush when the > stage-2 is updated. > Yes, this risk could affect only guests, not xen. >> >> **Mitigation:** *(Handled by design)* >> This feature ensures that guest-initiated invalidations are correctly >> forwarded to the hardware, >> preserving IOMMU coherency. > > How is this a mitigation? You have to properly handle commands. If you > don't properly handle them, then yes it will break. > Not really a mitigation, will remove it. Guest is responsible for the regular initiation of invalidation requests to mitigate this risk. >> >> 4. Observation: >> --------------- >> The code includes transformations to handle nested translation versus >> standard modes and uses guest-configured >> command queues (e.g., `CMD_CFGI_STE`) and event notifications. >> >> **Risk:** >> Malicious or malformed queue commands from guests could bypass >> validation, manipulate SMMUv3 state, >> or cause system instability. >> >> **Mitigation:** *(Handled by design)* >> Built-in validation of command queue entries and sanitization mechanisms >> ensure only permitted configurations >> are applied. > > This is true as long as we didn't make an mistake in the configurations ;). > Yes, but I don’t see anything we can do to prevent configuration mistakes. > >> This is supported via additions in `vsmmuv3` and `cmdqueue` >> handling code. >> >> 5. Observation: >> --------------- >> Device Tree modifications enable device assignment and configuration >> through guest DT fragments (e.g., `iommus`) >> are added via `libxl`. >> >> **Risk:** >> Erroneous or malicious Device Tree injection could result in device >> misbinding or guest access to unauthorized >> hardware. > > The DT fragment are not security support and will never be at least > until you have can a libfdt that is able to detect malformed Device-Tree > (I haven't checked if this has changed recently). > But this should still be considered a risk? Similar to the previous observation, system integrator should ensure that DT fragments are correct. >> >> **Mitigation:** >> >> - `libxl` perform checks of guest configuration and parse only >> predefined dt fragments and nodes, reducing risk. >> - The system integrator must ensure correct resource mapping in the >> guest Device Tree (DT) fragments. > > > 6. Observation: >> --------------- >> Introducing optional per-guest enabled features (`viommu` argument in xl >> guest config) means some guests >> may opt-out. >> >> **Risk:** >> Differences between guests with and without `viommu` may cause >> unexpected behavior or privilege drift. > > I don't understand this risk. Can you clarify? > This risk is similar to the topics discussed in Observations 8 and 9, but in the context of vIOMMU-disabled guests potentially hogging the command and event queues due to faster processing of iommu requests. I will expand this. >> >> **Mitigation:** >> Verify that downgrade paths are safe and well-isolated; ensure missing >> support doesn't cause security issues. >> Additional audits on emulation paths and domains interference need to be >> performed in a multi-guest environment. >> >> 7. Observation: >> --------------- > > This observation with 7, 8 and 9 are the most important observations but > it seems to be missing some details on how this will be implemented. I > will try to provide some questions that should help filling the gaps. > Thanks, I will expand these observations according to comments. >> Guests have the ability to issue Stage-1 IOMMU commands like cache >> invalidation, stream table entries >> configuration, etc. An adversarial guest may issue a high volume of >> commands in rapid succession. >> >> **Risk:** >> Excessive commands requests can cause high hypervisor CPU consumption >> and disrupt scheduling, >> leading to degraded system responsiveness and potential denial-of- >> service scenarios. >> >> **Mitigation:** >> >> - Xen scheduler limits guest vCPU execution time, securing basic guest >> rate-limiting. > > This really depends on your scheduler. Some scheduler (e.g. NULL) will > not do any scheduling at all. Furthermore, the scheduler only preempt > EL1/EL0. It doesn't preempt EL2, so any long running operation need > manual preemption. Therefore, I wouldn't consider this as a mitigation. > >> - Batch multiple commands of same type to reduce overhead on the virtual >> SMMUv3 hardware emulation. > > The guest can send commands in any order. So can you expand how this > would work? Maybe with some example. > >> - Implement vIOMMU commands execution restart and continuation support > > This needs a bit more details. How will you decide whether to restart > and what would be the action? (I guess it will be re-executing the > instruction to write to the CWRITER). > >> >> 8. Observation: >> --------------- >> Some guest commands issued towards vIOMMU are propagated to pIOMMU >> command queue (e.g. TLB invalidate). >> >> **Risk:** >> Excessive commands requests from abusive guest can cause flooding of >> physical IOMMU command queue, >> leading to degraded pIOMMU responsivness on commands issued from other >> guests. >> >> **Mitigation:** >> >> - Xen credit scheduler limits guest vCPU execution time, securing basic >> guest rate-limiting. > > Same as above. This mitigation cannot be used. > > >> - Batch commands which should be propagated towards pIOMMU cmd queue and >> enable support for batch >> execution pause/continuation > > Can this be expanded? > >> - If possible, implement domain penalization by adding a per-domain cost >> counter for vIOMMU/pIOMMU usage. > > Can this be expanded? > >> >> 9. Observation: >> --------------- >> vIOMMU feature includes event queue used for forwarding IOMMU events to >> guest >> (e.g. translation faults, invalid stream IDs, permission errors). >> A malicious guest can misconfigure its SMMU state or intentionally >> trigger faults with high frequency. >> >> **Risk:** >> Occurance of IOMMU events with high frequency can cause Xen to flood the > > s/occurance/occurrence/ > >> event queue and disrupt scheduling with >> high hypervisor CPU load for events handling. >> >> **Mitigation:** >> >> - Implement fail-safe state by disabling events forwarding when faults >> are occured with high frequency and >> not processed by guest. > > I am not sure to understand how this would work. Can you expand? > >> - Batch multiple events of same type to reduce overhead on the virtual >> SMMUv3 hardware emulation. > > Ditto. > >> - Consider disabling event queue for untrusted guests > > My understanding is there is only a single physical event queue. Xen > would be responsible to handle the events in the queue and forward to > the respective guests. If so, it is not clear what you mean by "disable > event queue". > I was referring to emulated IOMMU event queue. The idea is to make it optional for guests. When disabled, events won't be propagated to the guest. >> >> Performance Impact >> ================== >> >> With iommu stage-1 and nested translation inclusion, performance >> overhead is introduced comparing to existing, >> stage-2 only usage in Xen. Once mappings are established, translations >> should not introduce significant overhead. >> Emulated paths may introduce moderate overhead, primarily affecting >> device initialization and event handling. >> Performance impact highly depends on target CPU capabilities. >> Testing is performed on QEMU virt and Renesas R-Car (QEMU emulated) >> platforms. > > I am afraid QEMU is not a reliable platform to do performance testing. > Don't you have a real HW with vIOMMU support? > Yes, I will provide performance measurement for Renesas HW also. > [...] > >> References >> ========== >> >> - Original feature implemented by Rahul Singh: >> >> https://patchwork.kernel.org/project/xen-devel/cover/ >> cover.1669888522.git.rahul.singh@arm.com/ >> - SMMUv3 architecture documentation >> - Existing vIOMMU code patterns > > I am not sure what this is referring to? > QEMU and KVM IOMMU emulation patterns were used as a reference. BR, Milan
Hi, On 02/12/2025 22:08, Milan Djokic wrote: > Hi Julien, > > On 11/27/25 11:22, Julien Grall wrote: >>> We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and >>> pIOMMU. Considering single vIOMMU model limitation pointed out by >>> Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the >>> only proper solution. >> >> I am not sure to fully understand. My assumption with the single vIOMMU >> is you have a virtual SID that would be mapped to a (pIOMMU, physical >> SID). > > In the original single vIOMMU implementation, vSID was also equal to > pSID, we didn't have SW mapping layer between them. Once SID overlap > issue was discovered with this model, I have switched to vIOMMU-per- > pIOMMU model. Alternative was to introduce a SW mapping layer and stick > with a single vIOMMU model. Imo, vSID->pSID mapping layer would > overcomplicate the design, especially for PCI RC streamIDs handling. > On the other hand, if even a multi-vIOMMU model introduces problems that > I am not aware of yet, adding a complex mapping layer would be the only > viable solution. > > > Does this means in your solution you will end up with multiple > > vPCI as well and then map pBDF == vBDF? (this because the SID have to be > > fixed at boot) > > > > The important thing which I haven't mentioned here is that our focus is > on non-PCI devices for this feature atm. If I'm not mistaken, arm PCI > passthrough is still work in progress, so our plan was to implement full > vIOMMU PCI support in the future, once PCI passthrough support is > complete for arm. Of course, we need to make sure that vIOMMU design > provides a suitable infrastructure for PCI. > To answer your question, yes we will have multiple vPCI nodes with this > model, establishing 1-1 vSID-pSID mapping (same iommu-map range between > pPCI-vPCI). > For pBDF to vBDF 1-1 mapping, I'm not sure if this is necessary. My > understanding is that vBDF->pBDF mapping does not affect vSID->pSID > mapping. Am I wrong here? From my understanding, the mapping between a vBDF and vSID is setup at domain creation (as this is described in ACPI/Device-Tree). As PCI devices can be hotplug, if you want to enforce vSID == pSID, then you indirectly need to enforce vBDF == pBDF. [...] >>> - **Runtime Configuration**: Introduces a `viommu` boot parameter for >>> dynamic enablement. >>> >>> Separate vIOMMU device is exposed to guest for every physical IOMMU in >>> the system. >>> vIOMMU feature is designed in a way to provide a generic vIOMMU >>> framework and a backend implementation >>> for target IOMMU as separate components. >>> Backend implementation contains specific IOMMU structure and commands >>> handling (only SMMUv3 currently supported). >>> This structure allows potential reuse of stage-1 feature for other IOMMU >>> types. >>> >>> Security Considerations >>> ======================= >>> >>> **viommu security benefits:** >>> >>> - Stage-1 translation ensures guest devices cannot perform unauthorized >>> DMA (device I/O address mapping managed by guest). >>> - Emulated IOMMU removes guest direct dependency on IOMMU hardware, >>> while maintaining domains isolation. >> >> Sorry, I don't follow this argument. Are you saying that it would be >> possible to emulate a SMMUv3 vIOMMU on top of the IPMMU? >> > > No, this would not work. Emulated IOMMU has to match with the pIOMMU type. > The argument only points out that we are emulating IOMMU, so the guest > does not need direct HW interface for IOMMU functions. Sorry, but I am still missing how this is a security benefits. [...] >>> >>> 2. Observation: >>> --------------- >>> Guests can now invalidate Stage-1 caches; invalidation needs forwarding >>> to SMMUv3 hardware to maintain coherence. >>> >>> **Risk:** >>> Failing to propagate cache invalidation could allow stale mappings, >>> enabling access to old mappings and possibly >>> data leakage or misrouting. >> >> You are referring to data leakage/misrouting between two devices own by >> the same guest, right? Xen would still be in charge of flush when the >> stage-2 is updated. >> > > Yes, this risk could affect only guests, not xen. But it would affect a single guest right? IOW, it is not possible for guest A to leak data to guest B even if we don't properly invalidate stage-1. Correct? > >>> >>> **Mitigation:** *(Handled by design)* >>> This feature ensures that guest-initiated invalidations are correctly >>> forwarded to the hardware, >>> preserving IOMMU coherency. >> >> How is this a mitigation? You have to properly handle commands. If you >> don't properly handle them, then yes it will break. >> > > Not really a mitigation, will remove it. Guest is responsible for the > regular initiation of invalidation requests to mitigate this risk. > >>> >>> 4. Observation: >>> --------------- >>> The code includes transformations to handle nested translation versus >>> standard modes and uses guest-configured >>> command queues (e.g., `CMD_CFGI_STE`) and event notifications. >>> >>> **Risk:** >>> Malicious or malformed queue commands from guests could bypass >>> validation, manipulate SMMUv3 state, >>> or cause system instability. >>> >>> **Mitigation:** *(Handled by design)* >>> Built-in validation of command queue entries and sanitization mechanisms >>> ensure only permitted configurations >>> are applied. >> >> This is true as long as we didn't make an mistake in the >> configurations ;). >> > > Yes, but I don’t see anything we can do to prevent configuration mistakes. There is nothing really preventing it. Same for ... > >> >>> This is supported via additions in `vsmmuv3` and `cmdqueue` >>> handling code. >>> >>> 5. Observation: >>> --------------- >>> Device Tree modifications enable device assignment and configuration >>> through guest DT fragments (e.g., `iommus`) >>> are added via `libxl`. >>> >>> **Risk:** >>> Erroneous or malicious Device Tree injection could result in device >>> misbinding or guest access to unauthorized >>> hardware. >> >> The DT fragment are not security support and will never be at least >> until you have can a libfdt that is able to detect malformed Device-Tree >> (I haven't checked if this has changed recently). >> > > But this should still be considered a risk? Similar to the previous > observation, system integrator should ensure that DT fragments are correct. ... this one. I agree they are risks, but they don't provide much input in the design of the vIOMMU. I am a lot more concerned for the scheduling part because the resources are shared. >> My understanding is there is only a single physical event queue. Xen >> would be responsible to handle the events in the queue and forward to >> the respective guests. If so, it is not clear what you mean by "disable >> event queue". >> > > I was referring to emulated IOMMU event queue. The idea is to make it > optional for guests. When disabled, events won't be propagated to the > guest. But Xen will still receive the events, correct? If so, how does it make it better? > >>> >>> Performance Impact >>> ================== >>> >>> With iommu stage-1 and nested translation inclusion, performance >>> overhead is introduced comparing to existing, >>> stage-2 only usage in Xen. Once mappings are established, translations >>> should not introduce significant overhead. >>> Emulated paths may introduce moderate overhead, primarily affecting >>> device initialization and event handling. >>> Performance impact highly depends on target CPU capabilities. >>> Testing is performed on QEMU virt and Renesas R-Car (QEMU emulated) >>> platforms. >> >> I am afraid QEMU is not a reliable platform to do performance testing. >> Don't you have a real HW with vIOMMU support? >> > > Yes, I will provide performance measurement for Renesas HW also. FWIW, I don't need to know the performance right now. I am mostly pointing out that if you want to provide performance number, then they should really come from real HW rather than QEMU. Cheers, -- Julien Grall
Hi Julien, On 12/3/25 11:32, Julien Grall wrote: > Hi, > > On 02/12/2025 22:08, Milan Djokic wrote: >> Hi Julien, >> >> On 11/27/25 11:22, Julien Grall wrote: >>>> We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and >>>> pIOMMU. Considering single vIOMMU model limitation pointed out by >>>> Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the >>>> only proper solution. >> >> > Does this means in your solution you will end up with multiple >> > vPCI as well and then map pBDF == vBDF? (this because the SID have to be >> > fixed at boot) >> > >> >> To answer your question, yes we will have multiple vPCI nodes with this >> model, establishing 1-1 vSID-pSID mapping (same iommu-map range between >> pPCI-vPCI). >> For pBDF to vBDF 1-1 mapping, I'm not sure if this is necessary. My >> understanding is that vBDF->pBDF mapping does not affect vSID->pSID >> mapping. Am I wrong here? > > From my understanding, the mapping between a vBDF and vSID is setup at > domain creation (as this is described in ACPI/Device-Tree). As PCI > devices can be hotplug, if you want to enforce vSID == pSID, then you > indirectly need to enforce vBDF == pBDF. > I was not aware of that. I will have to do a detailed analysis on this and come back with a solution. Right now I'm not sure how and if enumeration will work with multi vIOMMU/vPCI model. If that's not possible, we will have to introduce a mapping layer for vSID->pSID and go back to single vPCI/vIOMMU model. > [...] > >>>> - **Runtime Configuration**: Introduces a `viommu` boot parameter for >>>> dynamic enablement. >>>> >>>> Separate vIOMMU device is exposed to guest for every physical IOMMU in >>>> the system. >>>> vIOMMU feature is designed in a way to provide a generic vIOMMU >>>> framework and a backend implementation >>>> for target IOMMU as separate components. >>>> Backend implementation contains specific IOMMU structure and commands >>>> handling (only SMMUv3 currently supported). >>>> This structure allows potential reuse of stage-1 feature for other IOMMU >>>> types. >>>> >>>> Security Considerations >>>> ======================= >>>> >>>> **viommu security benefits:** >>>> >>>> - Stage-1 translation ensures guest devices cannot perform unauthorized >>>> DMA (device I/O address mapping managed by guest). >>>> - Emulated IOMMU removes guest direct dependency on IOMMU hardware, >>>> while maintaining domains isolation. >>> >>> Sorry, I don't follow this argument. Are you saying that it would be >>> possible to emulate a SMMUv3 vIOMMU on top of the IPMMU? >>> >> >> No, this would not work. Emulated IOMMU has to match with the pIOMMU type. >> The argument only points out that we are emulating IOMMU, so the guest >> does not need direct HW interface for IOMMU functions. > > Sorry, but I am still missing how this is a security benefits. > Yes, this is a mistake. This should be in the design section. > [...] > > >>>> >>>> 2. Observation: >>>> --------------- >>>> Guests can now invalidate Stage-1 caches; invalidation needs forwarding >>>> to SMMUv3 hardware to maintain coherence. >>>> >>>> **Risk:** >>>> Failing to propagate cache invalidation could allow stale mappings, >>>> enabling access to old mappings and possibly >>>> data leakage or misrouting. >>> >>> You are referring to data leakage/misrouting between two devices own by >>> the same guest, right? Xen would still be in charge of flush when the >>> stage-2 is updated. >>> >> >> Yes, this risk could affect only guests, not xen. > > But it would affect a single guest right? IOW, it is not possible for > guest A to leak data to guest B even if we don't properly invalidate > stage-1. Correct? > Correct. I don't see any possible scenario for data leakage between different guests, just between 2 devices assigned to the same guest. I will elaborate on this risk to make it clearer. >>>> >>>> 4. Observation: >>>> --------------- >>>> The code includes transformations to handle nested translation versus >>>> standard modes and uses guest-configured >>>> command queues (e.g., `CMD_CFGI_STE`) and event notifications. >>>> >>>> **Risk:** >>>> Malicious or malformed queue commands from guests could bypass >>>> validation, manipulate SMMUv3 state, >>>> or cause system instability. >>>> >>>> **Mitigation:** *(Handled by design)* >>>> Built-in validation of command queue entries and sanitization mechanisms >>>> ensure only permitted configurations >>>> are applied. >>> >>> This is true as long as we didn't make an mistake in the >>> configurations ;). >>> >> >> Yes, but I don’t see anything we can do to prevent configuration mistakes. > > There is nothing really preventing it. Same for ... >> >>> >>>> This is supported via additions in `vsmmuv3` and `cmdqueue` >>>> handling code. >>>> >>>> 5. Observation: >>>> --------------- >>>> Device Tree modifications enable device assignment and configuration >>>> through guest DT fragments (e.g., `iommus`) >>>> are added via `libxl`. >>>> >>>> **Risk:** >>>> Erroneous or malicious Device Tree injection could result in device >>>> misbinding or guest access to unauthorized >>>> hardware. >>> >>> The DT fragment are not security support and will never be at least >>> until you have can a libfdt that is able to detect malformed Device-Tree >>> (I haven't checked if this has changed recently). >>> >> >> But this should still be considered a risk? Similar to the previous >> observation, system integrator should ensure that DT fragments are correct. > > ... this one. I agree they are risks, but they don't provide much input > in the design of the vIOMMU. > I get your point. I can remove them if considered to be overhead in this context. > I am a lot more concerned for the scheduling part because the resources > are shared. > >>> My understanding is there is only a single physical event queue. Xen >>> would be responsible to handle the events in the queue and forward to >>> the respective guests. If so, it is not clear what you mean by "disable >>> event queue". >>> >> >> I was referring to emulated IOMMU event queue. The idea is to make it >> optional for guests. When disabled, events won't be propagated to the >> guest. > > But Xen will still receive the events, correct? If so, how does it make > it better? > You are correct, Xen will still receive events and handle them in pIOMMU driver. This is only a mitigation for the part introduced by vIOMMU design (events emulation), not the complete solution. This risk has more general context and could be related to stage-2 only guests also (e.g. guests that perform DMA to an address they are not allowed to access, causing translation faults). But imo mitigation for the physical event queue flooding should be part of the pIOMMU driver design Best regards, Milan
Hi Milan, Milan Djokic <milan_djokic@epam.com> writes: > On 9/1/25 13:06, Milan Djokic wrote: [...] > > Hello Volodymyr, Julien > > Sorry for the delayed follow-up on this topic. > We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU > and pIOMMU. Considering single vIOMMU model limitation pointed out by > Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the > only proper solution. > Following is the updated design document. > I have added additional details to the design and performance impact > sections, and also indicated future improvements. Security > considerations section is unchanged apart from some minor details > according to review comments. > Let me know what do you think about updated design. Once approved, I > will send the updated vIOMMU patch series. This looks fine for me. I can't see any immediate flaws here. So let's get to patches :) [...] -- WBR, Volodymyr
© 2016 - 2025 Red Hat, Inc.