[v1] Add SMMUv3 Stage 1 Support for XEN guests

[PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests

Posted by Milan Djokic 6 months ago

This patch series represents a rebase of an older patch series implemented and
sumbitted by Rahul Singh as an RFC: https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/.
Original patch series content is aligned with the latest xen structure in terms of common/arch-specific code structuring. 
Some minor bugfixes are also applied:
- Sanity checks / error handling
- Non-pci devices support for emulated iommu

Overall description of stage-1 support is available in the original
patch series cover letter. Original commits structure with detailed explanation for each commit
functionality is maintained.

Patch series testing is performed in qemu arm environment. Additionally,
stage-1 translation for non-pci devices is verified on a Renesas platform.

Jean-Philippe Brucker (1):
  xen/arm: smmuv3: Maintain a SID->device structure

Rahul Singh (19):
  xen/arm: smmuv3: Add support for stage-1 and nested stage translation
  xen/arm: smmuv3: Alloc io_domain for each device
  xen/arm: vIOMMU: add generic vIOMMU framework
  xen/arm: vsmmuv3: Add dummy support for virtual SMMUv3 for guests
  xen/domctl: Add XEN_DOMCTL_CONFIG_VIOMMU_* and viommu config param
  xen/arm: vIOMMU: Add cmdline boot option "viommu = <boolean>"
  xen/arm: vsmmuv3: Add support for registers emulation
  xen/arm: vsmmuv3: Add support for cmdqueue handling
  xen/arm: vsmmuv3: Add support for command CMD_CFGI_STE
  xen/arm: vsmmuv3: Attach Stage-1 configuration to SMMUv3 hardware
  xen/arm: vsmmuv3: Add support for event queue and global error
  xen/arm: vsmmuv3: Add "iommus" property node for dom0 devices
  xen/arm: vIOMMU: IOMMU device tree node for dom0
  xen/arm: vsmmuv3: Emulated SMMUv3 device tree node for dom0less
  arm/libxl: vsmmuv3: Emulated SMMUv3 device tree node in libxl
  xen/arm: vsmmuv3: Alloc virq for virtual SMMUv3
  xen/arm: vsmmuv3: Add support to send stage-1 event to guest
  libxl/arm: vIOMMU: Modify the partial device tree for iommus
  xen/arm: vIOMMU: Modify the partial device tree for dom0less

 docs/man/xl.cfg.5.pod.in                |  13 +
 docs/misc/xen-command-line.pandoc       |   7 +
 tools/golang/xenlight/helpers.gen.go    |   2 +
 tools/golang/xenlight/types.gen.go      |   1 +
 tools/include/libxl.h                   |   5 +
 tools/libs/light/libxl_arm.c            | 123 +++-
 tools/libs/light/libxl_types.idl        |   6 +
 tools/xl/xl_parse.c                     |  10 +
 xen/arch/arm/dom0less-build.c           |  72 ++
 xen/arch/arm/domain.c                   |  26 +
 xen/arch/arm/domain_build.c             | 103 ++-
 xen/arch/arm/include/asm/domain.h       |   4 +
 xen/arch/arm/include/asm/viommu.h       | 102 +++
 xen/common/device-tree/dom0less-build.c |  31 +-
 xen/drivers/passthrough/Kconfig         |  14 +
 xen/drivers/passthrough/arm/Makefile    |   2 +
 xen/drivers/passthrough/arm/smmu-v3.c   | 369 +++++++++-
 xen/drivers/passthrough/arm/smmu-v3.h   |  49 +-
 xen/drivers/passthrough/arm/viommu.c    |  87 +++
 xen/drivers/passthrough/arm/vsmmu-v3.c  | 895 ++++++++++++++++++++++++
 xen/drivers/passthrough/arm/vsmmu-v3.h  |  32 +
 xen/include/public/arch-arm.h           |  14 +-
 xen/include/public/device_tree_defs.h   |   1 +
 xen/include/xen/iommu.h                 |  14 +
 24 files changed, 1935 insertions(+), 47 deletions(-)
 create mode 100644 xen/arch/arm/include/asm/viommu.h
 create mode 100644 xen/drivers/passthrough/arm/viommu.c
 create mode 100644 xen/drivers/passthrough/arm/vsmmu-v3.c
 create mode 100644 xen/drivers/passthrough/arm/vsmmu-v3.h

-- 
2.43.0

Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests

Posted by Julien Grall 6 months ago

Hi Milan,

On Thu, 7 Aug 2025 at 17:55, Milan Djokic <milan_djokic@epam.com> wrote:

> This patch series represents a rebase of an older patch series implemented
> and
> sumbitted by Rahul Singh as an RFC:
> https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/
> .
> Original patch series content is aligned with the latest xen structure in
> terms of common/arch-specific code structuring.
> Some minor bugfixes are also applied:
> - Sanity checks / error handling
> - Non-pci devices support for emulated iommu


>
> Overall description of stage-1 support is available in the original
> patch series cover letter. Original commits structure with detailed
> explanation for each commit
> functionality is maintained.


I am a bit surprised not much has changed. Last time we asked a document to
explain the overall design of the vSMMU including some details on the
security posture. I can’t remember if this was ever posted.

If not, then you need to start with that. Otherwise, if is going to be
pretty difficult to review this series.

Cheers,


>
> Patch series testing is performed in qemu arm environment. Additionally,
> stage-1 translation for non-pci devices is verified on a Renesas platform.
>
> Jean-Philippe Brucker (1):
>   xen/arm: smmuv3: Maintain a SID->device structure
>
> Rahul Singh (19):
>   xen/arm: smmuv3: Add support for stage-1 and nested stage translation
>   xen/arm: smmuv3: Alloc io_domain for each device
>   xen/arm: vIOMMU: add generic vIOMMU framework
>   xen/arm: vsmmuv3: Add dummy support for virtual SMMUv3 for guests
>   xen/domctl: Add XEN_DOMCTL_CONFIG_VIOMMU_* and viommu config param
>   xen/arm: vIOMMU: Add cmdline boot option "viommu = <boolean>"
>   xen/arm: vsmmuv3: Add support for registers emulation
>   xen/arm: vsmmuv3: Add support for cmdqueue handling
>   xen/arm: vsmmuv3: Add support for command CMD_CFGI_STE
>   xen/arm: vsmmuv3: Attach Stage-1 configuration to SMMUv3 hardware
>   xen/arm: vsmmuv3: Add support for event queue and global error
>   xen/arm: vsmmuv3: Add "iommus" property node for dom0 devices
>   xen/arm: vIOMMU: IOMMU device tree node for dom0
>   xen/arm: vsmmuv3: Emulated SMMUv3 device tree node for dom0less
>   arm/libxl: vsmmuv3: Emulated SMMUv3 device tree node in libxl
>   xen/arm: vsmmuv3: Alloc virq for virtual SMMUv3
>   xen/arm: vsmmuv3: Add support to send stage-1 event to guest
>   libxl/arm: vIOMMU: Modify the partial device tree for iommus
>   xen/arm: vIOMMU: Modify the partial device tree for dom0less
>
>  docs/man/xl.cfg.5.pod.in                |  13 +
>  docs/misc/xen-command-line.pandoc       |   7 +
>  tools/golang/xenlight/helpers.gen.go    |   2 +
>  tools/golang/xenlight/types.gen.go      |   1 +
>  tools/include/libxl.h                   |   5 +
>  tools/libs/light/libxl_arm.c            | 123 +++-
>  tools/libs/light/libxl_types.idl        |   6 +
>  tools/xl/xl_parse.c                     |  10 +
>  xen/arch/arm/dom0less-build.c           |  72 ++
>  xen/arch/arm/domain.c                   |  26 +
>  xen/arch/arm/domain_build.c             | 103 ++-
>  xen/arch/arm/include/asm/domain.h       |   4 +
>  xen/arch/arm/include/asm/viommu.h       | 102 +++
>  xen/common/device-tree/dom0less-build.c |  31 +-
>  xen/drivers/passthrough/Kconfig         |  14 +
>  xen/drivers/passthrough/arm/Makefile    |   2 +
>  xen/drivers/passthrough/arm/smmu-v3.c   | 369 +++++++++-
>  xen/drivers/passthrough/arm/smmu-v3.h   |  49 +-
>  xen/drivers/passthrough/arm/viommu.c    |  87 +++
>  xen/drivers/passthrough/arm/vsmmu-v3.c  | 895 ++++++++++++++++++++++++
>  xen/drivers/passthrough/arm/vsmmu-v3.h  |  32 +
>  xen/include/public/arch-arm.h           |  14 +-
>  xen/include/public/device_tree_defs.h   |   1 +
>  xen/include/xen/iommu.h                 |  14 +
>  24 files changed, 1935 insertions(+), 47 deletions(-)
>  create mode 100644 xen/arch/arm/include/asm/viommu.h
>  create mode 100644 xen/drivers/passthrough/arm/viommu.c
>  create mode 100644 xen/drivers/passthrough/arm/vsmmu-v3.c
>  create mode 100644 xen/drivers/passthrough/arm/vsmmu-v3.h
>
> --
> 2.43.0
>

Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests

Posted by Milan Djokic 5 months, 4 weeks ago

On 8/7/25 19:58, Julien Grall wrote:
> Hi Milan,
> 
> On Thu, 7 Aug 2025 at 17:55, Milan Djokic <milan_djokic@epam.com 
> <mailto:milan_djokic@epam.com>> wrote:
> 
>     This patch series represents a rebase of an older patch series
>     implemented and
>     sumbitted by Rahul Singh as an RFC: https://patchwork.kernel.org/
>     project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/
>     <https://eur01.safelinks.protection.outlook.com/?
>     url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fxen-
>     devel%2Fcover%2Fcover.1669888522.git.rahul.singh%40arm.com%2F&data=05%7C02%7Cmilan_djokic%40epam.com%7C03265dfcc1a94a11e83f08ddd5dc0edc%7Cb41b72d04e9f4c268a69f949f367c91d%7C1%7C0%7C638901863296475715%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=bdsPyXoIqvzwWIWk0Ot3BDOu8yAaF%2Bq3Vrs4wsmZJEA%3D&reserved=0>.
>     Original patch series content is aligned with the latest xen
>     structure in terms of common/arch-specific code structuring.
>     Some minor bugfixes are also applied:
>     - Sanity checks / error handling
>     - Non-pci devices support for emulated iommu
> 
> 
> 
>     Overall description of stage-1 support is available in the original
>     patch series cover letter. Original commits structure with detailed
>     explanation for each commit
>     functionality is maintained.
> 
> 
> I am a bit surprised not much has changed. Last time we asked a document 
> to explain the overall design of the vSMMU including some details on the 
> security posture. I can’t remember if this was ever posted.
> 
> If not, then you need to start with that. Otherwise, if is going to be 
> pretty difficult to review this series.
> 
> Cheers,
Hello Julien,

We have prepared a design document and it will be part of the updated 
patch series (added in docs/design). I'll also extend cover letter with 
details on implementation structure to make review easier.
Following is the design document content which will be provided in 
updated patch series:

Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
==========================================================

Author: Milan Djokic <milan_djokic@epam.com>
Date:   2025-08-07
Status: Draft

Introduction
------------

The SMMUv3 supports two stages of translation. Each stage of translation
can be independently enabled. An incoming address is logically
translated from VA to IPA in stage 1, then the IPA is input to stage 2
which translates the IPA to the output PA. Stage 1 translation support
is required to provide isolation between different devices within the OS.

Xen already supports Stage 2 translation but there is no support for
Stage 1 translation. This design proposal outlines the introduction of
Stage-1 SMMUv3 support in Xen for ARM guests.

Motivation
----------

ARM systems utilizing SMMUv3 require Stage-1 address translation to
ensure correct and secure DMA behavior inside guests.

This feature enables:
- Stage-1 translation in guest domain
- Safe device passthrough under secure memory translation

Design Overview
---------------

These changes provide emulated SMMUv3 support:

- SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
   SMMUv3 driver
- vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling
- Register/Command Emulation: SMMUv3 register emulation and command
   queue handling
- Device Tree Extensions: adds iommus and virtual SMMUv3 nodes to
   device trees for dom0 and dom0less scenarios
- Runtime Configuration: introduces a 'viommu' boot parameter for
   dynamic enablement

Security Considerations
------------------------

viommu security benefits:
- Stage-1 translation ensures guest devices cannot perform unauthorized
   DMA
- Emulated SMMUv3 for domains removes dependency on host hardware while
   maintaining isolation

Observations and Potential Risks
--------------------------------

1. Observation:
Support for Stage-1 translation introduces new data structures
(s1_cfg and s2_cfg) and logic to write both Stage-1 and Stage-2 entries
in the Stream Table Entry (STE), including an abort field for partial
config states.

Risk:
A partially applied Stage-1 configuration might leave guest DMA
mappings in an inconsistent state, enabling unauthorized access or
cross-domain interference.

Mitigation (Handled by design):
Both s1_cfg and s2_cfg are written atomically. The abort field ensures
Stage-1 config is only used when fully applied. Incomplete configs are
ignored by the hypervisor.

2. Observation:
Guests can now issue Stage-1 cache invalidations.

Risk:
Failure to propagate invalidations could leave stale mappings, enabling
data leakage or misrouting.

Mitigation (Handled by design):
Guest invalidations are forwarded to the hardware to ensure IOMMU
coherency.

3. Observation:
The feature introduces large functional changes including the vIOMMU
framework, vsmmuv3 devices, command queues, event queues, domain
handling, and Device Tree modifications.

Risk:
Increased attack surface with risk of race conditions, malformed
commands, or misconfiguration via the device tree.

Mitigation:
- Improved sanity checks and error handling
- Feature is marked as Tech Preview and self-contained to reduce risk
   to unrelated code

4. Observation:
The implementation supports nested and standard translation modes,
using guest command queues (e.g. CMD_CFGI_STE) and events.

Risk:
Malicious commands could bypass validation and corrupt SMMUv3 state or
destabilize dom0.

Mitigation (Handled by design):
Command queues are validated, and only permitted configuration changes
are accepted. Handled in vsmmuv3 and cmdqueue logic.

5. Observation:
Device Tree changes inject iommus and vsmmuv3 nodes via libxl.

Risk:
Malicious or incorrect DT fragments could result in wrong device
assignments or hardware access.

Mitigation:
Only vetted and sanitized DT fragments are allowed. libxl limits what
guests can inject.

6. Observation:
The feature is enabled per-guest via viommu setting.

Risk:
Guests without viommu may behave differently, potentially causing
confusion, privilege drift, or accidental exposure.

Mitigation:
Ensure downgrade paths are safe. Perform isolation audits in
multi-guest environments to ensure correct behavior.

Performance Impact
------------------

Hardware-managed translations are expected to have minimal overhead.
Emulated vIOMMU may introduce some latency during initialization or
event processing.

Testing
-------

- QEMU-based testing for Stage-1 and nested translation
- Hardware testing on Renesas SMMUv3-enabled ARM systems
- Unit tests for translation accuracy (not yet implemented)

Migration and Compatibility
---------------------------

This feature is optional and disabled by default (viommu="") to ensure
backward compatibility.

References
----------

- Original implementation by Rahul Singh:

https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/
- ARM SMMUv3 architecture documentation
- Existing vIOMMU code in Xen

BR,
Milan

Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests

Posted by Julien Grall 5 months, 4 weeks ago

On 13/08/2025 11:04, Milan Djokic wrote:
> Hello Julien,

Hi Milan,

> 
> We have prepared a design document and it will be part of the updated 
> patch series (added in docs/design). I'll also extend cover letter with 
> details on implementation structure to make review easier.

I would suggest to just iterate on the design document for now.

> Following is the design document content which will be provided in 
> updated patch series:
> 
> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
> ==========================================================
> 
> Author: Milan Djokic <milan_djokic@epam.com>
> Date:   2025-08-07
> Status: Draft
> 
> Introduction
> ------------
> 
> The SMMUv3 supports two stages of translation. Each stage of translation
> can be independently enabled. An incoming address is logically
> translated from VA to IPA in stage 1, then the IPA is input to stage 2
> which translates the IPA to the output PA. Stage 1 translation support
> is required to provide isolation between different devices within the OS.
> 
> Xen already supports Stage 2 translation but there is no support for
> Stage 1 translation. This design proposal outlines the introduction of
> Stage-1 SMMUv3 support in Xen for ARM guests.
> 
> Motivation
> ----------
> 
> ARM systems utilizing SMMUv3 require Stage-1 address translation to
> ensure correct and secure DMA behavior inside guests.

Can you clarify what you mean by "correct"? DMA would still work without 
stage-1.
> 
> This feature enables:
> - Stage-1 translation in guest domain
> - Safe device passthrough under secure memory translation
> 
> Design Overview
> ---------------
> 
> These changes provide emulated SMMUv3 support:
> 
> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
>    SMMUv3 driver
> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling

So what are you planning to expose to a guest? Is it one vIOMMU per 
pIOMMU? Or a single one?

Have you considered the pros/cons for both?
> - Register/Command Emulation: SMMUv3 register emulation and command
>    queue handling

For each pSMMU, we have a single command queue that will receive command 
from all the guests. How do you plan to prevent a guest hogging the 
command queue?

In addition to that, AFAIU, the size of the virtual command queue is 
fixed by the guest rather than Xen. If a guest is filling up the queue 
with commands before notifying Xen, how do you plan to ensure we don't 
spend too much time in Xen (which is not preemptible)?

Lastly, what do you plan to expose? Is it a full vIOMMU (including event 
forwarding)?

> - Device Tree Extensions: adds iommus and virtual SMMUv3 nodes to
>    device trees for dom0 and dom0less scenarios
> - Runtime Configuration: introduces a 'viommu' boot parameter for
>    dynamic enablement
> 
> Security Considerations
> ------------------------
> 
> viommu security benefits:
> - Stage-1 translation ensures guest devices cannot perform unauthorized
>    DMA
> - Emulated SMMUv3 for domains removes dependency on host hardware while
>    maintaining isolation

I don't understand this sentence.

> 
> Observations and Potential Risks
> --------------------------------
> 
> 1. Observation:
> Support for Stage-1 translation introduces new data structures
> (s1_cfg and s2_cfg) and logic to write both Stage-1 and Stage-2 entries
> in the Stream Table Entry (STE), including an abort field for partial
> config states.
> 
> Risk:
> A partially applied Stage-1 configuration might leave guest DMA
> mappings in an inconsistent state, enabling unauthorized access or
> cross-domain interference.

I don't understand how a misconfigured stage-1 could lead to 
cross-domain interference. Can you clarify?

> 
> Mitigation (Handled by design):
> Both s1_cfg and s2_cfg are written atomically. The abort field ensures
> Stage-1 config is only used when fully applied. Incomplete configs are
> ignored by the hypervisor.
> 
> 2. Observation:
> Guests can now issue Stage-1 cache invalidations.
> 
> Risk:
> Failure to propagate invalidations could leave stale mappings, enabling
> data leakage or misrouting.

This is a risk from the guest PoV, right? IOW, this would not open up a 
security hole in Xen.

> 
> Mitigation (Handled by design):
> Guest invalidations are forwarded to the hardware to ensure IOMMU
> coherency.
> 
> 3. Observation:
> The feature introduces large functional changes including the vIOMMU
> framework, vsmmuv3 devices, command queues, event queues, domain
> handling, and Device Tree modifications.
> 
> Risk:
> Increased attack surface with risk of race conditions, malformed
> commands, or misconfiguration via the device tree.
> 
> Mitigation:
> - Improved sanity checks and error handling
> - Feature is marked as Tech Preview and self-contained to reduce risk
>    to unrelated code

Surely, you will want to use the code in production... No?

> 
> 4. Observation:
> The implementation supports nested and standard translation modes,
> using guest command queues (e.g. CMD_CFGI_STE) and events.
> 
> Risk:
> Malicious commands could bypass validation and corrupt SMMUv3 state or
> destabilize dom0.
> 
> Mitigation (Handled by design):
> Command queues are validated, and only permitted configuration changes
> are accepted. Handled in vsmmuv3 and cmdqueue logic.

I didn't mention anything in obversation 1 but now I have to say it... 
The observations you wrote are what I would expect to be handled in any 
submission to our code base. This is the bare minimum to have the code 
secure. But you don't seem to address the more subttle ones which are 
more related to scheduling issue (see some above). They require some 
design and discussion.

> 
> 5. Observation:
> Device Tree changes inject iommus and vsmmuv3 nodes via libxl.
> 
> Risk:
> Malicious or incorrect DT fragments could result in wrong device
> assignments or hardware access.
> 
> Mitigation:
> Only vetted and sanitized DT fragments are allowed. libxl limits what
> guests can inject.

Today, libxl doesn't do any sanitisation on the DT. In fact, this is 
pretty much impossible because libfdt expects trusted DT. Is this 
something you are planning to change?
> 
> 6. Observation:
> The feature is enabled per-guest via viommu setting.
> 
> Risk:
> Guests without viommu may behave differently, potentially causing
> confusion, privilege drift, or accidental exposure.
> 
> Mitigation:
> Ensure downgrade paths are safe. Perform isolation audits in
> multi-guest environments to ensure correct behavior.
> 
> Performance Impact
> ------------------
> 
> Hardware-managed translations are expected to have minimal overhead.
> Emulated vIOMMU may introduce some latency during initialization or
> event processing.

Latency to who? We still expect isolation between guests and a guest 
will not go over its time slice.

For the guest itself, the main performance impact will be TLB flushes 
because they are commands that will end up to be emulated by Xen. 
Depending on your Linux configuration (I haven't checked other), this 
will either happen every unmap operation or they will be batch. The 
performance of the latter will be the worse one.

Have you done any benchmark to confirm the impact? Just to note, I would 
not gate the work for virtual SMMUv3 based on the performance. I think 
it is ok to offer the support if the user want extra security and 
doesn't care about performance. But it would be good to outline them as 
I expect them to be pretty bad...

Cheers,

-- 
Julien Grall

Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests

Posted by Milan Djokic 5 months, 3 weeks ago

Hello Julien,

On 8/13/25 14:11, Julien Grall wrote:
> On 13/08/2025 11:04, Milan Djokic wrote:
>> Hello Julien,
> 
> Hi Milan,
> 
>>
>> We have prepared a design document and it will be part of the updated
>> patch series (added in docs/design). I'll also extend cover letter with
>> details on implementation structure to make review easier.
> 
> I would suggest to just iterate on the design document for now.
> 
>> Following is the design document content which will be provided in
>> updated patch series:
>>
>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>> ==========================================================
>>
>> Author: Milan Djokic <milan_djokic@epam.com>
>> Date:   2025-08-07
>> Status: Draft
>>
>> Introduction
>> ------------
>>
>> The SMMUv3 supports two stages of translation. Each stage of translation
>> can be independently enabled. An incoming address is logically
>> translated from VA to IPA in stage 1, then the IPA is input to stage 2
>> which translates the IPA to the output PA. Stage 1 translation support
>> is required to provide isolation between different devices within the OS.
>>
>> Xen already supports Stage 2 translation but there is no support for
>> Stage 1 translation. This design proposal outlines the introduction of
>> Stage-1 SMMUv3 support in Xen for ARM guests.
>>
>> Motivation
>> ----------
>>
>> ARM systems utilizing SMMUv3 require Stage-1 address translation to
>> ensure correct and secure DMA behavior inside guests.
> 
> Can you clarify what you mean by "correct"? DMA would still work without
> stage-1.

Correct in terms of working with guest managed I/O space. I'll rephrase 
this statement, it seems ambiguous.

>>
>> This feature enables:
>> - Stage-1 translation in guest domain
>> - Safe device passthrough under secure memory translation
>>
>> Design Overview
>> ---------------
>>
>> These changes provide emulated SMMUv3 support:
>>
>> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
>>     SMMUv3 driver
>> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling
> 
> So what are you planning to expose to a guest? Is it one vIOMMU per
> pIOMMU? Or a single one?

Single vIOMMU model is used in this design.

> 
> Have you considered the pros/cons for both?
>> - Register/Command Emulation: SMMUv3 register emulation and command
>>     queue handling
> 

That's a point for consideration.
single vIOMMU prevails in terms of less complex implementation and a 
simple guest iommmu model - single vIOMMU node, one interrupt path, 
event queue, single set of trap handlers for emulation, etc.
Cons for a single vIOMMU model could be less accurate hw representation 
and a potential bottleneck with one emulated queue and interrupt path.
On the other hand, vIOMMU per pIOMMU provides more accurate hw modeling 
and offers better scalability in case of many IOMMUs in the system, but 
this comes with more complex emulation logic and device tree, also 
handling multiple vIOMMUs on guest side.
IMO, single vIOMMU model seems like a better option mostly because it's 
less complex, easier to maintain and debug. Of course, this decision can 
and should be discussed.

> For each pSMMU, we have a single command queue that will receive command
> from all the guests. How do you plan to prevent a guest hogging the
> command queue?
> 
> In addition to that, AFAIU, the size of the virtual command queue is
> fixed by the guest rather than Xen. If a guest is filling up the queue
> with commands before notifying Xen, how do you plan to ensure we don't
> spend too much time in Xen (which is not preemptible)?
> 

We'll have to do a detailed analysis on these scenarios, they are not 
covered by the design (as well as some others which is clear after your 
comments). I'll come back with an updated design.

> Lastly, what do you plan to expose? Is it a full vIOMMU (including event
> forwarding)?
> 

Yes, implementation provides full vIOMMU functionality, with stage-1 
event forwarding to guest.

>> - Device Tree Extensions: adds iommus and virtual SMMUv3 nodes to
>>     device trees for dom0 and dom0less scenarios
>> - Runtime Configuration: introduces a 'viommu' boot parameter for
>>     dynamic enablement
>>
>> Security Considerations
>> ------------------------
>>
>> viommu security benefits:
>> - Stage-1 translation ensures guest devices cannot perform unauthorized
>>     DMA
>> - Emulated SMMUv3 for domains removes dependency on host hardware while
>>     maintaining isolation
> 
> I don't understand this sentence.
> 

Current implementation emulates IOMMU with predefined capabilities, 
exposed as a single vIOMMU to guest. That's where "removes dependency on 
host hardware" came from. I'll rephrase this part to be more clear.


>>
>> Observations and Potential Risks
>> --------------------------------
>>
>> 1. Observation:
>> Support for Stage-1 translation introduces new data structures
>> (s1_cfg and s2_cfg) and logic to write both Stage-1 and Stage-2 entries
>> in the Stream Table Entry (STE), including an abort field for partial
>> config states.
>>
>> Risk:
>> A partially applied Stage-1 configuration might leave guest DMA
>> mappings in an inconsistent state, enabling unauthorized access or
>> cross-domain interference.
> 
> I don't understand how a misconfigured stage-1 could lead to
> cross-domain interference. Can you clarify?
> 

For stage-1 support, SID-to-device mapping and per device  io_domain 
allocation is introduced in Xen smmu driver, and we have to take care 
that these mappings are valid all the time. If these are not properly 
managed, structures and SIDs could be mapped to wrong device (and 
consequentially wrong guest) in some extreme cases.
This is covered by the design, but listed as a risc anyway for eventual 
future updates in this area.


>>
>> Mitigation (Handled by design):
>> Both s1_cfg and s2_cfg are written atomically. The abort field ensures
>> Stage-1 config is only used when fully applied. Incomplete configs are
>> ignored by the hypervisor.
>>
>> 2. Observation:
>> Guests can now issue Stage-1 cache invalidations.
>>
>> Risk:
>> Failure to propagate invalidations could leave stale mappings, enabling
>> data leakage or misrouting.
> 
> This is a risk from the guest PoV, right? IOW, this would not open up a
> security hole in Xen.
> 

Yes, this is guest PoV, although still related to vIOMMU.

>>
>> Mitigation (Handled by design):
>> Guest invalidations are forwarded to the hardware to ensure IOMMU
>> coherency.
>>
>> 3. Observation:
>> The feature introduces large functional changes including the vIOMMU
>> framework, vsmmuv3 devices, command queues, event queues, domain
>> handling, and Device Tree modifications.
>>
>> Risk:
>> Increased attack surface with risk of race conditions, malformed
>> commands, or misconfiguration via the device tree.
>>
>> Mitigation:
>> - Improved sanity checks and error handling
>> - Feature is marked as Tech Preview and self-contained to reduce risk
>>     to unrelated code
> 
> Surely, you will want to use the code in production... No?
> 

Yes, it is planned for production usage. At the moment, it is optionally 
enabled (grouped under unsupported features), needs community feedback, 
complete security analysis and performance benchmarking/optimization. 
That's the reason it's marked as a Tech Preview at this point.


>>
>> 4. Observation:
>> The implementation supports nested and standard translation modes,
>> using guest command queues (e.g. CMD_CFGI_STE) and events.
>>
>> Risk:
>> Malicious commands could bypass validation and corrupt SMMUv3 state or
>> destabilize dom0.
>>
>> Mitigation (Handled by design):
>> Command queues are validated, and only permitted configuration changes
>> are accepted. Handled in vsmmuv3 and cmdqueue logic.
> 
> I didn't mention anything in obversation 1 but now I have to say it...
> The observations you wrote are what I would expect to be handled in any
> submission to our code base. This is the bare minimum to have the code
> secure. But you don't seem to address the more subttle ones which are
> more related to scheduling issue (see some above). They require some
> design and discussion.
> 

Yes, it's clear to me after your comments that some important 
observations are missing. We'll do additional analysis and come back 
with a more complete design.

>>
>> 5. Observation:
>> Device Tree changes inject iommus and vsmmuv3 nodes via libxl.
>>
>> Risk:
>> Malicious or incorrect DT fragments could result in wrong device
>> assignments or hardware access.
>>
>> Mitigation:
>> Only vetted and sanitized DT fragments are allowed. libxl limits what
>> guests can inject.
> 
> Today, libxl doesn't do any sanitisation on the DT. In fact, this is
> pretty much impossible because libfdt expects trusted DT. Is this
> something you are planning to change?

I've referred to libxl parsing only supported fragments/nodes from DT, 
but yes, that's not actual sanitization. I'll update these statements.

>>
>> 6. Observation:
>> The feature is enabled per-guest via viommu setting.
>>
>> Risk:
>> Guests without viommu may behave differently, potentially causing
>> confusion, privilege drift, or accidental exposure.
>>
>> Mitigation:
>> Ensure downgrade paths are safe. Perform isolation audits in
>> multi-guest environments to ensure correct behavior.
>>
>> Performance Impact
>> ------------------
>>
>> Hardware-managed translations are expected to have minimal overhead.
>> Emulated vIOMMU may introduce some latency during initialization or
>> event processing.
> 
> Latency to who? We still expect isolation between guests and a guest
> will not go over its time slice.
> 

This is more related to comparison of emulated vs hw translation, and 
overall overhead introduced with these mechanisms. I'll rephrase this 
part to be more clear.

> For the guest itself, the main performance impact will be TLB flushes
> because they are commands that will end up to be emulated by Xen.
> Depending on your Linux configuration (I haven't checked other), this
> will either happen every unmap operation or they will be batch. The
> performance of the latter will be the worse one.
> 
> Have you done any benchmark to confirm the impact? Just to note, I would
> not gate the work for virtual SMMUv3 based on the performance. I think
> it is ok to offer the support if the user want extra security and
> doesn't care about performance. But it would be good to outline them as
> I expect them to be pretty bad...
> 

We haven't performed detailed benchmarking, just a measurement of boot 
time and our domU application execution rate with and without viommu. We 
could perform some measurements for viommu operations and add results in 
this section.

Thank you for your feedback, I'll come back with an updated design 
document for further review.

BR,
Milan

Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests

Posted by Volodymyr Babchuk 5 months, 2 weeks ago

Hi Milan,

Milan Djokic <milan_djokic@epam.com> writes:

> Hello Julien,
>
> On 8/13/25 14:11, Julien Grall wrote:
>> On 13/08/2025 11:04, Milan Djokic wrote:
>>> Hello Julien,
>> Hi Milan,
>> 
>>>
>>> We have prepared a design document and it will be part of the updated
>>> patch series (added in docs/design). I'll also extend cover letter with
>>> details on implementation structure to make review easier.
>> I would suggest to just iterate on the design document for now.
>> 
>>> Following is the design document content which will be provided in
>>> updated patch series:
>>>
>>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>>> ==========================================================
>>>
>>> Author: Milan Djokic <milan_djokic@epam.com>
>>> Date:   2025-08-07
>>> Status: Draft
>>>
>>> Introduction
>>> ------------
>>>
>>> The SMMUv3 supports two stages of translation. Each stage of translation
>>> can be independently enabled. An incoming address is logically
>>> translated from VA to IPA in stage 1, then the IPA is input to stage 2
>>> which translates the IPA to the output PA. Stage 1 translation support
>>> is required to provide isolation between different devices within the OS.
>>>
>>> Xen already supports Stage 2 translation but there is no support for
>>> Stage 1 translation. This design proposal outlines the introduction of
>>> Stage-1 SMMUv3 support in Xen for ARM guests.
>>>
>>> Motivation
>>> ----------
>>>
>>> ARM systems utilizing SMMUv3 require Stage-1 address translation to
>>> ensure correct and secure DMA behavior inside guests.
>> Can you clarify what you mean by "correct"? DMA would still work
>> without
>> stage-1.
>
> Correct in terms of working with guest managed I/O space. I'll
> rephrase this statement, it seems ambiguous.
>
>>>
>>> This feature enables:
>>> - Stage-1 translation in guest domain
>>> - Safe device passthrough under secure memory translation
>>>
>>> Design Overview
>>> ---------------
>>>
>>> These changes provide emulated SMMUv3 support:
>>>
>>> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
>>>     SMMUv3 driver
>>> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling
>> So what are you planning to expose to a guest? Is it one vIOMMU per
>> pIOMMU? Or a single one?
>
> Single vIOMMU model is used in this design.
>
>> Have you considered the pros/cons for both?
>>> - Register/Command Emulation: SMMUv3 register emulation and command
>>>     queue handling
>> 
>
> That's a point for consideration.
> single vIOMMU prevails in terms of less complex implementation and a
> simple guest iommmu model - single vIOMMU node, one interrupt path,
> event queue, single set of trap handlers for emulation, etc.
> Cons for a single vIOMMU model could be less accurate hw
> representation and a potential bottleneck with one emulated queue and
> interrupt path.
> On the other hand, vIOMMU per pIOMMU provides more accurate hw
> modeling and offers better scalability in case of many IOMMUs in the
> system, but this comes with more complex emulation logic and device
> tree, also handling multiple vIOMMUs on guest side.
> IMO, single vIOMMU model seems like a better option mostly because
> it's less complex, easier to maintain and debug. Of course, this
> decision can and should be discussed.
>

Well, I am not sure that this is possible, because of StreamID
allocation. The biggest offender is of course PCI, as each Root PCI
bridge will require own SMMU instance with own StreamID space. But even
without PCI you'll need some mechanism to map vStremID to
<pSMMU, pStreamID>, because there will be overlaps in SID space.


Actually, PCI/vPCI with vSMMU is its own can of worms...

>> For each pSMMU, we have a single command queue that will receive command
>> from all the guests. How do you plan to prevent a guest hogging the
>> command queue?
>> In addition to that, AFAIU, the size of the virtual command queue is
>> fixed by the guest rather than Xen. If a guest is filling up the queue
>> with commands before notifying Xen, how do you plan to ensure we don't
>> spend too much time in Xen (which is not preemptible)?
>> 
>
> We'll have to do a detailed analysis on these scenarios, they are not
> covered by the design (as well as some others which is clear after
> your comments). I'll come back with an updated design.

I think that can be handled akin to hypercall continuation, which is
used in similar places, like P2M code

[...]

-- 
WBR, Volodymyr

Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests

Posted by Milan Djokic 5 months, 2 weeks ago

Hello Julien, Volodymyr

On 8/27/25 01:28, Volodymyr Babchuk wrote:
> 
> Hi Milan,
> 
> Milan Djokic <milan_djokic@epam.com> writes:
> 
>> Hello Julien,
>>
>> On 8/13/25 14:11, Julien Grall wrote:
>>> On 13/08/2025 11:04, Milan Djokic wrote:
>>>> Hello Julien,
>>> Hi Milan,
>>>
>>>>
>>>> We have prepared a design document and it will be part of the updated
>>>> patch series (added in docs/design). I'll also extend cover letter with
>>>> details on implementation structure to make review easier.
>>> I would suggest to just iterate on the design document for now.
>>>
>>>> Following is the design document content which will be provided in
>>>> updated patch series:
>>>>
>>>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>>>> ==========================================================
>>>>
>>>> Author: Milan Djokic <milan_djokic@epam.com>
>>>> Date:   2025-08-07
>>>> Status: Draft
>>>>
>>>> Introduction
>>>> ------------
>>>>
>>>> The SMMUv3 supports two stages of translation. Each stage of translation
>>>> can be independently enabled. An incoming address is logically
>>>> translated from VA to IPA in stage 1, then the IPA is input to stage 2
>>>> which translates the IPA to the output PA. Stage 1 translation support
>>>> is required to provide isolation between different devices within the OS.
>>>>
>>>> Xen already supports Stage 2 translation but there is no support for
>>>> Stage 1 translation. This design proposal outlines the introduction of
>>>> Stage-1 SMMUv3 support in Xen for ARM guests.
>>>>
>>>> Motivation
>>>> ----------
>>>>
>>>> ARM systems utilizing SMMUv3 require Stage-1 address translation to
>>>> ensure correct and secure DMA behavior inside guests.
>>> Can you clarify what you mean by "correct"? DMA would still work
>>> without
>>> stage-1.
>>
>> Correct in terms of working with guest managed I/O space. I'll
>> rephrase this statement, it seems ambiguous.
>>
>>>>
>>>> This feature enables:
>>>> - Stage-1 translation in guest domain
>>>> - Safe device passthrough under secure memory translation
>>>>
>>>> Design Overview
>>>> ---------------
>>>>
>>>> These changes provide emulated SMMUv3 support:
>>>>
>>>> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
>>>>      SMMUv3 driver
>>>> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling
>>> So what are you planning to expose to a guest? Is it one vIOMMU per
>>> pIOMMU? Or a single one?
>>
>> Single vIOMMU model is used in this design.
>>
>>> Have you considered the pros/cons for both?
>>>> - Register/Command Emulation: SMMUv3 register emulation and command
>>>>      queue handling
>>>
>>
>> That's a point for consideration.
>> single vIOMMU prevails in terms of less complex implementation and a
>> simple guest iommmu model - single vIOMMU node, one interrupt path,
>> event queue, single set of trap handlers for emulation, etc.
>> Cons for a single vIOMMU model could be less accurate hw
>> representation and a potential bottleneck with one emulated queue and
>> interrupt path.
>> On the other hand, vIOMMU per pIOMMU provides more accurate hw
>> modeling and offers better scalability in case of many IOMMUs in the
>> system, but this comes with more complex emulation logic and device
>> tree, also handling multiple vIOMMUs on guest side.
>> IMO, single vIOMMU model seems like a better option mostly because
>> it's less complex, easier to maintain and debug. Of course, this
>> decision can and should be discussed.
>>
> 
> Well, I am not sure that this is possible, because of StreamID
> allocation. The biggest offender is of course PCI, as each Root PCI
> bridge will require own SMMU instance with own StreamID space. But even
> without PCI you'll need some mechanism to map vStremID to
> <pSMMU, pStreamID>, because there will be overlaps in SID space.
> 
> 
> Actually, PCI/vPCI with vSMMU is its own can of worms...
> 
>>> For each pSMMU, we have a single command queue that will receive command
>>> from all the guests. How do you plan to prevent a guest hogging the
>>> command queue?
>>> In addition to that, AFAIU, the size of the virtual command queue is
>>> fixed by the guest rather than Xen. If a guest is filling up the queue
>>> with commands before notifying Xen, how do you plan to ensure we don't
>>> spend too much time in Xen (which is not preemptible)?
>>>
>>
>> We'll have to do a detailed analysis on these scenarios, they are not
>> covered by the design (as well as some others which is clear after
>> your comments). I'll come back with an updated design.
> 
> I think that can be handled akin to hypercall continuation, which is
> used in similar places, like P2M code
> 
> [...]
> 

I have updated vIOMMU design document with additional security topics 
covered and performance impact results. Also added some additional 
explanations for vIOMMU components following your comments.
Updated document content:

===============================================
Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
===============================================

:Author:     Milan Djokic <milan_djokic@epam.com>
:Date:       2025-08-07
:Status:     Draft

Introduction
========

The SMMUv3 supports two stages of translation. Each stage of translation 
can be
independently enabled. An incoming address is logically translated from 
VA to
IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to
the output PA. Stage 1 translation support is required to provide 
isolation between different
devices within OS. XEN already supports Stage 2 translation but there is no
support for Stage 1 translation.
This design proposal outlines the introduction of Stage-1 SMMUv3 support 
in Xen for ARM guests.

Motivation
==========

ARM systems utilizing SMMUv3 require stage-1 address translation to 
ensure secure DMA and guest managed I/O memory mappings.
This feature enables:

- Stage-1 translation in guest domain
- Safe device passthrough under secure memory translation

Design Overview
===============

These changes provide emulated SMMUv3 support:

- **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support 
in SMMUv3 driver.
- **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1 
handling.
- **Register/Command Emulation**: SMMUv3 register emulation and command 
queue handling.
- **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to 
device trees for dom0 and dom0less scenarios.
- **Runtime Configuration**: Introduces a `viommu` boot parameter for 
dynamic enablement.

vIOMMU is exposed to guest as a single device with predefined 
capabilities and commands supported. Single vIOMMU model abstracts the 
details of an actual IOMMU hardware, simplifying usage from the guest 
point of view. Guest OS handles only a single IOMMU, even if multiple 
IOMMU units are available on the host system.

Security Considerations
=======================

**viommu security benefits:**

- Stage-1 translation ensures guest devices cannot perform unauthorized DMA.
- Emulated IOMMU removes guest dependency on IOMMU hardware while 
maintaining domains isolation.

1. Observation:
---------------
Support for Stage-1 translation in SMMUv3 introduces new data structures 
(`s1_cfg` alongside `s2_cfg`) and logic to write both Stage-1 and 
Stage-2 entries in the Stream Table Entry (STE), including an `abort` 
field to handle partial configuration states.

**Risk:**
Without proper handling, a partially applied Stage-1 configuration might 
leave guest DMA mappings in an inconsistent state, potentially enabling 
unauthorized access or causing cross-domain interference.

**Mitigation:** *(Handled by design)*
This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to 
STE and manages the `abort` field-only considering Stage-1 configuration 
if fully attached. This ensures incomplete or invalid guest 
configurations are safely ignored by the hypervisor.

2. Observation:
---------------
Guests can now invalidate Stage-1 caches; invalidation needs forwarding 
to SMMUv3 hardware to maintain coherence.

**Risk:**
Failing to propagate cache invalidation could allow stale mappings, 
enabling access to old mappings and possibly data leakage or misrouting.

**Mitigation:** *(Handled by design)*
This feature ensures that guest-initiated invalidations are correctly 
forwarded to the hardware, preserving IOMMU coherency.

3. Observation:
---------------
This design introduces substantial new functionality, including the 
`vIOMMU` framework, virtual SMMUv3 devices (`vsmmuv3`), command queues, 
event queues, domain management, and Device Tree modifications (e.g., 
`iommus` nodes and `libxl` integration).

**Risk:**
Large feature expansions increase the attack surface—potential for race 
conditions, unchecked command inputs, or Device Tree-based 
misconfigurations.

**Mitigation:**

- Sanity checks and error-handling improvements have been introduced in 
this feature.
- Further audits have to be performed for this feature and its 
dependencies in this area. Currently, feature is marked as *Tech 
Preview* and is self-contained, reducing the risk to unrelated components.

4. Observation:
---------------
The code includes transformations to handle nested translation versus 
standard modes and uses guest-configured command queues (e.g., 
`CMD_CFGI_STE`) and event notifications.

**Risk:**
Malicious or malformed queue commands from guests could bypass 
validation, manipulate SMMUv3 state, or cause Dom0 instability.

**Mitigation:** *(Handled by design)*
Built-in validation of command queue entries and sanitization mechanisms 
ensure only permitted configurations are applied. This is supported via 
additions in `vsmmuv3` and `cmdqueue` handling code.

5. Observation:
---------------
Device Tree modifications enable device assignment and 
configuration—guest DT fragments (e.g., `iommus`) are added via `libxl`.

**Risk:**
Erroneous or malicious Device Tree injection could result in device 
misbinding or guest access to unauthorized hardware.

**Mitigation:**

- `libxl` perform checks of guest configuration and parse only 
predefined dt fragments and nodes, reducing risc.
- The system integrator must ensure correct resource mapping in the 
guest Device Tree (DT) fragments.

6. Observation:
---------------
Introducing optional per-guest enabled features (`viommu` argument in xl 
guest config) means some guests may opt-out.

**Risk:**
Differences between guests with and without `viommu` may cause 
unexpected behavior or privilege drift.

**Mitigation:**
Verify that downgrade paths are safe and well-isolated; ensure missing 
support doesn't cause security issues. Additional audits on emulation 
paths and domains interference need to be performed in a multi-guest 
environment.

7. Observation:
---------------
Guests have the ability to issue Stage-1 IOMMU commands like cache 
invalidation, stream table entries configuration, etc. An adversarial 
guest may issue a high volume of commands in rapid succession.

**Risk**
Excessive commands requests can cause high hypervisor CPU consumption 
and disrupt scheduling, leading to degraded system responsiveness and 
potential denial-of-service scenarios.

**Mitigation**

- Xen credit scheduler limits guest vCPU execution time, securing basic 
guest rate-limiting.
- Batch multiple commands of same type to reduce overhead on the virtual 
SMMUv3 hardware emulation.
- Implement vIOMMU commands execution restart and continuation support

8. Observation:
---------------
Some guest commands issued towards vIOMMU are propagated to pIOMMU 
command queue (e.g. TLB invalidate). For each pIOMMU, only one command 
queue is
available for all domains.

**Risk**
Excessive commands requests from abusive guest can cause flooding of 
physical IOMMU command queue, leading to degraded pIOMMU responsivness 
on commands issued from other guests.

**Mitigation**

- Xen credit scheduler limits guest vCPU execution time, securing basic 
guest rate-limiting.
- Batch commands which should be propagated towards pIOMMU cmd queue and 
enable support for batch execution pause/continuation
- If possible, implement domain penalization by adding a per-domain cost 
counter for vIOMMU/pIOMMU usage.

9. Observation:
---------------
vIOMMU feature includes event queue used for forwarding IOMMU events to 
guest (e.g. translation faults, invalid stream IDs, permission errors). 
A malicious guest can misconfigure its SMMU state or intentionally 
trigger faults with high frequency.

**Risk**
Occurance of IOMMU events with high frequency can cause Xen to flood the 
event queue and disrupt scheduling with high hypervisor CPU load for 
events handling.

**Mitigation**

- Implement fail-safe state by disabling events forwarding when faults 
are occured with high frequency and not processed by guest.
- Batch multiple events of same type to reduce overhead on the virtual 
SMMUv3 hardware emulation.
- Consider disabling event queue for untrusted guests

Performance Impact
==================

With iommu stage-1 and nested translation inclusion, performance 
overhead is introduced comparing to existing, stage-2 only usage in Xen.
Once mappings are established, translations should not introduce 
significant overhead.
Emulated paths may introduce moderate overhead, primarily affecting 
device initialization and event handling.
Performance impact highly depends on target CPU capabilities. Testing is 
performed on cortex-a53 based platform.
Performance is mostly impacted by emulated vIOMMU operations, results 
shown in the following table.

+-------------------------------+---------------------------------+
| vIOMMU Operation              | Execution time in guest         |
+===============================+=================================+
| Reg read                      | median: 30μs, worst-case: 250μs |
+-------------------------------+---------------------------------+
| Reg write                     | median: 35μs, worst-case: 280μs |
+-------------------------------+---------------------------------+
| Invalidate TLB                | median: 90μs, worst-case: 1ms+  |
+-------------------------------+---------------------------------+
| Invalidate STE                | median: 450μs worst_case: 7ms+  |
+-------------------------------+---------------------------------+

With vIOMMU exposed to guest, guest OS has to initialize IOMMU device 
and configure stage-1 mappings for devices attached to it.
Following table shows initialization stages which impact stage-1 enabled 
guest boot time and compares it with stage-1 disabled guest.

"NOTE: Device probe execution time varies significantly depending on 
device complexity. virtio-gpu was selected as a test case due to its 
extensive use of dynamic DMA allocations and IOMMU mappings, making it a 
suitable candidate for benchmarking stage-1 vIOMMU behavior."

+---------------------+-----------------------+------------------------+
| Stage               | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
+=====================+=======================+========================+
| IOMMU Init          | ~25ms                 | /                      |
+---------------------+-----------------------+------------------------+
| Dev Attach / Mapping| ~220ms                | ~200ms                 |
+---------------------+-----------------------+------------------------+

For devices configured with dynamic DMA mappings, DMA allocate/map/unmap 
operations performance is also impacted on stage-1 enabled guests.
Dynamic DMA mapping operation issues emulated IOMMU functions like mmio 
write/read and TLB invalidations.
As a reference, following table shows performance results for runtime 
dma operations for virtio-gpu device.

+---------------+-------------------------+----------------------------+
| DMA Op        | Stage-1 Enabled Guest   | Stage-1 Disabled Guest     |
+===============+=========================+============================+
| dma_alloc     | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs|
+---------------+-------------------------+----------------------------+
| dma_free      | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs |
+---------------+-------------------------+----------------------------+
| dma_map       | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs|
+---------------+-------------------------+----------------------------+
| dma_unmap     | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs |
+---------------+-------------------------+----------------------------+

Testing
============

- QEMU-based ARM system tests for Stage-1 translation and nested 
virtualization.
- Actual hardware validation on platforms such as Renesas to ensure 
compatibility with real SMMUv3 implementations.
- Unit/Functional tests validating correct translations (not implemented).

Migration and Compatibility
===========================

This optional feature defaults to disabled (`viommu=""`) for backward 
compatibility.

References
==========

- Original feature implemented by Rahul Singh: 
https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/ 

- SMMUv3 architecture documentation
- Existing vIOMMU code patterns

Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests

Posted by Volodymyr Babchuk 5 months, 1 week ago

Hi Milan,

Thanks, "Security Considerations" sections looks really good. But I have
more questions.

Milan Djokic <milan_djokic@epam.com> writes:

> Hello Julien, Volodymyr
>
> On 8/27/25 01:28, Volodymyr Babchuk wrote:
>> Hi Milan,
>> Milan Djokic <milan_djokic@epam.com> writes:
>> 
>>> Hello Julien,
>>>
>>> On 8/13/25 14:11, Julien Grall wrote:
>>>> On 13/08/2025 11:04, Milan Djokic wrote:
>>>>> Hello Julien,
>>>> Hi Milan,
>>>>
>>>>>
>>>>> We have prepared a design document and it will be part of the updated
>>>>> patch series (added in docs/design). I'll also extend cover letter with
>>>>> details on implementation structure to make review easier.
>>>> I would suggest to just iterate on the design document for now.
>>>>
>>>>> Following is the design document content which will be provided in
>>>>> updated patch series:
>>>>>
>>>>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>>>>> ==========================================================
>>>>>
>>>>> Author: Milan Djokic <milan_djokic@epam.com>
>>>>> Date:   2025-08-07
>>>>> Status: Draft
>>>>>
>>>>> Introduction
>>>>> ------------
>>>>>
>>>>> The SMMUv3 supports two stages of translation. Each stage of translation
>>>>> can be independently enabled. An incoming address is logically
>>>>> translated from VA to IPA in stage 1, then the IPA is input to stage 2
>>>>> which translates the IPA to the output PA. Stage 1 translation support
>>>>> is required to provide isolation between different devices within the OS.
>>>>>
>>>>> Xen already supports Stage 2 translation but there is no support for
>>>>> Stage 1 translation. This design proposal outlines the introduction of
>>>>> Stage-1 SMMUv3 support in Xen for ARM guests.
>>>>>
>>>>> Motivation
>>>>> ----------
>>>>>
>>>>> ARM systems utilizing SMMUv3 require Stage-1 address translation to
>>>>> ensure correct and secure DMA behavior inside guests.
>>>> Can you clarify what you mean by "correct"? DMA would still work
>>>> without
>>>> stage-1.
>>>
>>> Correct in terms of working with guest managed I/O space. I'll
>>> rephrase this statement, it seems ambiguous.
>>>
>>>>>
>>>>> This feature enables:
>>>>> - Stage-1 translation in guest domain
>>>>> - Safe device passthrough under secure memory translation
>>>>>
>>>>> Design Overview
>>>>> ---------------
>>>>>
>>>>> These changes provide emulated SMMUv3 support:
>>>>>
>>>>> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
>>>>>      SMMUv3 driver
>>>>> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling
>>>> So what are you planning to expose to a guest? Is it one vIOMMU per
>>>> pIOMMU? Or a single one?
>>>
>>> Single vIOMMU model is used in this design.
>>>
>>>> Have you considered the pros/cons for both?
>>>>> - Register/Command Emulation: SMMUv3 register emulation and command
>>>>>      queue handling
>>>>
>>>
>>> That's a point for consideration.
>>> single vIOMMU prevails in terms of less complex implementation and a
>>> simple guest iommmu model - single vIOMMU node, one interrupt path,
>>> event queue, single set of trap handlers for emulation, etc.
>>> Cons for a single vIOMMU model could be less accurate hw
>>> representation and a potential bottleneck with one emulated queue and
>>> interrupt path.
>>> On the other hand, vIOMMU per pIOMMU provides more accurate hw
>>> modeling and offers better scalability in case of many IOMMUs in the
>>> system, but this comes with more complex emulation logic and device
>>> tree, also handling multiple vIOMMUs on guest side.
>>> IMO, single vIOMMU model seems like a better option mostly because
>>> it's less complex, easier to maintain and debug. Of course, this
>>> decision can and should be discussed.
>>>
>> Well, I am not sure that this is possible, because of StreamID
>> allocation. The biggest offender is of course PCI, as each Root PCI
>> bridge will require own SMMU instance with own StreamID space. But even
>> without PCI you'll need some mechanism to map vStremID to
>> <pSMMU, pStreamID>, because there will be overlaps in SID space.
>> Actually, PCI/vPCI with vSMMU is its own can of worms...
>> 
>>>> For each pSMMU, we have a single command queue that will receive command
>>>> from all the guests. How do you plan to prevent a guest hogging the
>>>> command queue?
>>>> In addition to that, AFAIU, the size of the virtual command queue is
>>>> fixed by the guest rather than Xen. If a guest is filling up the queue
>>>> with commands before notifying Xen, how do you plan to ensure we don't
>>>> spend too much time in Xen (which is not preemptible)?
>>>>
>>>
>>> We'll have to do a detailed analysis on these scenarios, they are not
>>> covered by the design (as well as some others which is clear after
>>> your comments). I'll come back with an updated design.
>> I think that can be handled akin to hypercall continuation, which is
>> used in similar places, like P2M code
>> [...]
>> 
>
> I have updated vIOMMU design document with additional security topics
> covered and performance impact results. Also added some additional
> explanations for vIOMMU components following your comments.
> Updated document content:
>
> ===============================================
> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
> ===============================================
>
> :Author:     Milan Djokic <milan_djokic@epam.com>
> :Date:       2025-08-07
> :Status:     Draft
>
> Introduction
> ========
>
> The SMMUv3 supports two stages of translation. Each stage of
> translation can be
> independently enabled. An incoming address is logically translated
> from VA to
> IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to
> the output PA. Stage 1 translation support is required to provide
> isolation between different
> devices within OS. XEN already supports Stage 2 translation but there is no
> support for Stage 1 translation.
> This design proposal outlines the introduction of Stage-1 SMMUv3
> support in Xen for ARM guests.
>
> Motivation
> ==========
>
> ARM systems utilizing SMMUv3 require stage-1 address translation to
> ensure secure DMA and guest managed I/O memory mappings.

It is unclear for my what you mean by "guest manged IO memory mappings",
could you please provide an example?

> This feature enables:
>
> - Stage-1 translation in guest domain
> - Safe device passthrough under secure memory translation
>

As I see it, ARM specs use "secure" mostly when referring to Secure mode
(S-EL1, S-EL2, EL3) and associated secure counterparts of architectural
devices, like secure GIC, secure Timer, etc. So I'd probably don't use
this word here to reduce confusion

> Design Overview
> ===============
>
> These changes provide emulated SMMUv3 support:
>
> - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation
>     support in SMMUv3 driver.

"Nested translation" as in "nested virtualization"? Or is this something else?

> - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1
>     handling.

I think, this is the big topic. You see, apart from SMMU, there is
at least Renesas IP-MMU, which uses completely different API. And
probably there are other IO-MMU implementations possible. Right now
vIOMMU framework handles only SMMU, which is okay, but probably we
should design it in a such way, that other IO-MMUs will be supported as
well. Maybe even IO-MMUs for other architectures (RISC V maybe?).

> - **Register/Command Emulation**: SMMUv3 register emulation and
>     command queue handling.

Continuing previous paragraph: what about other IO-MMUs? For example, if
platform provides only Renesas IO-MMU, will vIOMMU framework still
emulate SMMUv3 registers and queue handling?

> - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes
>     to device trees for dom0 and dom0less scenarios.
> - **Runtime Configuration**: Introduces a `viommu` boot parameter for
>     dynamic enablement.
>
> vIOMMU is exposed to guest as a single device with predefined
> capabilities and commands supported. Single vIOMMU model abstracts the
> details of an actual IOMMU hardware, simplifying usage from the guest
> point of view. Guest OS handles only a single IOMMU, even if multiple
> IOMMU units are available on the host system.

In the previous email I asked how are you planning to handle potential
SID overlaps, especially in PCI use case. I want to return to this
topic. I am not saying that this is impossible, but I'd like to see this
covered in the design document.

>
> Security Considerations
> =======================
>
> **viommu security benefits:**
>
> - Stage-1 translation ensures guest devices cannot perform unauthorized DMA.
> - Emulated IOMMU removes guest dependency on IOMMU hardware while
>   maintaining domains isolation.

I am not sure that I got this paragraph. 

>
>
> 1. Observation:
> ---------------
> Support for Stage-1 translation in SMMUv3 introduces new data
> structures (`s1_cfg` alongside `s2_cfg`) and logic to write both
> Stage-1 and Stage-2 entries in the Stream Table Entry (STE), including
> an `abort` field to handle partial configuration states.
>
> **Risk:**
> Without proper handling, a partially applied Stage-1 configuration
> might leave guest DMA mappings in an inconsistent state, potentially
> enabling unauthorized access or causing cross-domain interference.
>
> **Mitigation:** *(Handled by design)*
> This feature introduces logic that writes both `s1_cfg` and `s2_cfg`
> to STE and manages the `abort` field-only considering Stage-1
> configuration if fully attached. This ensures incomplete or invalid
> guest configurations are safely ignored by the hypervisor.
>
> 2. Observation:
> ---------------
> Guests can now invalidate Stage-1 caches; invalidation needs
> forwarding to SMMUv3 hardware to maintain coherence.
>
> **Risk:**
> Failing to propagate cache invalidation could allow stale mappings,
> enabling access to old mappings and possibly data leakage or
> misrouting.
>
> **Mitigation:** *(Handled by design)*
> This feature ensures that guest-initiated invalidations are correctly
> forwarded to the hardware, preserving IOMMU coherency.
>
> 3. Observation:
> ---------------
> This design introduces substantial new functionality, including the
> `vIOMMU` framework, virtual SMMUv3 devices (`vsmmuv3`), command
> queues, event queues, domain management, and Device Tree modifications
> (e.g., `iommus` nodes and `libxl` integration).
>
> **Risk:**
> Large feature expansions increase the attack surface—potential for
> race conditions, unchecked command inputs, or Device Tree-based
> misconfigurations.
>
> **Mitigation:**
>
> - Sanity checks and error-handling improvements have been introduced
>   in this feature.
> - Further audits have to be performed for this feature and its
>   dependencies in this area. Currently, feature is marked as *Tech
>   Preview* and is self-contained, reducing the risk to unrelated
>  components.
>
> 4. Observation:
> ---------------
> The code includes transformations to handle nested translation versus
> standard modes and uses guest-configured command queues (e.g.,
> `CMD_CFGI_STE`) and event notifications.
>
> **Risk:**
> Malicious or malformed queue commands from guests could bypass
> validation, manipulate SMMUv3 state, or cause Dom0 instability.

Only Dom0?

>
> **Mitigation:** *(Handled by design)*
> Built-in validation of command queue entries and sanitization
> mechanisms ensure only permitted configurations are applied. This is
> supported via additions in `vsmmuv3` and `cmdqueue` handling code.
>
> 5. Observation:
> ---------------
> Device Tree modifications enable device assignment and
> configuration—guest DT fragments (e.g., `iommus`) are added via
> `libxl`.
>
> **Risk:**
> Erroneous or malicious Device Tree injection could result in device
> misbinding or guest access to unauthorized hardware.
>
> **Mitigation:**
>
> - `libxl` perform checks of guest configuration and parse only
>   predefined dt fragments and nodes, reducing risc.
> - The system integrator must ensure correct resource mapping in the
>   guest Device Tree (DT) fragments.
>
> 6. Observation:
> ---------------
> Introducing optional per-guest enabled features (`viommu` argument in
> xl guest config) means some guests may opt-out.
>
> **Risk:**
> Differences between guests with and without `viommu` may cause
> unexpected behavior or privilege drift.
>
> **Mitigation:**
> Verify that downgrade paths are safe and well-isolated; ensure missing
> support doesn't cause security issues. Additional audits on emulation
> paths and domains interference need to be performed in a multi-guest
> environment.
>
> 7. Observation:
> ---------------
> Guests have the ability to issue Stage-1 IOMMU commands like cache
> invalidation, stream table entries configuration, etc. An adversarial
> guest may issue a high volume of commands in rapid succession.
>
> **Risk**
> Excessive commands requests can cause high hypervisor CPU consumption
> and disrupt scheduling, leading to degraded system responsiveness and
> potential denial-of-service scenarios.
>
> **Mitigation**
>
> - Xen credit scheduler limits guest vCPU execution time, securing
>   basic guest rate-limiting.

I don't thing that this feature available only in credit schedulers,
AFAIK, all schedulers except null scheduler will limit vCPU execution time.

> - Batch multiple commands of same type to reduce overhead on the
>   virtual SMMUv3 hardware emulation.
> - Implement vIOMMU commands execution restart and continuation support

So, something like "hypercall continuation"?

>
> 8. Observation:
> ---------------
> Some guest commands issued towards vIOMMU are propagated to pIOMMU
> command queue (e.g. TLB invalidate). For each pIOMMU, only one command
> queue is
> available for all domains.
>
> **Risk**
> Excessive commands requests from abusive guest can cause flooding of
> physical IOMMU command queue, leading to degraded pIOMMU responsivness
> on commands issued from other guests.
>
> **Mitigation**
>
> - Xen credit scheduler limits guest vCPU execution time, securing
>   basic guest rate-limiting.
> - Batch commands which should be propagated towards pIOMMU cmd queue
>   and enable support for batch execution pause/continuation
> - If possible, implement domain penalization by adding a per-domain
>   cost counter for vIOMMU/pIOMMU usage.
>
> 9. Observation:
> ---------------
> vIOMMU feature includes event queue used for forwarding IOMMU events
> to guest (e.g. translation faults, invalid stream IDs, permission
> errors). A malicious guest can misconfigure its SMMU state or
> intentionally trigger faults with high frequency.
>
> **Risk**
> Occurance of IOMMU events with high frequency can cause Xen to flood
> the event queue and disrupt scheduling with high hypervisor CPU load
> for events handling.
>
> **Mitigation**
>
> - Implement fail-safe state by disabling events forwarding when faults
>   are occured with high frequency and not processed by guest.
> - Batch multiple events of same type to reduce overhead on the virtual
>   SMMUv3 hardware emulation.
> - Consider disabling event queue for untrusted guests
>
> Performance Impact
> ==================
>
> With iommu stage-1 and nested translation inclusion, performance
> overhead is introduced comparing to existing, stage-2 only usage in
> Xen.
> Once mappings are established, translations should not introduce
> significant overhead.
> Emulated paths may introduce moderate overhead, primarily affecting
> device initialization and event handling.
> Performance impact highly depends on target CPU capabilities. Testing
> is performed on cortex-a53 based platform.

Which platform exactly? While QEMU emulates SMMU to some extent, we are
observing somewhat different SMMU behavior on real HW platforms (mostly
due to cache coherence problems). Also, according to MMU-600 errata, it
can have lower than expected performance in some use-cases.

> Performance is mostly impacted by emulated vIOMMU operations, results
> shown in the following table.
>
> +-------------------------------+---------------------------------+
> | vIOMMU Operation              | Execution time in guest         |
> +===============================+=================================+
> | Reg read                      | median: 30μs, worst-case: 250μs |
> +-------------------------------+---------------------------------+
> | Reg write                     | median: 35μs, worst-case: 280μs |
> +-------------------------------+---------------------------------+
> | Invalidate TLB                | median: 90μs, worst-case: 1ms+  |
> +-------------------------------+---------------------------------+
> | Invalidate STE                | median: 450μs worst_case: 7ms+  |
> +-------------------------------+---------------------------------+
>
> With vIOMMU exposed to guest, guest OS has to initialize IOMMU device
> and configure stage-1 mappings for devices attached to it.
> Following table shows initialization stages which impact stage-1
> enabled guest boot time and compares it with stage-1 disabled guest.
>
> "NOTE: Device probe execution time varies significantly depending on
> device complexity. virtio-gpu was selected as a test case due to its
> extensive use of dynamic DMA allocations and IOMMU mappings, making it
> a suitable candidate for benchmarking stage-1 vIOMMU behavior."
>
> +---------------------+-----------------------+------------------------+
> | Stage               | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
> +=====================+=======================+========================+
> | IOMMU Init          | ~25ms                 | /                      |
> +---------------------+-----------------------+------------------------+
> | Dev Attach / Mapping| ~220ms                | ~200ms                 |
> +---------------------+-----------------------+------------------------+
>
> For devices configured with dynamic DMA mappings, DMA
> allocate/map/unmap operations performance is also impacted on stage-1
> enabled guests.
> Dynamic DMA mapping operation issues emulated IOMMU functions like
> mmio write/read and TLB invalidations.
> As a reference, following table shows performance results for runtime
> dma operations for virtio-gpu device.
>
> +---------------+-------------------------+----------------------------+
> | DMA Op        | Stage-1 Enabled Guest   | Stage-1 Disabled Guest     |
> +===============+=========================+============================+
> | dma_alloc     | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs|
> +---------------+-------------------------+----------------------------+
> | dma_free      | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs |
> +---------------+-------------------------+----------------------------+
> | dma_map       | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs|
> +---------------+-------------------------+----------------------------+
> | dma_unmap     | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs |
> +---------------+-------------------------+----------------------------+
>
> Testing
> ============
>
> - QEMU-based ARM system tests for Stage-1 translation and nested
>   virtualization.
> - Actual hardware validation on platforms such as Renesas to ensure
>   compatibility with real SMMUv3 implementations.
> - Unit/Functional tests validating correct translations (not implemented).
>
> Migration and Compatibility
> ===========================
>
> This optional feature defaults to disabled (`viommu=""`) for backward
> compatibility.
>

-- 
WBR, Volodymyr

Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests

Posted by Milan Djokic 5 months, 1 week ago

Hi Volodymyr,

On 8/29/25 18:27, Volodymyr Babchuk wrote:
> Hi Milan,
> 
> Thanks, "Security Considerations" sections looks really good. But I have
> more questions.
> 
> Milan Djokic <milan_djokic@epam.com> writes:
> 
>> Hello Julien, Volodymyr
>>
>> On 8/27/25 01:28, Volodymyr Babchuk wrote:
>>> Hi Milan,
>>> Milan Djokic <milan_djokic@epam.com> writes:
>>>
>>>> Hello Julien,
>>>>
>>>> On 8/13/25 14:11, Julien Grall wrote:
>>>>> On 13/08/2025 11:04, Milan Djokic wrote:
>>>>>> Hello Julien,
>>>>> Hi Milan,
>>>>>
>>>>>>
>>>>>> We have prepared a design document and it will be part of the updated
>>>>>> patch series (added in docs/design). I'll also extend cover letter with
>>>>>> details on implementation structure to make review easier.
>>>>> I would suggest to just iterate on the design document for now.
>>>>>
>>>>>> Following is the design document content which will be provided in
>>>>>> updated patch series:
>>>>>>
>>>>>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>>>>>> ==========================================================
>>>>>>
>>>>>> Author: Milan Djokic <milan_djokic@epam.com>
>>>>>> Date:   2025-08-07
>>>>>> Status: Draft
>>>>>>
>>>>>> Introduction
>>>>>> ------------
>>>>>>
>>>>>> The SMMUv3 supports two stages of translation. Each stage of translation
>>>>>> can be independently enabled. An incoming address is logically
>>>>>> translated from VA to IPA in stage 1, then the IPA is input to stage 2
>>>>>> which translates the IPA to the output PA. Stage 1 translation support
>>>>>> is required to provide isolation between different devices within the OS.
>>>>>>
>>>>>> Xen already supports Stage 2 translation but there is no support for
>>>>>> Stage 1 translation. This design proposal outlines the introduction of
>>>>>> Stage-1 SMMUv3 support in Xen for ARM guests.
>>>>>>
>>>>>> Motivation
>>>>>> ----------
>>>>>>
>>>>>> ARM systems utilizing SMMUv3 require Stage-1 address translation to
>>>>>> ensure correct and secure DMA behavior inside guests.
>>>>> Can you clarify what you mean by "correct"? DMA would still work
>>>>> without
>>>>> stage-1.
>>>>
>>>> Correct in terms of working with guest managed I/O space. I'll
>>>> rephrase this statement, it seems ambiguous.
>>>>
>>>>>>
>>>>>> This feature enables:
>>>>>> - Stage-1 translation in guest domain
>>>>>> - Safe device passthrough under secure memory translation
>>>>>>
>>>>>> Design Overview
>>>>>> ---------------
>>>>>>
>>>>>> These changes provide emulated SMMUv3 support:
>>>>>>
>>>>>> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
>>>>>>       SMMUv3 driver
>>>>>> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling
>>>>> So what are you planning to expose to a guest? Is it one vIOMMU per
>>>>> pIOMMU? Or a single one?
>>>>
>>>> Single vIOMMU model is used in this design.
>>>>
>>>>> Have you considered the pros/cons for both?
>>>>>> - Register/Command Emulation: SMMUv3 register emulation and command
>>>>>>       queue handling
>>>>>
>>>>
>>>> That's a point for consideration.
>>>> single vIOMMU prevails in terms of less complex implementation and a
>>>> simple guest iommmu model - single vIOMMU node, one interrupt path,
>>>> event queue, single set of trap handlers for emulation, etc.
>>>> Cons for a single vIOMMU model could be less accurate hw
>>>> representation and a potential bottleneck with one emulated queue and
>>>> interrupt path.
>>>> On the other hand, vIOMMU per pIOMMU provides more accurate hw
>>>> modeling and offers better scalability in case of many IOMMUs in the
>>>> system, but this comes with more complex emulation logic and device
>>>> tree, also handling multiple vIOMMUs on guest side.
>>>> IMO, single vIOMMU model seems like a better option mostly because
>>>> it's less complex, easier to maintain and debug. Of course, this
>>>> decision can and should be discussed.
>>>>
>>> Well, I am not sure that this is possible, because of StreamID
>>> allocation. The biggest offender is of course PCI, as each Root PCI
>>> bridge will require own SMMU instance with own StreamID space. But even
>>> without PCI you'll need some mechanism to map vStremID to
>>> <pSMMU, pStreamID>, because there will be overlaps in SID space.
>>> Actually, PCI/vPCI with vSMMU is its own can of worms...
>>>
>>>>> For each pSMMU, we have a single command queue that will receive command
>>>>> from all the guests. How do you plan to prevent a guest hogging the
>>>>> command queue?
>>>>> In addition to that, AFAIU, the size of the virtual command queue is
>>>>> fixed by the guest rather than Xen. If a guest is filling up the queue
>>>>> with commands before notifying Xen, how do you plan to ensure we don't
>>>>> spend too much time in Xen (which is not preemptible)?
>>>>>
>>>>
>>>> We'll have to do a detailed analysis on these scenarios, they are not
>>>> covered by the design (as well as some others which is clear after
>>>> your comments). I'll come back with an updated design.
>>> I think that can be handled akin to hypercall continuation, which is
>>> used in similar places, like P2M code
>>> [...]
>>>
>>
>> I have updated vIOMMU design document with additional security topics
>> covered and performance impact results. Also added some additional
>> explanations for vIOMMU components following your comments.
>> Updated document content:
>>
>> ===============================================
>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>> ===============================================
>>
>> :Author:     Milan Djokic <milan_djokic@epam.com>
>> :Date:       2025-08-07
>> :Status:     Draft
>>
>> Introduction
>> ========
>>
>> The SMMUv3 supports two stages of translation. Each stage of
>> translation can be
>> independently enabled. An incoming address is logically translated
>> from VA to
>> IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to
>> the output PA. Stage 1 translation support is required to provide
>> isolation between different
>> devices within OS. XEN already supports Stage 2 translation but there is no
>> support for Stage 1 translation.
>> This design proposal outlines the introduction of Stage-1 SMMUv3
>> support in Xen for ARM guests.
>>
>> Motivation
>> ==========
>>
>> ARM systems utilizing SMMUv3 require stage-1 address translation to
>> ensure secure DMA and guest managed I/O memory mappings.
> 
> It is unclear for my what you mean by "guest manged IO memory mappings",
> could you please provide an example?
> 

Basically enabling stage-1 translation means that the guest is 
responsible for managing IOVA to IPA mappings through its own IOMMU 
driver. Guest manages its own stage-1 page tables and TLB.
For example, when a guest driver wants to perform DMA mapping (e.g. with 
dma_map_single()), it will request mapping of its buffer physical 
address to IOVA through guest IOMMU driver. Guest IOMMU driver will 
further issue mapping commands emulated by Xen which translate it into 
stage-2 mappings.

>> This feature enables:
>>
>> - Stage-1 translation in guest domain
>> - Safe device passthrough under secure memory translation
>>
> 
> As I see it, ARM specs use "secure" mostly when referring to Secure mode
> (S-EL1, S-EL2, EL3) and associated secure counterparts of architectural
> devices, like secure GIC, secure Timer, etc. So I'd probably don't use
> this word here to reduce confusion
> 

Sure, secure in terms of isolation is the topic here. I'll rephrase this

>> Design Overview
>> ===============
>>
>> These changes provide emulated SMMUv3 support:
>>
>> - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation
>>      support in SMMUv3 driver.
> 
> "Nested translation" as in "nested virtualization"? Or is this something else?
> 

No, this refers to 2-stage translation IOVA->IPA->PA as a nested 
translation. Although with this feature, nested virtualization is also 
enabled since guest can emulate its own IOMMU e.g. when kvm is run in guest.


>> - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1
>>      handling.
> 
> I think, this is the big topic. You see, apart from SMMU, there is
> at least Renesas IP-MMU, which uses completely different API. And
> probably there are other IO-MMU implementations possible. Right now
> vIOMMU framework handles only SMMU, which is okay, but probably we
> should design it in a such way, that other IO-MMUs will be supported as
> well. Maybe even IO-MMUs for other architectures (RISC V maybe?).
> 

I think that it is already designed in such manner. We have a generic 
vIOMMU framework and a backend implementation for target IOMMU as 
separate components. And the backend implements supported 
commands/mechanisms which are specific for target IOMMU type. At this 
point, only SMMUv3 is supported, but it is possible to implement other 
IOMMU types support under the same generic framework. AFAIK, RISC-V 
IOMMU stage-2 is still in early development stage, but I do believe that 
it will be also compatible with vIOMMU framework.

>> - **Register/Command Emulation**: SMMUv3 register emulation and
>>      command queue handling.
> 
> Continuing previous paragraph: what about other IO-MMUs? For example, if
> platform provides only Renesas IO-MMU, will vIOMMU framework still
> emulate SMMUv3 registers and queue handling?
> 

Yes, this is not supported in current implementation. To support other 
IOMMU than SMMUv3, stage-1 emulation backend needs to be implemented for
target IOMMU and probably Xen driver for target IOMMU has to be updated 
to handle stage-1 configuration. I will elaborate this part in the 
design, to make clear that we have a generic vIOMMU framework, but only 
SMMUv3 backend exists atm.

>> - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes
>>      to device trees for dom0 and dom0less scenarios.
>> - **Runtime Configuration**: Introduces a `viommu` boot parameter for
>>      dynamic enablement.
>>
>> vIOMMU is exposed to guest as a single device with predefined
>> capabilities and commands supported. Single vIOMMU model abstracts the
>> details of an actual IOMMU hardware, simplifying usage from the guest
>> point of view. Guest OS handles only a single IOMMU, even if multiple
>> IOMMU units are available on the host system.
> 
> In the previous email I asked how are you planning to handle potential
> SID overlaps, especially in PCI use case. I want to return to this
> topic. I am not saying that this is impossible, but I'd like to see this
> covered in the design document.
> 

Sorry, I've missed this part in the previous mail. This is a valid point,
SID overlapping would be an issue for a single vIOMMU model. To prevent 
it, design will have to be extended with SID namespace virtualization, 
introducing a remapping layer which will make sure that guest virtual 
SIDs are unique and maintain proper mappings of vSIDs to pSIDs.
For PCI case, we need to have an extended remapping logic where 
iommu-map property will be also patched in the guest device tree since 
we need a range of unique vSIDs for every RC assigned to guest.
Alternative approach would be to switch to vIOMMU per pIOMMU model. 
Since both approaches require major updates, I'll have to do a detailed 
analysis and come back with an updated design which would address this 
issue.


>>
>> Security Considerations
>> =======================
>>
>> **viommu security benefits:**
>>
>> - Stage-1 translation ensures guest devices cannot perform unauthorized DMA.
>> - Emulated IOMMU removes guest dependency on IOMMU hardware while
>>    maintaining domains isolation.
> 
> I am not sure that I got this paragraph.
> 

First one refers to guest controlled DMA access. Only IOVA->IPA mappings 
created by guest are usable by the device when stage-1 is enabled. On 
the other hand, with stage-2 only enabled, device could access to 
complete IOVA->PA mapping created by Xen for guest. Since the guest has 
no control over device IOVA accesses, a malicious guest kernel could 
potentially access memory regions it shouldn't be allowed to, e.g. if 
stage-2 mappings are stale. With stage-1 enabled, guest device driver 
has to explicitly map IOVAs and this request is propagated through 
emulated IOMMU, making sure that IOVA mappings are valid all the time.

Second claim means that with emulated IOMMU, guests don’t need direct 
access to physical IOMMU hardware. The hypervisor emulates IOMMU 
behavior for the guest, while still ensuring that memory access by 
devices remains properly isolated between guests, just like it would 
with real IOMMU hardware.

>>
>>
>> 1. Observation:
>> ---------------
>> Support for Stage-1 translation in SMMUv3 introduces new data
>> structures (`s1_cfg` alongside `s2_cfg`) and logic to write both
>> Stage-1 and Stage-2 entries in the Stream Table Entry (STE), including
>> an `abort` field to handle partial configuration states.
>>
>> **Risk:**
>> Without proper handling, a partially applied Stage-1 configuration
>> might leave guest DMA mappings in an inconsistent state, potentially
>> enabling unauthorized access or causing cross-domain interference.
>>
>> **Mitigation:** *(Handled by design)*
>> This feature introduces logic that writes both `s1_cfg` and `s2_cfg`
>> to STE and manages the `abort` field-only considering Stage-1
>> configuration if fully attached. This ensures incomplete or invalid
>> guest configurations are safely ignored by the hypervisor.
>>
>> 2. Observation:
>> ---------------
>> Guests can now invalidate Stage-1 caches; invalidation needs
>> forwarding to SMMUv3 hardware to maintain coherence.
>>
>> **Risk:**
>> Failing to propagate cache invalidation could allow stale mappings,
>> enabling access to old mappings and possibly data leakage or
>> misrouting.
>>
>> **Mitigation:** *(Handled by design)*
>> This feature ensures that guest-initiated invalidations are correctly
>> forwarded to the hardware, preserving IOMMU coherency.
>>
>> 3. Observation:
>> ---------------
>> This design introduces substantial new functionality, including the
>> `vIOMMU` framework, virtual SMMUv3 devices (`vsmmuv3`), command
>> queues, event queues, domain management, and Device Tree modifications
>> (e.g., `iommus` nodes and `libxl` integration).
>>
>> **Risk:**
>> Large feature expansions increase the attack surface—potential for
>> race conditions, unchecked command inputs, or Device Tree-based
>> misconfigurations.
>>
>> **Mitigation:**
>>
>> - Sanity checks and error-handling improvements have been introduced
>>    in this feature.
>> - Further audits have to be performed for this feature and its
>>    dependencies in this area. Currently, feature is marked as *Tech
>>    Preview* and is self-contained, reducing the risk to unrelated
>>   components.
>>
>> 4. Observation:
>> ---------------
>> The code includes transformations to handle nested translation versus
>> standard modes and uses guest-configured command queues (e.g.,
>> `CMD_CFGI_STE`) and event notifications.
>>
>> **Risk:**
>> Malicious or malformed queue commands from guests could bypass
>> validation, manipulate SMMUv3 state, or cause Dom0 instability.
> 
> Only Dom0?
> 

This is a mistake, the whole system could be affected. I'll fix this.

>>
>> **Mitigation:** *(Handled by design)*
>> Built-in validation of command queue entries and sanitization
>> mechanisms ensure only permitted configurations are applied. This is
>> supported via additions in `vsmmuv3` and `cmdqueue` handling code.
>>
>> 5. Observation:
>> ---------------
>> Device Tree modifications enable device assignment and
>> configuration—guest DT fragments (e.g., `iommus`) are added via
>> `libxl`.
>>
>> **Risk:**
>> Erroneous or malicious Device Tree injection could result in device
>> misbinding or guest access to unauthorized hardware.
>>
>> **Mitigation:**
>>
>> - `libxl` perform checks of guest configuration and parse only
>>    predefined dt fragments and nodes, reducing risc.
>> - The system integrator must ensure correct resource mapping in the
>>    guest Device Tree (DT) fragments.
>>
>> 6. Observation:
>> ---------------
>> Introducing optional per-guest enabled features (`viommu` argument in
>> xl guest config) means some guests may opt-out.
>>
>> **Risk:**
>> Differences between guests with and without `viommu` may cause
>> unexpected behavior or privilege drift.
>>
>> **Mitigation:**
>> Verify that downgrade paths are safe and well-isolated; ensure missing
>> support doesn't cause security issues. Additional audits on emulation
>> paths and domains interference need to be performed in a multi-guest
>> environment.
>>
>> 7. Observation:
>> ---------------
>> Guests have the ability to issue Stage-1 IOMMU commands like cache
>> invalidation, stream table entries configuration, etc. An adversarial
>> guest may issue a high volume of commands in rapid succession.
>>
>> **Risk**
>> Excessive commands requests can cause high hypervisor CPU consumption
>> and disrupt scheduling, leading to degraded system responsiveness and
>> potential denial-of-service scenarios.
>>
>> **Mitigation**
>>
>> - Xen credit scheduler limits guest vCPU execution time, securing
>>    basic guest rate-limiting.
> 
> I don't thing that this feature available only in credit schedulers,
> AFAIK, all schedulers except null scheduler will limit vCPU execution time.
> 

I was not aware of that. I'll rephrase this part.

>> - Batch multiple commands of same type to reduce overhead on the
>>    virtual SMMUv3 hardware emulation.
>> - Implement vIOMMU commands execution restart and continuation support
> 
> So, something like "hypercall continuation"?
> 

Yes

>>
>> 8. Observation:
>> ---------------
>> Some guest commands issued towards vIOMMU are propagated to pIOMMU
>> command queue (e.g. TLB invalidate). For each pIOMMU, only one command
>> queue is
>> available for all domains.
>>
>> **Risk**
>> Excessive commands requests from abusive guest can cause flooding of
>> physical IOMMU command queue, leading to degraded pIOMMU responsivness
>> on commands issued from other guests.
>>
>> **Mitigation**
>>
>> - Xen credit scheduler limits guest vCPU execution time, securing
>>    basic guest rate-limiting.
>> - Batch commands which should be propagated towards pIOMMU cmd queue
>>    and enable support for batch execution pause/continuation
>> - If possible, implement domain penalization by adding a per-domain
>>    cost counter for vIOMMU/pIOMMU usage.
>>
>> 9. Observation:
>> ---------------
>> vIOMMU feature includes event queue used for forwarding IOMMU events
>> to guest (e.g. translation faults, invalid stream IDs, permission
>> errors). A malicious guest can misconfigure its SMMU state or
>> intentionally trigger faults with high frequency.
>>
>> **Risk**
>> Occurance of IOMMU events with high frequency can cause Xen to flood
>> the event queue and disrupt scheduling with high hypervisor CPU load
>> for events handling.
>>
>> **Mitigation**
>>
>> - Implement fail-safe state by disabling events forwarding when faults
>>    are occured with high frequency and not processed by guest.
>> - Batch multiple events of same type to reduce overhead on the virtual
>>    SMMUv3 hardware emulation.
>> - Consider disabling event queue for untrusted guests
>>
>> Performance Impact
>> ==================
>>
>> With iommu stage-1 and nested translation inclusion, performance
>> overhead is introduced comparing to existing, stage-2 only usage in
>> Xen.
>> Once mappings are established, translations should not introduce
>> significant overhead.
>> Emulated paths may introduce moderate overhead, primarily affecting
>> device initialization and event handling.
>> Performance impact highly depends on target CPU capabilities. Testing
>> is performed on cortex-a53 based platform.
> 
> Which platform exactly? While QEMU emulates SMMU to some extent, we are
> observing somewhat different SMMU behavior on real HW platforms (mostly
> due to cache coherence problems). Also, according to MMU-600 errata, it
> can have lower than expected performance in some use-cases.
> 

Performance measurement are done on QEMU emulated Renesas platform. I'll 
add some details for this.

>> Performance is mostly impacted by emulated vIOMMU operations, results
>> shown in the following table.
>>
>> +-------------------------------+---------------------------------+
>> | vIOMMU Operation              | Execution time in guest         |
>> +===============================+=================================+
>> | Reg read                      | median: 30μs, worst-case: 250μs |
>> +-------------------------------+---------------------------------+
>> | Reg write                     | median: 35μs, worst-case: 280μs |
>> +-------------------------------+---------------------------------+
>> | Invalidate TLB                | median: 90μs, worst-case: 1ms+  |
>> +-------------------------------+---------------------------------+
>> | Invalidate STE                | median: 450μs worst_case: 7ms+  |
>> +-------------------------------+---------------------------------+
>>
>> With vIOMMU exposed to guest, guest OS has to initialize IOMMU device
>> and configure stage-1 mappings for devices attached to it.
>> Following table shows initialization stages which impact stage-1
>> enabled guest boot time and compares it with stage-1 disabled guest.
>>
>> "NOTE: Device probe execution time varies significantly depending on
>> device complexity. virtio-gpu was selected as a test case due to its
>> extensive use of dynamic DMA allocations and IOMMU mappings, making it
>> a suitable candidate for benchmarking stage-1 vIOMMU behavior."
>>
>> +---------------------+-----------------------+------------------------+
>> | Stage               | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
>> +=====================+=======================+========================+
>> | IOMMU Init          | ~25ms                 | /                      |
>> +---------------------+-----------------------+------------------------+
>> | Dev Attach / Mapping| ~220ms                | ~200ms                 |
>> +---------------------+-----------------------+------------------------+
>>
>> For devices configured with dynamic DMA mappings, DMA
>> allocate/map/unmap operations performance is also impacted on stage-1
>> enabled guests.
>> Dynamic DMA mapping operation issues emulated IOMMU functions like
>> mmio write/read and TLB invalidations.
>> As a reference, following table shows performance results for runtime
>> dma operations for virtio-gpu device.
>>
>> +---------------+-------------------------+----------------------------+
>> | DMA Op        | Stage-1 Enabled Guest   | Stage-1 Disabled Guest     |
>> +===============+=========================+============================+
>> | dma_alloc     | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs|
>> +---------------+-------------------------+----------------------------+
>> | dma_free      | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs |
>> +---------------+-------------------------+----------------------------+
>> | dma_map       | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs|
>> +---------------+-------------------------+----------------------------+
>> | dma_unmap     | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs |
>> +---------------+-------------------------+----------------------------+
>>
>> Testing
>> ============
>>
>> - QEMU-based ARM system tests for Stage-1 translation and nested
>>    virtualization.
>> - Actual hardware validation on platforms such as Renesas to ensure
>>    compatibility with real SMMUv3 implementations.
>> - Unit/Functional tests validating correct translations (not implemented).
>>
>> Migration and Compatibility
>> ===========================
>>
>> This optional feature defaults to disabled (`viommu=""`) for backward
>> compatibility.
>>
> 

BR,
Milan

Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests

Posted by Milan Djokic 3 months ago

On 9/1/25 13:06, Milan Djokic wrote:
> Hi Volodymyr,
> 
> On 8/29/25 18:27, Volodymyr Babchuk wrote:
>> Hi Milan,
>>
>> Thanks, "Security Considerations" sections looks really good. But I have
>> more questions.
>>
>> Milan Djokic <milan_djokic@epam.com> writes:
>>
>>> Hello Julien, Volodymyr
>>>
>>> On 8/27/25 01:28, Volodymyr Babchuk wrote:
>>>> Hi Milan,
>>>> Milan Djokic <milan_djokic@epam.com> writes:
>>>>
>>>>> Hello Julien,
>>>>>
>>>>> On 8/13/25 14:11, Julien Grall wrote:
>>>>>> On 13/08/2025 11:04, Milan Djokic wrote:
>>>>>>> Hello Julien,
>>>>>> Hi Milan,
>>>>>>
>>>>>>>
>>>>>>> We have prepared a design document and it will be part of the updated
>>>>>>> patch series (added in docs/design). I'll also extend cover letter with
>>>>>>> details on implementation structure to make review easier.
>>>>>> I would suggest to just iterate on the design document for now.
>>>>>>
>>>>>>> Following is the design document content which will be provided in
>>>>>>> updated patch series:
>>>>>>>
>>>>>>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>>>>>>> ==========================================================
>>>>>>>
>>>>>>> Author: Milan Djokic <milan_djokic@epam.com>
>>>>>>> Date:   2025-08-07
>>>>>>> Status: Draft
>>>>>>>
>>>>>>> Introduction
>>>>>>> ------------
>>>>>>>
>>>>>>> The SMMUv3 supports two stages of translation. Each stage of translation
>>>>>>> can be independently enabled. An incoming address is logically
>>>>>>> translated from VA to IPA in stage 1, then the IPA is input to stage 2
>>>>>>> which translates the IPA to the output PA. Stage 1 translation support
>>>>>>> is required to provide isolation between different devices within the OS.
>>>>>>>
>>>>>>> Xen already supports Stage 2 translation but there is no support for
>>>>>>> Stage 1 translation. This design proposal outlines the introduction of
>>>>>>> Stage-1 SMMUv3 support in Xen for ARM guests.
>>>>>>>
>>>>>>> Motivation
>>>>>>> ----------
>>>>>>>
>>>>>>> ARM systems utilizing SMMUv3 require Stage-1 address translation to
>>>>>>> ensure correct and secure DMA behavior inside guests.
>>>>>> Can you clarify what you mean by "correct"? DMA would still work
>>>>>> without
>>>>>> stage-1.
>>>>>
>>>>> Correct in terms of working with guest managed I/O space. I'll
>>>>> rephrase this statement, it seems ambiguous.
>>>>>
>>>>>>>
>>>>>>> This feature enables:
>>>>>>> - Stage-1 translation in guest domain
>>>>>>> - Safe device passthrough under secure memory translation
>>>>>>>
>>>>>>> Design Overview
>>>>>>> ---------------
>>>>>>>
>>>>>>> These changes provide emulated SMMUv3 support:
>>>>>>>
>>>>>>> - SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
>>>>>>>        SMMUv3 driver
>>>>>>> - vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling
>>>>>> So what are you planning to expose to a guest? Is it one vIOMMU per
>>>>>> pIOMMU? Or a single one?
>>>>>
>>>>> Single vIOMMU model is used in this design.
>>>>>
>>>>>> Have you considered the pros/cons for both?
>>>>>>> - Register/Command Emulation: SMMUv3 register emulation and command
>>>>>>>        queue handling
>>>>>>
>>>>>
>>>>> That's a point for consideration.
>>>>> single vIOMMU prevails in terms of less complex implementation and a
>>>>> simple guest iommmu model - single vIOMMU node, one interrupt path,
>>>>> event queue, single set of trap handlers for emulation, etc.
>>>>> Cons for a single vIOMMU model could be less accurate hw
>>>>> representation and a potential bottleneck with one emulated queue and
>>>>> interrupt path.
>>>>> On the other hand, vIOMMU per pIOMMU provides more accurate hw
>>>>> modeling and offers better scalability in case of many IOMMUs in the
>>>>> system, but this comes with more complex emulation logic and device
>>>>> tree, also handling multiple vIOMMUs on guest side.
>>>>> IMO, single vIOMMU model seems like a better option mostly because
>>>>> it's less complex, easier to maintain and debug. Of course, this
>>>>> decision can and should be discussed.
>>>>>
>>>> Well, I am not sure that this is possible, because of StreamID
>>>> allocation. The biggest offender is of course PCI, as each Root PCI
>>>> bridge will require own SMMU instance with own StreamID space. But even
>>>> without PCI you'll need some mechanism to map vStremID to
>>>> <pSMMU, pStreamID>, because there will be overlaps in SID space.
>>>> Actually, PCI/vPCI with vSMMU is its own can of worms...
>>>>
>>>>>> For each pSMMU, we have a single command queue that will receive command
>>>>>> from all the guests. How do you plan to prevent a guest hogging the
>>>>>> command queue?
>>>>>> In addition to that, AFAIU, the size of the virtual command queue is
>>>>>> fixed by the guest rather than Xen. If a guest is filling up the queue
>>>>>> with commands before notifying Xen, how do you plan to ensure we don't
>>>>>> spend too much time in Xen (which is not preemptible)?
>>>>>>
>>>>>
>>>>> We'll have to do a detailed analysis on these scenarios, they are not
>>>>> covered by the design (as well as some others which is clear after
>>>>> your comments). I'll come back with an updated design.
>>>> I think that can be handled akin to hypercall continuation, which is
>>>> used in similar places, like P2M code
>>>> [...]
>>>>
>>>
>>> I have updated vIOMMU design document with additional security topics
>>> covered and performance impact results. Also added some additional
>>> explanations for vIOMMU components following your comments.
>>> Updated document content:
>>>
>>> ===============================================
>>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>>> ===============================================
>>>
>>> :Author:     Milan Djokic <milan_djokic@epam.com>
>>> :Date:       2025-08-07
>>> :Status:     Draft
>>>
>>> Introduction
>>> ========
>>>
>>> The SMMUv3 supports two stages of translation. Each stage of
>>> translation can be
>>> independently enabled. An incoming address is logically translated
>>> from VA to
>>> IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to
>>> the output PA. Stage 1 translation support is required to provide
>>> isolation between different
>>> devices within OS. XEN already supports Stage 2 translation but there is no
>>> support for Stage 1 translation.
>>> This design proposal outlines the introduction of Stage-1 SMMUv3
>>> support in Xen for ARM guests.
>>>
>>> Motivation
>>> ==========
>>>
>>> ARM systems utilizing SMMUv3 require stage-1 address translation to
>>> ensure secure DMA and guest managed I/O memory mappings.
>>
>> It is unclear for my what you mean by "guest manged IO memory mappings",
>> could you please provide an example?
>>
> 
> Basically enabling stage-1 translation means that the guest is
> responsible for managing IOVA to IPA mappings through its own IOMMU
> driver. Guest manages its own stage-1 page tables and TLB.
> For example, when a guest driver wants to perform DMA mapping (e.g. with
> dma_map_single()), it will request mapping of its buffer physical
> address to IOVA through guest IOMMU driver. Guest IOMMU driver will
> further issue mapping commands emulated by Xen which translate it into
> stage-2 mappings.
> 
>>> This feature enables:
>>>
>>> - Stage-1 translation in guest domain
>>> - Safe device passthrough under secure memory translation
>>>
>>
>> As I see it, ARM specs use "secure" mostly when referring to Secure mode
>> (S-EL1, S-EL2, EL3) and associated secure counterparts of architectural
>> devices, like secure GIC, secure Timer, etc. So I'd probably don't use
>> this word here to reduce confusion
>>
> 
> Sure, secure in terms of isolation is the topic here. I'll rephrase this
> 
>>> Design Overview
>>> ===============
>>>
>>> These changes provide emulated SMMUv3 support:
>>>
>>> - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation
>>>       support in SMMUv3 driver.
>>
>> "Nested translation" as in "nested virtualization"? Or is this something else?
>>
> 
> No, this refers to 2-stage translation IOVA->IPA->PA as a nested
> translation. Although with this feature, nested virtualization is also
> enabled since guest can emulate its own IOMMU e.g. when kvm is run in guest.
> 
> 
>>> - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1
>>>       handling.
>>
>> I think, this is the big topic. You see, apart from SMMU, there is
>> at least Renesas IP-MMU, which uses completely different API. And
>> probably there are other IO-MMU implementations possible. Right now
>> vIOMMU framework handles only SMMU, which is okay, but probably we
>> should design it in a such way, that other IO-MMUs will be supported as
>> well. Maybe even IO-MMUs for other architectures (RISC V maybe?).
>>
> 
> I think that it is already designed in such manner. We have a generic
> vIOMMU framework and a backend implementation for target IOMMU as
> separate components. And the backend implements supported
> commands/mechanisms which are specific for target IOMMU type. At this
> point, only SMMUv3 is supported, but it is possible to implement other
> IOMMU types support under the same generic framework. AFAIK, RISC-V
> IOMMU stage-2 is still in early development stage, but I do believe that
> it will be also compatible with vIOMMU framework.
> 
>>> - **Register/Command Emulation**: SMMUv3 register emulation and
>>>       command queue handling.
>>
>> Continuing previous paragraph: what about other IO-MMUs? For example, if
>> platform provides only Renesas IO-MMU, will vIOMMU framework still
>> emulate SMMUv3 registers and queue handling?
>>
> 
> Yes, this is not supported in current implementation. To support other
> IOMMU than SMMUv3, stage-1 emulation backend needs to be implemented for
> target IOMMU and probably Xen driver for target IOMMU has to be updated
> to handle stage-1 configuration. I will elaborate this part in the
> design, to make clear that we have a generic vIOMMU framework, but only
> SMMUv3 backend exists atm.
> 
>>> - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes
>>>       to device trees for dom0 and dom0less scenarios.
>>> - **Runtime Configuration**: Introduces a `viommu` boot parameter for
>>>       dynamic enablement.
>>>
>>> vIOMMU is exposed to guest as a single device with predefined
>>> capabilities and commands supported. Single vIOMMU model abstracts the
>>> details of an actual IOMMU hardware, simplifying usage from the guest
>>> point of view. Guest OS handles only a single IOMMU, even if multiple
>>> IOMMU units are available on the host system.
>>
>> In the previous email I asked how are you planning to handle potential
>> SID overlaps, especially in PCI use case. I want to return to this
>> topic. I am not saying that this is impossible, but I'd like to see this
>> covered in the design document.
>>
> 
> Sorry, I've missed this part in the previous mail. This is a valid point,
> SID overlapping would be an issue for a single vIOMMU model. To prevent
> it, design will have to be extended with SID namespace virtualization,
> introducing a remapping layer which will make sure that guest virtual
> SIDs are unique and maintain proper mappings of vSIDs to pSIDs.
> For PCI case, we need to have an extended remapping logic where
> iommu-map property will be also patched in the guest device tree since
> we need a range of unique vSIDs for every RC assigned to guest.
> Alternative approach would be to switch to vIOMMU per pIOMMU model.
> Since both approaches require major updates, I'll have to do a detailed
> analysis and come back with an updated design which would address this
> issue.
> 
> 
>>>
>>> Security Considerations
>>> =======================
>>>
>>> **viommu security benefits:**
>>>
>>> - Stage-1 translation ensures guest devices cannot perform unauthorized DMA.
>>> - Emulated IOMMU removes guest dependency on IOMMU hardware while
>>>     maintaining domains isolation.
>>
>> I am not sure that I got this paragraph.
>>
> 
> First one refers to guest controlled DMA access. Only IOVA->IPA mappings
> created by guest are usable by the device when stage-1 is enabled. On
> the other hand, with stage-2 only enabled, device could access to
> complete IOVA->PA mapping created by Xen for guest. Since the guest has
> no control over device IOVA accesses, a malicious guest kernel could
> potentially access memory regions it shouldn't be allowed to, e.g. if
> stage-2 mappings are stale. With stage-1 enabled, guest device driver
> has to explicitly map IOVAs and this request is propagated through
> emulated IOMMU, making sure that IOVA mappings are valid all the time.
> 
> Second claim means that with emulated IOMMU, guests don’t need direct
> access to physical IOMMU hardware. The hypervisor emulates IOMMU
> behavior for the guest, while still ensuring that memory access by
> devices remains properly isolated between guests, just like it would
> with real IOMMU hardware.
> 
>>>
>>>
>>> 1. Observation:
>>> ---------------
>>> Support for Stage-1 translation in SMMUv3 introduces new data
>>> structures (`s1_cfg` alongside `s2_cfg`) and logic to write both
>>> Stage-1 and Stage-2 entries in the Stream Table Entry (STE), including
>>> an `abort` field to handle partial configuration states.
>>>
>>> **Risk:**
>>> Without proper handling, a partially applied Stage-1 configuration
>>> might leave guest DMA mappings in an inconsistent state, potentially
>>> enabling unauthorized access or causing cross-domain interference.
>>>
>>> **Mitigation:** *(Handled by design)*
>>> This feature introduces logic that writes both `s1_cfg` and `s2_cfg`
>>> to STE and manages the `abort` field-only considering Stage-1
>>> configuration if fully attached. This ensures incomplete or invalid
>>> guest configurations are safely ignored by the hypervisor.
>>>
>>> 2. Observation:
>>> ---------------
>>> Guests can now invalidate Stage-1 caches; invalidation needs
>>> forwarding to SMMUv3 hardware to maintain coherence.
>>>
>>> **Risk:**
>>> Failing to propagate cache invalidation could allow stale mappings,
>>> enabling access to old mappings and possibly data leakage or
>>> misrouting.
>>>
>>> **Mitigation:** *(Handled by design)*
>>> This feature ensures that guest-initiated invalidations are correctly
>>> forwarded to the hardware, preserving IOMMU coherency.
>>>
>>> 3. Observation:
>>> ---------------
>>> This design introduces substantial new functionality, including the
>>> `vIOMMU` framework, virtual SMMUv3 devices (`vsmmuv3`), command
>>> queues, event queues, domain management, and Device Tree modifications
>>> (e.g., `iommus` nodes and `libxl` integration).
>>>
>>> **Risk:**
>>> Large feature expansions increase the attack surface—potential for
>>> race conditions, unchecked command inputs, or Device Tree-based
>>> misconfigurations.
>>>
>>> **Mitigation:**
>>>
>>> - Sanity checks and error-handling improvements have been introduced
>>>     in this feature.
>>> - Further audits have to be performed for this feature and its
>>>     dependencies in this area. Currently, feature is marked as *Tech
>>>     Preview* and is self-contained, reducing the risk to unrelated
>>>    components.
>>>
>>> 4. Observation:
>>> ---------------
>>> The code includes transformations to handle nested translation versus
>>> standard modes and uses guest-configured command queues (e.g.,
>>> `CMD_CFGI_STE`) and event notifications.
>>>
>>> **Risk:**
>>> Malicious or malformed queue commands from guests could bypass
>>> validation, manipulate SMMUv3 state, or cause Dom0 instability.
>>
>> Only Dom0?
>>
> 
> This is a mistake, the whole system could be affected. I'll fix this.
> 
>>>
>>> **Mitigation:** *(Handled by design)*
>>> Built-in validation of command queue entries and sanitization
>>> mechanisms ensure only permitted configurations are applied. This is
>>> supported via additions in `vsmmuv3` and `cmdqueue` handling code.
>>>
>>> 5. Observation:
>>> ---------------
>>> Device Tree modifications enable device assignment and
>>> configuration—guest DT fragments (e.g., `iommus`) are added via
>>> `libxl`.
>>>
>>> **Risk:**
>>> Erroneous or malicious Device Tree injection could result in device
>>> misbinding or guest access to unauthorized hardware.
>>>
>>> **Mitigation:**
>>>
>>> - `libxl` perform checks of guest configuration and parse only
>>>     predefined dt fragments and nodes, reducing risc.
>>> - The system integrator must ensure correct resource mapping in the
>>>     guest Device Tree (DT) fragments.
>>>
>>> 6. Observation:
>>> ---------------
>>> Introducing optional per-guest enabled features (`viommu` argument in
>>> xl guest config) means some guests may opt-out.
>>>
>>> **Risk:**
>>> Differences between guests with and without `viommu` may cause
>>> unexpected behavior or privilege drift.
>>>
>>> **Mitigation:**
>>> Verify that downgrade paths are safe and well-isolated; ensure missing
>>> support doesn't cause security issues. Additional audits on emulation
>>> paths and domains interference need to be performed in a multi-guest
>>> environment.
>>>
>>> 7. Observation:
>>> ---------------
>>> Guests have the ability to issue Stage-1 IOMMU commands like cache
>>> invalidation, stream table entries configuration, etc. An adversarial
>>> guest may issue a high volume of commands in rapid succession.
>>>
>>> **Risk**
>>> Excessive commands requests can cause high hypervisor CPU consumption
>>> and disrupt scheduling, leading to degraded system responsiveness and
>>> potential denial-of-service scenarios.
>>>
>>> **Mitigation**
>>>
>>> - Xen credit scheduler limits guest vCPU execution time, securing
>>>     basic guest rate-limiting.
>>
>> I don't thing that this feature available only in credit schedulers,
>> AFAIK, all schedulers except null scheduler will limit vCPU execution time.
>>
> 
> I was not aware of that. I'll rephrase this part.
> 
>>> - Batch multiple commands of same type to reduce overhead on the
>>>     virtual SMMUv3 hardware emulation.
>>> - Implement vIOMMU commands execution restart and continuation support
>>
>> So, something like "hypercall continuation"?
>>
> 
> Yes
> 
>>>
>>> 8. Observation:
>>> ---------------
>>> Some guest commands issued towards vIOMMU are propagated to pIOMMU
>>> command queue (e.g. TLB invalidate). For each pIOMMU, only one command
>>> queue is
>>> available for all domains.
>>>
>>> **Risk**
>>> Excessive commands requests from abusive guest can cause flooding of
>>> physical IOMMU command queue, leading to degraded pIOMMU responsivness
>>> on commands issued from other guests.
>>>
>>> **Mitigation**
>>>
>>> - Xen credit scheduler limits guest vCPU execution time, securing
>>>     basic guest rate-limiting.
>>> - Batch commands which should be propagated towards pIOMMU cmd queue
>>>     and enable support for batch execution pause/continuation
>>> - If possible, implement domain penalization by adding a per-domain
>>>     cost counter for vIOMMU/pIOMMU usage.
>>>
>>> 9. Observation:
>>> ---------------
>>> vIOMMU feature includes event queue used for forwarding IOMMU events
>>> to guest (e.g. translation faults, invalid stream IDs, permission
>>> errors). A malicious guest can misconfigure its SMMU state or
>>> intentionally trigger faults with high frequency.
>>>
>>> **Risk**
>>> Occurance of IOMMU events with high frequency can cause Xen to flood
>>> the event queue and disrupt scheduling with high hypervisor CPU load
>>> for events handling.
>>>
>>> **Mitigation**
>>>
>>> - Implement fail-safe state by disabling events forwarding when faults
>>>     are occured with high frequency and not processed by guest.
>>> - Batch multiple events of same type to reduce overhead on the virtual
>>>     SMMUv3 hardware emulation.
>>> - Consider disabling event queue for untrusted guests
>>>
>>> Performance Impact
>>> ==================
>>>
>>> With iommu stage-1 and nested translation inclusion, performance
>>> overhead is introduced comparing to existing, stage-2 only usage in
>>> Xen.
>>> Once mappings are established, translations should not introduce
>>> significant overhead.
>>> Emulated paths may introduce moderate overhead, primarily affecting
>>> device initialization and event handling.
>>> Performance impact highly depends on target CPU capabilities. Testing
>>> is performed on cortex-a53 based platform.
>>
>> Which platform exactly? While QEMU emulates SMMU to some extent, we are
>> observing somewhat different SMMU behavior on real HW platforms (mostly
>> due to cache coherence problems). Also, according to MMU-600 errata, it
>> can have lower than expected performance in some use-cases.
>>
> 
> Performance measurement are done on QEMU emulated Renesas platform. I'll
> add some details for this.
> 
>>> Performance is mostly impacted by emulated vIOMMU operations, results
>>> shown in the following table.
>>>
>>> +-------------------------------+---------------------------------+
>>> | vIOMMU Operation              | Execution time in guest         |
>>> +===============================+=================================+
>>> | Reg read                      | median: 30μs, worst-case: 250μs |
>>> +-------------------------------+---------------------------------+
>>> | Reg write                     | median: 35μs, worst-case: 280μs |
>>> +-------------------------------+---------------------------------+
>>> | Invalidate TLB                | median: 90μs, worst-case: 1ms+  |
>>> +-------------------------------+---------------------------------+
>>> | Invalidate STE                | median: 450μs worst_case: 7ms+  |
>>> +-------------------------------+---------------------------------+
>>>
>>> With vIOMMU exposed to guest, guest OS has to initialize IOMMU device
>>> and configure stage-1 mappings for devices attached to it.
>>> Following table shows initialization stages which impact stage-1
>>> enabled guest boot time and compares it with stage-1 disabled guest.
>>>
>>> "NOTE: Device probe execution time varies significantly depending on
>>> device complexity. virtio-gpu was selected as a test case due to its
>>> extensive use of dynamic DMA allocations and IOMMU mappings, making it
>>> a suitable candidate for benchmarking stage-1 vIOMMU behavior."
>>>
>>> +---------------------+-----------------------+------------------------+
>>> | Stage               | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
>>> +=====================+=======================+========================+
>>> | IOMMU Init          | ~25ms                 | /                      |
>>> +---------------------+-----------------------+------------------------+
>>> | Dev Attach / Mapping| ~220ms                | ~200ms                 |
>>> +---------------------+-----------------------+------------------------+
>>>
>>> For devices configured with dynamic DMA mappings, DMA
>>> allocate/map/unmap operations performance is also impacted on stage-1
>>> enabled guests.
>>> Dynamic DMA mapping operation issues emulated IOMMU functions like
>>> mmio write/read and TLB invalidations.
>>> As a reference, following table shows performance results for runtime
>>> dma operations for virtio-gpu device.
>>>
>>> +---------------+-------------------------+----------------------------+
>>> | DMA Op        | Stage-1 Enabled Guest   | Stage-1 Disabled Guest     |
>>> +===============+=========================+============================+
>>> | dma_alloc     | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs|
>>> +---------------+-------------------------+----------------------------+
>>> | dma_free      | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs |
>>> +---------------+-------------------------+----------------------------+
>>> | dma_map       | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs|
>>> +---------------+-------------------------+----------------------------+
>>> | dma_unmap     | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs |
>>> +---------------+-------------------------+----------------------------+
>>>
>>> Testing
>>> ============
>>>
>>> - QEMU-based ARM system tests for Stage-1 translation and nested
>>>     virtualization.
>>> - Actual hardware validation on platforms such as Renesas to ensure
>>>     compatibility with real SMMUv3 implementations.
>>> - Unit/Functional tests validating correct translations (not implemented).
>>>
>>> Migration and Compatibility
>>> ===========================
>>>
>>> This optional feature defaults to disabled (`viommu=""`) for backward
>>> compatibility.
>>>
>>
> 
> BR,
> Milan
> 

Hello Volodymyr, Julien

Sorry for the delayed follow-up on this topic.
We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and 
pIOMMU. Considering single vIOMMU model limitation pointed out by 
Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the 
only proper solution.
Following is the updated design document.
I have added additional details to the design and performance impact 
sections, and also indicated future improvements. Security 
considerations section is unchanged apart from some minor details 
according to review comments.
Let me know what do you think about updated design. Once approved, I 
will send the updated vIOMMU patch series.


==========================================================
Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
==========================================================

:Author:     Milan Djokic <milan_djokic@epam.com>
:Date:       2025-11-03
:Status:     Draft

Introduction
============

The SMMUv3 supports two stages of translation. Each stage of translation 
can be
independently enabled. An incoming address is logically translated from 
VA to
IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to
the output PA. Stage 1 translation support is required to provide 
isolation between different
devices within OS. XEN already supports Stage 2 translation but there is no
support for Stage 1 translation.
This design proposal outlines the introduction of Stage-1 SMMUv3 support 
in Xen for ARM guests.

Motivation
==========

ARM systems utilizing SMMUv3 require stage-1 address translation to 
ensure secure DMA and
guest managed I/O memory mappings.
With stage-1 enabed, guest manages IOVA to IPA mappings through its own 
IOMMU driver.

This feature enables:

- Stage-1 translation in guest domain
- Safe device passthrough with per-device address translation table

Design Overview
===============

These changes provide emulated SMMUv3 support:

- **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support 
in SMMUv3 driver.
- **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1 
handling.
- **Register/Command Emulation**: SMMUv3 register emulation and command 
queue handling.
- **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to 
device trees for dom0 and dom0less scenarios.
- **Runtime Configuration**: Introduces a `viommu` boot parameter for 
dynamic enablement.

Separate vIOMMU device is exposed to guest for every physical IOMMU in 
the system.
vIOMMU feature is designed in a way to provide a generic vIOMMU 
framework and a backend implementation
for target IOMMU as separate components.
Backend implementation contains specific IOMMU structure and commands 
handling (only SMMUv3 currently supported).
This structure allows potential reuse of stage-1 feature for other IOMMU 
types.

Security Considerations
=======================

**viommu security benefits:**

- Stage-1 translation ensures guest devices cannot perform unauthorized 
DMA (device I/O address mapping managed by guest).
- Emulated IOMMU removes guest direct dependency on IOMMU hardware, 
while maintaining domains isolation.


1. Observation:
---------------
Support for Stage-1 translation in SMMUv3 introduces new data structures 
(`s1_cfg` alongside `s2_cfg`)
and logic to write both Stage-1 and Stage-2 entries in the Stream Table 
Entry (STE), including an `abort`
field to handle partial configuration states.

**Risk:**
Without proper handling, a partially applied Stage-1 configuration might 
leave guest DMA mappings in an
inconsistent state, potentially enabling unauthorized access or causing 
cross-domain interference.

**Mitigation:** *(Handled by design)*
This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to 
STE and manages the `abort` field-only
considering Stage-1 configuration if fully attached. This ensures 
incomplete or invalid guest configurations
are safely ignored by the hypervisor.

2. Observation:
---------------
Guests can now invalidate Stage-1 caches; invalidation needs forwarding 
to SMMUv3 hardware to maintain coherence.

**Risk:**
Failing to propagate cache invalidation could allow stale mappings, 
enabling access to old mappings and possibly
data leakage or misrouting.

**Mitigation:** *(Handled by design)*
This feature ensures that guest-initiated invalidations are correctly 
forwarded to the hardware,
preserving IOMMU coherency.

3. Observation:
---------------
This design introduces substantial new functionality, including the 
`vIOMMU` framework, virtual SMMUv3
devices (`vsmmuv3`), command queues, event queues, domain management, 
and Device Tree
modifications (e.g., `iommus` nodes and `libxl` integration).

**Risk:**
Large feature expansions increase the attack surface potential for race 
conditions, unchecked command inputs,
or Device Tree-based misconfigurations.

**Mitigation:**

- Sanity checks and error-handling improvements have been introduced in 
this feature.
- Further audits have to be performed for this feature and its 
dependencies in this area.

4. Observation:
---------------
The code includes transformations to handle nested translation versus 
standard modes and uses guest-configured
command queues (e.g., `CMD_CFGI_STE`) and event notifications.

**Risk:**
Malicious or malformed queue commands from guests could bypass 
validation, manipulate SMMUv3 state,
or cause system instability.

**Mitigation:** *(Handled by design)*
Built-in validation of command queue entries and sanitization mechanisms 
ensure only permitted configurations
are applied. This is supported via additions in `vsmmuv3` and `cmdqueue` 
handling code.

5. Observation:
---------------
Device Tree modifications enable device assignment and configuration 
through guest DT fragments (e.g., `iommus`)
are added via `libxl`.

**Risk:**
Erroneous or malicious Device Tree injection could result in device 
misbinding or guest access to unauthorized
hardware.

**Mitigation:**

- `libxl` perform checks of guest configuration and parse only 
predefined dt fragments and nodes, reducing risk.
- The system integrator must ensure correct resource mapping in the 
guest Device Tree (DT) fragments.

6. Observation:
---------------
Introducing optional per-guest enabled features (`viommu` argument in xl 
guest config) means some guests
may opt-out.

**Risk:**
Differences between guests with and without `viommu` may cause 
unexpected behavior or privilege drift.

**Mitigation:**
Verify that downgrade paths are safe and well-isolated; ensure missing 
support doesn't cause security issues.
Additional audits on emulation paths and domains interference need to be 
performed in a multi-guest environment.

7. Observation:
---------------
Guests have the ability to issue Stage-1 IOMMU commands like cache 
invalidation, stream table entries
configuration, etc. An adversarial guest may issue a high volume of 
commands in rapid succession.

**Risk:**
Excessive commands requests can cause high hypervisor CPU consumption 
and disrupt scheduling,
leading to degraded system responsiveness and potential 
denial-of-service scenarios.

**Mitigation:**

- Xen scheduler limits guest vCPU execution time, securing basic guest 
rate-limiting.
- Batch multiple commands of same type to reduce overhead on the virtual 
SMMUv3 hardware emulation.
- Implement vIOMMU commands execution restart and continuation support

8. Observation:
---------------
Some guest commands issued towards vIOMMU are propagated to pIOMMU 
command queue (e.g. TLB invalidate).

**Risk:**
Excessive commands requests from abusive guest can cause flooding of 
physical IOMMU command queue,
leading to degraded pIOMMU responsivness on commands issued from other 
guests.

**Mitigation:**

- Xen credit scheduler limits guest vCPU execution time, securing basic 
guest rate-limiting.
- Batch commands which should be propagated towards pIOMMU cmd queue and 
enable support for batch
   execution pause/continuation
- If possible, implement domain penalization by adding a per-domain cost 
counter for vIOMMU/pIOMMU usage.

9. Observation:
---------------
vIOMMU feature includes event queue used for forwarding IOMMU events to 
guest
(e.g. translation faults, invalid stream IDs, permission errors).
A malicious guest can misconfigure its SMMU state or intentionally 
trigger faults with high frequency.

**Risk:**
Occurance of IOMMU events with high frequency can cause Xen to flood the 
event queue and disrupt scheduling with
high hypervisor CPU load for events handling.

**Mitigation:**

- Implement fail-safe state by disabling events forwarding when faults 
are occured with high frequency and
   not processed by guest.
- Batch multiple events of same type to reduce overhead on the virtual 
SMMUv3 hardware emulation.
- Consider disabling event queue for untrusted guests

Performance Impact
==================

With iommu stage-1 and nested translation inclusion, performance 
overhead is introduced comparing to existing,
stage-2 only usage in Xen. Once mappings are established, translations 
should not introduce significant overhead.
Emulated paths may introduce moderate overhead, primarily affecting 
device initialization and event handling.
Performance impact highly depends on target CPU capabilities.
Testing is performed on QEMU virt and Renesas R-Car (QEMU emulated) 
platforms.
Performance is mostly impacted by emulated vIOMMU operations, results 
shown in the following table.

+-------------------------------+---------------------------------+
| vIOMMU Operation              | Execution time in guest         |
+===============================+=================================+
| Reg read                      | median: 30μs, worst-case: 250μs |
+-------------------------------+---------------------------------+
| Reg write                     | median: 35μs, worst-case: 280μs |
+-------------------------------+---------------------------------+
| Invalidate TLB                | median: 90μs, worst-case: 1ms+  |
+-------------------------------+---------------------------------+
| Invalidate STE                | median: 450μs worst_case: 7ms+  |
+-------------------------------+---------------------------------+

With vIOMMU exposed to guest, guest OS has to initialize IOMMU device 
and configure stage-1 mappings for devices
attached to it.
Following table shows initialization stages which impact stage-1 enabled 
guest boot time and compares it with
stage-1 disabled guest.

"NOTE: Device probe execution time varies significantly depending on 
device complexity. virtio-gpu was selected
as a test case due to its extensive use of dynamic DMA allocations and 
IOMMU mappings, making it a suitable
candidate for benchmarking stage-1 vIOMMU behavior."

+---------------------+-----------------------+------------------------+
| Stage               | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
+=====================+=======================+========================+
| IOMMU Init          | ~25ms                 | /                      |
+---------------------+-----------------------+------------------------+
| Dev Attach / Mapping| ~220ms                | ~200ms                 |
+---------------------+-----------------------+------------------------+

For devices configured with dynamic DMA mappings, DMA allocate/map/unmap 
operations performance is
also impacted on stage-1 enabled guests.
Dynamic DMA mapping operation trigger emulated IOMMU functions like mmio 
write/read and TLB invalidations.
As a reference, following table shows performance results for runtime 
dma operations for virtio-gpu device.

+---------------+-------------------------+----------------------------+
| DMA Op        | Stage-1 Enabled Guest   | Stage-1 Disabled Guest     |
+===============+=========================+============================+
| dma_alloc     | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs|
+---------------+-------------------------+----------------------------+
| dma_free      | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs |
+---------------+-------------------------+----------------------------+
| dma_map       | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs|
+---------------+-------------------------+----------------------------+
| dma_unmap     | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs |
+---------------+-------------------------+----------------------------+

Testing
=======

- QEMU-based ARM system tests for Stage-1 translation.
- Actual hardware validation to ensure compatibility with real SMMUv3 
implementations.
- Unit/Functional tests validating correct translations (not implemented).

Migration and Compatibility
===========================

This optional feature defaults to disabled (`viommu=""`) for backward 
compatibility.

Future improvements
===================

- Implement the proposed mitigations to address security risks that are 
not covered by the current
   design (events batching, commands execution continuation)
- Support for other IOMMU HW (Renesas, RISC-V, etc.)
- Due to static definition of SPIs and MMIO regions for emulated 
devices, current implementation statically
   defines SPIs and MMIO regions for up to 16 vIOMMUs per guest. Future 
improvements would include configurable
   number of IOMMUs or automatic runtime resolution for target platform.

References
==========

- Original feature implemented by Rahul Singh:
  
https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@arm.com/ 

- SMMUv3 architecture documentation
- Existing vIOMMU code patterns


BR,
Milan

Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests

Posted by Julien Grall 2 months, 1 week ago


On 03/11/2025 13:16, Milan Djokic wrote:
> Hello Volodymyr, Julien

Hi Milan,

Thanks for the new update. For the future, can you trim your reply?

> Sorry for the delayed follow-up on this topic.
> We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and 
> pIOMMU. Considering single vIOMMU model limitation pointed out by 
> Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the 
> only proper solution.

I am not sure to fully understand. My assumption with the single vIOMMU 
is you have a virtual SID that would be mapped to a (pIOMMU, physical 
SID). Does this means in your solution you will end up with multiple 
vPCI as well and then map pBDF == vBDF? (this because the SID have to be 
fixed at boot)

> Following is the updated design document.
> I have added additional details to the design and performance impact 
> sections, and also indicated future improvements. Security 
> considerations section is unchanged apart from some minor details 
> according to review comments.
> Let me know what do you think about updated design. Once approved, I 
> will send the updated vIOMMU patch series.
> 
> 
> ==========================================================
> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
> ==========================================================
> 
> :Author:     Milan Djokic <milan_djokic@epam.com>
> :Date:       2025-11-03
> :Status:     Draft
> 
> Introduction
> ============
> 
> The SMMUv3 supports two stages of translation. Each stage of translation 
> can be
> independently enabled. An incoming address is logically translated from 
> VA to
> IPA in stage 1, then the IPA is input to stage 2 which translates the 
> IPA to
> the output PA. Stage 1 translation support is required to provide 
> isolation between different
> devices within OS. XEN already supports Stage 2 translation but there is no
> support for Stage 1 translation.
> This design proposal outlines the introduction of Stage-1 SMMUv3 support 
> in Xen for ARM guests.
> 
> Motivation
> ==========
> 
> ARM systems utilizing SMMUv3 require stage-1 address translation to 
> ensure secure DMA and
> guest managed I/O memory mappings.
> With stage-1 enabed, guest manages IOVA to IPA mappings through its own 
> IOMMU driver.
> 
> This feature enables:
> 
> - Stage-1 translation in guest domain
> - Safe device passthrough with per-device address translation table

I find this misleading. Even without this feature, device passthrough is 
still safe in the sense a device will be isolated (assuming all the DMA 
goes through the IOMMU) and will not be able to DMA outside of the guest 
memory. What the stage-1 is doing is providing an extra layer to control 
what each device can see. This is useful if you don't trust your devices 
or you want to assign a device to userspace (e.g. for DPDK).

> 
> Design Overview
> ===============
> 
> These changes provide emulated SMMUv3 support:

If my understanding is correct, there are all some implications in how 
we create the PCI topology. It would be good to spell them out.

> 
> - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support 
> in SMMUv3 driver.
> - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1 
> handling.
> - **Register/Command Emulation**: SMMUv3 register emulation and command 
> queue handling.
> - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to 
> device trees for dom0 and dom0less scenarios.

What about ACPI?

> - **Runtime Configuration**: Introduces a `viommu` boot parameter for 
> dynamic enablement.
> 
> Separate vIOMMU device is exposed to guest for every physical IOMMU in 
> the system.
> vIOMMU feature is designed in a way to provide a generic vIOMMU 
> framework and a backend implementation
> for target IOMMU as separate components.
> Backend implementation contains specific IOMMU structure and commands 
> handling (only SMMUv3 currently supported).
> This structure allows potential reuse of stage-1 feature for other IOMMU 
> types.
> 
> Security Considerations
> =======================
> 
> **viommu security benefits:**
> 
> - Stage-1 translation ensures guest devices cannot perform unauthorized 
> DMA (device I/O address mapping managed by guest).
> - Emulated IOMMU removes guest direct dependency on IOMMU hardware, 
> while maintaining domains isolation.

Sorry, I don't follow this argument. Are you saying that it would be 
possible to emulate a SMMUv3 vIOMMU on top of the IPMMU?

> 1. Observation:
> ---------------
> Support for Stage-1 translation in SMMUv3 introduces new data structures 
> (`s1_cfg` alongside `s2_cfg`)
> and logic to write both Stage-1 and Stage-2 entries in the Stream Table 
> Entry (STE), including an `abort`
> field to handle partial configuration states.
> 
> **Risk:**
> Without proper handling, a partially applied Stage-1 configuration might 
> leave guest DMA mappings in an
> inconsistent state, potentially enabling unauthorized access or causing 
> cross-domain interference.

How so? Even if you misconfigure the S1, the S2 would still be properly 
configured (you just mention partially applied stage-1).

> 
> **Mitigation:** *(Handled by design)*
> This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to 
> STE and manages the `abort` field-only
> considering Stage-1 configuration if fully attached. This ensures 
> incomplete or invalid guest configurations
> are safely ignored by the hypervisor.

Can you clarify what you mean by invalid guest configurations?

> 
> 2. Observation:
> ---------------
> Guests can now invalidate Stage-1 caches; invalidation needs forwarding 
> to SMMUv3 hardware to maintain coherence.
> 
> **Risk:**
> Failing to propagate cache invalidation could allow stale mappings, 
> enabling access to old mappings and possibly
> data leakage or misrouting.

You are referring to data leakage/misrouting between two devices own by 
the same guest, right? Xen would still be in charge of flush when the 
stage-2 is updated.

> 
> **Mitigation:** *(Handled by design)*
> This feature ensures that guest-initiated invalidations are correctly 
> forwarded to the hardware,
> preserving IOMMU coherency.

How is this a mitigation? You have to properly handle commands. If you 
don't properly handle them, then yes it will break.

> 
> 4. Observation:
> ---------------
> The code includes transformations to handle nested translation versus 
> standard modes and uses guest-configured
> command queues (e.g., `CMD_CFGI_STE`) and event notifications.
> 
> **Risk:**
> Malicious or malformed queue commands from guests could bypass 
> validation, manipulate SMMUv3 state,
> or cause system instability.
> 
> **Mitigation:** *(Handled by design)*
> Built-in validation of command queue entries and sanitization mechanisms 
> ensure only permitted configurations
> are applied.

This is true as long as we didn't make an mistake in the configurations ;).


> This is supported via additions in `vsmmuv3` and `cmdqueue` 
> handling code.
> 
> 5. Observation:
> ---------------
> Device Tree modifications enable device assignment and configuration 
> through guest DT fragments (e.g., `iommus`)
> are added via `libxl`.
> 
> **Risk:**
> Erroneous or malicious Device Tree injection could result in device 
> misbinding or guest access to unauthorized
> hardware.

The DT fragment are not security support and will never be at least 
until you have can a libfdt that is able to detect malformed Device-Tree 
(I haven't checked if this has changed recently).

> 
> **Mitigation:**
> 
> - `libxl` perform checks of guest configuration and parse only 
> predefined dt fragments and nodes, reducing risk.
> - The system integrator must ensure correct resource mapping in the 
> guest Device Tree (DT) fragments.
 > > 6. Observation:
> ---------------
> Introducing optional per-guest enabled features (`viommu` argument in xl 
> guest config) means some guests
> may opt-out.
> 
> **Risk:**
> Differences between guests with and without `viommu` may cause 
> unexpected behavior or privilege drift.

I don't understand this risk. Can you clarify?

> 
> **Mitigation:**
> Verify that downgrade paths are safe and well-isolated; ensure missing 
> support doesn't cause security issues.
> Additional audits on emulation paths and domains interference need to be 
> performed in a multi-guest environment.
> 
> 7. Observation:
> ---------------

This observation with 7, 8 and 9 are the most important observations but 
it seems to be missing some details on how this will be implemented. I 
will try to provide some questions that should help filling the gaps.

> Guests have the ability to issue Stage-1 IOMMU commands like cache 
> invalidation, stream table entries
> configuration, etc. An adversarial guest may issue a high volume of 
> commands in rapid succession.
> 
> **Risk:**
> Excessive commands requests can cause high hypervisor CPU consumption 
> and disrupt scheduling,
> leading to degraded system responsiveness and potential denial-of- 
> service scenarios.
> 
> **Mitigation:**
> 
> - Xen scheduler limits guest vCPU execution time, securing basic guest 
> rate-limiting.

This really depends on your scheduler. Some scheduler (e.g. NULL) will 
not do any scheduling at all. Furthermore, the scheduler only preempt 
EL1/EL0. It doesn't preempt EL2, so any long running operation need 
manual preemption. Therefore, I wouldn't consider this as a mitigation.

> - Batch multiple commands of same type to reduce overhead on the virtual 
> SMMUv3 hardware emulation.

The guest can send commands in any order. So can you expand how this 
would work? Maybe with some example.

> - Implement vIOMMU commands execution restart and continuation support

This needs a bit more details. How will you decide whether to restart 
and what would be the action? (I guess it will be re-executing the 
instruction to write to the CWRITER).

> 
> 8. Observation:
> ---------------
> Some guest commands issued towards vIOMMU are propagated to pIOMMU 
> command queue (e.g. TLB invalidate).
> 
> **Risk:**
> Excessive commands requests from abusive guest can cause flooding of 
> physical IOMMU command queue,
> leading to degraded pIOMMU responsivness on commands issued from other 
> guests.
> 
> **Mitigation:**
> 
> - Xen credit scheduler limits guest vCPU execution time, securing basic 
> guest rate-limiting.

Same as above. This mitigation cannot be used.


> - Batch commands which should be propagated towards pIOMMU cmd queue and 
> enable support for batch
>    execution pause/continuation

Can this be expanded?

> - If possible, implement domain penalization by adding a per-domain cost 
> counter for vIOMMU/pIOMMU usage.

Can this be expanded?

> 
> 9. Observation:
> ---------------
> vIOMMU feature includes event queue used for forwarding IOMMU events to 
> guest
> (e.g. translation faults, invalid stream IDs, permission errors).
> A malicious guest can misconfigure its SMMU state or intentionally 
> trigger faults with high frequency.
> 
> **Risk:**
> Occurance of IOMMU events with high frequency can cause Xen to flood the 

s/occurance/occurrence/

> event queue and disrupt scheduling with
> high hypervisor CPU load for events handling.
> 
> **Mitigation:**
> 
> - Implement fail-safe state by disabling events forwarding when faults 
> are occured with high frequency and
>    not processed by guest.

I am not sure to understand how this would work. Can you expand?

> - Batch multiple events of same type to reduce overhead on the virtual 
> SMMUv3 hardware emulation.

Ditto.

> - Consider disabling event queue for untrusted guests

My understanding is there is only a single physical event queue. Xen 
would be responsible to handle the events in the queue and forward to 
the respective guests. If so, it is not clear what you mean by "disable 
event queue".

> 
> Performance Impact
> ==================
> 
> With iommu stage-1 and nested translation inclusion, performance 
> overhead is introduced comparing to existing,
> stage-2 only usage in Xen. Once mappings are established, translations 
> should not introduce significant overhead.
> Emulated paths may introduce moderate overhead, primarily affecting 
> device initialization and event handling.
> Performance impact highly depends on target CPU capabilities.
> Testing is performed on QEMU virt and Renesas R-Car (QEMU emulated) 
> platforms.

I am afraid QEMU is not a reliable platform to do performance testing. 
Don't you have a real HW with vIOMMU support?

[...]

> References
> ==========
> 
> - Original feature implemented by Rahul Singh:
> 
> https://patchwork.kernel.org/project/xen-devel/cover/ 
> cover.1669888522.git.rahul.singh@arm.com/
> - SMMUv3 architecture documentation
> - Existing vIOMMU code patterns

I am not sure what this is referring to?

Cheers,

-- 
Julien Grall

Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests

Posted by Milan Djokic 2 months, 1 week ago

Hi Julien,

On 11/27/25 11:22, Julien Grall wrote:
>> We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and
>> pIOMMU. Considering single vIOMMU model limitation pointed out by
>> Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the
>> only proper solution.
> 
> I am not sure to fully understand. My assumption with the single vIOMMU
> is you have a virtual SID that would be mapped to a (pIOMMU, physical
> SID). 

In the original single vIOMMU implementation, vSID was also equal to 
pSID, we didn't have SW mapping layer between them. Once SID overlap 
issue was discovered with this model, I have switched to 
vIOMMU-per-pIOMMU model. Alternative was to introduce a SW mapping layer 
and stick with a single vIOMMU model. Imo, vSID->pSID mapping layer 
would overcomplicate the design, especially for PCI RC streamIDs handling.
On the other hand, if even a multi-vIOMMU model introduces problems that 
I am not aware of yet, adding a complex mapping layer would be the only 
viable solution.

 > Does this means in your solution you will end up with multiple
 > vPCI as well and then map pBDF == vBDF? (this because the SID have to be
 > fixed at boot)
 >

The important thing which I haven't mentioned here is that our focus is 
on non-PCI devices for this feature atm. If I'm not mistaken, arm PCI 
passthrough is still work in progress, so our plan was to implement full 
vIOMMU PCI support in the future, once PCI passthrough support is 
complete for arm. Of course, we need to make sure that vIOMMU design 
provides a suitable infrastructure for PCI.
To answer your question, yes we will have multiple vPCI nodes with this 
model, establishing 1-1 vSID-pSID mapping (same iommu-map range between 
pPCI-vPCI).
For pBDF to vBDF 1-1 mapping, I'm not sure if this is necessary. My 
understanding is that vBDF->pBDF mapping does not affect vSID->pSID 
mapping. Am I wrong here?


>> ==========================================================
>> Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
>> ==========================================================
>>
>> :Author:     Milan Djokic <milan_djokic@epam.com>
>> :Date:       2025-11-03
>> :Status:     Draft
>>
>> Introduction
>> ============
>>
>> The SMMUv3 supports two stages of translation. Each stage of translation
>> can be
>> independently enabled. An incoming address is logically translated from
>> VA to
>> IPA in stage 1, then the IPA is input to stage 2 which translates the
>> IPA to
>> the output PA. Stage 1 translation support is required to provide
>> isolation between different
>> devices within OS. XEN already supports Stage 2 translation but there is no
>> support for Stage 1 translation.
>> This design proposal outlines the introduction of Stage-1 SMMUv3 support
>> in Xen for ARM guests.
>>
>> Motivation
>> ==========
>>
>> ARM systems utilizing SMMUv3 require stage-1 address translation to
>> ensure secure DMA and
>> guest managed I/O memory mappings.
>> With stage-1 enabed, guest manages IOVA to IPA mappings through its own
>> IOMMU driver.
>>
>> This feature enables:
>>
>> - Stage-1 translation in guest domain
>> - Safe device passthrough with per-device address translation table
> 
> I find this misleading. Even without this feature, device passthrough is
> still safe in the sense a device will be isolated (assuming all the DMA
> goes through the IOMMU) and will not be able to DMA outside of the guest
> memory. What the stage-1 is doing is providing an extra layer to control
> what each device can see. This is useful if you don't trust your devices
> or you want to assign a device to userspace (e.g. for DPDK).
> 

I'll rephrase this.

>>
>> Design Overview
>> ===============
>>
>> These changes provide emulated SMMUv3 support:
> 
> If my understanding is correct, there are all some implications in how
> we create the PCI topology. It would be good to spell them out.
> 

Sure, I will outline them.

>>
>> - **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support
>> in SMMUv3 driver.
>> - **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1
>> handling.
>> - **Register/Command Emulation**: SMMUv3 register emulation and command
>> queue handling.
>> - **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to
>> device trees for dom0 and dom0less scenarios.
> 
> What about ACPI?
> 

ACPI support is not part of this feature atm. This will be a topic for 
future updates.

>> - **Runtime Configuration**: Introduces a `viommu` boot parameter for
>> dynamic enablement.
>>
>> Separate vIOMMU device is exposed to guest for every physical IOMMU in
>> the system.
>> vIOMMU feature is designed in a way to provide a generic vIOMMU
>> framework and a backend implementation
>> for target IOMMU as separate components.
>> Backend implementation contains specific IOMMU structure and commands
>> handling (only SMMUv3 currently supported).
>> This structure allows potential reuse of stage-1 feature for other IOMMU
>> types.
>>
>> Security Considerations
>> =======================
>>
>> **viommu security benefits:**
>>
>> - Stage-1 translation ensures guest devices cannot perform unauthorized
>> DMA (device I/O address mapping managed by guest).
>> - Emulated IOMMU removes guest direct dependency on IOMMU hardware,
>> while maintaining domains isolation.
> 
> Sorry, I don't follow this argument. Are you saying that it would be
> possible to emulate a SMMUv3 vIOMMU on top of the IPMMU?
> 

No, this would not work. Emulated IOMMU has to match with the pIOMMU type.
The argument only points out that we are emulating IOMMU, so the guest 
does not need direct HW interface for IOMMU functions.

>> 1. Observation:
>> ---------------
>> Support for Stage-1 translation in SMMUv3 introduces new data structures
>> (`s1_cfg` alongside `s2_cfg`)
>> and logic to write both Stage-1 and Stage-2 entries in the Stream Table
>> Entry (STE), including an `abort`
>> field to handle partial configuration states.
>>
>> **Risk:**
>> Without proper handling, a partially applied Stage-1 configuration might
>> leave guest DMA mappings in an
>> inconsistent state, potentially enabling unauthorized access or causing
>> cross-domain interference.
> 
> How so? Even if you misconfigure the S1, the S2 would still be properly
> configured (you just mention partially applied stage-1).
> 

This could be the case when we have only stage-1. But yes, this is 
improbable case for xen, stage-2 should be mentioned also, will fix this.

>>
>> **Mitigation:** *(Handled by design)*
>> This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to
>> STE and manages the `abort` field-only
>> considering Stage-1 configuration if fully attached. This ensures
>> incomplete or invalid guest configurations
>> are safely ignored by the hypervisor.
> 
> Can you clarify what you mean by invalid guest configurations?
> 

s1 and s2 config will be considered only if configured for the guest 
device. E.g. if only stage-2 is attached for the guest device, stage-1 
configuration will be invalid, but safely ignored. I'll change this to 
"device configuration" instead of ambiguous "guest configuration".

>>
>> 2. Observation:
>> ---------------
>> Guests can now invalidate Stage-1 caches; invalidation needs forwarding
>> to SMMUv3 hardware to maintain coherence.
>>
>> **Risk:**
>> Failing to propagate cache invalidation could allow stale mappings,
>> enabling access to old mappings and possibly
>> data leakage or misrouting.
> 
> You are referring to data leakage/misrouting between two devices own by
> the same guest, right? Xen would still be in charge of flush when the
> stage-2 is updated.
> 

Yes, this risk could affect only guests, not xen.

>>
>> **Mitigation:** *(Handled by design)*
>> This feature ensures that guest-initiated invalidations are correctly
>> forwarded to the hardware,
>> preserving IOMMU coherency.
> 
> How is this a mitigation? You have to properly handle commands. If you
> don't properly handle them, then yes it will break.
> 

Not really a mitigation, will remove it. Guest is responsible for the 
regular initiation of invalidation requests to mitigate this risk.

>>
>> 4. Observation:
>> ---------------
>> The code includes transformations to handle nested translation versus
>> standard modes and uses guest-configured
>> command queues (e.g., `CMD_CFGI_STE`) and event notifications.
>>
>> **Risk:**
>> Malicious or malformed queue commands from guests could bypass
>> validation, manipulate SMMUv3 state,
>> or cause system instability.
>>
>> **Mitigation:** *(Handled by design)*
>> Built-in validation of command queue entries and sanitization mechanisms
>> ensure only permitted configurations
>> are applied.
> 
> This is true as long as we didn't make an mistake in the configurations ;).
> 

Yes, but I don’t see anything we can do to prevent configuration mistakes.

> 
>> This is supported via additions in `vsmmuv3` and `cmdqueue`
>> handling code.
>>
>> 5. Observation:
>> ---------------
>> Device Tree modifications enable device assignment and configuration
>> through guest DT fragments (e.g., `iommus`)
>> are added via `libxl`.
>>
>> **Risk:**
>> Erroneous or malicious Device Tree injection could result in device
>> misbinding or guest access to unauthorized
>> hardware.
> 
> The DT fragment are not security support and will never be at least
> until you have can a libfdt that is able to detect malformed Device-Tree
> (I haven't checked if this has changed recently).
> 

But this should still be considered a risk? Similar to the previous 
observation, system integrator should ensure that DT fragments are correct.

>>
>> **Mitigation:**
>>
>> - `libxl` perform checks of guest configuration and parse only
>> predefined dt fragments and nodes, reducing risk.
>> - The system integrator must ensure correct resource mapping in the
>> guest Device Tree (DT) fragments.
>   > > 6. Observation:
>> ---------------
>> Introducing optional per-guest enabled features (`viommu` argument in xl
>> guest config) means some guests
>> may opt-out.
>>
>> **Risk:**
>> Differences between guests with and without `viommu` may cause
>> unexpected behavior or privilege drift.
> 
> I don't understand this risk. Can you clarify?
> 

This risk is similar to the topics discussed in Observations 8 and 9, 
but in the context of vIOMMU-disabled guests potentially hogging the 
command and event queues due to faster processing of iommu requests. I 
will expand this.

>>
>> **Mitigation:**
>> Verify that downgrade paths are safe and well-isolated; ensure missing
>> support doesn't cause security issues.
>> Additional audits on emulation paths and domains interference need to be
>> performed in a multi-guest environment.
>>
>> 7. Observation:
>> ---------------
> 
> This observation with 7, 8 and 9 are the most important observations but
> it seems to be missing some details on how this will be implemented. I
> will try to provide some questions that should help filling the gaps.
> 

Thanks, I will expand these observations according to comments.

>> Guests have the ability to issue Stage-1 IOMMU commands like cache
>> invalidation, stream table entries
>> configuration, etc. An adversarial guest may issue a high volume of
>> commands in rapid succession.
>>
>> **Risk:**
>> Excessive commands requests can cause high hypervisor CPU consumption
>> and disrupt scheduling,
>> leading to degraded system responsiveness and potential denial-of-
>> service scenarios.
>>
>> **Mitigation:**
>>
>> - Xen scheduler limits guest vCPU execution time, securing basic guest
>> rate-limiting.
> 
> This really depends on your scheduler. Some scheduler (e.g. NULL) will
> not do any scheduling at all. Furthermore, the scheduler only preempt
> EL1/EL0. It doesn't preempt EL2, so any long running operation need
> manual preemption. Therefore, I wouldn't consider this as a mitigation.
> 
>> - Batch multiple commands of same type to reduce overhead on the virtual
>> SMMUv3 hardware emulation.
> 
> The guest can send commands in any order. So can you expand how this
> would work? Maybe with some example.
> 
>> - Implement vIOMMU commands execution restart and continuation support
> 
> This needs a bit more details. How will you decide whether to restart
> and what would be the action? (I guess it will be re-executing the
> instruction to write to the CWRITER).
> 
>>
>> 8. Observation:
>> ---------------
>> Some guest commands issued towards vIOMMU are propagated to pIOMMU
>> command queue (e.g. TLB invalidate).
>>
>> **Risk:**
>> Excessive commands requests from abusive guest can cause flooding of
>> physical IOMMU command queue,
>> leading to degraded pIOMMU responsivness on commands issued from other
>> guests.
>>
>> **Mitigation:**
>>
>> - Xen credit scheduler limits guest vCPU execution time, securing basic
>> guest rate-limiting.
> 
> Same as above. This mitigation cannot be used.
> 
> 
>> - Batch commands which should be propagated towards pIOMMU cmd queue and
>> enable support for batch
>>     execution pause/continuation
> 
> Can this be expanded?
> 
>> - If possible, implement domain penalization by adding a per-domain cost
>> counter for vIOMMU/pIOMMU usage.
> 
> Can this be expanded?
> 
>>
>> 9. Observation:
>> ---------------
>> vIOMMU feature includes event queue used for forwarding IOMMU events to
>> guest
>> (e.g. translation faults, invalid stream IDs, permission errors).
>> A malicious guest can misconfigure its SMMU state or intentionally
>> trigger faults with high frequency.
>>
>> **Risk:**
>> Occurance of IOMMU events with high frequency can cause Xen to flood the
> 
> s/occurance/occurrence/
> 
>> event queue and disrupt scheduling with
>> high hypervisor CPU load for events handling.
>>
>> **Mitigation:**
>>
>> - Implement fail-safe state by disabling events forwarding when faults
>> are occured with high frequency and
>>     not processed by guest.
> 
> I am not sure to understand how this would work. Can you expand?
> 
>> - Batch multiple events of same type to reduce overhead on the virtual
>> SMMUv3 hardware emulation.
> 
> Ditto.
> 
>> - Consider disabling event queue for untrusted guests
> 
> My understanding is there is only a single physical event queue. Xen
> would be responsible to handle the events in the queue and forward to
> the respective guests. If so, it is not clear what you mean by "disable
> event queue".
> 

I was referring to emulated IOMMU event queue. The idea is to make it 
optional for guests. When disabled, events won't be propagated to the guest.

>>
>> Performance Impact
>> ==================
>>
>> With iommu stage-1 and nested translation inclusion, performance
>> overhead is introduced comparing to existing,
>> stage-2 only usage in Xen. Once mappings are established, translations
>> should not introduce significant overhead.
>> Emulated paths may introduce moderate overhead, primarily affecting
>> device initialization and event handling.
>> Performance impact highly depends on target CPU capabilities.
>> Testing is performed on QEMU virt and Renesas R-Car (QEMU emulated)
>> platforms.
> 
> I am afraid QEMU is not a reliable platform to do performance testing.
> Don't you have a real HW with vIOMMU support?
> 

Yes, I will provide performance measurement for Renesas HW also.

> [...]
> 
>> References
>> ==========
>>
>> - Original feature implemented by Rahul Singh:
>>
>> https://patchwork.kernel.org/project/xen-devel/cover/
>> cover.1669888522.git.rahul.singh@arm.com/
>> - SMMUv3 architecture documentation
>> - Existing vIOMMU code patterns
> 
> I am not sure what this is referring to?
> 

QEMU and KVM IOMMU emulation patterns were used as a reference.

BR,
Milan

Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests

Posted by Julien Grall 2 months, 1 week ago

Hi,

On 02/12/2025 22:08, Milan Djokic wrote:
> Hi Julien,
> 
> On 11/27/25 11:22, Julien Grall wrote:
>>> We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and
>>> pIOMMU. Considering single vIOMMU model limitation pointed out by
>>> Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the
>>> only proper solution.
>>
>> I am not sure to fully understand. My assumption with the single vIOMMU
>> is you have a virtual SID that would be mapped to a (pIOMMU, physical
>> SID). 
> 
> In the original single vIOMMU implementation, vSID was also equal to 
> pSID, we didn't have SW mapping layer between them. Once SID overlap 
> issue was discovered with this model, I have switched to vIOMMU-per- 
> pIOMMU model. Alternative was to introduce a SW mapping layer and stick 
> with a single vIOMMU model. Imo, vSID->pSID mapping layer would 
> overcomplicate the design, especially for PCI RC streamIDs handling.
> On the other hand, if even a multi-vIOMMU model introduces problems that 
> I am not aware of yet, adding a complex mapping layer would be the only 
> viable solution.
> 
>  > Does this means in your solution you will end up with multiple
>  > vPCI as well and then map pBDF == vBDF? (this because the SID have to be
>  > fixed at boot)
>  >
> 
> The important thing which I haven't mentioned here is that our focus is 
> on non-PCI devices for this feature atm. If I'm not mistaken, arm PCI 
> passthrough is still work in progress, so our plan was to implement full 
> vIOMMU PCI support in the future, once PCI passthrough support is 
> complete for arm. Of course, we need to make sure that vIOMMU design 
> provides a suitable infrastructure for PCI.
> To answer your question, yes we will have multiple vPCI nodes with this 
> model, establishing 1-1 vSID-pSID mapping (same iommu-map range between 
> pPCI-vPCI).
> For pBDF to vBDF 1-1 mapping, I'm not sure if this is necessary. My 
> understanding is that vBDF->pBDF mapping does not affect vSID->pSID 
> mapping. Am I wrong here?

 From my understanding, the mapping between a vBDF and vSID is setup at 
domain creation (as this is described in ACPI/Device-Tree). As PCI 
devices can be hotplug, if you want to enforce vSID == pSID, then you 
indirectly need to enforce vBDF == pBDF.

[...]

>>> - **Runtime Configuration**: Introduces a `viommu` boot parameter for
>>> dynamic enablement.
>>>
>>> Separate vIOMMU device is exposed to guest for every physical IOMMU in
>>> the system.
>>> vIOMMU feature is designed in a way to provide a generic vIOMMU
>>> framework and a backend implementation
>>> for target IOMMU as separate components.
>>> Backend implementation contains specific IOMMU structure and commands
>>> handling (only SMMUv3 currently supported).
>>> This structure allows potential reuse of stage-1 feature for other IOMMU
>>> types.
>>>
>>> Security Considerations
>>> =======================
>>>
>>> **viommu security benefits:**
>>>
>>> - Stage-1 translation ensures guest devices cannot perform unauthorized
>>> DMA (device I/O address mapping managed by guest).
>>> - Emulated IOMMU removes guest direct dependency on IOMMU hardware,
>>> while maintaining domains isolation.
>>
>> Sorry, I don't follow this argument. Are you saying that it would be
>> possible to emulate a SMMUv3 vIOMMU on top of the IPMMU?
>>
> 
> No, this would not work. Emulated IOMMU has to match with the pIOMMU type.
> The argument only points out that we are emulating IOMMU, so the guest 
> does not need direct HW interface for IOMMU functions.

Sorry, but I am still missing how this is a security benefits.

[...]


>>>
>>> 2. Observation:
>>> ---------------
>>> Guests can now invalidate Stage-1 caches; invalidation needs forwarding
>>> to SMMUv3 hardware to maintain coherence.
>>>
>>> **Risk:**
>>> Failing to propagate cache invalidation could allow stale mappings,
>>> enabling access to old mappings and possibly
>>> data leakage or misrouting.
>>
>> You are referring to data leakage/misrouting between two devices own by
>> the same guest, right? Xen would still be in charge of flush when the
>> stage-2 is updated.
>>
> 
> Yes, this risk could affect only guests, not xen.

But it would affect a single guest right? IOW, it is not possible for 
guest A to leak data to guest B even if we don't properly invalidate 
stage-1. Correct?

> 
>>>
>>> **Mitigation:** *(Handled by design)*
>>> This feature ensures that guest-initiated invalidations are correctly
>>> forwarded to the hardware,
>>> preserving IOMMU coherency.
>>
>> How is this a mitigation? You have to properly handle commands. If you
>> don't properly handle them, then yes it will break.
>>
> 
> Not really a mitigation, will remove it. Guest is responsible for the 
> regular initiation of invalidation requests to mitigate this risk.
> 
>>>
>>> 4. Observation:
>>> ---------------
>>> The code includes transformations to handle nested translation versus
>>> standard modes and uses guest-configured
>>> command queues (e.g., `CMD_CFGI_STE`) and event notifications.
>>>
>>> **Risk:**
>>> Malicious or malformed queue commands from guests could bypass
>>> validation, manipulate SMMUv3 state,
>>> or cause system instability.
>>>
>>> **Mitigation:** *(Handled by design)*
>>> Built-in validation of command queue entries and sanitization mechanisms
>>> ensure only permitted configurations
>>> are applied.
>>
>> This is true as long as we didn't make an mistake in the 
>> configurations ;).
>>
> 
> Yes, but I don’t see anything we can do to prevent configuration mistakes.

There is nothing really preventing it. Same for ...
> 
>>
>>> This is supported via additions in `vsmmuv3` and `cmdqueue`
>>> handling code.
>>>
>>> 5. Observation:
>>> ---------------
>>> Device Tree modifications enable device assignment and configuration
>>> through guest DT fragments (e.g., `iommus`)
>>> are added via `libxl`.
>>>
>>> **Risk:**
>>> Erroneous or malicious Device Tree injection could result in device
>>> misbinding or guest access to unauthorized
>>> hardware.
>>
>> The DT fragment are not security support and will never be at least
>> until you have can a libfdt that is able to detect malformed Device-Tree
>> (I haven't checked if this has changed recently).
>>
> 
> But this should still be considered a risk? Similar to the previous 
> observation, system integrator should ensure that DT fragments are correct.

... this one. I agree they are risks, but they don't provide much input 
in the design of the vIOMMU.

I am a lot more concerned for the scheduling part because the resources 
are shared.

>> My understanding is there is only a single physical event queue. Xen
>> would be responsible to handle the events in the queue and forward to
>> the respective guests. If so, it is not clear what you mean by "disable
>> event queue".
>>
> 
> I was referring to emulated IOMMU event queue. The idea is to make it 
> optional for guests. When disabled, events won't be propagated to the 
> guest.

But Xen will still receive the events, correct? If so, how does it make 
it better?

> 
>>>
>>> Performance Impact
>>> ==================
>>>
>>> With iommu stage-1 and nested translation inclusion, performance
>>> overhead is introduced comparing to existing,
>>> stage-2 only usage in Xen. Once mappings are established, translations
>>> should not introduce significant overhead.
>>> Emulated paths may introduce moderate overhead, primarily affecting
>>> device initialization and event handling.
>>> Performance impact highly depends on target CPU capabilities.
>>> Testing is performed on QEMU virt and Renesas R-Car (QEMU emulated)
>>> platforms.
>>
>> I am afraid QEMU is not a reliable platform to do performance testing.
>> Don't you have a real HW with vIOMMU support?
>>
> 
> Yes, I will provide performance measurement for Renesas HW also.

FWIW, I don't need to know the performance right now. I am mostly 
pointing out that if you want to provide performance number, then they 
should really come from real HW rather than QEMU.

Cheers,

-- 
Julien Grall

Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests

Posted by Milan Djokic 2 months ago

Hi Julien,
On 12/3/25 11:32, Julien Grall wrote:
> Hi,
> 
> On 02/12/2025 22:08, Milan Djokic wrote:
>> Hi Julien,
>>
>> On 11/27/25 11:22, Julien Grall wrote:
>>>> We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and
>>>> pIOMMU. Considering single vIOMMU model limitation pointed out by
>>>> Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the
>>>> only proper solution.
>>
>>   > Does this means in your solution you will end up with multiple
>>   > vPCI as well and then map pBDF == vBDF? (this because the SID have to be
>>   > fixed at boot)
>>   >
>>
>> To answer your question, yes we will have multiple vPCI nodes with this
>> model, establishing 1-1 vSID-pSID mapping (same iommu-map range between
>> pPCI-vPCI).
>> For pBDF to vBDF 1-1 mapping, I'm not sure if this is necessary. My
>> understanding is that vBDF->pBDF mapping does not affect vSID->pSID
>> mapping. Am I wrong here?
> 
>   From my understanding, the mapping between a vBDF and vSID is setup at
> domain creation (as this is described in ACPI/Device-Tree). As PCI
> devices can be hotplug, if you want to enforce vSID == pSID, then you
> indirectly need to enforce vBDF == pBDF.
> 

I was not aware of that. I will have to do a detailed analysis on this 
and come back with a solution. Right now I'm not sure how and if 
enumeration will work with multi vIOMMU/vPCI model. If that's not 
possible, we will have to introduce a mapping layer for vSID->pSID and 
go back to single vPCI/vIOMMU model.

> [...]
> 
>>>> - **Runtime Configuration**: Introduces a `viommu` boot parameter for
>>>> dynamic enablement.
>>>>
>>>> Separate vIOMMU device is exposed to guest for every physical IOMMU in
>>>> the system.
>>>> vIOMMU feature is designed in a way to provide a generic vIOMMU
>>>> framework and a backend implementation
>>>> for target IOMMU as separate components.
>>>> Backend implementation contains specific IOMMU structure and commands
>>>> handling (only SMMUv3 currently supported).
>>>> This structure allows potential reuse of stage-1 feature for other IOMMU
>>>> types.
>>>>
>>>> Security Considerations
>>>> =======================
>>>>
>>>> **viommu security benefits:**
>>>>
>>>> - Stage-1 translation ensures guest devices cannot perform unauthorized
>>>> DMA (device I/O address mapping managed by guest).
>>>> - Emulated IOMMU removes guest direct dependency on IOMMU hardware,
>>>> while maintaining domains isolation.
>>>
>>> Sorry, I don't follow this argument. Are you saying that it would be
>>> possible to emulate a SMMUv3 vIOMMU on top of the IPMMU?
>>>
>>
>> No, this would not work. Emulated IOMMU has to match with the pIOMMU type.
>> The argument only points out that we are emulating IOMMU, so the guest
>> does not need direct HW interface for IOMMU functions.
> 
> Sorry, but I am still missing how this is a security benefits.
> 

Yes, this is a mistake. This should be in the design section.

> [...]
> 
> 
>>>>
>>>> 2. Observation:
>>>> ---------------
>>>> Guests can now invalidate Stage-1 caches; invalidation needs forwarding
>>>> to SMMUv3 hardware to maintain coherence.
>>>>
>>>> **Risk:**
>>>> Failing to propagate cache invalidation could allow stale mappings,
>>>> enabling access to old mappings and possibly
>>>> data leakage or misrouting.
>>>
>>> You are referring to data leakage/misrouting between two devices own by
>>> the same guest, right? Xen would still be in charge of flush when the
>>> stage-2 is updated.
>>>
>>
>> Yes, this risk could affect only guests, not xen.
> 
> But it would affect a single guest right? IOW, it is not possible for
> guest A to leak data to guest B even if we don't properly invalidate
> stage-1. Correct?
> 

Correct. I don't see any possible scenario for data leakage between 
different guests, just between 2 devices assigned to the same guest.
I will elaborate on this risk to make it clearer.

>>>>
>>>> 4. Observation:
>>>> ---------------
>>>> The code includes transformations to handle nested translation versus
>>>> standard modes and uses guest-configured
>>>> command queues (e.g., `CMD_CFGI_STE`) and event notifications.
>>>>
>>>> **Risk:**
>>>> Malicious or malformed queue commands from guests could bypass
>>>> validation, manipulate SMMUv3 state,
>>>> or cause system instability.
>>>>
>>>> **Mitigation:** *(Handled by design)*
>>>> Built-in validation of command queue entries and sanitization mechanisms
>>>> ensure only permitted configurations
>>>> are applied.
>>>
>>> This is true as long as we didn't make an mistake in the
>>> configurations ;).
>>>
>>
>> Yes, but I don’t see anything we can do to prevent configuration mistakes.
> 
> There is nothing really preventing it. Same for ...
>>
>>>
>>>> This is supported via additions in `vsmmuv3` and `cmdqueue`
>>>> handling code.
>>>>
>>>> 5. Observation:
>>>> ---------------
>>>> Device Tree modifications enable device assignment and configuration
>>>> through guest DT fragments (e.g., `iommus`)
>>>> are added via `libxl`.
>>>>
>>>> **Risk:**
>>>> Erroneous or malicious Device Tree injection could result in device
>>>> misbinding or guest access to unauthorized
>>>> hardware.
>>>
>>> The DT fragment are not security support and will never be at least
>>> until you have can a libfdt that is able to detect malformed Device-Tree
>>> (I haven't checked if this has changed recently).
>>>
>>
>> But this should still be considered a risk? Similar to the previous
>> observation, system integrator should ensure that DT fragments are correct.
> 
> ... this one. I agree they are risks, but they don't provide much input
> in the design of the vIOMMU.
> 

I get your point. I can remove them if considered to be overhead in this 
context.

> I am a lot more concerned for the scheduling part because the resources
> are shared.
> 
>>> My understanding is there is only a single physical event queue. Xen
>>> would be responsible to handle the events in the queue and forward to
>>> the respective guests. If so, it is not clear what you mean by "disable
>>> event queue".
>>>
>>
>> I was referring to emulated IOMMU event queue. The idea is to make it
>> optional for guests. When disabled, events won't be propagated to the
>> guest.
> 
> But Xen will still receive the events, correct? If so, how does it make
> it better?
> 

You are correct, Xen will still receive events and handle them in pIOMMU 
driver. This is only a mitigation for the part introduced by vIOMMU 
design (events emulation), not the complete solution. This risk has more 
general context and could be related to stage-2 only guests also (e.g. 
guests that perform DMA to an address they are not allowed to access, 
causing translation faults).
But imo mitigation for the physical event queue flooding should be part 
of the pIOMMU driver design

Best regards,
Milan

Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests

Posted by Volodymyr Babchuk 2 months, 1 week ago

Hi Milan,

Milan Djokic <milan_djokic@epam.com> writes:

> On 9/1/25 13:06, Milan Djokic wrote:

[...]
>
> Hello Volodymyr, Julien
>
> Sorry for the delayed follow-up on this topic.
> We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU
> and pIOMMU. Considering single vIOMMU model limitation pointed out by
> Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the
> only proper solution.
> Following is the updated design document.
> I have added additional details to the design and performance impact
> sections, and also indicated future improvements. Security
> considerations section is unchanged apart from some minor details
> according to review comments.
> Let me know what do you think about updated design. Once approved, I
> will send the updated vIOMMU patch series.

This looks fine for me. I can't see any immediate flaws here. So let's
get to patches :)


[...]

-- 
WBR, Volodymyr