Defer the IOMMU translation and support access_type

[PATCH v2 0/5] Defer the IOMMU translation and support access_type

Posted by Jim Shu 1 month, 1 week ago

Note: v1 title is "accel/tcg: Pass the access_type to IOMMUMemoryRegion"

Incoming security protection devices feature more complex IOMMUMemoryRegion
implementation in the CPU path than ARM MPC device. For example,
RISC-V wgChecker [1] may permit the access with only RO/WO permissions.
Consequently, the IOMMUMemoryRegion could return different sections for
read & write access.

To support such IOMMUMemoryRegion behavior in the CPU path, the design
of IOMMU translation must be updated:

1. address_space_translate*() must now pass the access_type to
   IOMMUMemoryRegion.
2. Since IOMMU translation results are too complex to be fully stored
   in the CPU TLB. we will defer the translation until the actual access
   occurs. Also, TLB is allowed to store the untranslated IOMMU region.

To implement deferred IOMMU translation, this patchset introduces the
following changes:

1. tlb_set_page_full() no longer translates the IOMMU region
   immediately. Instead, it stores the untranslated region directly in
   the TLB. A new slow-path flag, TLB_IOMMU, is introduced to force
   access into the slow path when a region has not yet been translated
   in the TLB entry.

2. When the CPU utilizes a TLB entry in the slow path, it should perform
   the lazy IOMMU translation of the access_type first. The resulting
   translated region and access type are stored in CPUTLBEntryFull.
   Since the slow path always performs lazy translation first, we can
   switch the CPUTLBEntryFull content to the correct access type before
   use.

3. To accelerate memory access in the fast path, lazy translation can
   update the addend of the CPUTLBEntry when translating the region to a
   host memory region. We restrict the IOMMU region to have a single
   non-zero 'addend' across all permissions. If a second 'addend' is
   present for a CPUTLBEntry, QEMU will trigger an assertion. This
   limitation is sufficient for security devices, as their "secondary"
   region is typically an IO region used to emulate device error
   handling when access is rejected.

4. To support non-slow TLB flags, lazy translation can update the TLB
   flags in the 'addr_idx' of the CPUTLBEntry. Lazy translation only
   updates the flags for the permissions specified in @prot. This
   ensures that each access_type of a translated region to maintains
   independent TLB flags. For example, TLB_DIRTY of memory region will
   not be "polluted" from other permission that translated to different
   region.

Both RISC-V wgChecker [1] and RISC-V IOPMP [2] devices require this
feature.

[1] RISC-V WG:
https://patchew.org/QEMU/20251021155548.584543-1-jim.shu@sifive.com/
[2] RISC-V IOPMP:
https://patchew.org/QEMU/20250312093735.1517740-1-ethan84@andestech.com/

Changed since v1:
- Remove the ping-pong TLB entry behavior. Instead, defer the IOMMU
  translation until actual access in the CPU path. Provide a IOMMU
  lazy translation function with the special handling of 'addend'
  and 'addr_idx' fields of CPUTLBEntry.
- Fix the checkpatch warning.


Jim Shu (5):
  accel/tcg: Pass access_type as an argument of tlb_set_page*()
  accel/tcg: address_space_translate*() will pass the correct
    iommu_flags
  accel/tcg: Provide early AS translate function
  accel/tcg: Add IOMMU lazy translation function
  accel/tcg: Support IOMMU lazy translation in CPU TLB

 accel/tcg/cputlb.c                   | 247 +++++++++++++++++++++++++--
 include/accel/tcg/iommu.h            |  17 +-
 include/exec/cputlb.h                |  11 +-
 include/exec/tlb-flags.h             |   4 +-
 include/hw/core/cpu.h                |  15 ++
 system/physmem.c                     |  60 ++++++-
 target/alpha/helper.c                |   2 +-
 target/avr/helper.c                  |   3 +-
 target/hppa/mem_helper.c             |   1 -
 target/i386/tcg/system/excp_helper.c |   3 +-
 target/loongarch/tcg/tlb_helper.c    |   2 +-
 target/m68k/helper.c                 |  10 +-
 target/microblaze/helper.c           |   8 +-
 target/mips/tcg/system/tlb_helper.c  |   4 +-
 target/or1k/mmu.c                    |   2 +-
 target/ppc/mmu_helper.c              |   2 +-
 target/riscv/cpu_helper.c            |   2 +-
 target/rx/cpu.c                      |   3 +-
 target/s390x/tcg/excp_helper.c       |   2 +-
 target/sh4/helper.c                  |   3 +-
 target/sparc/mmu_helper.c            |   6 +-
 target/tricore/helper.c              |   2 +-
 target/xtensa/helper.c               |   3 +-
 23 files changed, 354 insertions(+), 58 deletions(-)

-- 
2.43.0

Re: [PATCH v2 0/5] Defer the IOMMU translation and support access_type

Posted by Pierrick Bouvier 1 month ago

On 4/21/2026 9:29 AM, Jim Shu wrote:
> Note: v1 title is "accel/tcg: Pass the access_type to IOMMUMemoryRegion"
> 
> Incoming security protection devices feature more complex IOMMUMemoryRegion
> implementation in the CPU path than ARM MPC device. For example,
> RISC-V wgChecker [1] may permit the access with only RO/WO permissions.
> Consequently, the IOMMUMemoryRegion could return different sections for
> read & write access.
> 
> To support such IOMMUMemoryRegion behavior in the CPU path, the design
> of IOMMU translation must be updated:
> 
> 1. address_space_translate*() must now pass the access_type to
>    IOMMUMemoryRegion.
> 2. Since IOMMU translation results are too complex to be fully stored
>    in the CPU TLB. we will defer the translation until the actual access
>    occurs. Also, TLB is allowed to store the untranslated IOMMU region.
> 
> To implement deferred IOMMU translation, this patchset introduces the
> following changes:
> 
> 1. tlb_set_page_full() no longer translates the IOMMU region
>    immediately. Instead, it stores the untranslated region directly in
>    the TLB. A new slow-path flag, TLB_IOMMU, is introduced to force
>    access into the slow path when a region has not yet been translated
>    in the TLB entry.
> 
> 2. When the CPU utilizes a TLB entry in the slow path, it should perform
>    the lazy IOMMU translation of the access_type first. The resulting
>    translated region and access type are stored in CPUTLBEntryFull.
>    Since the slow path always performs lazy translation first, we can
>    switch the CPUTLBEntryFull content to the correct access type before
>    use.
> 
> 3. To accelerate memory access in the fast path, lazy translation can
>    update the addend of the CPUTLBEntry when translating the region to a
>    host memory region. We restrict the IOMMU region to have a single
>    non-zero 'addend' across all permissions. If a second 'addend' is
>    present for a CPUTLBEntry, QEMU will trigger an assertion. This
>    limitation is sufficient for security devices, as their "secondary"
>    region is typically an IO region used to emulate device error
>    handling when access is rejected.
> 
> 4. To support non-slow TLB flags, lazy translation can update the TLB
>    flags in the 'addr_idx' of the CPUTLBEntry. Lazy translation only
>    updates the flags for the permissions specified in @prot. This
>    ensures that each access_type of a translated region to maintains
>    independent TLB flags. For example, TLB_DIRTY of memory region will
>    not be "polluted" from other permission that translated to different
>    region.
> 
> Both RISC-V wgChecker [1] and RISC-V IOPMP [2] devices require this
> feature.
> 
> [1] RISC-V WG:
> https://patchew.org/QEMU/20251021155548.584543-1-jim.shu@sifive.com/
> [2] RISC-V IOPMP:
> https://patchew.org/QEMU/20250312093735.1517740-1-ethan84@andestech.com/
> 
> Changed since v1:
> - Remove the ping-pong TLB entry behavior. Instead, defer the IOMMU
>   translation until actual access in the CPU path. Provide a IOMMU
>   lazy translation function with the special handling of 'addend'
>   and 'addr_idx' fields of CPUTLBEntry.
> - Fix the checkpatch warning.
> 
> 
> Jim Shu (5):
>   accel/tcg: Pass access_type as an argument of tlb_set_page*()
>   accel/tcg: address_space_translate*() will pass the correct
>     iommu_flags
>   accel/tcg: Provide early AS translate function
>   accel/tcg: Add IOMMU lazy translation function
>   accel/tcg: Support IOMMU lazy translation in CPU TLB
> 
>  accel/tcg/cputlb.c                   | 247 +++++++++++++++++++++++++--
>  include/accel/tcg/iommu.h            |  17 +-
>  include/exec/cputlb.h                |  11 +-
>  include/exec/tlb-flags.h             |   4 +-
>  include/hw/core/cpu.h                |  15 ++
>  system/physmem.c                     |  60 ++++++-
>  target/alpha/helper.c                |   2 +-
>  target/avr/helper.c                  |   3 +-
>  target/hppa/mem_helper.c             |   1 -
>  target/i386/tcg/system/excp_helper.c |   3 +-
>  target/loongarch/tcg/tlb_helper.c    |   2 +-
>  target/m68k/helper.c                 |  10 +-
>  target/microblaze/helper.c           |   8 +-
>  target/mips/tcg/system/tlb_helper.c  |   4 +-
>  target/or1k/mmu.c                    |   2 +-
>  target/ppc/mmu_helper.c              |   2 +-
>  target/riscv/cpu_helper.c            |   2 +-
>  target/rx/cpu.c                      |   3 +-
>  target/s390x/tcg/excp_helper.c       |   2 +-
>  target/sh4/helper.c                  |   3 +-
>  target/sparc/mmu_helper.c            |   6 +-
>  target/tricore/helper.c              |   2 +-
>  target/xtensa/helper.c               |   3 +-
>  23 files changed, 354 insertions(+), 58 deletions(-)
> 

For the series:
Acked-by: Pierrick Bouvier <pierrick.bouvier@oss.qualcomm.com>

Re: [PATCH v2 0/5] Defer the IOMMU translation and support access_type

Posted by Pierrick Bouvier 1 month ago

Adding Peter, who might have a good opinion on it since he worked on MPC
implementation.

On 4/21/2026 9:29 AM, Jim Shu wrote:
> Note: v1 title is "accel/tcg: Pass the access_type to IOMMUMemoryRegion"
> 
> Incoming security protection devices feature more complex IOMMUMemoryRegion
> implementation in the CPU path than ARM MPC device. For example,
> RISC-V wgChecker [1] may permit the access with only RO/WO permissions.
> Consequently, the IOMMUMemoryRegion could return different sections for
> read & write access.
> 
> To support such IOMMUMemoryRegion behavior in the CPU path, the design
> of IOMMU translation must be updated:
> 
> 1. address_space_translate*() must now pass the access_type to
>    IOMMUMemoryRegion.
> 2. Since IOMMU translation results are too complex to be fully stored
>    in the CPU TLB. we will defer the translation until the actual access
>    occurs. Also, TLB is allowed to store the untranslated IOMMU region.
> 
> To implement deferred IOMMU translation, this patchset introduces the
> following changes:
> 
> 1. tlb_set_page_full() no longer translates the IOMMU region
>    immediately. Instead, it stores the untranslated region directly in
>    the TLB. A new slow-path flag, TLB_IOMMU, is introduced to force
>    access into the slow path when a region has not yet been translated
>    in the TLB entry.
> 
> 2. When the CPU utilizes a TLB entry in the slow path, it should perform
>    the lazy IOMMU translation of the access_type first. The resulting
>    translated region and access type are stored in CPUTLBEntryFull.
>    Since the slow path always performs lazy translation first, we can
>    switch the CPUTLBEntryFull content to the correct access type before
>    use.
> 
> 3. To accelerate memory access in the fast path, lazy translation can
>    update the addend of the CPUTLBEntry when translating the region to a
>    host memory region. We restrict the IOMMU region to have a single
>    non-zero 'addend' across all permissions. If a second 'addend' is
>    present for a CPUTLBEntry, QEMU will trigger an assertion. This
>    limitation is sufficient for security devices, as their "secondary"
>    region is typically an IO region used to emulate device error
>    handling when access is rejected.
> 
> 4. To support non-slow TLB flags, lazy translation can update the TLB
>    flags in the 'addr_idx' of the CPUTLBEntry. Lazy translation only
>    updates the flags for the permissions specified in @prot. This
>    ensures that each access_type of a translated region to maintains
>    independent TLB flags. For example, TLB_DIRTY of memory region will
>    not be "polluted" from other permission that translated to different
>    region.
> 
> Both RISC-V wgChecker [1] and RISC-V IOPMP [2] devices require this
> feature.
> 
> [1] RISC-V WG:
> https://patchew.org/QEMU/20251021155548.584543-1-jim.shu@sifive.com/
> [2] RISC-V IOPMP:
> https://patchew.org/QEMU/20250312093735.1517740-1-ethan84@andestech.com/
> 
> Changed since v1:
> - Remove the ping-pong TLB entry behavior. Instead, defer the IOMMU
>   translation until actual access in the CPU path. Provide a IOMMU
>   lazy translation function with the special handling of 'addend'
>   and 'addr_idx' fields of CPUTLBEntry.
> - Fix the checkpatch warning.
> 
> 
> Jim Shu (5):
>   accel/tcg: Pass access_type as an argument of tlb_set_page*()
>   accel/tcg: address_space_translate*() will pass the correct
>     iommu_flags
>   accel/tcg: Provide early AS translate function
>   accel/tcg: Add IOMMU lazy translation function
>   accel/tcg: Support IOMMU lazy translation in CPU TLB
> 
>  accel/tcg/cputlb.c                   | 247 +++++++++++++++++++++++++--
>  include/accel/tcg/iommu.h            |  17 +-
>  include/exec/cputlb.h                |  11 +-
>  include/exec/tlb-flags.h             |   4 +-
>  include/hw/core/cpu.h                |  15 ++
>  system/physmem.c                     |  60 ++++++-
>  target/alpha/helper.c                |   2 +-
>  target/avr/helper.c                  |   3 +-
>  target/hppa/mem_helper.c             |   1 -
>  target/i386/tcg/system/excp_helper.c |   3 +-
>  target/loongarch/tcg/tlb_helper.c    |   2 +-
>  target/m68k/helper.c                 |  10 +-
>  target/microblaze/helper.c           |   8 +-
>  target/mips/tcg/system/tlb_helper.c  |   4 +-
>  target/or1k/mmu.c                    |   2 +-
>  target/ppc/mmu_helper.c              |   2 +-
>  target/riscv/cpu_helper.c            |   2 +-
>  target/rx/cpu.c                      |   3 +-
>  target/s390x/tcg/excp_helper.c       |   2 +-
>  target/sh4/helper.c                  |   3 +-
>  target/sparc/mmu_helper.c            |   6 +-
>  target/tricore/helper.c              |   2 +-
>  target/xtensa/helper.c               |   3 +-
>  23 files changed, 354 insertions(+), 58 deletions(-)
>

Re: [PATCH v2 0/5] Defer the IOMMU translation and support access_type

Posted by Pierrick Bouvier 1 month, 1 week ago

Hi Jim,

On 4/21/2026 9:29 AM, Jim Shu wrote:
> Note: v1 title is "accel/tcg: Pass the access_type to IOMMUMemoryRegion"
> 
> Incoming security protection devices feature more complex IOMMUMemoryRegion
> implementation in the CPU path than ARM MPC device. For example,
> RISC-V wgChecker [1] may permit the access with only RO/WO permissions.
> Consequently, the IOMMUMemoryRegion could return different sections for
> read & write access.
>

The consequence is not directly implied for me.

Why having a single translation and page protection handle read/write 
access is not enough here?

What is the (real world) use case to return different physical memory 
locations depending on read/write access for a given virtual address?

I'm not against the goal of this series, but trying to understand why it 
went in this direction, which complexify this already complex code path.

> To support such IOMMUMemoryRegion behavior in the CPU path, the design
> of IOMMU translation must be updated:
> 
> 1. address_space_translate*() must now pass the access_type to
>     IOMMUMemoryRegion.
> 2. Since IOMMU translation results are too complex to be fully stored
>     in the CPU TLB. we will defer the translation until the actual access
>     occurs. Also, TLB is allowed to store the untranslated IOMMU region.
> 
> To implement deferred IOMMU translation, this patchset introduces the
> following changes:
> 
> 1. tlb_set_page_full() no longer translates the IOMMU region
>     immediately. Instead, it stores the untranslated region directly in
>     the TLB. A new slow-path flag, TLB_IOMMU, is introduced to force
>     access into the slow path when a region has not yet been translated
>     in the TLB entry.
> 
> 2. When the CPU utilizes a TLB entry in the slow path, it should perform
>     the lazy IOMMU translation of the access_type first. The resulting
>     translated region and access type are stored in CPUTLBEntryFull.
>     Since the slow path always performs lazy translation first, we can
>     switch the CPUTLBEntryFull content to the correct access type before
>     use.
> 
> 3. To accelerate memory access in the fast path, lazy translation can
>     update the addend of the CPUTLBEntry when translating the region to a
>     host memory region. We restrict the IOMMU region to have a single
>     non-zero 'addend' across all permissions. If a second 'addend' is
>     present for a CPUTLBEntry, QEMU will trigger an assertion. This
>     limitation is sufficient for security devices, as their "secondary"
>     region is typically an IO region used to emulate device error
>     handling when access is rejected.
> 
> 4. To support non-slow TLB flags, lazy translation can update the TLB
>     flags in the 'addr_idx' of the CPUTLBEntry. Lazy translation only
>     updates the flags for the permissions specified in @prot. This
>     ensures that each access_type of a translated region to maintains
>     independent TLB flags. For example, TLB_DIRTY of memory region will
>     not be "polluted" from other permission that translated to different
>     region.
> 
> Both RISC-V wgChecker [1] and RISC-V IOPMP [2] devices require this
> feature.
> 
> [1] RISC-V WG:
> https://patchew.org/QEMU/20251021155548.584543-1-jim.shu@sifive.com/
> [2] RISC-V IOPMP:
> https://patchew.org/QEMU/20250312093735.1517740-1-ethan84@andestech.com/
> 
> Changed since v1:
> - Remove the ping-pong TLB entry behavior. Instead, defer the IOMMU
>    translation until actual access in the CPU path. Provide a IOMMU
>    lazy translation function with the special handling of 'addend'
>    and 'addr_idx' fields of CPUTLBEntry.
> - Fix the checkpatch warning.
> 
> 
> Jim Shu (5):
>    accel/tcg: Pass access_type as an argument of tlb_set_page*()
>    accel/tcg: address_space_translate*() will pass the correct
>      iommu_flags
>    accel/tcg: Provide early AS translate function
>    accel/tcg: Add IOMMU lazy translation function
>    accel/tcg: Support IOMMU lazy translation in CPU TLB
> 
>   accel/tcg/cputlb.c                   | 247 +++++++++++++++++++++++++--
>   include/accel/tcg/iommu.h            |  17 +-
>   include/exec/cputlb.h                |  11 +-
>   include/exec/tlb-flags.h             |   4 +-
>   include/hw/core/cpu.h                |  15 ++
>   system/physmem.c                     |  60 ++++++-
>   target/alpha/helper.c                |   2 +-
>   target/avr/helper.c                  |   3 +-
>   target/hppa/mem_helper.c             |   1 -
>   target/i386/tcg/system/excp_helper.c |   3 +-
>   target/loongarch/tcg/tlb_helper.c    |   2 +-
>   target/m68k/helper.c                 |  10 +-
>   target/microblaze/helper.c           |   8 +-
>   target/mips/tcg/system/tlb_helper.c  |   4 +-
>   target/or1k/mmu.c                    |   2 +-
>   target/ppc/mmu_helper.c              |   2 +-
>   target/riscv/cpu_helper.c            |   2 +-
>   target/rx/cpu.c                      |   3 +-
>   target/s390x/tcg/excp_helper.c       |   2 +-
>   target/sh4/helper.c                  |   3 +-
>   target/sparc/mmu_helper.c            |   6 +-
>   target/tricore/helper.c              |   2 +-
>   target/xtensa/helper.c               |   3 +-
>   23 files changed, 354 insertions(+), 58 deletions(-)
>

Re: [PATCH v2 0/5] Defer the IOMMU translation and support access_type

Posted by Jim Shu 1 month, 1 week ago

On Thu, Apr 23, 2026 at 12:01 AM Pierrick Bouvier
<pierrick.bouvier@oss.qualcomm.com> wrote:

Hi Pierrick,

Thanks for discussing the design!

>
> Hi Jim,
>
> On 4/21/2026 9:29 AM, Jim Shu wrote:
> > Note: v1 title is "accel/tcg: Pass the access_type to IOMMUMemoryRegion"
> >
> > Incoming security protection devices feature more complex IOMMUMemoryRegion
> > implementation in the CPU path than ARM MPC device. For example,
> > RISC-V wgChecker [1] may permit the access with only RO/WO permissions.
> > Consequently, the IOMMUMemoryRegion could return different sections for
> > read & write access.
> >
>
> The consequence is not directly implied for me.
>
> Why having a single translation and page protection handle read/write
> access is not enough here?
>
> What is the (real world) use case to return different physical memory
> locations depending on read/write access for a given virtual address?

When permission is granted, the wgChecker device returns the original
region of the address, called the downstream region
When permission is denied, the wgChecker device returns a special
MemoryRegion for error handling, called the blocked_io region).
This region may trigger an IRQ and a bus error. Additionally, memory
accesses to this region will result in 'read 0, write ignored' rather
than triggering exceptions from the MMU or MPU.

Consequently, implementing RWX permission checks on security
protection devices may return different regions (either downstream or
blocked_io) depending on the access type.
This design is similar to the existing ARM MPC device
(hw/misc/tz-mpc.c). The primary difference is that the MPC only checks
the physical address rather than RWX permissions, meaning it returns
either the downstream or blocked_io region for a single address
regardless of the access type.

I also considered an alternative design where CPUTLBEntryFull stores
both the successful permission and the blocked_io region of the IOMMU
region.
In this scenario, the slow-path code utilizing CPUTLBEntryFull would
check permissions and return the blocked_io region if access is denied

However, the address_space_translate_iommu() implementation supports
recursive IOMMU region translation.
if multiple IOMMU regions are encountered during a single address
translation, storing only the first region's permissions is
insufficient.
Ultimately, we still face a situation where RWX permissions might
return different regions separately (e.g., downstream,
blocked_io_iommu1, and blocked_io_iommu2).

>
> I'm not against the goal of this series, but trying to understand why it
> went in this direction, which complexify this already complex code path.
>
> > To support such IOMMUMemoryRegion behavior in the CPU path, the design
> > of IOMMU translation must be updated:
> >
> > 1. address_space_translate*() must now pass the access_type to
> >     IOMMUMemoryRegion.
> > 2. Since IOMMU translation results are too complex to be fully stored
> >     in the CPU TLB. we will defer the translation until the actual access
> >     occurs. Also, TLB is allowed to store the untranslated IOMMU region.
> >
> > To implement deferred IOMMU translation, this patchset introduces the
> > following changes:
> >
> > 1. tlb_set_page_full() no longer translates the IOMMU region
> >     immediately. Instead, it stores the untranslated region directly in
> >     the TLB. A new slow-path flag, TLB_IOMMU, is introduced to force
> >     access into the slow path when a region has not yet been translated
> >     in the TLB entry.
> >
> > 2. When the CPU utilizes a TLB entry in the slow path, it should perform
> >     the lazy IOMMU translation of the access_type first. The resulting
> >     translated region and access type are stored in CPUTLBEntryFull.
> >     Since the slow path always performs lazy translation first, we can
> >     switch the CPUTLBEntryFull content to the correct access type before
> >     use.
> >
> > 3. To accelerate memory access in the fast path, lazy translation can
> >     update the addend of the CPUTLBEntry when translating the region to a
> >     host memory region. We restrict the IOMMU region to have a single
> >     non-zero 'addend' across all permissions. If a second 'addend' is
> >     present for a CPUTLBEntry, QEMU will trigger an assertion. This
> >     limitation is sufficient for security devices, as their "secondary"
> >     region is typically an IO region used to emulate device error
> >     handling when access is rejected.
> >
> > 4. To support non-slow TLB flags, lazy translation can update the TLB
> >     flags in the 'addr_idx' of the CPUTLBEntry. Lazy translation only
> >     updates the flags for the permissions specified in @prot. This
> >     ensures that each access_type of a translated region to maintains
> >     independent TLB flags. For example, TLB_DIRTY of memory region will
> >     not be "polluted" from other permission that translated to different
> >     region.
> >
> > Both RISC-V wgChecker [1] and RISC-V IOPMP [2] devices require this
> > feature.
> >
> > [1] RISC-V WG:
> > https://patchew.org/QEMU/20251021155548.584543-1-jim.shu@sifive.com/
> > [2] RISC-V IOPMP:
> > https://patchew.org/QEMU/20250312093735.1517740-1-ethan84@andestech.com/
> >
> > Changed since v1:
> > - Remove the ping-pong TLB entry behavior. Instead, defer the IOMMU
> >    translation until actual access in the CPU path. Provide a IOMMU
> >    lazy translation function with the special handling of 'addend'
> >    and 'addr_idx' fields of CPUTLBEntry.
> > - Fix the checkpatch warning.
> >
> >
> > Jim Shu (5):
> >    accel/tcg: Pass access_type as an argument of tlb_set_page*()
> >    accel/tcg: address_space_translate*() will pass the correct
> >      iommu_flags
> >    accel/tcg: Provide early AS translate function
> >    accel/tcg: Add IOMMU lazy translation function
> >    accel/tcg: Support IOMMU lazy translation in CPU TLB
> >
> >   accel/tcg/cputlb.c                   | 247 +++++++++++++++++++++++++--
> >   include/accel/tcg/iommu.h            |  17 +-
> >   include/exec/cputlb.h                |  11 +-
> >   include/exec/tlb-flags.h             |   4 +-
> >   include/hw/core/cpu.h                |  15 ++
> >   system/physmem.c                     |  60 ++++++-
> >   target/alpha/helper.c                |   2 +-
> >   target/avr/helper.c                  |   3 +-
> >   target/hppa/mem_helper.c             |   1 -
> >   target/i386/tcg/system/excp_helper.c |   3 +-
> >   target/loongarch/tcg/tlb_helper.c    |   2 +-
> >   target/m68k/helper.c                 |  10 +-
> >   target/microblaze/helper.c           |   8 +-
> >   target/mips/tcg/system/tlb_helper.c  |   4 +-
> >   target/or1k/mmu.c                    |   2 +-
> >   target/ppc/mmu_helper.c              |   2 +-
> >   target/riscv/cpu_helper.c            |   2 +-
> >   target/rx/cpu.c                      |   3 +-
> >   target/s390x/tcg/excp_helper.c       |   2 +-
> >   target/sh4/helper.c                  |   3 +-
> >   target/sparc/mmu_helper.c            |   6 +-
> >   target/tricore/helper.c              |   2 +-
> >   target/xtensa/helper.c               |   3 +-
> >   23 files changed, 354 insertions(+), 58 deletions(-)
> >
>

Regards,
Jim

Re: [PATCH v2 0/5] Defer the IOMMU translation and support access_type

Posted by Pierrick Bouvier 1 month, 1 week ago

On 4/22/2026 11:16 PM, Jim Shu wrote:
> On Thu, Apr 23, 2026 at 12:01 AM Pierrick Bouvier
> <pierrick.bouvier@oss.qualcomm.com> wrote:
> 
> Hi Pierrick,
> 
> Thanks for discussing the design!
> 
>>
>> Hi Jim,
>>
>> On 4/21/2026 9:29 AM, Jim Shu wrote:
>>> Note: v1 title is "accel/tcg: Pass the access_type to IOMMUMemoryRegion"
>>>
>>> Incoming security protection devices feature more complex IOMMUMemoryRegion
>>> implementation in the CPU path than ARM MPC device. For example,
>>> RISC-V wgChecker [1] may permit the access with only RO/WO permissions.
>>> Consequently, the IOMMUMemoryRegion could return different sections for
>>> read & write access.
>>>
>>
>> The consequence is not directly implied for me.
>>
>> Why having a single translation and page protection handle read/write
>> access is not enough here?
>>
>> What is the (real world) use case to return different physical memory
>> locations depending on read/write access for a given virtual address?
> 
> When permission is granted, the wgChecker device returns the original
> region of the address, called the downstream region
> When permission is denied, the wgChecker device returns a special
> MemoryRegion for error handling, called the blocked_io region).
> This region may trigger an IRQ and a bus error. Additionally, memory
> accesses to this region will result in 'read 0, write ignored' rather
> than triggering exceptions from the MMU or MPU.
> 
> Consequently, implementing RWX permission checks on security
> protection devices may return different regions (either downstream or
> blocked_io) depending on the access type.
> This design is similar to the existing ARM MPC device
> (hw/misc/tz-mpc.c). The primary difference is that the MPC only checks
> the physical address rather than RWX permissions, meaning it returns
> either the downstream or blocked_io region for a single address
> regardless of the access type.
>

It is indeed, a big difference. Translation with MPC device still 
results in a single address space being chosen.

> I also considered an alternative design where CPUTLBEntryFull stores
> both the successful permission and the blocked_io region of the IOMMU
> region.
> In this scenario, the slow-path code utilizing CPUTLBEntryFull would
> check permissions and return the blocked_io region if access is denied
> 
> However, the address_space_translate_iommu() implementation supports
> recursive IOMMU region translation.
> if multiple IOMMU regions are encountered during a single address
> translation, storing only the first region's permissions is
> insufficient.
> Ultimately, we still face a situation where RWX permissions might
> return different regions separately (e.g., downstream,
> blocked_io_iommu1, and blocked_io_iommu2).
>

The "blocked_io_iommu_X" is a consequence of the current choice of 
implementation, but I still don't see why it's an absolute necessity.

An alternative would be to treat all accesses as MMIO, but I guess your 
goal here is to optimize the code path where we access RAM?

Why can't you reuse existing page protection mechanism for this, 
authorizing read or write? WorldChecker just seems to be an additional 
check on top of existing translation. The fact it's an "additional" 
device is a design/implementation details, and could simply be part of 
page protection mechanism. It might require some additional plumbing 
when an interrupt is raised though, to redirect this correctly instead 
of signaling CPU like it would normally do.

The fact is does different action based on read/write attribute does not 
really fit very well with existing implementation, as you noticed. And I 
wonder if it's worth changing the global design just to optimize this 
(optional) use case for RISC-V.

>>
>> I'm not against the goal of this series, but trying to understand why it
>> went in this direction, which complexify this already complex code path.
>>
>>> To support such IOMMUMemoryRegion behavior in the CPU path, the design
>>> of IOMMU translation must be updated:
>>>
>>> 1. address_space_translate*() must now pass the access_type to
>>>      IOMMUMemoryRegion.
>>> 2. Since IOMMU translation results are too complex to be fully stored
>>>      in the CPU TLB. we will defer the translation until the actual access
>>>      occurs. Also, TLB is allowed to store the untranslated IOMMU region.
>>>
>>> To implement deferred IOMMU translation, this patchset introduces the
>>> following changes:
>>>
>>> 1. tlb_set_page_full() no longer translates the IOMMU region
>>>      immediately. Instead, it stores the untranslated region directly in
>>>      the TLB. A new slow-path flag, TLB_IOMMU, is introduced to force
>>>      access into the slow path when a region has not yet been translated
>>>      in the TLB entry.
>>>
>>> 2. When the CPU utilizes a TLB entry in the slow path, it should perform
>>>      the lazy IOMMU translation of the access_type first. The resulting
>>>      translated region and access type are stored in CPUTLBEntryFull.
>>>      Since the slow path always performs lazy translation first, we can
>>>      switch the CPUTLBEntryFull content to the correct access type before
>>>      use.
>>>
>>> 3. To accelerate memory access in the fast path, lazy translation can
>>>      update the addend of the CPUTLBEntry when translating the region to a
>>>      host memory region. We restrict the IOMMU region to have a single
>>>      non-zero 'addend' across all permissions. If a second 'addend' is
>>>      present for a CPUTLBEntry, QEMU will trigger an assertion. This
>>>      limitation is sufficient for security devices, as their "secondary"
>>>      region is typically an IO region used to emulate device error
>>>      handling when access is rejected.
>>>
>>> 4. To support non-slow TLB flags, lazy translation can update the TLB
>>>      flags in the 'addr_idx' of the CPUTLBEntry. Lazy translation only
>>>      updates the flags for the permissions specified in @prot. This
>>>      ensures that each access_type of a translated region to maintains
>>>      independent TLB flags. For example, TLB_DIRTY of memory region will
>>>      not be "polluted" from other permission that translated to different
>>>      region.
>>>
>>> Both RISC-V wgChecker [1] and RISC-V IOPMP [2] devices require this
>>> feature.
>>>
>>> [1] RISC-V WG:
>>> https://patchew.org/QEMU/20251021155548.584543-1-jim.shu@sifive.com/
>>> [2] RISC-V IOPMP:
>>> https://patchew.org/QEMU/20250312093735.1517740-1-ethan84@andestech.com/
>>>
>>> Changed since v1:
>>> - Remove the ping-pong TLB entry behavior. Instead, defer the IOMMU
>>>     translation until actual access in the CPU path. Provide a IOMMU
>>>     lazy translation function with the special handling of 'addend'
>>>     and 'addr_idx' fields of CPUTLBEntry.
>>> - Fix the checkpatch warning.
>>>
>>>
>>> Jim Shu (5):
>>>     accel/tcg: Pass access_type as an argument of tlb_set_page*()
>>>     accel/tcg: address_space_translate*() will pass the correct
>>>       iommu_flags
>>>     accel/tcg: Provide early AS translate function
>>>     accel/tcg: Add IOMMU lazy translation function
>>>     accel/tcg: Support IOMMU lazy translation in CPU TLB
>>>
>>>    accel/tcg/cputlb.c                   | 247 +++++++++++++++++++++++++--
>>>    include/accel/tcg/iommu.h            |  17 +-
>>>    include/exec/cputlb.h                |  11 +-
>>>    include/exec/tlb-flags.h             |   4 +-
>>>    include/hw/core/cpu.h                |  15 ++
>>>    system/physmem.c                     |  60 ++++++-
>>>    target/alpha/helper.c                |   2 +-
>>>    target/avr/helper.c                  |   3 +-
>>>    target/hppa/mem_helper.c             |   1 -
>>>    target/i386/tcg/system/excp_helper.c |   3 +-
>>>    target/loongarch/tcg/tlb_helper.c    |   2 +-
>>>    target/m68k/helper.c                 |  10 +-
>>>    target/microblaze/helper.c           |   8 +-
>>>    target/mips/tcg/system/tlb_helper.c  |   4 +-
>>>    target/or1k/mmu.c                    |   2 +-
>>>    target/ppc/mmu_helper.c              |   2 +-
>>>    target/riscv/cpu_helper.c            |   2 +-
>>>    target/rx/cpu.c                      |   3 +-
>>>    target/s390x/tcg/excp_helper.c       |   2 +-
>>>    target/sh4/helper.c                  |   3 +-
>>>    target/sparc/mmu_helper.c            |   6 +-
>>>    target/tricore/helper.c              |   2 +-
>>>    target/xtensa/helper.c               |   3 +-
>>>    23 files changed, 354 insertions(+), 58 deletions(-)
>>>
>>
> 
> Regards,
> Jim

Re: [PATCH v2 0/5] Defer the IOMMU translation and support access_type

Posted by Jim Shu 1 month ago

On Fri, Apr 24, 2026 at 2:25 AM Pierrick Bouvier
<pierrick.bouvier@oss.qualcomm.com> wrote:
>
> On 4/22/2026 11:16 PM, Jim Shu wrote:
> > On Thu, Apr 23, 2026 at 12:01 AM Pierrick Bouvier
> > <pierrick.bouvier@oss.qualcomm.com> wrote:
> >
> > Hi Pierrick,
> >
> > Thanks for discussing the design!
> >
> >>
> >> Hi Jim,
> >>
> >> On 4/21/2026 9:29 AM, Jim Shu wrote:
> >>> Note: v1 title is "accel/tcg: Pass the access_type to IOMMUMemoryRegion"
> >>>
> >>> Incoming security protection devices feature more complex IOMMUMemoryRegion
> >>> implementation in the CPU path than ARM MPC device. For example,
> >>> RISC-V wgChecker [1] may permit the access with only RO/WO permissions.
> >>> Consequently, the IOMMUMemoryRegion could return different sections for
> >>> read & write access.
> >>>
> >>
> >> The consequence is not directly implied for me.
> >>
> >> Why having a single translation and page protection handle read/write
> >> access is not enough here?
> >>
> >> What is the (real world) use case to return different physical memory
> >> locations depending on read/write access for a given virtual address?
> >
> > When permission is granted, the wgChecker device returns the original
> > region of the address, called the downstream region
> > When permission is denied, the wgChecker device returns a special
> > MemoryRegion for error handling, called the blocked_io region).
> > This region may trigger an IRQ and a bus error. Additionally, memory
> > accesses to this region will result in 'read 0, write ignored' rather
> > than triggering exceptions from the MMU or MPU.
> >
> > Consequently, implementing RWX permission checks on security
> > protection devices may return different regions (either downstream or
> > blocked_io) depending on the access type.
> > This design is similar to the existing ARM MPC device
> > (hw/misc/tz-mpc.c). The primary difference is that the MPC only checks
> > the physical address rather than RWX permissions, meaning it returns
> > either the downstream or blocked_io region for a single address
> > regardless of the access type.
> >
>
> It is indeed, a big difference. Translation with MPC device still
> results in a single address space being chosen.
>
> > I also considered an alternative design where CPUTLBEntryFull stores
> > both the successful permission and the blocked_io region of the IOMMU
> > region.
> > In this scenario, the slow-path code utilizing CPUTLBEntryFull would
> > check permissions and return the blocked_io region if access is denied
> >
> > However, the address_space_translate_iommu() implementation supports
> > recursive IOMMU region translation.
> > if multiple IOMMU regions are encountered during a single address
> > translation, storing only the first region's permissions is
> > insufficient.
> > Ultimately, we still face a situation where RWX permissions might
> > return different regions separately (e.g., downstream,
> > blocked_io_iommu1, and blocked_io_iommu2).
> >
>
> The "blocked_io_iommu_X" is a consequence of the current choice of
> implementation, but I still don't see why it's an absolute necessity.
>

I am open to removing it if the community agrees. I believe platforms
with the ARM MPC do not use multiple IOMMU regions when translating a
single address.

> An alternative would be to treat all accesses as MMIO, but I guess your
> goal here is to optimize the code path where we access RAM?
>

If the downstream region is a memory region, it will update the TLB
flags and vaddr in addr_idx for the successful permissions of IOMMU
region.
To explain further: In the alternative approach, we would obtain the
downstream region, IOMMU permissions, and the blocked_io region from
the IOMMU translation function (which would require an additional
return value or API to get the blocked_io region). The TLB entry will
handle the downstream region as a non-IOMMU region but will only
update the TLB flags for the IOMMU permissions; it will also store the
permissions and the blocked_io region. The slow-path code checks the
IOMMU permissions. If permission is denied, it will perform the
transaction on the blocked_io region instead.
We can remove the lazy IOMMU translation if using an alternative approach.

> Why can't you reuse existing page protection mechanism for this,
> authorizing read or write? WorldChecker just seems to be an additional
> check on top of existing translation. The fact it's an "additional"
> device is a design/implementation details, and could simply be part of
> page protection mechanism. It might require some additional plumbing
> when an interrupt is raised though, to redirect this correctly instead
> of signaling CPU like it would normally do.
>
> The fact is does different action based on read/write attribute does not
> really fit very well with existing implementation, as you noticed. And I
> wonder if it's worth changing the global design just to optimize this
> (optional) use case for RISC-V.
>

In the SoC architecture, the wgCheckers are positioned in front of
devices (either memory or MMIO). When the CPU or DMA sends
transactions to the device, the device's wgChecker first performs a
permission check to determine if the transaction should be forwarded.
If we move this check to the architecture's tlb_fill() function, how
would the CPU code determine if transactions were sent to the device
(or to the upstream region of the wgChecker)? The CPU code would need
to be aware of the memory tree hierarchy to do that, which I believe
is more difficult to implement.

In the current QEMU architecture, when a security protection device
sits in front of a device, placing it within the memory tree seems to
be the most suitable approach, similar to the ARM MPC device. The
address_space*() API can check if a transaction will be forwarded to
the device and perform the permission check via the IOMMU region. Both
the CPU-side address_space_translate_for_iotlb() and the DMA-side
address_space_translate_iommu() handle this. The only missing
component is that the CPU TLB cannot handle RWX permission checks for
IOMMU regions within the memory tree.




> >>
> >> I'm not against the goal of this series, but trying to understand why it
> >> went in this direction, which complexify this already complex code path.
> >>
> >>> To support such IOMMUMemoryRegion behavior in the CPU path, the design
> >>> of IOMMU translation must be updated:
> >>>
> >>> 1. address_space_translate*() must now pass the access_type to
> >>>      IOMMUMemoryRegion.
> >>> 2. Since IOMMU translation results are too complex to be fully stored
> >>>      in the CPU TLB. we will defer the translation until the actual access
> >>>      occurs. Also, TLB is allowed to store the untranslated IOMMU region.
> >>>
> >>> To implement deferred IOMMU translation, this patchset introduces the
> >>> following changes:
> >>>
> >>> 1. tlb_set_page_full() no longer translates the IOMMU region
> >>>      immediately. Instead, it stores the untranslated region directly in
> >>>      the TLB. A new slow-path flag, TLB_IOMMU, is introduced to force
> >>>      access into the slow path when a region has not yet been translated
> >>>      in the TLB entry.
> >>>
> >>> 2. When the CPU utilizes a TLB entry in the slow path, it should perform
> >>>      the lazy IOMMU translation of the access_type first. The resulting
> >>>      translated region and access type are stored in CPUTLBEntryFull.
> >>>      Since the slow path always performs lazy translation first, we can
> >>>      switch the CPUTLBEntryFull content to the correct access type before
> >>>      use.
> >>>
> >>> 3. To accelerate memory access in the fast path, lazy translation can
> >>>      update the addend of the CPUTLBEntry when translating the region to a
> >>>      host memory region. We restrict the IOMMU region to have a single
> >>>      non-zero 'addend' across all permissions. If a second 'addend' is
> >>>      present for a CPUTLBEntry, QEMU will trigger an assertion. This
> >>>      limitation is sufficient for security devices, as their "secondary"
> >>>      region is typically an IO region used to emulate device error
> >>>      handling when access is rejected.
> >>>
> >>> 4. To support non-slow TLB flags, lazy translation can update the TLB
> >>>      flags in the 'addr_idx' of the CPUTLBEntry. Lazy translation only
> >>>      updates the flags for the permissions specified in @prot. This
> >>>      ensures that each access_type of a translated region to maintains
> >>>      independent TLB flags. For example, TLB_DIRTY of memory region will
> >>>      not be "polluted" from other permission that translated to different
> >>>      region.
> >>>
> >>> Both RISC-V wgChecker [1] and RISC-V IOPMP [2] devices require this
> >>> feature.
> >>>
> >>> [1] RISC-V WG:
> >>> https://patchew.org/QEMU/20251021155548.584543-1-jim.shu@sifive.com/
> >>> [2] RISC-V IOPMP:
> >>> https://patchew.org/QEMU/20250312093735.1517740-1-ethan84@andestech.com/
> >>>
> >>> Changed since v1:
> >>> - Remove the ping-pong TLB entry behavior. Instead, defer the IOMMU
> >>>     translation until actual access in the CPU path. Provide a IOMMU
> >>>     lazy translation function with the special handling of 'addend'
> >>>     and 'addr_idx' fields of CPUTLBEntry.
> >>> - Fix the checkpatch warning.
> >>>
> >>>
> >>> Jim Shu (5):
> >>>     accel/tcg: Pass access_type as an argument of tlb_set_page*()
> >>>     accel/tcg: address_space_translate*() will pass the correct
> >>>       iommu_flags
> >>>     accel/tcg: Provide early AS translate function
> >>>     accel/tcg: Add IOMMU lazy translation function
> >>>     accel/tcg: Support IOMMU lazy translation in CPU TLB
> >>>
> >>>    accel/tcg/cputlb.c                   | 247 +++++++++++++++++++++++++--
> >>>    include/accel/tcg/iommu.h            |  17 +-
> >>>    include/exec/cputlb.h                |  11 +-
> >>>    include/exec/tlb-flags.h             |   4 +-
> >>>    include/hw/core/cpu.h                |  15 ++
> >>>    system/physmem.c                     |  60 ++++++-
> >>>    target/alpha/helper.c                |   2 +-
> >>>    target/avr/helper.c                  |   3 +-
> >>>    target/hppa/mem_helper.c             |   1 -
> >>>    target/i386/tcg/system/excp_helper.c |   3 +-
> >>>    target/loongarch/tcg/tlb_helper.c    |   2 +-
> >>>    target/m68k/helper.c                 |  10 +-
> >>>    target/microblaze/helper.c           |   8 +-
> >>>    target/mips/tcg/system/tlb_helper.c  |   4 +-
> >>>    target/or1k/mmu.c                    |   2 +-
> >>>    target/ppc/mmu_helper.c              |   2 +-
> >>>    target/riscv/cpu_helper.c            |   2 +-
> >>>    target/rx/cpu.c                      |   3 +-
> >>>    target/s390x/tcg/excp_helper.c       |   2 +-
> >>>    target/sh4/helper.c                  |   3 +-
> >>>    target/sparc/mmu_helper.c            |   6 +-
> >>>    target/tricore/helper.c              |   2 +-
> >>>    target/xtensa/helper.c               |   3 +-
> >>>    23 files changed, 354 insertions(+), 58 deletions(-)
> >>>
> >>
> >
> > Regards,
> > Jim
>

Re: [PATCH v2 0/5] Defer the IOMMU translation and support access_type

Posted by Pierrick Bouvier 1 month ago

On 4/24/2026 8:15 AM, Jim Shu wrote:
> On Fri, Apr 24, 2026 at 2:25 AM Pierrick Bouvier
> <pierrick.bouvier@oss.qualcomm.com> wrote:
>>
>> On 4/22/2026 11:16 PM, Jim Shu wrote:
>>> On Thu, Apr 23, 2026 at 12:01 AM Pierrick Bouvier
>>> <pierrick.bouvier@oss.qualcomm.com> wrote:
>>>
>>> Hi Pierrick,
>>>
>>> Thanks for discussing the design!
>>>
>>>>
>>>> Hi Jim,
>>>>
>>>> On 4/21/2026 9:29 AM, Jim Shu wrote:
>>>>> Note: v1 title is "accel/tcg: Pass the access_type to IOMMUMemoryRegion"
>>>>>
>>>>> Incoming security protection devices feature more complex IOMMUMemoryRegion
>>>>> implementation in the CPU path than ARM MPC device. For example,
>>>>> RISC-V wgChecker [1] may permit the access with only RO/WO permissions.
>>>>> Consequently, the IOMMUMemoryRegion could return different sections for
>>>>> read & write access.
>>>>>
>>>>
>>>> The consequence is not directly implied for me.
>>>>
>>>> Why having a single translation and page protection handle read/write
>>>> access is not enough here?
>>>>
>>>> What is the (real world) use case to return different physical memory
>>>> locations depending on read/write access for a given virtual address?
>>>
>>> When permission is granted, the wgChecker device returns the original
>>> region of the address, called the downstream region
>>> When permission is denied, the wgChecker device returns a special
>>> MemoryRegion for error handling, called the blocked_io region).
>>> This region may trigger an IRQ and a bus error. Additionally, memory
>>> accesses to this region will result in 'read 0, write ignored' rather
>>> than triggering exceptions from the MMU or MPU.
>>>
>>> Consequently, implementing RWX permission checks on security
>>> protection devices may return different regions (either downstream or
>>> blocked_io) depending on the access type.
>>> This design is similar to the existing ARM MPC device
>>> (hw/misc/tz-mpc.c). The primary difference is that the MPC only checks
>>> the physical address rather than RWX permissions, meaning it returns
>>> either the downstream or blocked_io region for a single address
>>> regardless of the access type.
>>>
>>
>> It is indeed, a big difference. Translation with MPC device still
>> results in a single address space being chosen.
>>
>>> I also considered an alternative design where CPUTLBEntryFull stores
>>> both the successful permission and the blocked_io region of the IOMMU
>>> region.
>>> In this scenario, the slow-path code utilizing CPUTLBEntryFull would
>>> check permissions and return the blocked_io region if access is denied
>>>
>>> However, the address_space_translate_iommu() implementation supports
>>> recursive IOMMU region translation.
>>> if multiple IOMMU regions are encountered during a single address
>>> translation, storing only the first region's permissions is
>>> insufficient.
>>> Ultimately, we still face a situation where RWX permissions might
>>> return different regions separately (e.g., downstream,
>>> blocked_io_iommu1, and blocked_io_iommu2).
>>>
>>
>> The "blocked_io_iommu_X" is a consequence of the current choice of
>> implementation, but I still don't see why it's an absolute necessity.
>>
> 
> I am open to removing it if the community agrees. I believe platforms
> with the ARM MPC do not use multiple IOMMU regions when translating a
> single address.
>

I am not expert on the topic, but it seems MPC is mostly used with 
microcontrollers with trustzone and is much more limited in scope than 
WorldChecker. WorldChecker seems to be a mix of MMU + IOMMU + something 
like Arm Granule Protection Table in a single external component.

I am probably biased by Arm "World" implementation of this (out of MPC), 
where CPU and SMMU play this role, instead of an external component.

Going back to MPC, the translation is still deterministic and not based 
on type of access. From what I understand, MPC can statically forbid 
accesses to specific blocks, simply based on original address.

>> An alternative would be to treat all accesses as MMIO, but I guess your
>> goal here is to optimize the code path where we access RAM?
>>
> 
> If the downstream region is a memory region, it will update the TLB
> flags and vaddr in addr_idx for the successful permissions of IOMMU
> region.
> To explain further: In the alternative approach, we would obtain the
> downstream region, IOMMU permissions, and the blocked_io region from
> the IOMMU translation function (which would require an additional
> return value or API to get the blocked_io region). The TLB entry will
> handle the downstream region as a non-IOMMU region but will only
> update the TLB flags for the IOMMU permissions; it will also store the
> permissions and the blocked_io region. The slow-path code checks the
> IOMMU permissions. If permission is denied, it will perform the
> transaction on the blocked_io region instead.
> We can remove the lazy IOMMU translation if using an alternative approach.
>

What I was suggesting in my question was to force any access to a 
MemoryRegion handled by a wgChecker to go through a read or write 
callback, and to do the actual check there.

For instance, MPC redirects blocked calls to:
static const MemoryRegionOps tz_mpc_mem_blocked_ops = {
   .read_with_attrs = tz_mpc_mem_blocked_read,
   .write_with_attrs = tz_mpc_mem_blocked_write,

So I was suggesting something like:
static const MemoryRegionOps wgchecker_ops = {
   .read_with_attrs = wgchecker_mem_read,
   .write_with_attrs = wgchecker_mem_write,

And redirect all read/write to those callbacks, thus turning the whole 
range into an MMIO range. Thus my original question to understand if 
your main concern is runtime performance or not.
  >> Why can't you reuse existing page protection mechanism for this,
>> authorizing read or write? WorldChecker just seems to be an additional
>> check on top of existing translation. The fact it's an "additional"
>> device is a design/implementation details, and could simply be part of
>> page protection mechanism. It might require some additional plumbing
>> when an interrupt is raised though, to redirect this correctly instead
>> of signaling CPU like it would normally do.
>>
>> The fact is does different action based on read/write attribute does not
>> really fit very well with existing implementation, as you noticed. And I
>> wonder if it's worth changing the global design just to optimize this
>> (optional) use case for RISC-V.
>>
> 
> In the SoC architecture, the wgCheckers are positioned in front of
> devices (either memory or MMIO). When the CPU or DMA sends
> transactions to the device, the device's wgChecker first performs a
> permission check to determine if the transaction should be forwarded.
> If we move this check to the architecture's tlb_fill() function, how
> would the CPU code determine if transactions were sent to the device
> (or to the upstream region of the wgChecker)? The CPU code would need
> to be aware of the memory tree hierarchy to do that, which I believe
> is more difficult to implement.
>

The memory hierarchy has to be known anyway by the component dealing 
with that. Currently, TLB is responsible for it, iterating through the 
different regions. However, it could be wgChecker code that does that, 
or tlb_fill() for RISC-V.

My only concern with existing design is that it pushes all that on 
generic TLB code, for a feature that is optional and only used by 
RISC-V, but still impacting all architectures.

It's not a no, I don't have any authority on this part of codebase, but 
just a question to understand why it can't be solved another way that 
would be RISC-V specific.

> In the current QEMU architecture, when a security protection device
> sits in front of a device, placing it within the memory tree seems to
> be the most suitable approach, similar to the ARM MPC device. The
> address_space*() API can check if a transaction will be forwarded to
> the device and perform the permission check via the IOMMU region. Both
> the CPU-side address_space_translate_for_iotlb() and the DMA-side
> address_space_translate_iommu() handle this. The only missing
> component is that the CPU TLB cannot handle RWX permission checks for
> IOMMU regions within the memory tree.
>

I understand this and what you want.
I'm really open to it if anyone else has feedback on this, and help 
decide if it's worth changing generic TLB code for this use case.

> 
> 
> 
>>>>
>>>> I'm not against the goal of this series, but trying to understand why it
>>>> went in this direction, which complexify this already complex code path.
>>>>
>>>>> To support such IOMMUMemoryRegion behavior in the CPU path, the design
>>>>> of IOMMU translation must be updated:
>>>>>
>>>>> 1. address_space_translate*() must now pass the access_type to
>>>>>       IOMMUMemoryRegion.
>>>>> 2. Since IOMMU translation results are too complex to be fully stored
>>>>>       in the CPU TLB. we will defer the translation until the actual access
>>>>>       occurs. Also, TLB is allowed to store the untranslated IOMMU region.
>>>>>
>>>>> To implement deferred IOMMU translation, this patchset introduces the
>>>>> following changes:
>>>>>
>>>>> 1. tlb_set_page_full() no longer translates the IOMMU region
>>>>>       immediately. Instead, it stores the untranslated region directly in
>>>>>       the TLB. A new slow-path flag, TLB_IOMMU, is introduced to force
>>>>>       access into the slow path when a region has not yet been translated
>>>>>       in the TLB entry.
>>>>>
>>>>> 2. When the CPU utilizes a TLB entry in the slow path, it should perform
>>>>>       the lazy IOMMU translation of the access_type first. The resulting
>>>>>       translated region and access type are stored in CPUTLBEntryFull.
>>>>>       Since the slow path always performs lazy translation first, we can
>>>>>       switch the CPUTLBEntryFull content to the correct access type before
>>>>>       use.
>>>>>
>>>>> 3. To accelerate memory access in the fast path, lazy translation can
>>>>>       update the addend of the CPUTLBEntry when translating the region to a
>>>>>       host memory region. We restrict the IOMMU region to have a single
>>>>>       non-zero 'addend' across all permissions. If a second 'addend' is
>>>>>       present for a CPUTLBEntry, QEMU will trigger an assertion. This
>>>>>       limitation is sufficient for security devices, as their "secondary"
>>>>>       region is typically an IO region used to emulate device error
>>>>>       handling when access is rejected.
>>>>>
>>>>> 4. To support non-slow TLB flags, lazy translation can update the TLB
>>>>>       flags in the 'addr_idx' of the CPUTLBEntry. Lazy translation only
>>>>>       updates the flags for the permissions specified in @prot. This
>>>>>       ensures that each access_type of a translated region to maintains
>>>>>       independent TLB flags. For example, TLB_DIRTY of memory region will
>>>>>       not be "polluted" from other permission that translated to different
>>>>>       region.
>>>>>
>>>>> Both RISC-V wgChecker [1] and RISC-V IOPMP [2] devices require this
>>>>> feature.
>>>>>
>>>>> [1] RISC-V WG:
>>>>> https://patchew.org/QEMU/20251021155548.584543-1-jim.shu@sifive.com/
>>>>> [2] RISC-V IOPMP:
>>>>> https://patchew.org/QEMU/20250312093735.1517740-1-ethan84@andestech.com/
>>>>>
>>>>> Changed since v1:
>>>>> - Remove the ping-pong TLB entry behavior. Instead, defer the IOMMU
>>>>>      translation until actual access in the CPU path. Provide a IOMMU
>>>>>      lazy translation function with the special handling of 'addend'
>>>>>      and 'addr_idx' fields of CPUTLBEntry.
>>>>> - Fix the checkpatch warning.
>>>>>
>>>>>
>>>>> Jim Shu (5):
>>>>>      accel/tcg: Pass access_type as an argument of tlb_set_page*()
>>>>>      accel/tcg: address_space_translate*() will pass the correct
>>>>>        iommu_flags
>>>>>      accel/tcg: Provide early AS translate function
>>>>>      accel/tcg: Add IOMMU lazy translation function
>>>>>      accel/tcg: Support IOMMU lazy translation in CPU TLB
>>>>>
>>>>>     accel/tcg/cputlb.c                   | 247 +++++++++++++++++++++++++--
>>>>>     include/accel/tcg/iommu.h            |  17 +-
>>>>>     include/exec/cputlb.h                |  11 +-
>>>>>     include/exec/tlb-flags.h             |   4 +-
>>>>>     include/hw/core/cpu.h                |  15 ++
>>>>>     system/physmem.c                     |  60 ++++++-
>>>>>     target/alpha/helper.c                |   2 +-
>>>>>     target/avr/helper.c                  |   3 +-
>>>>>     target/hppa/mem_helper.c             |   1 -
>>>>>     target/i386/tcg/system/excp_helper.c |   3 +-
>>>>>     target/loongarch/tcg/tlb_helper.c    |   2 +-
>>>>>     target/m68k/helper.c                 |  10 +-
>>>>>     target/microblaze/helper.c           |   8 +-
>>>>>     target/mips/tcg/system/tlb_helper.c  |   4 +-
>>>>>     target/or1k/mmu.c                    |   2 +-
>>>>>     target/ppc/mmu_helper.c              |   2 +-
>>>>>     target/riscv/cpu_helper.c            |   2 +-
>>>>>     target/rx/cpu.c                      |   3 +-
>>>>>     target/s390x/tcg/excp_helper.c       |   2 +-
>>>>>     target/sh4/helper.c                  |   3 +-
>>>>>     target/sparc/mmu_helper.c            |   6 +-
>>>>>     target/tricore/helper.c              |   2 +-
>>>>>     target/xtensa/helper.c               |   3 +-
>>>>>     23 files changed, 354 insertions(+), 58 deletions(-)
>>>>>
>>>>
>>>
>>> Regards,
>>> Jim
>>

Regards,
Pierrick

Re: [PATCH v2 0/5] Defer the IOMMU translation and support access_type

Posted by Jim Shu 1 month ago

On Sat, Apr 25, 2026 at 6:44 AM Pierrick Bouvier
<pierrick.bouvier@oss.qualcomm.com> wrote:
>
> On 4/24/2026 8:15 AM, Jim Shu wrote:
> > On Fri, Apr 24, 2026 at 2:25 AM Pierrick Bouvier
> > <pierrick.bouvier@oss.qualcomm.com> wrote:
> >>
> >> On 4/22/2026 11:16 PM, Jim Shu wrote:
> >>> On Thu, Apr 23, 2026 at 12:01 AM Pierrick Bouvier
> >>> <pierrick.bouvier@oss.qualcomm.com> wrote:
> >>>
> >>> Hi Pierrick,
> >>>
> >>> Thanks for discussing the design!
> >>>
> >>>>
> >>>> Hi Jim,
> >>>>
> >>>> On 4/21/2026 9:29 AM, Jim Shu wrote:
> >>>>> Note: v1 title is "accel/tcg: Pass the access_type to IOMMUMemoryRegion"
> >>>>>
> >>>>> Incoming security protection devices feature more complex IOMMUMemoryRegion
> >>>>> implementation in the CPU path than ARM MPC device. For example,
> >>>>> RISC-V wgChecker [1] may permit the access with only RO/WO permissions.
> >>>>> Consequently, the IOMMUMemoryRegion could return different sections for
> >>>>> read & write access.
> >>>>>
> >>>>
> >>>> The consequence is not directly implied for me.
> >>>>
> >>>> Why having a single translation and page protection handle read/write
> >>>> access is not enough here?
> >>>>
> >>>> What is the (real world) use case to return different physical memory
> >>>> locations depending on read/write access for a given virtual address?
> >>>
> >>> When permission is granted, the wgChecker device returns the original
> >>> region of the address, called the downstream region
> >>> When permission is denied, the wgChecker device returns a special
> >>> MemoryRegion for error handling, called the blocked_io region).
> >>> This region may trigger an IRQ and a bus error. Additionally, memory
> >>> accesses to this region will result in 'read 0, write ignored' rather
> >>> than triggering exceptions from the MMU or MPU.
> >>>
> >>> Consequently, implementing RWX permission checks on security
> >>> protection devices may return different regions (either downstream or
> >>> blocked_io) depending on the access type.
> >>> This design is similar to the existing ARM MPC device
> >>> (hw/misc/tz-mpc.c). The primary difference is that the MPC only checks
> >>> the physical address rather than RWX permissions, meaning it returns
> >>> either the downstream or blocked_io region for a single address
> >>> regardless of the access type.
> >>>
> >>
> >> It is indeed, a big difference. Translation with MPC device still
> >> results in a single address space being chosen.
> >>
> >>> I also considered an alternative design where CPUTLBEntryFull stores
> >>> both the successful permission and the blocked_io region of the IOMMU
> >>> region.
> >>> In this scenario, the slow-path code utilizing CPUTLBEntryFull would
> >>> check permissions and return the blocked_io region if access is denied
> >>>
> >>> However, the address_space_translate_iommu() implementation supports
> >>> recursive IOMMU region translation.
> >>> if multiple IOMMU regions are encountered during a single address
> >>> translation, storing only the first region's permissions is
> >>> insufficient.
> >>> Ultimately, we still face a situation where RWX permissions might
> >>> return different regions separately (e.g., downstream,
> >>> blocked_io_iommu1, and blocked_io_iommu2).
> >>>
> >>
> >> The "blocked_io_iommu_X" is a consequence of the current choice of
> >> implementation, but I still don't see why it's an absolute necessity.
> >>
> >
> > I am open to removing it if the community agrees. I believe platforms
> > with the ARM MPC do not use multiple IOMMU regions when translating a
> > single address.
> >
>
> I am not expert on the topic, but it seems MPC is mostly used with
> microcontrollers with trustzone and is much more limited in scope than
> WorldChecker. WorldChecker seems to be a mix of MMU + IOMMU + something
> like Arm Granule Protection Table in a single external component.
>
> I am probably biased by Arm "World" implementation of this (out of MPC),
> where CPU and SMMU play this role, instead of an external component.
>
> Going back to MPC, the translation is still deterministic and not based
> on type of access. From what I understand, MPC can statically forbid
> accesses to specific blocks, simply based on original address.

ARMv9-A GPC verifies permission within the CPU MMU and IOMMU. It
performs checks when the memory transaction is travel from the source
(such as a CPU or DMA) toward the system bus. In contrast, RISC-V
wgChecker is positioned directly in front of peripheral devices. It
performs checks when the transcations travel from the bus toward the
target devices. In this regard, the ARMv8-A TZASC [1] and ARMv8-M MPC
are architecturally similar to the RISC-V wgChecker.

We can monitor all memory transactions in the SoC by implementing
security filters at either every transaction source or every
transaction device. These 2 designs are both reasonable, representing
different approaches to SoC security architecture.

ARMv8-A TZASC [1] is also the target-side filter and includes RW
permission checks. While the TZASC device is not supported in QEMU
now, it could also leverage these generic code changes when
implementing it. Target-side filtering with RW permission checks is
not a RISC-V only design.

[1] https://developer.arm.com/documentation/ddi0504/c/introduction/about-the-tzc-400/tzc-400-example-system

>
> >> An alternative would be to treat all accesses as MMIO, but I guess your
> >> goal here is to optimize the code path where we access RAM?
> >>
> >
> > If the downstream region is a memory region, it will update the TLB
> > flags and vaddr in addr_idx for the successful permissions of IOMMU
> > region.
> > To explain further: In the alternative approach, we would obtain the
> > downstream region, IOMMU permissions, and the blocked_io region from
> > the IOMMU translation function (which would require an additional
> > return value or API to get the blocked_io region). The TLB entry will
> > handle the downstream region as a non-IOMMU region but will only
> > update the TLB flags for the IOMMU permissions; it will also store the
> > permissions and the blocked_io region. The slow-path code checks the
> > IOMMU permissions. If permission is denied, it will perform the
> > transaction on the blocked_io region instead.
> > We can remove the lazy IOMMU translation if using an alternative approach.
> >
>
> What I was suggesting in my question was to force any access to a
> MemoryRegion handled by a wgChecker to go through a read or write
> callback, and to do the actual check there.
>
> For instance, MPC redirects blocked calls to:
> static const MemoryRegionOps tz_mpc_mem_blocked_ops = {
>    .read_with_attrs = tz_mpc_mem_blocked_read,
>    .write_with_attrs = tz_mpc_mem_blocked_write,
>
> So I was suggesting something like:
> static const MemoryRegionOps wgchecker_ops = {
>    .read_with_attrs = wgchecker_mem_read,
>    .write_with_attrs = wgchecker_mem_write,
>
> And redirect all read/write to those callbacks, thus turning the whole
> range into an MMIO range. Thus my original question to understand if
> your main concern is runtime performance or not.

Yes, runtime performance is also critical to us.
We have positioned the wgChecker in front of the DRAM [2] to partition
the memory for each world.
If using the slow path for DRAM access, Linux will be too slow to boot.

[2] Add WG support to virt machine:
https://patchew.org/QEMU/20251021155548.584543-1-jim.shu@sifive.com/20251021162108.585468-4-jim.shu@sifive.com/

>   >> Why can't you reuse existing page protection mechanism for this,
> >> authorizing read or write? WorldChecker just seems to be an additional
> >> check on top of existing translation. The fact it's an "additional"
> >> device is a design/implementation details, and could simply be part of
> >> page protection mechanism. It might require some additional plumbing
> >> when an interrupt is raised though, to redirect this correctly instead
> >> of signaling CPU like it would normally do.
> >>
> >> The fact is does different action based on read/write attribute does not
> >> really fit very well with existing implementation, as you noticed. And I
> >> wonder if it's worth changing the global design just to optimize this
> >> (optional) use case for RISC-V.
> >>
> >
> > In the SoC architecture, the wgCheckers are positioned in front of
> > devices (either memory or MMIO). When the CPU or DMA sends
> > transactions to the device, the device's wgChecker first performs a
> > permission check to determine if the transaction should be forwarded.
> > If we move this check to the architecture's tlb_fill() function, how
> > would the CPU code determine if transactions were sent to the device
> > (or to the upstream region of the wgChecker)? The CPU code would need
> > to be aware of the memory tree hierarchy to do that, which I believe
> > is more difficult to implement.
> >
>
> The memory hierarchy has to be known anyway by the component dealing
> with that. Currently, TLB is responsible for it, iterating through the
> different regions. However, it could be wgChecker code that does that,
> or tlb_fill() for RISC-V.
>
> My only concern with existing design is that it pushes all that on
> generic TLB code, for a feature that is optional and only used by
> RISC-V, but still impacting all architectures.
>

I don't think current address_space/flatview APIs can determine if a
PA will access an IOMMUMemoryRegion and return that specific region.
To perform this check in the tlb_fill() or wgChecker device, I would
need to add this new API to get the correct wgChecker instance and
perform the permission check of it. Thus, this approach still requires
modifying the generic code to add the new address_space API.

I can try my best to minimize the changes of generic TLB code, but it
is impossible to support wgChecker w/o any modification of generic
code. Original MPC patchset also adds the CPU-side IOMMU region
support to address_space_translate_for_iotlb() [3] and adds the
TCGIOMMUNotifier.

My v1 patchset has fewer modifications of generic TLB code, but it
relies on ping-pong TLB entries for RW permission seperately. It is a
little tricky and may not be easy to maintain. I tried to formalize
the changes of TLB entry for the IOMMU region with RW permission in v2
patchset via lazy IOMMU translation.

[3] https://lists.gnu.org/archive/html/qemu-devel/2018-06/msg00666.html

> It's not a no, I don't have any authority on this part of codebase, but
> just a question to understand why it can't be solved another way that
> would be RISC-V specific.

As I mentioned above, ARMv8-A TZASC also has a similar design.


>
> > In the current QEMU architecture, when a security protection device
> > sits in front of a device, placing it within the memory tree seems to
> > be the most suitable approach, similar to the ARM MPC device. The
> > address_space*() API can check if a transaction will be forwarded to
> > the device and perform the permission check via the IOMMU region. Both
> > the CPU-side address_space_translate_for_iotlb() and the DMA-side
> > address_space_translate_iommu() handle this. The only missing
> > component is that the CPU TLB cannot handle RWX permission checks for
> > IOMMU regions within the memory tree.
> >
>
> I understand this and what you want.
> I'm really open to it if anyone else has feedback on this, and help
> decide if it's worth changing generic TLB code for this use case.
>
> >
> >
> >
> >>>>
> >>>> I'm not against the goal of this series, but trying to understand why it
> >>>> went in this direction, which complexify this already complex code path.
> >>>>
> >>>>> To support such IOMMUMemoryRegion behavior in the CPU path, the design
> >>>>> of IOMMU translation must be updated:
> >>>>>
> >>>>> 1. address_space_translate*() must now pass the access_type to
> >>>>>       IOMMUMemoryRegion.
> >>>>> 2. Since IOMMU translation results are too complex to be fully stored
> >>>>>       in the CPU TLB. we will defer the translation until the actual access
> >>>>>       occurs. Also, TLB is allowed to store the untranslated IOMMU region.
> >>>>>
> >>>>> To implement deferred IOMMU translation, this patchset introduces the
> >>>>> following changes:
> >>>>>
> >>>>> 1. tlb_set_page_full() no longer translates the IOMMU region
> >>>>>       immediately. Instead, it stores the untranslated region directly in
> >>>>>       the TLB. A new slow-path flag, TLB_IOMMU, is introduced to force
> >>>>>       access into the slow path when a region has not yet been translated
> >>>>>       in the TLB entry.
> >>>>>
> >>>>> 2. When the CPU utilizes a TLB entry in the slow path, it should perform
> >>>>>       the lazy IOMMU translation of the access_type first. The resulting
> >>>>>       translated region and access type are stored in CPUTLBEntryFull.
> >>>>>       Since the slow path always performs lazy translation first, we can
> >>>>>       switch the CPUTLBEntryFull content to the correct access type before
> >>>>>       use.
> >>>>>
> >>>>> 3. To accelerate memory access in the fast path, lazy translation can
> >>>>>       update the addend of the CPUTLBEntry when translating the region to a
> >>>>>       host memory region. We restrict the IOMMU region to have a single
> >>>>>       non-zero 'addend' across all permissions. If a second 'addend' is
> >>>>>       present for a CPUTLBEntry, QEMU will trigger an assertion. This
> >>>>>       limitation is sufficient for security devices, as their "secondary"
> >>>>>       region is typically an IO region used to emulate device error
> >>>>>       handling when access is rejected.
> >>>>>
> >>>>> 4. To support non-slow TLB flags, lazy translation can update the TLB
> >>>>>       flags in the 'addr_idx' of the CPUTLBEntry. Lazy translation only
> >>>>>       updates the flags for the permissions specified in @prot. This
> >>>>>       ensures that each access_type of a translated region to maintains
> >>>>>       independent TLB flags. For example, TLB_DIRTY of memory region will
> >>>>>       not be "polluted" from other permission that translated to different
> >>>>>       region.
> >>>>>
> >>>>> Both RISC-V wgChecker [1] and RISC-V IOPMP [2] devices require this
> >>>>> feature.
> >>>>>
> >>>>> [1] RISC-V WG:
> >>>>> https://patchew.org/QEMU/20251021155548.584543-1-jim.shu@sifive.com/
> >>>>> [2] RISC-V IOPMP:
> >>>>> https://patchew.org/QEMU/20250312093735.1517740-1-ethan84@andestech.com/
> >>>>>
> >>>>> Changed since v1:
> >>>>> - Remove the ping-pong TLB entry behavior. Instead, defer the IOMMU
> >>>>>      translation until actual access in the CPU path. Provide a IOMMU
> >>>>>      lazy translation function with the special handling of 'addend'
> >>>>>      and 'addr_idx' fields of CPUTLBEntry.
> >>>>> - Fix the checkpatch warning.
> >>>>>
> >>>>>
> >>>>> Jim Shu (5):
> >>>>>      accel/tcg: Pass access_type as an argument of tlb_set_page*()
> >>>>>      accel/tcg: address_space_translate*() will pass the correct
> >>>>>        iommu_flags
> >>>>>      accel/tcg: Provide early AS translate function
> >>>>>      accel/tcg: Add IOMMU lazy translation function
> >>>>>      accel/tcg: Support IOMMU lazy translation in CPU TLB
> >>>>>
> >>>>>     accel/tcg/cputlb.c                   | 247 +++++++++++++++++++++++++--
> >>>>>     include/accel/tcg/iommu.h            |  17 +-
> >>>>>     include/exec/cputlb.h                |  11 +-
> >>>>>     include/exec/tlb-flags.h             |   4 +-
> >>>>>     include/hw/core/cpu.h                |  15 ++
> >>>>>     system/physmem.c                     |  60 ++++++-
> >>>>>     target/alpha/helper.c                |   2 +-
> >>>>>     target/avr/helper.c                  |   3 +-
> >>>>>     target/hppa/mem_helper.c             |   1 -
> >>>>>     target/i386/tcg/system/excp_helper.c |   3 +-
> >>>>>     target/loongarch/tcg/tlb_helper.c    |   2 +-
> >>>>>     target/m68k/helper.c                 |  10 +-
> >>>>>     target/microblaze/helper.c           |   8 +-
> >>>>>     target/mips/tcg/system/tlb_helper.c  |   4 +-
> >>>>>     target/or1k/mmu.c                    |   2 +-
> >>>>>     target/ppc/mmu_helper.c              |   2 +-
> >>>>>     target/riscv/cpu_helper.c            |   2 +-
> >>>>>     target/rx/cpu.c                      |   3 +-
> >>>>>     target/s390x/tcg/excp_helper.c       |   2 +-
> >>>>>     target/sh4/helper.c                  |   3 +-
> >>>>>     target/sparc/mmu_helper.c            |   6 +-
> >>>>>     target/tricore/helper.c              |   2 +-
> >>>>>     target/xtensa/helper.c               |   3 +-
> >>>>>     23 files changed, 354 insertions(+), 58 deletions(-)
> >>>>>
> >>>>
> >>>
> >>> Regards,
> >>> Jim
> >>
>
> Regards,
> Pierrick

Regards,
Jim

Re: [PATCH v2 0/5] Defer the IOMMU translation and support access_type

Posted by Pierrick Bouvier 1 month ago

On 4/29/2026 1:08 AM, Jim Shu wrote:
> On Sat, Apr 25, 2026 at 6:44 AM Pierrick Bouvier
> <pierrick.bouvier@oss.qualcomm.com> wrote:
>>
>> On 4/24/2026 8:15 AM, Jim Shu wrote:
>>> On Fri, Apr 24, 2026 at 2:25 AM Pierrick Bouvier
>>> <pierrick.bouvier@oss.qualcomm.com> wrote:
>>>>
>>>> On 4/22/2026 11:16 PM, Jim Shu wrote:
>>>>> On Thu, Apr 23, 2026 at 12:01 AM Pierrick Bouvier
>>>>> <pierrick.bouvier@oss.qualcomm.com> wrote:
>>>>>
>>>>> Hi Pierrick,
>>>>>
>>>>> Thanks for discussing the design!
>>>>>
>>>>>>
>>>>>> Hi Jim,
>>>>>>
>>>>>> On 4/21/2026 9:29 AM, Jim Shu wrote:
>>>>>>> Note: v1 title is "accel/tcg: Pass the access_type to IOMMUMemoryRegion"
>>>>>>>
>>>>>>> Incoming security protection devices feature more complex IOMMUMemoryRegion
>>>>>>> implementation in the CPU path than ARM MPC device. For example,
>>>>>>> RISC-V wgChecker [1] may permit the access with only RO/WO permissions.
>>>>>>> Consequently, the IOMMUMemoryRegion could return different sections for
>>>>>>> read & write access.
>>>>>>>
>>>>>>
>>>>>> The consequence is not directly implied for me.
>>>>>>
>>>>>> Why having a single translation and page protection handle read/write
>>>>>> access is not enough here?
>>>>>>
>>>>>> What is the (real world) use case to return different physical memory
>>>>>> locations depending on read/write access for a given virtual address?
>>>>>
>>>>> When permission is granted, the wgChecker device returns the original
>>>>> region of the address, called the downstream region
>>>>> When permission is denied, the wgChecker device returns a special
>>>>> MemoryRegion for error handling, called the blocked_io region).
>>>>> This region may trigger an IRQ and a bus error. Additionally, memory
>>>>> accesses to this region will result in 'read 0, write ignored' rather
>>>>> than triggering exceptions from the MMU or MPU.
>>>>>
>>>>> Consequently, implementing RWX permission checks on security
>>>>> protection devices may return different regions (either downstream or
>>>>> blocked_io) depending on the access type.
>>>>> This design is similar to the existing ARM MPC device
>>>>> (hw/misc/tz-mpc.c). The primary difference is that the MPC only checks
>>>>> the physical address rather than RWX permissions, meaning it returns
>>>>> either the downstream or blocked_io region for a single address
>>>>> regardless of the access type.
>>>>>
>>>>
>>>> It is indeed, a big difference. Translation with MPC device still
>>>> results in a single address space being chosen.
>>>>
>>>>> I also considered an alternative design where CPUTLBEntryFull stores
>>>>> both the successful permission and the blocked_io region of the IOMMU
>>>>> region.
>>>>> In this scenario, the slow-path code utilizing CPUTLBEntryFull would
>>>>> check permissions and return the blocked_io region if access is denied
>>>>>
>>>>> However, the address_space_translate_iommu() implementation supports
>>>>> recursive IOMMU region translation.
>>>>> if multiple IOMMU regions are encountered during a single address
>>>>> translation, storing only the first region's permissions is
>>>>> insufficient.
>>>>> Ultimately, we still face a situation where RWX permissions might
>>>>> return different regions separately (e.g., downstream,
>>>>> blocked_io_iommu1, and blocked_io_iommu2).
>>>>>
>>>>
>>>> The "blocked_io_iommu_X" is a consequence of the current choice of
>>>> implementation, but I still don't see why it's an absolute necessity.
>>>>
>>>
>>> I am open to removing it if the community agrees. I believe platforms
>>> with the ARM MPC do not use multiple IOMMU regions when translating a
>>> single address.
>>>
>>
>> I am not expert on the topic, but it seems MPC is mostly used with
>> microcontrollers with trustzone and is much more limited in scope than
>> WorldChecker. WorldChecker seems to be a mix of MMU + IOMMU + something
>> like Arm Granule Protection Table in a single external component.
>>
>> I am probably biased by Arm "World" implementation of this (out of MPC),
>> where CPU and SMMU play this role, instead of an external component.
>>
>> Going back to MPC, the translation is still deterministic and not based
>> on type of access. From what I understand, MPC can statically forbid
>> accesses to specific blocks, simply based on original address.
> 
> ARMv9-A GPC verifies permission within the CPU MMU and IOMMU. It
> performs checks when the memory transaction is travel from the source
> (such as a CPU or DMA) toward the system bus. In contrast, RISC-V
> wgChecker is positioned directly in front of peripheral devices. It
> performs checks when the transcations travel from the bus toward the
> target devices. In this regard, the ARMv8-A TZASC [1] and ARMv8-M MPC
> are architecturally similar to the RISC-V wgChecker.
> 
> We can monitor all memory transactions in the SoC by implementing
> security filters at either every transaction source or every
> transaction device. These 2 designs are both reasonable, representing
> different approaches to SoC security architecture.
>

Sure, I was not implying anything is wrong here, just questioning on why
it was implemented this particular way in QEMU, and which alternatives
are possible.

> ARMv8-A TZASC [1] is also the target-side filter and includes RW
> permission checks. While the TZASC device is not supported in QEMU
> now, it could also leverage these generic code changes when
> implementing it. Target-side filtering with RW permission checks is
> not a RISC-V only design.
>

That's an excellent argument for bringing this in generic TLB management 👍!

> [1] https://developer.arm.com/documentation/ddi0504/c/introduction/about-the-tzc-400/tzc-400-example-system
> 
>>
>>>> An alternative would be to treat all accesses as MMIO, but I guess your
>>>> goal here is to optimize the code path where we access RAM?
>>>>
>>>
>>> If the downstream region is a memory region, it will update the TLB
>>> flags and vaddr in addr_idx for the successful permissions of IOMMU
>>> region.
>>> To explain further: In the alternative approach, we would obtain the
>>> downstream region, IOMMU permissions, and the blocked_io region from
>>> the IOMMU translation function (which would require an additional
>>> return value or API to get the blocked_io region). The TLB entry will
>>> handle the downstream region as a non-IOMMU region but will only
>>> update the TLB flags for the IOMMU permissions; it will also store the
>>> permissions and the blocked_io region. The slow-path code checks the
>>> IOMMU permissions. If permission is denied, it will perform the
>>> transaction on the blocked_io region instead.
>>> We can remove the lazy IOMMU translation if using an alternative approach.
>>>
>>
>> What I was suggesting in my question was to force any access to a
>> MemoryRegion handled by a wgChecker to go through a read or write
>> callback, and to do the actual check there.
>>
>> For instance, MPC redirects blocked calls to:
>> static const MemoryRegionOps tz_mpc_mem_blocked_ops = {
>>    .read_with_attrs = tz_mpc_mem_blocked_read,
>>    .write_with_attrs = tz_mpc_mem_blocked_write,
>>
>> So I was suggesting something like:
>> static const MemoryRegionOps wgchecker_ops = {
>>    .read_with_attrs = wgchecker_mem_read,
>>    .write_with_attrs = wgchecker_mem_write,
>>
>> And redirect all read/write to those callbacks, thus turning the whole
>> range into an MMIO range. Thus my original question to understand if
>> your main concern is runtime performance or not.
> 
> Yes, runtime performance is also critical to us.
> We have positioned the wgChecker in front of the DRAM [2] to partition
> the memory for each world.
> If using the slow path for DRAM access, Linux will be too slow to boot.
> 
> [2] Add WG support to virt machine:
> https://patchew.org/QEMU/20251021155548.584543-1-jim.shu@sifive.com/20251021162108.585468-4-jim.shu@sifive.com/
>

Sure, makes sense.

>>   >> Why can't you reuse existing page protection mechanism for this,
>>>> authorizing read or write? WorldChecker just seems to be an additional
>>>> check on top of existing translation. The fact it's an "additional"
>>>> device is a design/implementation details, and could simply be part of
>>>> page protection mechanism. It might require some additional plumbing
>>>> when an interrupt is raised though, to redirect this correctly instead
>>>> of signaling CPU like it would normally do.
>>>>
>>>> The fact is does different action based on read/write attribute does not
>>>> really fit very well with existing implementation, as you noticed. And I
>>>> wonder if it's worth changing the global design just to optimize this
>>>> (optional) use case for RISC-V.
>>>>
>>>
>>> In the SoC architecture, the wgCheckers are positioned in front of
>>> devices (either memory or MMIO). When the CPU or DMA sends
>>> transactions to the device, the device's wgChecker first performs a
>>> permission check to determine if the transaction should be forwarded.
>>> If we move this check to the architecture's tlb_fill() function, how
>>> would the CPU code determine if transactions were sent to the device
>>> (or to the upstream region of the wgChecker)? The CPU code would need
>>> to be aware of the memory tree hierarchy to do that, which I believe
>>> is more difficult to implement.
>>>
>>
>> The memory hierarchy has to be known anyway by the component dealing
>> with that. Currently, TLB is responsible for it, iterating through the
>> different regions. However, it could be wgChecker code that does that,
>> or tlb_fill() for RISC-V.
>>
>> My only concern with existing design is that it pushes all that on
>> generic TLB code, for a feature that is optional and only used by
>> RISC-V, but still impacting all architectures.
>>
> 
> I don't think current address_space/flatview APIs can determine if a
> PA will access an IOMMUMemoryRegion and return that specific region.
> To perform this check in the tlb_fill() or wgChecker device, I would
> need to add this new API to get the correct wgChecker instance and
> perform the permission check of it. Thus, this approach still requires
> modifying the generic code to add the new address_space API.
> 
> I can try my best to minimize the changes of generic TLB code, but it
> is impossible to support wgChecker w/o any modification of generic
> code. Original MPC patchset also adds the CPU-side IOMMU region
> support to address_space_translate_for_iotlb() [3] and adds the
> TCGIOMMUNotifier.
> 
> My v1 patchset has fewer modifications of generic TLB code, but it
> relies on ping-pong TLB entries for RW permission seperately. It is a
> little tricky and may not be easy to maintain. I tried to formalize
> the changes of TLB entry for the IOMMU region with RW permission in v2
> patchset via lazy IOMMU translation.
>

Your approach is good, and I was just wanted to understand why it's the
right way to go, and potential other use cases for this.

> [3] https://lists.gnu.org/archive/html/qemu-devel/2018-06/msg00666.html
> 
>> It's not a no, I don't have any authority on this part of codebase, but
>> just a question to understand why it can't be solved another way that
>> would be RISC-V specific.
> 
> As I mentioned above, ARMv8-A TZASC also has a similar design.
> 
> 
>>
>>> In the current QEMU architecture, when a security protection device
>>> sits in front of a device, placing it within the memory tree seems to
>>> be the most suitable approach, similar to the ARM MPC device. The
>>> address_space*() API can check if a transaction will be forwarded to
>>> the device and perform the permission check via the IOMMU region. Both
>>> the CPU-side address_space_translate_for_iotlb() and the DMA-side
>>> address_space_translate_iommu() handle this. The only missing
>>> component is that the CPU TLB cannot handle RWX permission checks for
>>> IOMMU regions within the memory tree.
>>>
>>
>> I understand this and what you want.
>> I'm really open to it if anyone else has feedback on this, and help
>> decide if it's worth changing generic TLB code for this use case.
>>
>>>
>>>
>>>
>>>>>>
>>>>>> I'm not against the goal of this series, but trying to understand why it
>>>>>> went in this direction, which complexify this already complex code path.
>>>>>>
>>>>>>> To support such IOMMUMemoryRegion behavior in the CPU path, the design
>>>>>>> of IOMMU translation must be updated:
>>>>>>>
>>>>>>> 1. address_space_translate*() must now pass the access_type to
>>>>>>>       IOMMUMemoryRegion.
>>>>>>> 2. Since IOMMU translation results are too complex to be fully stored
>>>>>>>       in the CPU TLB. we will defer the translation until the actual access
>>>>>>>       occurs. Also, TLB is allowed to store the untranslated IOMMU region.
>>>>>>>
>>>>>>> To implement deferred IOMMU translation, this patchset introduces the
>>>>>>> following changes:
>>>>>>>
>>>>>>> 1. tlb_set_page_full() no longer translates the IOMMU region
>>>>>>>       immediately. Instead, it stores the untranslated region directly in
>>>>>>>       the TLB. A new slow-path flag, TLB_IOMMU, is introduced to force
>>>>>>>       access into the slow path when a region has not yet been translated
>>>>>>>       in the TLB entry.
>>>>>>>
>>>>>>> 2. When the CPU utilizes a TLB entry in the slow path, it should perform
>>>>>>>       the lazy IOMMU translation of the access_type first. The resulting
>>>>>>>       translated region and access type are stored in CPUTLBEntryFull.
>>>>>>>       Since the slow path always performs lazy translation first, we can
>>>>>>>       switch the CPUTLBEntryFull content to the correct access type before
>>>>>>>       use.
>>>>>>>
>>>>>>> 3. To accelerate memory access in the fast path, lazy translation can
>>>>>>>       update the addend of the CPUTLBEntry when translating the region to a
>>>>>>>       host memory region. We restrict the IOMMU region to have a single
>>>>>>>       non-zero 'addend' across all permissions. If a second 'addend' is
>>>>>>>       present for a CPUTLBEntry, QEMU will trigger an assertion. This
>>>>>>>       limitation is sufficient for security devices, as their "secondary"
>>>>>>>       region is typically an IO region used to emulate device error
>>>>>>>       handling when access is rejected.
>>>>>>>
>>>>>>> 4. To support non-slow TLB flags, lazy translation can update the TLB
>>>>>>>       flags in the 'addr_idx' of the CPUTLBEntry. Lazy translation only
>>>>>>>       updates the flags for the permissions specified in @prot. This
>>>>>>>       ensures that each access_type of a translated region to maintains
>>>>>>>       independent TLB flags. For example, TLB_DIRTY of memory region will
>>>>>>>       not be "polluted" from other permission that translated to different
>>>>>>>       region.
>>>>>>>
>>>>>>> Both RISC-V wgChecker [1] and RISC-V IOPMP [2] devices require this
>>>>>>> feature.
>>>>>>>
>>>>>>> [1] RISC-V WG:
>>>>>>> https://patchew.org/QEMU/20251021155548.584543-1-jim.shu@sifive.com/
>>>>>>> [2] RISC-V IOPMP:
>>>>>>> https://patchew.org/QEMU/20250312093735.1517740-1-ethan84@andestech.com/
>>>>>>>
>>>>>>> Changed since v1:
>>>>>>> - Remove the ping-pong TLB entry behavior. Instead, defer the IOMMU
>>>>>>>      translation until actual access in the CPU path. Provide a IOMMU
>>>>>>>      lazy translation function with the special handling of 'addend'
>>>>>>>      and 'addr_idx' fields of CPUTLBEntry.
>>>>>>> - Fix the checkpatch warning.
>>>>>>>
>>>>>>>
>>>>>>> Jim Shu (5):
>>>>>>>      accel/tcg: Pass access_type as an argument of tlb_set_page*()
>>>>>>>      accel/tcg: address_space_translate*() will pass the correct
>>>>>>>        iommu_flags
>>>>>>>      accel/tcg: Provide early AS translate function
>>>>>>>      accel/tcg: Add IOMMU lazy translation function
>>>>>>>      accel/tcg: Support IOMMU lazy translation in CPU TLB
>>>>>>>
>>>>>>>     accel/tcg/cputlb.c                   | 247 +++++++++++++++++++++++++--
>>>>>>>     include/accel/tcg/iommu.h            |  17 +-
>>>>>>>     include/exec/cputlb.h                |  11 +-
>>>>>>>     include/exec/tlb-flags.h             |   4 +-
>>>>>>>     include/hw/core/cpu.h                |  15 ++
>>>>>>>     system/physmem.c                     |  60 ++++++-
>>>>>>>     target/alpha/helper.c                |   2 +-
>>>>>>>     target/avr/helper.c                  |   3 +-
>>>>>>>     target/hppa/mem_helper.c             |   1 -
>>>>>>>     target/i386/tcg/system/excp_helper.c |   3 +-
>>>>>>>     target/loongarch/tcg/tlb_helper.c    |   2 +-
>>>>>>>     target/m68k/helper.c                 |  10 +-
>>>>>>>     target/microblaze/helper.c           |   8 +-
>>>>>>>     target/mips/tcg/system/tlb_helper.c  |   4 +-
>>>>>>>     target/or1k/mmu.c                    |   2 +-
>>>>>>>     target/ppc/mmu_helper.c              |   2 +-
>>>>>>>     target/riscv/cpu_helper.c            |   2 +-
>>>>>>>     target/rx/cpu.c                      |   3 +-
>>>>>>>     target/s390x/tcg/excp_helper.c       |   2 +-
>>>>>>>     target/sh4/helper.c                  |   3 +-
>>>>>>>     target/sparc/mmu_helper.c            |   6 +-
>>>>>>>     target/tricore/helper.c              |   2 +-
>>>>>>>     target/xtensa/helper.c               |   3 +-
>>>>>>>     23 files changed, 354 insertions(+), 58 deletions(-)
>>>>>>>
>>>>>>
>>>>>
>>>>> Regards,
>>>>> Jim
>>>>
>>
>> Regards,
>> Pierrick
> 
> Regards,
> Jim

Regards,
Pierrick

Re: [PATCH v2 0/5] Defer the IOMMU translation and support access_type

Posted by Philippe Mathieu-Daudé 1 month, 1 week ago

Cc'ing Pierrick & Mark

On 21/4/26 18:29, Jim Shu wrote:
> Note: v1 title is "accel/tcg: Pass the access_type to IOMMUMemoryRegion"
> 
> Incoming security protection devices feature more complex IOMMUMemoryRegion
> implementation in the CPU path than ARM MPC device. For example,
> RISC-V wgChecker [1] may permit the access with only RO/WO permissions.
> Consequently, the IOMMUMemoryRegion could return different sections for
> read & write access.
> 
> To support such IOMMUMemoryRegion behavior in the CPU path, the design
> of IOMMU translation must be updated:
> 
> 1. address_space_translate*() must now pass the access_type to
>     IOMMUMemoryRegion.
> 2. Since IOMMU translation results are too complex to be fully stored
>     in the CPU TLB. we will defer the translation until the actual access
>     occurs. Also, TLB is allowed to store the untranslated IOMMU region.
> 
> To implement deferred IOMMU translation, this patchset introduces the
> following changes:
> 
> 1. tlb_set_page_full() no longer translates the IOMMU region
>     immediately. Instead, it stores the untranslated region directly in
>     the TLB. A new slow-path flag, TLB_IOMMU, is introduced to force
>     access into the slow path when a region has not yet been translated
>     in the TLB entry.
> 
> 2. When the CPU utilizes a TLB entry in the slow path, it should perform
>     the lazy IOMMU translation of the access_type first. The resulting
>     translated region and access type are stored in CPUTLBEntryFull.
>     Since the slow path always performs lazy translation first, we can
>     switch the CPUTLBEntryFull content to the correct access type before
>     use.
> 
> 3. To accelerate memory access in the fast path, lazy translation can
>     update the addend of the CPUTLBEntry when translating the region to a
>     host memory region. We restrict the IOMMU region to have a single
>     non-zero 'addend' across all permissions. If a second 'addend' is
>     present for a CPUTLBEntry, QEMU will trigger an assertion. This
>     limitation is sufficient for security devices, as their "secondary"
>     region is typically an IO region used to emulate device error
>     handling when access is rejected.
> 
> 4. To support non-slow TLB flags, lazy translation can update the TLB
>     flags in the 'addr_idx' of the CPUTLBEntry. Lazy translation only
>     updates the flags for the permissions specified in @prot. This
>     ensures that each access_type of a translated region to maintains
>     independent TLB flags. For example, TLB_DIRTY of memory region will
>     not be "polluted" from other permission that translated to different
>     region.
> 
> Both RISC-V wgChecker [1] and RISC-V IOPMP [2] devices require this
> feature.
> 
> [1] RISC-V WG:
> https://patchew.org/QEMU/20251021155548.584543-1-jim.shu@sifive.com/
> [2] RISC-V IOPMP:
> https://patchew.org/QEMU/20250312093735.1517740-1-ethan84@andestech.com/
> 
> Changed since v1:
> - Remove the ping-pong TLB entry behavior. Instead, defer the IOMMU
>    translation until actual access in the CPU path. Provide a IOMMU
>    lazy translation function with the special handling of 'addend'
>    and 'addr_idx' fields of CPUTLBEntry.
> - Fix the checkpatch warning.
> 
> 
> Jim Shu (5):
>    accel/tcg: Pass access_type as an argument of tlb_set_page*()
>    accel/tcg: address_space_translate*() will pass the correct
>      iommu_flags
>    accel/tcg: Provide early AS translate function
>    accel/tcg: Add IOMMU lazy translation function
>    accel/tcg: Support IOMMU lazy translation in CPU TLB