KVM: Arm SMMUv3 driver for pKVM

[RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Mostafa Saleh posted 58 patches 1 year, 1 month ago

Download series mbox

arch/arm64/include/asm/arm-smmu-v3-common.h   |  592 +++++++
arch/arm64/include/asm/kvm_asm.h              |    9 +
arch/arm64/include/asm/kvm_host.h             |   48 +-
arch/arm64/include/asm/kvm_hyp.h              |    2 +
arch/arm64/kvm/Makefile                       |    2 +-
arch/arm64/kvm/arm.c                          |    8 +-
arch/arm64/kvm/hyp/hyp-constants.c            |    1 +
arch/arm64/kvm/hyp/include/nvhe/iommu.h       |   91 ++
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |    3 +
arch/arm64/kvm/hyp/include/nvhe/mm.h          |    1 +
arch/arm64/kvm/hyp/include/nvhe/pkvm.h        |   37 +
.../arm64/kvm/hyp/include/nvhe/trap_handler.h |    2 +
arch/arm64/kvm/hyp/nvhe/Makefile              |    6 +-
arch/arm64/kvm/hyp/nvhe/alloc_mgt.c           |    2 +
arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  114 ++
arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c   | 1390 +++++++++++++++++
.../arm64/kvm/hyp/nvhe/iommu/io-pgtable-arm.c |  153 ++
arch/arm64/kvm/hyp/nvhe/iommu/iommu.c         |  490 ++++++
arch/arm64/kvm/hyp/nvhe/mem_protect.c         |  133 +-
arch/arm64/kvm/hyp/nvhe/mm.c                  |   17 +
arch/arm64/kvm/hyp/nvhe/power/hvc.c           |   47 +
arch/arm64/kvm/hyp/nvhe/power/scmi.c          |  231 +++
arch/arm64/kvm/hyp/nvhe/setup.c               |    9 +
arch/arm64/kvm/hyp/nvhe/timer-sr.c            |   42 +
arch/arm64/kvm/iommu.c                        |   89 ++
arch/arm64/kvm/mmu.c                          |   20 +
arch/arm64/kvm/pkvm.c                         |   20 +
drivers/gpu/drm/msm/msm_iommu.c               |    5 +-
drivers/iommu/Kconfig                         |    9 +
drivers/iommu/Makefile                        |    2 +-
drivers/iommu/arm/arm-smmu-v3/Makefile        |    7 +
.../arm/arm-smmu-v3/arm-smmu-v3-common.c      |  824 ++++++++++
.../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 1093 +++++++++++++
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  989 +-----------
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  758 +++------
drivers/iommu/io-pgtable-arm-common.c         |  929 +++++++++++
drivers/iommu/io-pgtable-arm.c                | 1061 +------------
drivers/iommu/io-pgtable-arm.h                |   30 -
drivers/iommu/io-pgtable.c                    |   15 +
drivers/iommu/iommu.c                         |   53 +-
include/kvm/arm_smmu_v3.h                     |   46 +
include/kvm/iommu.h                           |   59 +
include/kvm/power_domain.h                    |   24 +
include/linux/io-pgtable-arm.h                |  233 +++
include/linux/io-pgtable.h                    |   38 +-
include/linux/iommu.h                         |   43 +-
46 files changed, 7169 insertions(+), 2608 deletions(-)
create mode 100644 arch/arm64/include/asm/arm-smmu-v3-common.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/iommu.h
create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/io-pgtable-arm.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/power/hvc.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/power/scmi.c
create mode 100644 arch/arm64/kvm/iommu.c
create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
create mode 100644 drivers/iommu/io-pgtable-arm-common.c
delete mode 100644 drivers/iommu/io-pgtable-arm.h
create mode 100644 include/kvm/arm_smmu_v3.h
create mode 100644 include/kvm/iommu.h
create mode 100644 include/kvm/power_domain.h
create mode 100644 include/linux/io-pgtable-arm.h

Expand all Fold all

[RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Mostafa Saleh 1 year, 1 month ago

This is v2 of the series sent last year:
https://lore.kernel.org/kvmarm/20230201125328.2186498-1-jean-philippe@linaro.org/

pKVM overview:
=============
The pKVM hypervisor, recently introduced on arm64, provides a separation
of privileges between the host and hypervisor parts of KVM, where the
hypervisor is trusted by guests but the host is not [1][2]. The host is
initially trusted during boot, but its privileges are reduced after KVM
is initialized so that, if an adversary later gains access to the large
attack surface of the host, it cannot access guest data.

Currently with pKVM, the host can still instruct DMA-capable devices
like the GPU to access guest and hypervisor memory, which undermines
this isolation. Preventing DMA attacks requires an IOMMU, owned by the
hypervisor.

This series adds a hypervisor driver for the Arm SMMUv3 IOMMU. Since the
hypervisor part of pKVM (called nVHE here) is minimal, moving the whole
host SMMU driver into nVHE isn't really an option. It is too large and
complex and requires infrastructure from all over the kernel. We add a
reduced nVHE driver that deals with populating the SMMU tables and the
command queue, and the host driver still deals with probing and some
initialization.

Some of the pKVM infrastructure is not upstream yet, which are dependencies
for this series, so it should be considered a forward looking RFC for
what we think about how DMA isolation can be supported in pKVM or in
other similar confidential computing solutions and not a ready to merge
solution.
This is discussed further in the dependencies section below.

Patches overview
================
The patches are split as follows:
Patches 1-10: Mostly about splitting the current SMMUv3 driver and
io-pgtable-arm library, so the code can be re-used in the KVM driver
either inside the kernel or the hypervisor.
Most of these patches are best reviewed with git's --color-moved.

Patches 11-24: Introduce the hypervisor core code for IOMMUs which is
not specific to SMMUv3, these are the hypercall handlers and common
logic in the hypervisor.
It also introduces the key functions __pkvm_host_{un}use_dma_page which
are used to track DMA mapped pages, more on this in the design section.

Patches 25-41: Add the hypervisor part of the KVM SMMUv3 driver which
is called by hypervisor core IOMMU code, these are para-virtualized
operations such as attach/detach, map/unmap...

Patches 42-54: Add the kernel part of the KVM SMMUv3 driver, this
probes the IOMMUs and initialises them and populates the list of SMMUs
to the hypervisor, it also implements the kernel iommu_ops and registers
the IOMMUs with the kernel.

Patches 55-58: Two extra optimizations introduced at the end to avoid
complicating the start of the series, one to optimise iommu_map_sg and
the other is to batch TLB invalidation which I noticed to be a problem
while testing as my HW doesn’t support range invalidation.

A development branch is available at:
https://android-kvm.googlesource.com/linux/+log/refs/heads/for-upstream/pkvm-smmu

Design
======
We've explored 4 solutions so far, we only mention two of them here
which I believe are the most promising as they offer private IO spaces,
while the others were discussed in the v1 of the series cover letter.

1. Paravirtual I/O page tables
This is the solution implemented in this series. The host creates
IOVA->HPA mappings with two hypercalls map_pages() and unmap_pages(), and
the hypervisor populates the page tables. Page tables are abstracted into
IOMMU domains, which allow multiple devices to share the same address
space. Another four hypercalls, alloc_domain(), attach_dev(), detach_dev()
and free_domain(), manage the domains, the semantics of those hypercalls
are almost identical to the IOMMU ops which make the kernel driver part
simpler.

Some key points in the hypervisor design:
a- Tracking mapped pages: the hypervisor must prevent pages mapped in the
   IOMMU to be donated to a protected guest or the hypervisor, or allow
   a protected guest/hypervisor page be mapped in an IOMMU domain.

   For that we rely on the vmemmap refcount, where each time a page is
   mapped it’s refcount is incremented and ownership is checked, and
   each time it's successfully unmapped it’s decremented. And any memory
   donation would be denied for refcounted pages.

b- Locking: The io-pgtable-arm is lockless under some guarantees of how
   the IOMMU code behaves. However with pKVM, the kernel is not trusted
   and a malicious kernel can issue concurrent requests causing memory
   corruption or UAF, so that it has to be locked in the hypervisor.

c- Memory management: The hypervisor needs a way to allocate pages for
   the pv page tables, for that an IOMMU pool is created which can be
   topped up from a hypercall, and the IOMMU hypercalls returns encoded
   memory requests which can be fulfilled by the kernel driver.

2. Nested SMMUv3 translation (with emulation)
Another approach is to rely on nested translation support which is
optional in SMMUv3, that requires an architecturally accurate emulation
of SMMUv3 which can be complicated including cmdq emulation.

With this approach, we can use the same page tables as the CPU stage-2,
which adds more constraints on HW (SMMUv3 features must match the CPU)
and the ability of the devices to handle faults as the CPU part relies
on lazy mapping and has no guarantees about pages being mapped.
Or we can use a shadow IOMMU page table instead.

I have a prototype that is not ready yet to be posted for nested:
https://android-kvm.googlesource.com/linux/+log/refs/heads/smostafa/android15-6.6-smmu-nesting-wip


The trade off between the 2 approaches can be roughly summarised as:
Paravirtualization:
- Compatible with more HW (and IOMMUs).
- Better DMA performance due to shorter table walks/less TLB pressure
- Needs extra complexity to squeeze the last bit of optimization (around
  unmap, and map_sg).

Nested Emulation
- Faster map_pages (not sure about unmap because it requires cmdq
  emulation for TLB invalidation if DVM not used).
- Needs extra complexity for architecturally emulating SMMUv3.

I believe that the first approach looks more promising with this trade
off. However, I plan to complete the nested emulation and post it with
a comparison with this approach in terms of performance, and maybe this
topic can be discussed in an upcoming conference.

Dependencies
============
This series depends on some parts of pKVM that are not upstreamed yet,
some of them are currently posted[3][4]. However, not to spam the list
with many of these changes which are not relevant to IOMMU/SMMUv3 the
patches are developed on top of them.

This series also depends on another series reworking the io-pgtable walker[5]

Performance
===========
With CONFIG_DMA_MAP_BENCHMARK on a 4-core Morello board.
Numbers represent the average time needed for one dma_map/dma_unmap call
in μs, lower is better.
It is compared with the kernel driver, which is not quite a fair comparison
as it doesn't fulfil pKVM DMA isolation requirements. However, these
numbers are provided just to give a rough idea about how the overhead
looks like.
			Kernel driver	      pKVM driver
4K - 1 thread		0.1/0.7               0.3/1.3
4K - 4 threads		0.1/1.1               0.5/3.3
1M - 1 thread		0.8/21.5              2.6/27.3
1M - 4 threads		1.1/45.7              3.6/46.2

And tested as follows:
echo dma_map_benchmark > /sys/bus/pci/devices/0000\:06\:00.0/driver_override
echo 0000:06:00.0 >  /sys/bus/pci/devices/0000\:06\:00.0/driver/unbind
echo 0000:06:00.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind
./dma_map_bechmark -t $threads -g $nr_pages


Future work
==========
- Add IDENTITY_DOMAIN support, I already have some patches for that, but
  didn’t want to complicate this series, I can send them separately.
- Complete the comparison with the nesting support and find the most
  suitable solution for upstream.


Main changes since v1
=====================
- Patches are reordered to split the introduction of the KVM IOMMU
  code and the SMMUv3 driver.
- KVM EL2 code is closer the EL1 where domains are decoupled from
  IOMMUs.
- SMMUv3 new features (stage-1 support, IRQ and EVTQ in the kernel).
- Adaptions to the new SMMUv3 cleanups.
- Rework tracking of mapped pages to improve performance.
- Rework locking to improve performance.
- Rework unmap to improve performance.
- Adding iotlb_gather to optimize unmap.
- Add new operations to optimize map_sg operation.
- Registering driver is dynamically done instead of statically checked.
- Memory allocation for page table pages are changed to be separate
  pool and HVCs instead of share mc that required atomic allocation.
- Support for higher order page allocation.
- Support for non-coherent SMMUs.
- Support for DABT and MMIO emulation.


[1] https://lore.kernel.org/kvmarm/20220519134204.5379-1-will@kernel.org/
[2] https://www.youtube.com/watch?v=9npebeVFbFw
[3] https://lore.kernel.org/kvmarm/20241203103735.2267589-1-qperret@google.com/
[4] https://lore.kernel.org/all/20241202154742.3611749-1-tabba@google.com/
[5] https://lore.kernel.org/linux-iommu/20241028213146.238941-1-robdclark@gmail.com/T/#t


Jean-Philippe Brucker (23):
  iommu/io-pgtable-arm: Split the page table driver
  iommu/io-pgtable-arm: Split initialization
  iommu/io-pgtable: Add configure() operation
  iommu/arm-smmu-v3: Move some definitions to arm64 include/
  iommu/arm-smmu-v3: Extract driver-specific bits from probe function
  iommu/arm-smmu-v3: Move some functions to arm-smmu-v3-common.c
  iommu/arm-smmu-v3: Move queue and table allocation to
    arm-smmu-v3-common.c
  iommu/arm-smmu-v3: Move firmware probe to arm-smmu-v3-common
  iommu/arm-smmu-v3: Move IOMMU registration to arm-smmu-v3-common.c
  KVM: arm64: pkvm: Add pkvm_udelay()
  KVM: arm64: pkvm: Add __pkvm_host_add_remove_page()
  KVM: arm64: pkvm: Support SCMI power domain
  KVM: arm64: iommu: Support power management
  KVM: arm64: iommu: Add SMMUv3 driver
  KVM: arm64: smmu-v3: Initialize registers
  KVM: arm64: smmu-v3: Setup command queue
  KVM: arm64: smmu-v3: Reset the device
  KVM: arm64: smmu-v3: Support io-pgtable
  iommu/arm-smmu-v3-kvm: Add host driver for pKVM
  iommu/arm-smmu-v3-kvm: Pass a list of SMMU devices to the hypervisor
  iommu/arm-smmu-v3-kvm: Validate device features
  iommu/arm-smmu-v3-kvm: Allocate structures and reset device
  iommu/arm-smmu-v3-kvm: Probe power domains

Mostafa Saleh (35):
  iommu/arm-smmu-v3: Move common irq code to common file
  KVM: arm64: Add __pkvm_{use, unuse}_dma()
  KVM: arm64: Introduce IOMMU driver infrastructure
  KVM: arm64: pkvm: Add IOMMU hypercalls
  KVM: arm64: iommu: Add a memory pool for the IOMMU
  KVM: arm64: iommu: Add domains
  KVM: arm64: iommu: Add {attach, detach}_dev
  KVM: arm64: iommu: Add map/unmap() operations
  KVM: arm64: iommu: support iommu_iotlb_gather
  KVM: arm64: Support power domains
  KVM: arm64: iommu: Support DABT for IOMMU
  KVM: arm64: smmu-v3: Setup stream table
  KVM: arm64: smmu-v3: Setup event queue
  KVM: arm64: smmu-v3: Add {alloc/free}_domain
  KVM: arm64: smmu-v3: Add TLB ops
  KVM: arm64: smmu-v3: Add context descriptor functions
  KVM: arm64: smmu-v3: Add attach_dev
  KVM: arm64: smmu-v3: Add detach_dev
  iommu/io-pgtable: Generalize walker interface
  iommu/io-pgtable-arm: Add post table walker callback
  drivers/iommu: io-pgtable: Add IO_PGTABLE_QUIRK_UNMAP_INVAL
  KVM: arm64: smmu-v3: Add map/unmap pages and iova_to_phys
  KVM: arm64: smmu-v3: Add DABT handler
  KVM: arm64: Add function to topup generic allocator
  KVM: arm64: Add macro for SMCCC call with all returns
  iommu/arm-smmu-v3-kvm: Add function to topup IOMMU allocator
  iommu/arm-smmu-v3-kvm: Add IOMMU ops
  iommu/arm-smmu-v3-kvm: Add map, unmap and iova_to_phys operations
  iommu/arm-smmu-v3-kvm: Support PASID operations
  iommu/arm-smmu-v3-kvm: Add IRQs for the driver
  iommu/arm-smmu-v3-kvm: Enable runtime PM
  drivers/iommu: Add deferred map_sg operations
  KVM: arm64: iommu: Add hypercall for map_sg
  iommu/arm-smmu-v3-kvm: Implement sg operations
  iommu/arm-smmu-v3-kvm: Support command queue batching

 arch/arm64/include/asm/arm-smmu-v3-common.h   |  592 +++++++
 arch/arm64/include/asm/kvm_asm.h              |    9 +
 arch/arm64/include/asm/kvm_host.h             |   48 +-
 arch/arm64/include/asm/kvm_hyp.h              |    2 +
 arch/arm64/kvm/Makefile                       |    2 +-
 arch/arm64/kvm/arm.c                          |    8 +-
 arch/arm64/kvm/hyp/hyp-constants.c            |    1 +
 arch/arm64/kvm/hyp/include/nvhe/iommu.h       |   91 ++
 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |    3 +
 arch/arm64/kvm/hyp/include/nvhe/mm.h          |    1 +
 arch/arm64/kvm/hyp/include/nvhe/pkvm.h        |   37 +
 .../arm64/kvm/hyp/include/nvhe/trap_handler.h |    2 +
 arch/arm64/kvm/hyp/nvhe/Makefile              |    6 +-
 arch/arm64/kvm/hyp/nvhe/alloc_mgt.c           |    2 +
 arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  114 ++
 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c   | 1390 +++++++++++++++++
 .../arm64/kvm/hyp/nvhe/iommu/io-pgtable-arm.c |  153 ++
 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c         |  490 ++++++
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         |  133 +-
 arch/arm64/kvm/hyp/nvhe/mm.c                  |   17 +
 arch/arm64/kvm/hyp/nvhe/power/hvc.c           |   47 +
 arch/arm64/kvm/hyp/nvhe/power/scmi.c          |  231 +++
 arch/arm64/kvm/hyp/nvhe/setup.c               |    9 +
 arch/arm64/kvm/hyp/nvhe/timer-sr.c            |   42 +
 arch/arm64/kvm/iommu.c                        |   89 ++
 arch/arm64/kvm/mmu.c                          |   20 +
 arch/arm64/kvm/pkvm.c                         |   20 +
 drivers/gpu/drm/msm/msm_iommu.c               |    5 +-
 drivers/iommu/Kconfig                         |    9 +
 drivers/iommu/Makefile                        |    2 +-
 drivers/iommu/arm/arm-smmu-v3/Makefile        |    7 +
 .../arm/arm-smmu-v3/arm-smmu-v3-common.c      |  824 ++++++++++
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 1093 +++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  989 +-----------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  758 +++------
 drivers/iommu/io-pgtable-arm-common.c         |  929 +++++++++++
 drivers/iommu/io-pgtable-arm.c                | 1061 +------------
 drivers/iommu/io-pgtable-arm.h                |   30 -
 drivers/iommu/io-pgtable.c                    |   15 +
 drivers/iommu/iommu.c                         |   53 +-
 include/kvm/arm_smmu_v3.h                     |   46 +
 include/kvm/iommu.h                           |   59 +
 include/kvm/power_domain.h                    |   24 +
 include/linux/io-pgtable-arm.h                |  233 +++
 include/linux/io-pgtable.h                    |   38 +-
 include/linux/iommu.h                         |   43 +-
 46 files changed, 7169 insertions(+), 2608 deletions(-)
 create mode 100644 arch/arm64/include/asm/arm-smmu-v3-common.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/iommu.h
 create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/io-pgtable-arm.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/power/hvc.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/power/scmi.c
 create mode 100644 arch/arm64/kvm/iommu.c
 create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
 create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
 create mode 100644 drivers/iommu/io-pgtable-arm-common.c
 delete mode 100644 drivers/iommu/io-pgtable-arm.h
 create mode 100644 include/kvm/arm_smmu_v3.h
 create mode 100644 include/kvm/iommu.h
 create mode 100644 include/kvm/power_domain.h
 create mode 100644 include/linux/io-pgtable-arm.h

-- 
2.47.0.338.g60cca15819-goog

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Jason Gunthorpe 1 year, 1 month ago

On Thu, Dec 12, 2024 at 06:03:24PM +0000, Mostafa Saleh wrote:

> This series adds a hypervisor driver for the Arm SMMUv3 IOMMU. Since the
> hypervisor part of pKVM (called nVHE here) is minimal, moving the whole
> host SMMU driver into nVHE isn't really an option. It is too large and
> complex and requires infrastructure from all over the kernel. We add a
> reduced nVHE driver that deals with populating the SMMU tables and the
> command queue, and the host driver still deals with probing and some
> initialization.

The cover letter doesn't explain why someone needs page tables in the
guest at all?

If you are able to implement nested support then you can boot the
guest with no-iommu and an effective identity translation through a
hypervisor controlled S2. ie no guest map/unmap. Great DMA
performance.

I thought the point of doing the paravirt here was to allow dynamic
pinning of the guest memory? This is the primary downside with nested.
The entire guest memory has to be pinned down at guest boot.

> 1. Paravirtual I/O page tables
> This is the solution implemented in this series. The host creates
> IOVA->HPA mappings with two hypercalls map_pages() and unmap_pages(), and
> the hypervisor populates the page tables. Page tables are abstracted into
> IOMMU domains, which allow multiple devices to share the same address
> space. Another four hypercalls, alloc_domain(), attach_dev(), detach_dev()
> and free_domain(), manage the domains, the semantics of those hypercalls
> are almost identical to the IOMMU ops which make the kernel driver part
> simpler.

That is re-inventing virtio-iommu. I don't really understand why this
series is hacking up arm-smmuv3 so much, that is not, and should not,
be a paravirt driver. Why not create a clean new pkvm specific driver
for the paravirt?? Or find a way to re-use parts of virtio-iommu?

Shouldn't other arch versions of pkvm be able to re-use the same guest
iommu driver?

> b- Locking: The io-pgtable-arm is lockless under some guarantees of how
>    the IOMMU code behaves. However with pKVM, the kernel is not trusted
>    and a malicious kernel can issue concurrent requests causing memory
>    corruption or UAF, so that it has to be locked in the hypervisor.

? I don't get it, the hypervisor page table has to be private to the
hypervisor. It is not that io-pgtable-arm is lockless, it is that it
relies on a particular kind of caller supplied locking. pkvm's calls
into its private io-pgtable-arm would need pkvm specific locking that
makes sense for it. Where does a malicious guest kernel get into this?

> 2. Nested SMMUv3 translation (with emulation)
> Another approach is to rely on nested translation support which is
> optional in SMMUv3, that requires an architecturally accurate emulation
> of SMMUv3 which can be complicated including cmdq emulation.

The confidential compute folks are going in this direction.

> The trade off between the 2 approaches can be roughly summarised as:
> Paravirtualization:
> - Compatible with more HW (and IOMMUs).
> - Better DMA performance due to shorter table walks/less TLB pressure
> - Needs extra complexity to squeeze the last bit of optimization (around
>   unmap, and map_sg).

It has better straight line DMA performance if the DMAs are all
static. Generally much, much worse performance if the DMAs are
dynamically mapped as you have to trap so much stuff.

The other negative is there is no way to get SVA support with
para-virtualization.

The positive is you don't have to pin the VM's memory.

> Nested Emulation
> - Faster map_pages (not sure about unmap because it requires cmdq
>   emulation for TLB invalidation if DVM not used).

If you can do nested then you can run in identity mode and then you
don't have any performance down side. It is a complete win.

Even if you do non-idenity nested is still likely faster for changing
translation than paravirt approaches. A single cmdq range invalidate
should be about the same broad overhead as a single paravirt call to
unmap except they can be batched under load.

Things like vCMDQ eliminate this overhead entirely, to my mind that is
the future direction of this HW as you obviously need to HW optimize
invalidation...

> - Needs extra complexity for architecturally emulating SMMUv3.

Lots of people have now done this, it is not really so bad. In
exchange you get a full architected feature set, better performance,
and are ready for HW optimizations.

> - Add IDENTITY_DOMAIN support, I already have some patches for that, but
>   didn’t want to complicate this series, I can send them separately.

This seems kind of pointless to me. If you can tolerate identity (ie
pin all memory) then do nested, and maybe don't even bother with a
guest iommu.

If you want most of the guest memory to be swappable/movable/whatever
then paravirt is the only choice, and you really don't want the guest
to have any identiy support at all.

Really, I think you'd want to have both options, there is no "best"
here. It depends what people want to use the VM for.

My advice for merging would be to start with the pkvm side setting up
a fully pinned S2 and do not have a guest driver. Nesting without
emulating smmuv3. Basically you get protected identity DMA support. I
think that would be a much less sprawling patch series. From there it
would be well positioned to add both smmuv3 emulation and a paravirt
iommu flow.

Jason

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Mostafa Saleh 1 year, 1 month ago

Hi Jason,

Thanks a lot for taking the time to review this, I tried to reply to all
points. However I think a main source of confusion was that this is only
for the host kernel not guests, with this series guests still have no
access to DMA under pKVM. I hope that clarifies some of the points.

On Thu, Dec 12, 2024 at 03:41:19PM -0400, Jason Gunthorpe wrote:
> On Thu, Dec 12, 2024 at 06:03:24PM +0000, Mostafa Saleh wrote:
> 
> > This series adds a hypervisor driver for the Arm SMMUv3 IOMMU. Since the
> > hypervisor part of pKVM (called nVHE here) is minimal, moving the whole
> > host SMMU driver into nVHE isn't really an option. It is too large and
> > complex and requires infrastructure from all over the kernel. We add a
> > reduced nVHE driver that deals with populating the SMMU tables and the
> > command queue, and the host driver still deals with probing and some
> > initialization.
> 
> The cover letter doesn't explain why someone needs page tables in the
> guest at all?

This is not for guests but for the host, the hypervisor needs to
establish DMA isolation between the host and the hypervisor/guests.
Before these patches; as mentioned, a host can program a DMA device
to read/write any memory (that has nothing to do with whether the
guest has DMA access or not).

So it’s mandatory for pKVM to establish DMA isolation, otherwise
it can be easily defeated.

However, guest DMA support is optional and only needed for device
passthrough, I have some patches to support that in pKVM also(only with
vfio-platform), but it’s unlikely to be posted upstream before merging a
host DMA isolation solution first as it’s mandatory.

> 
> If you are able to implement nested support then you can boot the
> guest with no-iommu and an effective identity translation through a
> hypervisor controlled S2. ie no guest map/unmap. Great DMA
> performance.

We can do that for the host also, which is discussed in the v1 cover
letter. However, we try to keep feature parity with the normal (VHE)
KVM arm64 support, so constraining KVM support to not have IOVA spaces
for devices seems too much and impractical on modern systems (phones for
example).

> 
> I thought the point of doing the paravirt here was to allow dynamic
> pinning of the guest memory? This is the primary downside with nested.
> The entire guest memory has to be pinned down at guest boot.

As this is for the host, memory pinning is not really an issue (However,
with nesting and shared CPU stage-2 there are other challenges as
mentioned).

> 
> > 1. Paravirtual I/O page tables
> > This is the solution implemented in this series. The host creates
> > IOVA->HPA mappings with two hypercalls map_pages() and unmap_pages(), and
> > the hypervisor populates the page tables. Page tables are abstracted into
> > IOMMU domains, which allow multiple devices to share the same address
> > space. Another four hypercalls, alloc_domain(), attach_dev(), detach_dev()
> > and free_domain(), manage the domains, the semantics of those hypercalls
> > are almost identical to the IOMMU ops which make the kernel driver part
> > simpler.
> 
> That is re-inventing virtio-iommu. I don't really understand why this
> series is hacking up arm-smmuv3 so much, that is not, and should not,
> be a paravirt driver. Why not create a clean new pkvm specific driver
> for the paravirt?? Or find a way to re-use parts of virtio-iommu?
> 
> Shouldn't other arch versions of pkvm be able to re-use the same guest
> iommu driver?

As mentioned, this is for the host kernel not the guest. However the
hypervisor/kernel interface is not IOMMU specific. And it can be extended
to other IOMMUs/archs.

There is no hacking for the arm-smmu-v3 driver, but mostly splitting
the driver so it can be re-used + introduction for a separate hypervisor
driver, it’s similar to how SVA re-use part of the driver also but just
on a bigger scale.

> 
> > b- Locking: The io-pgtable-arm is lockless under some guarantees of how
> >    the IOMMU code behaves. However with pKVM, the kernel is not trusted
> >    and a malicious kernel can issue concurrent requests causing memory
> >    corruption or UAF, so that it has to be locked in the hypervisor.
> 
> ? I don't get it, the hypervisor page table has to be private to the
> hypervisor. It is not that io-pgtable-arm is lockless, it is that it
> relies on a particular kind of caller supplied locking. pkvm's calls
> into its private io-pgtable-arm would need pkvm specific locking that
> makes sense for it. Where does a malicious guest kernel get into this?

At the moment when the kernel driver uses the io-pgtable-arm, it doesn’t
protect it with any locks under some assumptions, for example, unmapping
a table with block size and a leaf inside it with page size concurrently
can cause a UAF, but the DMA API never does that.

With pKVM, the host kernel is not trusted, and if compromised it can
instrument such attacks to corrupt hypervisor memory, so the hypervisor
would lock io-pgtable-arm operations in EL2 to avoid that.

> 
> > 2. Nested SMMUv3 translation (with emulation)
> > Another approach is to rely on nested translation support which is
> > optional in SMMUv3, that requires an architecturally accurate emulation
> > of SMMUv3 which can be complicated including cmdq emulation.
> 
> The confidential compute folks are going in this direction.

I see, but one key advantage for pKVM that it requires minimum hardware,
with the paravirtual approach we can support single stage SMMUv3 or even
non-architected IOMMUs, that + the DMA performance, might give it slight
edge, but as I mentioned I plan to do more throughout comparison with
nesting and maybe discuss it in a conference this year.

> 
> > The trade off between the 2 approaches can be roughly summarised as:
> > Paravirtualization:
> > - Compatible with more HW (and IOMMUs).
> > - Better DMA performance due to shorter table walks/less TLB pressure
> > - Needs extra complexity to squeeze the last bit of optimization (around
> >   unmap, and map_sg).
> 
> It has better straight line DMA performance if the DMAs are all
> static. Generally much, much worse performance if the DMAs are
> dynamically mapped as you have to trap so much stuff.

I agree it’s not that clear, I will finish the nested implementation
and run some standard IO benchmarks.

> 
> The other negative is there is no way to get SVA support with
> para-virtualization.
> 
Yeah, SVA is tricky, I guess for that we would have to use nesting,
but tbh, I don’t think it’s a deal breaker for now.

> The positive is you don't have to pin the VM's memory.
> 
> > Nested Emulation
> > - Faster map_pages (not sure about unmap because it requires cmdq
> >   emulation for TLB invalidation if DVM not used).
> 
> If you can do nested then you can run in identity mode and then you
> don't have any performance down side. It is a complete win.

Unfortunately, as mentioned above it’s not that practical, many devices
in mobile space expect IO translation capability.

> 
> Even if you do non-idenity nested is still likely faster for changing
> translation than paravirt approaches. A single cmdq range invalidate
> should be about the same broad overhead as a single paravirt call to
> unmap except they can be batched under load.
> 
> Things like vCMDQ eliminate this overhead entirely, to my mind that is
> the future direction of this HW as you obviously need to HW optimize
> invalidation...
> 
> > - Needs extra complexity for architecturally emulating SMMUv3.
> 
> Lots of people have now done this, it is not really so bad. In
> exchange you get a full architected feature set, better performance,
> and are ready for HW optimizations.

It’s not impossible, it’s just more complicated doing it in the
hypervisor which has limited features compared to the kernel + I haven’t
seen any open source implementation for that except for Qemu which is in
userspace.

> 
> > - Add IDENTITY_DOMAIN support, I already have some patches for that, but
> >   didn’t want to complicate this series, I can send them separately.
> 
> This seems kind of pointless to me. If you can tolerate identity (ie
> pin all memory) then do nested, and maybe don't even bother with a
> guest iommu.

As mentioned, the choice for para-virt was not only to avoid pinning,
as this is the host, for IDENTITY_DOMAIN we either share the page table,
then we have to deal with lazy mapping (SMMU features, BBM...) or mirror
the table in a shadow SMMU only identity page table.

> 
> If you want most of the guest memory to be swappable/movable/whatever
> then paravirt is the only choice, and you really don't want the guest
> to have any identiy support at all.
> 
> Really, I think you'd want to have both options, there is no "best"
> here. It depends what people want to use the VM for.
> 
> My advice for merging would be to start with the pkvm side setting up
> a fully pinned S2 and do not have a guest driver. Nesting without
> emulating smmuv3. Basically you get protected identity DMA support. I
> think that would be a much less sprawling patch series. From there it
> would be well positioned to add both smmuv3 emulation and a paravirt
> iommu flow.
> 

I am open to any suggestions, but I believe any solution considered for
merge, should have enough features to be usable on actual systems (translating
IOMMU can be used for example) so either para-virt as this series or full
nesting as the PoC above (or maybe both?), which IMO comes down to the
trade-off mentioned above.

Thanks,
Mostafa

> Jason

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Jason Gunthorpe 1 year, 1 month ago

On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> Thanks a lot for taking the time to review this, I tried to reply to all
> points. However I think a main source of confusion was that this is only
> for the host kernel not guests, with this series guests still have no
> access to DMA under pKVM. I hope that clarifies some of the points.

I think I just used different words, I ment the direct guest of pvkm,
including what you are calling the host kernel.

> > The cover letter doesn't explain why someone needs page tables in the
> > guest at all?
> 
> This is not for guests but for the host, the hypervisor needs to
> establish DMA isolation between the host and the hypervisor/guests.

Why isn't this done directly in pkvm by setting up IOMMU tables that
identity map the host/guest's CPU mapping? Why does the host kernel or
guest kernel need to have page tables?

> However, guest DMA support is optional and only needed for device
> passthrough, 

Why? The CC cases are having the pkvm layer control the translation,
so when the host spawns a guest the pkvm will setup a contained IOMMU
translation for that guest as well.

Don't you also want to protect the guests from the host in this model?

> We can do that for the host also, which is discussed in the v1 cover
> letter. However, we try to keep feature parity with the normal (VHE)
> KVM arm64 support, so constraining KVM support to not have IOVA spaces
> for devices seems too much and impractical on modern systems (phones for
> example).

But why? Do you have current use cases on phone where you need to have
device-specific iommu_domains? What are they? Answering this goes a
long way to understanding the real performance of a para virt approach.

> There is no hacking for the arm-smmu-v3 driver, but mostly splitting
> the driver so it can be re-used + introduction for a separate
> hypervisor

I understood splitting some of it so you could share code with the
pkvm side, but I don't see that it should be connected to the
host/guest driver. Surely that should be a generic pkvm-iommu driver
that is arch neutral, like virtio-iommu.

> With pKVM, the host kernel is not trusted, and if compromised it can
> instrument such attacks to corrupt hypervisor memory, so the hypervisor
> would lock io-pgtable-arm operations in EL2 to avoid that.

io-pgtable-arm has a particular set of locking assumptions, the caller
has to follow it. When pkvm converts the hypercalls for the
para-virtualization into io-pgtable-arm calls it has to also ensure it
follows io-pgtable-arm's locking model if it is going to use that as
its code base. This has nothing to do with the guest or trust, it is
just implementing concurrency correctly in pkvm..

> Yeah, SVA is tricky, I guess for that we would have to use nesting,
> but tbh, I don’t think it’s a deal breaker for now.

Again, it depends what your actual use case for translation is inside
the host/guest environments. It would be good to clearly spell this out..
There are few drivers that directly manpulate the iommu_domains of a
device. a few gpus, ath1x wireless, some tegra stuff, "venus". Which
of those are you targetting?

> > Lots of people have now done this, it is not really so bad. In
> > exchange you get a full architected feature set, better performance,
> > and are ready for HW optimizations.
> 
> It’s not impossible, it’s just more complicated doing it in the
> hypervisor which has limited features compared to the kernel + I haven’t
> seen any open source implementation for that except for Qemu which is in
> userspace.

People are doing it in their CC stuff, which is about the same as
pkvm. I'm not sure if it will be open source, I hope so since it needs
security auditing..

> > > - Add IDENTITY_DOMAIN support, I already have some patches for that, but
> > >   didn’t want to complicate this series, I can send them separately.
> > 
> > This seems kind of pointless to me. If you can tolerate identity (ie
> > pin all memory) then do nested, and maybe don't even bother with a
> > guest iommu.
> 
> As mentioned, the choice for para-virt was not only to avoid pinning,
> as this is the host, for IDENTITY_DOMAIN we either share the page table,
> then we have to deal with lazy mapping (SMMU features, BBM...) or mirror
> the table in a shadow SMMU only identity page table.

AFAIK you always have to mirror unless you significantly change how
the KVM S1 page table stuff is working. The CC people have made those
changes and won't mirror, so it is doable..

> > My advice for merging would be to start with the pkvm side setting up
> > a fully pinned S2 and do not have a guest driver. Nesting without
> > emulating smmuv3. Basically you get protected identity DMA support. I
> > think that would be a much less sprawling patch series. From there it
> > would be well positioned to add both smmuv3 emulation and a paravirt
> > iommu flow.
> 
> I am open to any suggestions, but I believe any solution considered for
> merge, should have enough features to be usable on actual systems (translating
> IOMMU can be used for example) so either para-virt as this series or full
> nesting as the PoC above (or maybe both?), which IMO comes down to the
> trade-off mentioned above.

IMHO no, you can have a completely usable solution without host/guest
controlled translation. This is equivilant to a bare metal system with
no IOMMU HW. This exists and is still broadly useful. The majority of
cloud VMs out there are in this configuration.

That is the simplest/smallest thing to start with. Adding host/guest
controlled translation is a build-on-top excercise that seems to have
a lot of options and people may end up wanting to do all of them.

I don't think you need to show that host/guest controlled translation
is possible to make progress, of course it is possible. Just getting
to the point where pkvm can own the SMMU HW and provide DMA isolation
between all of it's direct host/guest is a good step.

Jason

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Mostafa Saleh 1 year, 1 month ago

On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > Thanks a lot for taking the time to review this, I tried to reply to all
> > points. However I think a main source of confusion was that this is only
> > for the host kernel not guests, with this series guests still have no
> > access to DMA under pKVM. I hope that clarifies some of the points.
> 
> I think I just used different words, I ment the direct guest of pvkm,
> including what you are calling the host kernel.
> 

KVM treats host/guests very differently, so I think the distinction
between both in this context is important as this driver is for the
host only, guests are another story.

> > > The cover letter doesn't explain why someone needs page tables in the
> > > guest at all?
> > 
> > This is not for guests but for the host, the hypervisor needs to
> > establish DMA isolation between the host and the hypervisor/guests.
> 
> Why isn't this done directly in pkvm by setting up IOMMU tables that
> identity map the host/guest's CPU mapping? Why does the host kernel or
> guest kernel need to have page tables?
> 

If we setup identity tables that either means there is no translation
capability for the guest(or host here) or nesting should be used,
which is discussed later in this cover letter.

> > However, guest DMA support is optional and only needed for device
> > passthrough, 
> 
> Why? The CC cases are having the pkvm layer control the translation,
> so when the host spawns a guest the pkvm will setup a contained IOMMU
> translation for that guest as well.
> 
> Don't you also want to protect the guests from the host in this model?
> 

We do protect the guests from the host, in the proposed approach by
preventing mapping memory from guests(or hypervisor) in the IOMMU or
donating memory currently mapped in the IOMMU.
However, at the moment pKVM doesn’t support device passthrough, so
guests don't need IOMMU page tables as they can’t use any device or issue
DMA directly.
I have some patches to support device passthrough in guests + guest
IOMMU page tables, which is not part of this series, as mentioned host
DMA isolation is critical for pKVM model, while guest device passthrough
is an optional feature (but we plan to upstream that later)

> > We can do that for the host also, which is discussed in the v1 cover
> > letter. However, we try to keep feature parity with the normal (VHE)
> > KVM arm64 support, so constraining KVM support to not have IOVA spaces
> > for devices seems too much and impractical on modern systems (phones for
> > example).
> 
> But why? Do you have current use cases on phone where you need to have
> device-specific iommu_domains? What are they? Answering this goes a
> long way to understanding the real performance of a para virt approach.
> 

I don’t think having one domain for all devices fits most cases, SoCs can have
heterogeneous SMMUs, different addresses sizes, coherency...
Also, the basic idea of isolation between devices where some can be
controlled from userspace, or influenced by external entities, which
should be isolated. (we'd want the USB/network devices to have access to
other devices memory for example)
Another example would be accelerators, where they only operate on contiguous
memory and having such large buffers on phones in phyiscal space is almost
impossible.

I don’t think having a single domain is practical (nor it helps in this case).

> > There is no hacking for the arm-smmu-v3 driver, but mostly splitting
> > the driver so it can be re-used + introduction for a separate
> > hypervisor
> 
> I understood splitting some of it so you could share code with the
> pkvm side, but I don't see that it should be connected to the
> host/guest driver. Surely that should be a generic pkvm-iommu driver
> that is arch neutral, like virtio-iommu.
> 

The host driver follows the KVM (nvhe/hvhe) model, where at boot the
kernel (EL1) does a lot of the initialization and then it becomes untrusted
and the hypervisor manages everything after.

Similarly, The driver first probes in EL1 and does many of the complicated
stuff that is not supported at the hypervisor (EL2) as parsing firmware tables.
And ends up populating a simplified description of the SMMU topology.

Then the KVM <-> SMMU interfaces is not arch specific, you can check that
in hyp_main.c or nvhe/iommu.c where there is no reference to SMMU and all
hypercalls are abstracted so other IOMMUs can be supported under pKVM
(That’s the case in Android).

Maybe the driver at EL1 also can be further split to have a standard
part for hypercall interface and an init part which is SMMUv3 specific,
but I’d rather not complicate things until we have other users upstream..

For guest VMs (not part of this series), the interface and the kernel driver
are completely arch agnostic, similarly to virtio-iommu.

> > With pKVM, the host kernel is not trusted, and if compromised it can
> > instrument such attacks to corrupt hypervisor memory, so the hypervisor
> > would lock io-pgtable-arm operations in EL2 to avoid that.
> 
> io-pgtable-arm has a particular set of locking assumptions, the caller
> has to follow it. When pkvm converts the hypercalls for the
> para-virtualization into io-pgtable-arm calls it has to also ensure it
> follows io-pgtable-arm's locking model if it is going to use that as
> its code base. This has nothing to do with the guest or trust, it is
> just implementing concurrency correctly in pkvm..
> 

AFAICT, io-pgtable-arm has a set of assumptions about how it's called,
that’s why it’s lockless as the DMA API should follow those assumptions.
For example you can’t unmap a table and an entry inside the table concurrently,
that can lead to UAF/memory corruption, and this never happens at the moment as
the kernel has no bugs :)

However, pKVM always assumes that the kernel can be malicious, so a bad kernel
can issue such a call breaking those assumptions leading to UAF/memory corruption
inside the hypervisor. Which is not acceptable, so the solution is to use a lock
to prevent such issues from concurrent requests.

> > Yeah, SVA is tricky, I guess for that we would have to use nesting,
> > but tbh, I don’t think it’s a deal breaker for now.
> 
> Again, it depends what your actual use case for translation is inside
> the host/guest environments. It would be good to clearly spell this out..
> There are few drivers that directly manpulate the iommu_domains of a
> device. a few gpus, ath1x wireless, some tegra stuff, "venus". Which
> of those are you targetting?
> 

Not sure I understand this point about manipulating domains.
AFAIK, SVA is not that common, including mobile spaces but I can be wrong,
that’s why it’s not a priority here.

> > > Lots of people have now done this, it is not really so bad. In
> > > exchange you get a full architected feature set, better performance,
> > > and are ready for HW optimizations.
> > 
> > It’s not impossible, it’s just more complicated doing it in the
> > hypervisor which has limited features compared to the kernel + I haven’t
> > seen any open source implementation for that except for Qemu which is in
> > userspace.
> 
> People are doing it in their CC stuff, which is about the same as
> pkvm. I'm not sure if it will be open source, I hope so since it needs
> security auditing..
> 

Yes, as mentioned later I also have a WIP implementation for KVM (which is
open source[1] :)) that I plan to send to the list (maybe in 3-4 months when
ready) as an alternative approach.

> > > > - Add IDENTITY_DOMAIN support, I already have some patches for that, but
> > > >   didn’t want to complicate this series, I can send them separately.
> > > 
> > > This seems kind of pointless to me. If you can tolerate identity (ie
> > > pin all memory) then do nested, and maybe don't even bother with a
> > > guest iommu.
> > 
> > As mentioned, the choice for para-virt was not only to avoid pinning,
> > as this is the host, for IDENTITY_DOMAIN we either share the page table,
> > then we have to deal with lazy mapping (SMMU features, BBM...) or mirror
> > the table in a shadow SMMU only identity page table.
> 
> AFAIK you always have to mirror unless you significantly change how
> the KVM S1 page table stuff is working. The CC people have made those
> changes and won't mirror, so it is doable..
> 

Yes, I agree, AFAIK, the current KVM pgtable code is not ready for shared
page tables with the IOMMU.

> > > My advice for merging would be to start with the pkvm side setting up
> > > a fully pinned S2 and do not have a guest driver. Nesting without
> > > emulating smmuv3. Basically you get protected identity DMA support. I
> > > think that would be a much less sprawling patch series. From there it
> > > would be well positioned to add both smmuv3 emulation and a paravirt
> > > iommu flow.
> > 
> > I am open to any suggestions, but I believe any solution considered for
> > merge, should have enough features to be usable on actual systems (translating
> > IOMMU can be used for example) so either para-virt as this series or full
> > nesting as the PoC above (or maybe both?), which IMO comes down to the
> > trade-off mentioned above.
> 
> IMHO no, you can have a completely usable solution without host/guest
> controlled translation. This is equivilant to a bare metal system with
> no IOMMU HW. This exists and is still broadly useful. The majority of
> cloud VMs out there are in this configuration.
> 
> That is the simplest/smallest thing to start with. Adding host/guest
> controlled translation is a build-on-top excercise that seems to have
> a lot of options and people may end up wanting to do all of them.
> 
> I don't think you need to show that host/guest controlled translation
> is possible to make progress, of course it is possible. Just getting
> to the point where pkvm can own the SMMU HW and provide DMA isolation
> between all of it's direct host/guest is a good step.

My plan was basically:
1) Finish and send nested SMMUv3 as RFC, with more insights about
   performance and complexity trade-offs of both approaches.

2) Discuss next steps for the upstream solution in an upcoming conference
   (like LPC or earlier if possible) and work on upstreaming it.

3) Work on guest device passthrough and IOMMU support.

I am open to gradually upstream this as you mentioned where as a first
step pKVM would establish DMA isolation without translation for host,
that should be enough to have functional pKVM and run protected workloads.

But although that might be usable on some systems, I don’t think that’s
practical in the long term as it limits the amount of HW that can run pKVM.

[1] https://android-kvm.googlesource.com/linux/+/refs/heads/smostafa/android15-6.6-smmu-nesting-wip

Thanks,
Mostafa

> 
> Jason

RE: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Tian, Kevin 1 year ago

> From: Mostafa Saleh <smostafa@google.com>
> Sent: Wednesday, January 8, 2025 8:10 PM
> 
> On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> > On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > > I am open to any suggestions, but I believe any solution considered for
> > > merge, should have enough features to be usable on actual systems
> (translating
> > > IOMMU can be used for example) so either para-virt as this series or full
> > > nesting as the PoC above (or maybe both?), which IMO comes down to
> the
> > > trade-off mentioned above.
> >
> > IMHO no, you can have a completely usable solution without host/guest
> > controlled translation. This is equivilant to a bare metal system with
> > no IOMMU HW. This exists and is still broadly useful. The majority of
> > cloud VMs out there are in this configuration.
> >
> > That is the simplest/smallest thing to start with. Adding host/guest
> > controlled translation is a build-on-top excercise that seems to have
> > a lot of options and people may end up wanting to do all of them.
> >
> > I don't think you need to show that host/guest controlled translation
> > is possible to make progress, of course it is possible. Just getting
> > to the point where pkvm can own the SMMU HW and provide DMA
> isolation
> > between all of it's direct host/guest is a good step.
> 
> My plan was basically:
> 1) Finish and send nested SMMUv3 as RFC, with more insights about
>    performance and complexity trade-offs of both approaches.
> 
> 2) Discuss next steps for the upstream solution in an upcoming conference
>    (like LPC or earlier if possible) and work on upstreaming it.
> 
> 3) Work on guest device passthrough and IOMMU support.
> 
> I am open to gradually upstream this as you mentioned where as a first
> step pKVM would establish DMA isolation without translation for host,
> that should be enough to have functional pKVM and run protected
> workloads.

Does that approach assume starting from a full-fledged SMMU driver 
inside pKVM or do we still expect the host to enumerate/initialize
the hw (but skip any translation) so the pKVM part can focus only
on managing translation?

I'm curious about the burden of maintaining another IOMMU
subsystem under the KVM directory. It's not built into the host kernel
image, but hosted in the same kernel repo. This series tried to
reduce the duplication via io-pgtable-arm but still considerable 
duplication exists (~2000LOC in pKVM). The would be very confusing
moving forward and hard to maintain e.g. ensure bugs fixed in
both sides.

The CPU side is a different story. iiuc KVM-ARM is a split driver model 
from day one for nVHE. It's kept even for VHE with difference only
on using hypercall vs using direct function call. pKVM is added 
incrementally on top of nVHE hence it's natural to maintain the
 pKVM logic in the kernel repo. No duplication.

But there is no such thing in the IOMMU side. Probably we'd want to
try reusing the entire IOMMU sub-system in pKVM if it's agreed
to use full-fledged drivers in pKVM. Or if continuing the split-driver
model should we try splitting the existing drivers into two parts then
connecting two together via function call on native and via hypercall
in pKVM (similar to how KVM-ARM does)?

Thanks
Kevin

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Mostafa Saleh 1 year ago

Hi Kevin,

On Thu, Jan 16, 2025 at 08:51:11AM +0000, Tian, Kevin wrote:
> > From: Mostafa Saleh <smostafa@google.com>
> > Sent: Wednesday, January 8, 2025 8:10 PM
> > 
> > On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> > > On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > > > I am open to any suggestions, but I believe any solution considered for
> > > > merge, should have enough features to be usable on actual systems
> > (translating
> > > > IOMMU can be used for example) so either para-virt as this series or full
> > > > nesting as the PoC above (or maybe both?), which IMO comes down to
> > the
> > > > trade-off mentioned above.
> > >
> > > IMHO no, you can have a completely usable solution without host/guest
> > > controlled translation. This is equivilant to a bare metal system with
> > > no IOMMU HW. This exists and is still broadly useful. The majority of
> > > cloud VMs out there are in this configuration.
> > >
> > > That is the simplest/smallest thing to start with. Adding host/guest
> > > controlled translation is a build-on-top excercise that seems to have
> > > a lot of options and people may end up wanting to do all of them.
> > >
> > > I don't think you need to show that host/guest controlled translation
> > > is possible to make progress, of course it is possible. Just getting
> > > to the point where pkvm can own the SMMU HW and provide DMA
> > isolation
> > > between all of it's direct host/guest is a good step.
> > 
> > My plan was basically:
> > 1) Finish and send nested SMMUv3 as RFC, with more insights about
> >    performance and complexity trade-offs of both approaches.
> > 
> > 2) Discuss next steps for the upstream solution in an upcoming conference
> >    (like LPC or earlier if possible) and work on upstreaming it.
> > 
> > 3) Work on guest device passthrough and IOMMU support.
> > 
> > I am open to gradually upstream this as you mentioned where as a first
> > step pKVM would establish DMA isolation without translation for host,
> > that should be enough to have functional pKVM and run protected
> > workloads.
> 
> Does that approach assume starting from a full-fledged SMMU driver 
> inside pKVM or do we still expect the host to enumerate/initialize
> the hw (but skip any translation) so the pKVM part can focus only
> on managing translation?

I have been thinking about this, and I think most of the initialization
won’t be changed, and we would do any possible initialization in the
kernel avoiding complexity in the hypervisor (parsing
device-tree/acpi...) also that makes code re-use easier if both drivers
do that in the kernel space.

> 
> I'm curious about the burden of maintaining another IOMMU
> subsystem under the KVM directory. It's not built into the host kernel
> image, but hosted in the same kernel repo. This series tried to
> reduce the duplication via io-pgtable-arm but still considerable 
> duplication exists (~2000LOC in pKVM). The would be very confusing
> moving forward and hard to maintain e.g. ensure bugs fixed in
> both sides.

KVM IOMMU subsystem is very different from the one kernel, it’s about
paravirtualtion and abstraction, I tried my best to make sure all
possible code can be re-used by splitting arm-smmu-v3-common.c and
io-pgtable-arm-common.c and even re-using iommu_iotlb_gather from the
iommu code.
So my guess, there won't be much of that effort as there is no
duplication in logic.

I am still thinking about how v3 will look like, but as mentioned I am
inclined to Jason’s suggestion to reduce the series and remove the
paravirtualtion stuff and only establish DMA isolation as a starting point.
That will remove a lot of code from the KVM IOMMU for now, but we'd
need to address that later.
And we can build on top of this code either a para-virtual approach or
nested-emulation one.

> 
> The CPU side is a different story. iiuc KVM-ARM is a split driver model 
> from day one for nVHE. It's kept even for VHE with difference only
> on using hypercall vs using direct function call. pKVM is added 
> incrementally on top of nVHE hence it's natural to maintain the
>  pKVM logic in the kernel repo. No duplication.
> 
> But there is no such thing in the IOMMU side. Probably we'd want to
> try reusing the entire IOMMU sub-system in pKVM if it's agreed
> to use full-fledged drivers in pKVM. Or if continuing the split-driver
> model should we try splitting the existing drivers into two parts then
> connecting two together via function call on native and via hypercall
> in pKVM (similar to how KVM-ARM does)?

For the IOMMU KVM code, it’s quite different from the kernel one and serves
different purposes, so there is no logic duplication there.
The idea to use hypercalls/function calls in some places for VHE/nVHE,
doesn’t really translate here, as already the driver is abstracted by
iommu_ops, unlike KVM which has one code base for everything, as I
mentioned in another reply, we can standardize the hypercall part of the
kernel driver into a an IOMMU agnostic file (as virtio-iommu) and the
KVM SMMUv3 kernel driver would ony be responsible for initialization,
that should be the closest to the split model in nVHE.

Also, pKVM have some different code path in the kernel, for example pKVM
has a different mem abort handler, different initialization (pkvm.c)

Thanks,
Mostafa

> 
> Thanks
> Kevin

RE: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Tian, Kevin 1 year ago

> From: Mostafa Saleh <smostafa@google.com>
> Sent: Wednesday, January 22, 2025 7:29 PM
> 
> Hi Kevin,
> 
> On Thu, Jan 16, 2025 at 08:51:11AM +0000, Tian, Kevin wrote:
> > > From: Mostafa Saleh <smostafa@google.com>
> > > Sent: Wednesday, January 8, 2025 8:10 PM
> > >
> > > My plan was basically:
> > > 1) Finish and send nested SMMUv3 as RFC, with more insights about
> > >    performance and complexity trade-offs of both approaches.
> > >
> > > 2) Discuss next steps for the upstream solution in an upcoming
> conference
> > >    (like LPC or earlier if possible) and work on upstreaming it.
> > >
> > > 3) Work on guest device passthrough and IOMMU support.
> > >
> > > I am open to gradually upstream this as you mentioned where as a first
> > > step pKVM would establish DMA isolation without translation for host,
> > > that should be enough to have functional pKVM and run protected
> > > workloads.
> >
> > Does that approach assume starting from a full-fledged SMMU driver
> > inside pKVM or do we still expect the host to enumerate/initialize
> > the hw (but skip any translation) so the pKVM part can focus only
> > on managing translation?
> 
> I have been thinking about this, and I think most of the initialization
> won’t be changed, and we would do any possible initialization in the
> kernel avoiding complexity in the hypervisor (parsing
> device-tree/acpi...) also that makes code re-use easier if both drivers
> do that in the kernel space.

yeah that'd make sense for now. 

> 
> >
> > I'm curious about the burden of maintaining another IOMMU
> > subsystem under the KVM directory. It's not built into the host kernel
> > image, but hosted in the same kernel repo. This series tried to
> > reduce the duplication via io-pgtable-arm but still considerable
> > duplication exists (~2000LOC in pKVM). The would be very confusing
> > moving forward and hard to maintain e.g. ensure bugs fixed in
> > both sides.
> 
> KVM IOMMU subsystem is very different from the one kernel, it’s about
> paravirtualtion and abstraction, I tried my best to make sure all
> possible code can be re-used by splitting arm-smmu-v3-common.c and
> io-pgtable-arm-common.c and even re-using iommu_iotlb_gather from the
> iommu code.
> So my guess, there won't be much of that effort as there is no
> duplication in logic.

I'm not sure how different it is. In concept it still manages iommu
mappings, just with additional restrictions. Bear me that I haven't
looked into the detail of the 2000LOC driver in pKVM smmu driver. 
but the size does scare me, especially considering the case when
other vendors are supported later.

Let's keep it in mind and re-check after you have v3. It's simpler hence
suppose the actual difference between a pKVM iommu driver and
a normal kernel IOMMU driver can be judged more easily than now.

The learning here would be beneficial to the design in other pKVM
components, e.g. when porting pKVM to x86. Currently KVM x86 is 
monothetic. Maintaining pKVM under KVM/x86 would be a much
bigger challenge than doing it under KVM/arm. There will also be
question about what can be shared and how to better maintain
the pKVM specific logic in KVM/x86.

Overall my gut-feeling is that the pKVM specific code must be small
enough otherwise maintaining a run-time irrelevant project in the
kernel repo would be questionable. 😊

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Mostafa Saleh 1 year ago

On Thu, Jan 23, 2025 at 08:25:13AM +0000, Tian, Kevin wrote:
> > From: Mostafa Saleh <smostafa@google.com>
> > Sent: Wednesday, January 22, 2025 7:29 PM
> > 
> > Hi Kevin,
> > 
> > On Thu, Jan 16, 2025 at 08:51:11AM +0000, Tian, Kevin wrote:
> > > > From: Mostafa Saleh <smostafa@google.com>
> > > > Sent: Wednesday, January 8, 2025 8:10 PM
> > > >
> > > > My plan was basically:
> > > > 1) Finish and send nested SMMUv3 as RFC, with more insights about
> > > >    performance and complexity trade-offs of both approaches.
> > > >
> > > > 2) Discuss next steps for the upstream solution in an upcoming
> > conference
> > > >    (like LPC or earlier if possible) and work on upstreaming it.
> > > >
> > > > 3) Work on guest device passthrough and IOMMU support.
> > > >
> > > > I am open to gradually upstream this as you mentioned where as a first
> > > > step pKVM would establish DMA isolation without translation for host,
> > > > that should be enough to have functional pKVM and run protected
> > > > workloads.
> > >
> > > Does that approach assume starting from a full-fledged SMMU driver
> > > inside pKVM or do we still expect the host to enumerate/initialize
> > > the hw (but skip any translation) so the pKVM part can focus only
> > > on managing translation?
> > 
> > I have been thinking about this, and I think most of the initialization
> > won’t be changed, and we would do any possible initialization in the
> > kernel avoiding complexity in the hypervisor (parsing
> > device-tree/acpi...) also that makes code re-use easier if both drivers
> > do that in the kernel space.
> 
> yeah that'd make sense for now. 
> 
> > 
> > >
> > > I'm curious about the burden of maintaining another IOMMU
> > > subsystem under the KVM directory. It's not built into the host kernel
> > > image, but hosted in the same kernel repo. This series tried to
> > > reduce the duplication via io-pgtable-arm but still considerable
> > > duplication exists (~2000LOC in pKVM). The would be very confusing
> > > moving forward and hard to maintain e.g. ensure bugs fixed in
> > > both sides.
> > 
> > KVM IOMMU subsystem is very different from the one kernel, it’s about
> > paravirtualtion and abstraction, I tried my best to make sure all
> > possible code can be re-used by splitting arm-smmu-v3-common.c and
> > io-pgtable-arm-common.c and even re-using iommu_iotlb_gather from the
> > iommu code.
> > So my guess, there won't be much of that effort as there is no
> > duplication in logic.
> 
> I'm not sure how different it is. In concept it still manages iommu
> mappings, just with additional restrictions. Bear me that I haven't
> looked into the detail of the 2000LOC driver in pKVM smmu driver. 
> but the size does scare me, especially considering the case when
> other vendors are supported later.
> 
> Let's keep it in mind and re-check after you have v3. It's simpler hence
> suppose the actual difference between a pKVM iommu driver and
> a normal kernel IOMMU driver can be judged more easily than now.

I see, I believe we can reduce the size by re-using more data-structure
types + more refactoring on the kernel side.

Also, we can make many parts of the code standard outside the driver as
calling hypercalls, dealing with memory allocation...., so. other IOMMUs
will only add minimal code.

> 
> The learning here would be beneficial to the design in other pKVM
> components, e.g. when porting pKVM to x86. Currently KVM x86 is 
> monothetic. Maintaining pKVM under KVM/x86 would be a much
> bigger challenge than doing it under KVM/arm. There will also be
> question about what can be shared and how to better maintain
> the pKVM specific logic in KVM/x86.
> 
> Overall my gut-feeling is that the pKVM specific code must be small
> enough otherwise maintaining a run-time irrelevant project in the
> kernel repo would be questionable. 😊
> 

I am not sure I understand, but I don’t see how pKVM is irrelevant,
it’s a mode in KVM (just like, nvhe/hvhe where they run in 2 exception
levels) and can’t be separated from the kernel as that defeats the
point of KVM, that means that all hypercalls have to be stable ABI,
same for the shared data, shared structs, types...

Thanks,
Mostafa

RE: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Tian, Kevin 11 months, 3 weeks ago

> From: Mostafa Saleh <smostafa@google.com>
> Sent: Wednesday, January 29, 2025 8:21 PM
> 
> On Thu, Jan 23, 2025 at 08:25:13AM +0000, Tian, Kevin wrote:
> >
> > The learning here would be beneficial to the design in other pKVM
> > components, e.g. when porting pKVM to x86. Currently KVM x86 is
> > monothetic. Maintaining pKVM under KVM/x86 would be a much
> > bigger challenge than doing it under KVM/arm. There will also be
> > question about what can be shared and how to better maintain
> > the pKVM specific logic in KVM/x86.
> >
> > Overall my gut-feeling is that the pKVM specific code must be small
> > enough otherwise maintaining a run-time irrelevant project in the
> > kernel repo would be questionable. 😊
> >
> 
> I am not sure I understand, but I don’t see how pKVM is irrelevant,
> it’s a mode in KVM (just like, nvhe/hvhe where they run in 2 exception
> levels) and can’t be separated from the kernel as that defeats the
> point of KVM, that means that all hypercalls have to be stable ABI,
> same for the shared data, shared structs, types...
> 

Yes pKVM doesn't favor stable ABI. My point was more on the part
that nvhe is a hardware limitation so kvm-arm already coped with it
from day one then adding the concept of pKVM atop was relatively
easy, but changing other subsystems to support this split model
just for pKVM adds more maintenance burden. Then the maintainers
may challenge the value of supporting pKVM if the size of maintaining
the split model becomes too large... Anyway we will see how it
turns out with more discussions on your next version.

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Jason Gunthorpe 1 year ago

On Wed, Jan 29, 2025 at 12:21:01PM +0000, Mostafa Saleh wrote:
> levels) and can’t be separated from the kernel as that defeats the
> point of KVM, that means that all hypercalls have to be stable ABI,
> same for the shared data, shared structs, types...

Sorry, just trying to understand this sentance, today pkvm has no
stable ABI right? That is the whole point of building it into the
kernel?

Things like the CC world are creating stable ABIs for their pkvm like
environments because they are not built into the kernel? And thus they
take the pain of that?

Jason

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Mostafa Saleh 1 year ago

On Wed, Jan 29, 2025 at 09:50:53AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 29, 2025 at 12:21:01PM +0000, Mostafa Saleh wrote:
> > levels) and can’t be separated from the kernel as that defeats the
> > point of KVM, that means that all hypercalls have to be stable ABI,
> > same for the shared data, shared structs, types...
> 
> Sorry, just trying to understand this sentance, today pkvm has no
> stable ABI right? That is the whole point of building it into the
> kernel?

Yes.

> 
> Things like the CC world are creating stable ABIs for their pkvm like
> environments because they are not built into the kernel? And thus they
> take the pain of that?

Yes, my point is, we can't just separate pKVM as Kevin was mentioning as
they has no ABI and it is tightly coupled with the kernel.


Thanks,
Mostafa

> 
> Jason

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Jason Gunthorpe 1 year ago

On Wed, Jan 08, 2025 at 12:09:53PM +0000, Mostafa Saleh wrote:

> I am open to gradually upstream this as you mentioned where as a first
> step pKVM would establish DMA isolation without translation for host,
> that should be enough to have functional pKVM and run protected workloads.

Personally I hate these giant patch series, you should strip it down
to small meaningful steps and try to stay below 20 per series.

I think getting pkvm to own the SMMU HW is a great first step that
everything else can build on

> But although that might be usable on some systems, I don’t think that’s
> practical in the long term as it limits the amount of HW that can run pKVM.

I suspect you will end up doing everything. Old HW needs paravirt, new
HW will want nesting and its performance. Users other than mobile will
come. If we were to use pKVM on server workloads we need nesting for
performance.

Jason

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Mostafa Saleh 1 year ago

On Thu, Jan 16, 2025 at 03:19:52PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 08, 2025 at 12:09:53PM +0000, Mostafa Saleh wrote:
> 
> > I am open to gradually upstream this as you mentioned where as a first
> > step pKVM would establish DMA isolation without translation for host,
> > that should be enough to have functional pKVM and run protected workloads.
> 
> Personally I hate these giant patch series, you should strip it down
> to small meaningful steps and try to stay below 20 per series.
> 
> I think getting pkvm to own the SMMU HW is a great first step that
> everything else can build on

I plan to do that for v3, I think that also removes the out-of-tree
dependencies, so the code applies directly on upstream.
Thanks for the feedback!

> 
> > But although that might be usable on some systems, I don’t think that’s
> > practical in the long term as it limits the amount of HW that can run pKVM.
> 
> I suspect you will end up doing everything. Old HW needs paravirt, new
> HW will want nesting and its performance. Users other than mobile will
> come. If we were to use pKVM on server workloads we need nesting for
> performance.

Yes, I guess that would be the case, as I mentioned in another reply
it would be interesting to get the order of magnitude both, which I am
looking into, I hope it'd help with which direction we should
prioritize upstream.

Thanks,
Mostafa

> 
> Jason

RE: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Tian, Kevin 1 year ago

> From: Mostafa Saleh <smostafa@google.com>
> Sent: Wednesday, January 8, 2025 8:10 PM
> 
> On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> > On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > > Yeah, SVA is tricky, I guess for that we would have to use nesting,
> > > but tbh, I don’t think it’s a deal breaker for now.
> >
> > Again, it depends what your actual use case for translation is inside
> > the host/guest environments. It would be good to clearly spell this out..
> > There are few drivers that directly manpulate the iommu_domains of a
> > device. a few gpus, ath1x wireless, some tegra stuff, "venus". Which
> > of those are you targetting?
> >
> 
> Not sure I understand this point about manipulating domains.
> AFAIK, SVA is not that common, including mobile spaces but I can be wrong,
> that’s why it’s not a priority here.

Nested translation is required beyond SVA. A scenario which requires
a vIOMMU and multiple device domains within the guest would like to
embrace nesting. Especially for ARM vSMMU nesting is a must.

But I'm not sure that I got Jason's point about " there is no way to get
SVA support with para-virtualization." virtio-iommu is a para-virtualized
model and SVA support is in its plan. The main requirement is to pass
the base pointer of the guest CPU page table to backend and PRI faults/
responses back forth.

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Jason Gunthorpe 1 year ago

On Thu, Jan 16, 2025 at 06:39:31AM +0000, Tian, Kevin wrote:
> > From: Mostafa Saleh <smostafa@google.com>
> > Sent: Wednesday, January 8, 2025 8:10 PM
> > 
> > On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> > > On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > > > Yeah, SVA is tricky, I guess for that we would have to use nesting,
> > > > but tbh, I don’t think it’s a deal breaker for now.
> > >
> > > Again, it depends what your actual use case for translation is inside
> > > the host/guest environments. It would be good to clearly spell this out..
> > > There are few drivers that directly manpulate the iommu_domains of a
> > > device. a few gpus, ath1x wireless, some tegra stuff, "venus". Which
> > > of those are you targetting?
> > >
> > 
> > Not sure I understand this point about manipulating domains.
> > AFAIK, SVA is not that common, including mobile spaces but I can be wrong,
> > that’s why it’s not a priority here.
> 
> Nested translation is required beyond SVA. A scenario which requires
> a vIOMMU and multiple device domains within the guest would like to
> embrace nesting. Especially for ARM vSMMU nesting is a must.

Right, if you need an iommu domain in the guest there are only three
mainstream ways to get this in Linux:
 1) Use the DMA API and have the iommu group be translating. This is
    optional in that the DMA API usually supports identity as an option.
 2) A driver directly calls iommu_paging_domain_alloc() and manually
    attaches it to some device, and does not use the DMA API. My list
    above of ath1x/etc are examples doing this
 3) Use VFIO

My remark to Mostafa is to be specific, which of the above do you want
to do in your mobile guest (and what driver exactly if #2) and why.

This will help inform what the performance profile looks like and
guide if nesting/para virt is appropriate.

> But I'm not sure that I got Jason's point about " there is no way to get
> SVA support with para-virtualization." virtio-iommu is a para-virtualized
> model and SVA support is in its plan. The main requirement is to pass
> the base pointer of the guest CPU page table to backend and PRI faults/
> responses back forth.

That's nesting, you have a full page table under the control of the
guest, and the guest needs to have a level of HW-specific
knowledge. It is just an alternative to using the native nesting
vIOMMU.

What I mean by "para-virtualization" is the guest does map/unmap calls
to the hypervisor and has no page tbale.

Jason

RE: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Tian, Kevin 1 year ago

> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Friday, January 17, 2025 3:15 AM
> 
> On Thu, Jan 16, 2025 at 06:39:31AM +0000, Tian, Kevin wrote:
> > > From: Mostafa Saleh <smostafa@google.com>
> > > Sent: Wednesday, January 8, 2025 8:10 PM
> > >
> > > On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> > > > On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > > > > Yeah, SVA is tricky, I guess for that we would have to use nesting,
> > > > > but tbh, I don’t think it’s a deal breaker for now.
> > > >
> > > > Again, it depends what your actual use case for translation is inside
> > > > the host/guest environments. It would be good to clearly spell this out..
> > > > There are few drivers that directly manpulate the iommu_domains of a
> > > > device. a few gpus, ath1x wireless, some tegra stuff, "venus". Which
> > > > of those are you targetting?
> > > >
> > >
> > > Not sure I understand this point about manipulating domains.
> > > AFAIK, SVA is not that common, including mobile spaces but I can be
> wrong,
> > > that’s why it’s not a priority here.
> >
> > Nested translation is required beyond SVA. A scenario which requires
> > a vIOMMU and multiple device domains within the guest would like to
> > embrace nesting. Especially for ARM vSMMU nesting is a must.
> 
> Right, if you need an iommu domain in the guest there are only three
> mainstream ways to get this in Linux:
>  1) Use the DMA API and have the iommu group be translating. This is
>     optional in that the DMA API usually supports identity as an option.
>  2) A driver directly calls iommu_paging_domain_alloc() and manually
>     attaches it to some device, and does not use the DMA API. My list
>     above of ath1x/etc are examples doing this
>  3) Use VFIO
> 
> My remark to Mostafa is to be specific, which of the above do you want
> to do in your mobile guest (and what driver exactly if #2) and why.
> 
> This will help inform what the performance profile looks like and
> guide if nesting/para virt is appropriate.

Yeah that part would be critical to help decide which route to pursue
first. Even when all options might be required in the end when pKVM
is scaled to more scenarios, as you mentioned in another mail, a staging
approach would be much preferrable to evolve.

The pros/cons between nesting/para virt is clear - more static the 
mapping is, more gain from the para approach due to less paging 
walking and smaller tlb footprint, while vice versa nesting performs
much better by avoiding frequent para calls on page table mgmt. 😊

> 
> > But I'm not sure that I got Jason's point about " there is no way to get
> > SVA support with para-virtualization." virtio-iommu is a para-virtualized
> > model and SVA support is in its plan. The main requirement is to pass
> > the base pointer of the guest CPU page table to backend and PRI faults/
> > responses back forth.
> 
> That's nesting, you have a full page table under the control of the
> guest, and the guest needs to have a level of HW-specific
> knowledge. It is just an alternative to using the native nesting
> vIOMMU.
> 
> What I mean by "para-virtualization" is the guest does map/unmap calls
> to the hypervisor and has no page tbale.
> 

Yes, that should never happen for SVA.

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Mostafa Saleh 1 year ago

On Fri, Jan 17, 2025 at 06:57:12AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@ziepe.ca>
> > Sent: Friday, January 17, 2025 3:15 AM
> > 
> > On Thu, Jan 16, 2025 at 06:39:31AM +0000, Tian, Kevin wrote:
> > > > From: Mostafa Saleh <smostafa@google.com>
> > > > Sent: Wednesday, January 8, 2025 8:10 PM
> > > >
> > > > On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> > > > > On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > > > > > Yeah, SVA is tricky, I guess for that we would have to use nesting,
> > > > > > but tbh, I don’t think it’s a deal breaker for now.
> > > > >
> > > > > Again, it depends what your actual use case for translation is inside
> > > > > the host/guest environments. It would be good to clearly spell this out..
> > > > > There are few drivers that directly manpulate the iommu_domains of a
> > > > > device. a few gpus, ath1x wireless, some tegra stuff, "venus". Which
> > > > > of those are you targetting?
> > > > >
> > > >
> > > > Not sure I understand this point about manipulating domains.
> > > > AFAIK, SVA is not that common, including mobile spaces but I can be
> > wrong,
> > > > that’s why it’s not a priority here.
> > >
> > > Nested translation is required beyond SVA. A scenario which requires
> > > a vIOMMU and multiple device domains within the guest would like to
> > > embrace nesting. Especially for ARM vSMMU nesting is a must.

We can still do para-virtualization for guests the same way we do for the
host and use a single stage IOMMU.

> > 
> > Right, if you need an iommu domain in the guest there are only three
> > mainstream ways to get this in Linux:
> >  1) Use the DMA API and have the iommu group be translating. This is
> >     optional in that the DMA API usually supports identity as an option.
> >  2) A driver directly calls iommu_paging_domain_alloc() and manually
> >     attaches it to some device, and does not use the DMA API. My list
> >     above of ath1x/etc are examples doing this
> >  3) Use VFIO
> > 
> > My remark to Mostafa is to be specific, which of the above do you want
> > to do in your mobile guest (and what driver exactly if #2) and why.
> > 
> > This will help inform what the performance profile looks like and
> > guide if nesting/para virt is appropriate.
> 

AFAIK, the most common use cases would be:
- Devices using DMA API because it requires a lot of memory to be
  contiguous in IOVA, which is hard to do with identity
- Devices with security requirements/constraints to be isolated from the
  rest of the system, also using DMA API
- VFIO is something we are looking at the moment and have prototyped with
  pKVM, and it should be supported soon in Android (only for platform
  devices for now)

> Yeah that part would be critical to help decide which route to pursue
> first. Even when all options might be required in the end when pKVM
> is scaled to more scenarios, as you mentioned in another mail, a staging
> approach would be much preferrable to evolve.

I agree that would probably be the case. I will work on more staging
approach for v3, mostly without the pv part as Jason suggested.

> 
> The pros/cons between nesting/para virt is clear - more static the 
> mapping is, more gain from the para approach due to less paging 
> walking and smaller tlb footprint, while vice versa nesting performs
> much better by avoiding frequent para calls on page table mgmt. 😊

I am also working to get the numbers for both cases so we know
the order of magnitude of each case, as I guess it won't be as clear
for large systems with many DMA initiators what approach is best.


Thanks,
Mostafa

> 
> > 
> > > But I'm not sure that I got Jason's point about " there is no way to get
> > > SVA support with para-virtualization." virtio-iommu is a para-virtualized
> > > model and SVA support is in its plan. The main requirement is to pass
> > > the base pointer of the guest CPU page table to backend and PRI faults/
> > > responses back forth.
> > 
> > That's nesting, you have a full page table under the control of the
> > guest, and the guest needs to have a level of HW-specific
> > knowledge. It is just an alternative to using the native nesting
> > vIOMMU.
> > 
> > What I mean by "para-virtualization" is the guest does map/unmap calls
> > to the hypervisor and has no page tbale.
> > 
> 
> Yes, that should never happen for SVA.

RE: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Tian, Kevin 1 year ago

> From: Mostafa Saleh <smostafa@google.com>
> Sent: Wednesday, January 22, 2025 7:04 PM
> 
> On Fri, Jan 17, 2025 at 06:57:12AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@ziepe.ca>
> > > Sent: Friday, January 17, 2025 3:15 AM
> > >
> > > On Thu, Jan 16, 2025 at 06:39:31AM +0000, Tian, Kevin wrote:
> > > > > From: Mostafa Saleh <smostafa@google.com>
> > > > > Sent: Wednesday, January 8, 2025 8:10 PM
> > > > >
> > > > > On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> > > > > > On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > > > > > > Yeah, SVA is tricky, I guess for that we would have to use nesting,
> > > > > > > but tbh, I don’t think it’s a deal breaker for now.
> > > > > >
> > > > > > Again, it depends what your actual use case for translation is inside
> > > > > > the host/guest environments. It would be good to clearly spell this
> out..
> > > > > > There are few drivers that directly manpulate the iommu_domains
> of a
> > > > > > device. a few gpus, ath1x wireless, some tegra stuff, "venus". Which
> > > > > > of those are you targetting?
> > > > > >
> > > > >
> > > > > Not sure I understand this point about manipulating domains.
> > > > > AFAIK, SVA is not that common, including mobile spaces but I can be
> > > wrong,
> > > > > that’s why it’s not a priority here.
> > > >
> > > > Nested translation is required beyond SVA. A scenario which requires
> > > > a vIOMMU and multiple device domains within the guest would like to
> > > > embrace nesting. Especially for ARM vSMMU nesting is a must.
> 
> We can still do para-virtualization for guests the same way we do for the
> host and use a single stage IOMMU.

same way but both require a nested setup.

In concept there are two layers of address translations: GVA->GPA via
guest page table, and GPA->HPA via pKVM page table.

The difference between host/guest is just on the GPA mapping. For host
it's 1:1 with additional hardening for which portion can be mapped and
which cannot. For guest it's non-identical with the mapping established
from the host.

A nested translation naturally fits that conceptual layers.

Using a single-stage IOMMU means you need to combine two layers
into one layer i.e. GVA->HPA by removing GPA. Then you have to
paravirt guest page table so every guest PTE change is intercepted
to replace GPA with HPA.

Doing so completely kills the benefit of SVA, which is why Jason said
a no-go.

> 
> > >
> > > Right, if you need an iommu domain in the guest there are only three
> > > mainstream ways to get this in Linux:
> > >  1) Use the DMA API and have the iommu group be translating. This is
> > >     optional in that the DMA API usually supports identity as an option.
> > >  2) A driver directly calls iommu_paging_domain_alloc() and manually
> > >     attaches it to some device, and does not use the DMA API. My list
> > >     above of ath1x/etc are examples doing this
> > >  3) Use VFIO
> > >
> > > My remark to Mostafa is to be specific, which of the above do you want
> > > to do in your mobile guest (and what driver exactly if #2) and why.
> > >
> > > This will help inform what the performance profile looks like and
> > > guide if nesting/para virt is appropriate.
> >
> 
> AFAIK, the most common use cases would be:
> - Devices using DMA API because it requires a lot of memory to be
>   contiguous in IOVA, which is hard to do with identity
> - Devices with security requirements/constraints to be isolated from the
>   rest of the system, also using DMA API
> - VFIO is something we are looking at the moment and have prototyped with
>   pKVM, and it should be supported soon in Android (only for platform
>   devices for now)

what really matters is the frequency of map/unmap.

> 
> > Yeah that part would be critical to help decide which route to pursue
> > first. Even when all options might be required in the end when pKVM
> > is scaled to more scenarios, as you mentioned in another mail, a staging
> > approach would be much preferrable to evolve.
> 
> I agree that would probably be the case. I will work on more staging
> approach for v3, mostly without the pv part as Jason suggested.
> 
> >
> > The pros/cons between nesting/para virt is clear - more static the
> > mapping is, more gain from the para approach due to less paging
> > walking and smaller tlb footprint, while vice versa nesting performs
> > much better by avoiding frequent para calls on page table mgmt. 😊
> 
> I am also working to get the numbers for both cases so we know
> the order of magnitude of each case, as I guess it won't be as clear
> for large systems with many DMA initiators what approach is best.
> 
> 

That'd be great!

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Mostafa Saleh 1 year ago

On Thu, Jan 23, 2025 at 08:13:34AM +0000, Tian, Kevin wrote:
> > From: Mostafa Saleh <smostafa@google.com>
> > Sent: Wednesday, January 22, 2025 7:04 PM
> > 
> > On Fri, Jan 17, 2025 at 06:57:12AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@ziepe.ca>
> > > > Sent: Friday, January 17, 2025 3:15 AM
> > > >
> > > > On Thu, Jan 16, 2025 at 06:39:31AM +0000, Tian, Kevin wrote:
> > > > > > From: Mostafa Saleh <smostafa@google.com>
> > > > > > Sent: Wednesday, January 8, 2025 8:10 PM
> > > > > >
> > > > > > On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> > > > > > > On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > > > > > > > Yeah, SVA is tricky, I guess for that we would have to use nesting,
> > > > > > > > but tbh, I don’t think it’s a deal breaker for now.
> > > > > > >
> > > > > > > Again, it depends what your actual use case for translation is inside
> > > > > > > the host/guest environments. It would be good to clearly spell this
> > out..
> > > > > > > There are few drivers that directly manpulate the iommu_domains
> > of a
> > > > > > > device. a few gpus, ath1x wireless, some tegra stuff, "venus". Which
> > > > > > > of those are you targetting?
> > > > > > >
> > > > > >
> > > > > > Not sure I understand this point about manipulating domains.
> > > > > > AFAIK, SVA is not that common, including mobile spaces but I can be
> > > > wrong,
> > > > > > that’s why it’s not a priority here.
> > > > >
> > > > > Nested translation is required beyond SVA. A scenario which requires
> > > > > a vIOMMU and multiple device domains within the guest would like to
> > > > > embrace nesting. Especially for ARM vSMMU nesting is a must.
> > 
> > We can still do para-virtualization for guests the same way we do for the
> > host and use a single stage IOMMU.
> 
> same way but both require a nested setup.
> 
> In concept there are two layers of address translations: GVA->GPA via
> guest page table, and GPA->HPA via pKVM page table.
> 
> The difference between host/guest is just on the GPA mapping. For host
> it's 1:1 with additional hardening for which portion can be mapped and
> which cannot. For guest it's non-identical with the mapping established
> from the host.
> 
> A nested translation naturally fits that conceptual layers.
> 
> Using a single-stage IOMMU means you need to combine two layers
> into one layer i.e. GVA->HPA by removing GPA. Then you have to
> paravirt guest page table so every guest PTE change is intercepted
> to replace GPA with HPA.
> 
> Doing so completely kills the benefit of SVA, which is why Jason said
> a no-go.

I agree, this can’t work with SVA, in order to make that work we would
need some new para-virt operation to install the S1 table, and the
hypervisor has to configure the device in nested translation.

But, for guests that doesn’t need SVA, they can just use single-stage
para-virt (like virtio-iommu)

> 
> > 
> > > >
> > > > Right, if you need an iommu domain in the guest there are only three
> > > > mainstream ways to get this in Linux:
> > > >  1) Use the DMA API and have the iommu group be translating. This is
> > > >     optional in that the DMA API usually supports identity as an option.
> > > >  2) A driver directly calls iommu_paging_domain_alloc() and manually
> > > >     attaches it to some device, and does not use the DMA API. My list
> > > >     above of ath1x/etc are examples doing this
> > > >  3) Use VFIO
> > > >
> > > > My remark to Mostafa is to be specific, which of the above do you want
> > > > to do in your mobile guest (and what driver exactly if #2) and why.
> > > >
> > > > This will help inform what the performance profile looks like and
> > > > guide if nesting/para virt is appropriate.
> > >
> > 
> > AFAIK, the most common use cases would be:
> > - Devices using DMA API because it requires a lot of memory to be
> >   contiguous in IOVA, which is hard to do with identity
> > - Devices with security requirements/constraints to be isolated from the
> >   rest of the system, also using DMA API
> > - VFIO is something we are looking at the moment and have prototyped with
> >   pKVM, and it should be supported soon in Android (only for platform
> >   devices for now)
> 
> what really matters is the frequency of map/unmap.
Yes, though it differs between devices/systems :/ that’s why I reckon we
would need both on the long term. However, starting with some benchmarks
for these cases can help to understand the magnitude of both solutions
and prioritise which one is more suitable to start with for upstream.

Thanks,
Mostafa
> 
> > 
> > > Yeah that part would be critical to help decide which route to pursue
> > > first. Even when all options might be required in the end when pKVM
> > > is scaled to more scenarios, as you mentioned in another mail, a staging
> > > approach would be much preferrable to evolve.
> > 
> > I agree that would probably be the case. I will work on more staging
> > approach for v3, mostly without the pv part as Jason suggested.
> > 
> > >
> > > The pros/cons between nesting/para virt is clear - more static the
> > > mapping is, more gain from the para approach due to less paging
> > > walking and smaller tlb footprint, while vice versa nesting performs
> > > much better by avoiding frequent para calls on page table mgmt. 😊
> > 
> > I am also working to get the numbers for both cases so we know
> > the order of magnitude of each case, as I guess it won't be as clear
> > for large systems with many DMA initiators what approach is best.
> > 
> > 
> 
> That'd be great!

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Jason Gunthorpe 1 year ago

On Wed, Jan 22, 2025 at 11:04:24AM +0000, Mostafa Saleh wrote:
> AFAIK, the most common use cases would be:
> - Devices using DMA API because it requires a lot of memory to be
>   contiguous in IOVA, which is hard to do with identity

This is not a feature of the DMA API any driver should rely on .. Are
you aware of one that does?

> - Devices with security requirements/constraints to be isolated from the
>   rest of the system, also using DMA API

This is real, but again, in a mobile context does this even exist? It isn't
like there are external PCIe ports that need securing on a phone?

> - VFIO is something we are looking at the moment and have prototyped with
>   pKVM, and it should be supported soon in Android (only for platform
>   devices for now)

Yes, this makes sense

Jason

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Mostafa Saleh 1 year ago

On Wed, Jan 22, 2025 at 12:20:55PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 22, 2025 at 11:04:24AM +0000, Mostafa Saleh wrote:
> > AFAIK, the most common use cases would be:
> > - Devices using DMA API because it requires a lot of memory to be
> >   contiguous in IOVA, which is hard to do with identity
> 
> This is not a feature of the DMA API any driver should rely on .. Are
> you aware of one that does?
> 

I’d guess one example is media drivers, they usually need large contiguous
buffers, and would use for ex dma_alloc_coherent(), if the IOMMU is disabled or
bypassed, that means that the kernel has to find such contiguous size in the
physical address which can be impossible on devices with small memory as
mobile devices. Similarly.

I will look more into this while working on the patches to identity map
everything for v3, and I’d see what kind of issues I hit.

> > - Devices with security requirements/constraints to be isolated from the
> >   rest of the system, also using DMA API
> 
> This is real, but again, in a mobile context does this even exist? It isn't
> like there are external PCIe ports that need securing on a phone?

It’s not just about completely external devices, it’s a defence in depth
measure, where for example, network devices can be poked externally an
there have cases in the past where exploits were found[1], so some vendors
might have a policy to isolate such devices. Which I believe is a valid.

[1] https://lwn.net/ml/oss-security/20221013101046.GB20615@suse.de/

Thanks,
Mostafa

> 
> > - VFIO is something we are looking at the moment and have prototyped with
> >   pKVM, and it should be supported soon in Android (only for platform
> >   devices for now)
> 
> Yes, this makes sense
> 
> Jason

Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

Posted by Jason Gunthorpe 1 year ago

On Wed, Jan 22, 2025 at 05:17:50PM +0000, Mostafa Saleh wrote:
> On Wed, Jan 22, 2025 at 12:20:55PM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 22, 2025 at 11:04:24AM +0000, Mostafa Saleh wrote:
> > > AFAIK, the most common use cases would be:
> > > - Devices using DMA API because it requires a lot of memory to be
> > >   contiguous in IOVA, which is hard to do with identity
> > 
> > This is not a feature of the DMA API any driver should rely on .. Are
> > you aware of one that does?
> > 
> 
> I’d guess one example is media drivers, they usually need large contiguous
> buffers, and would use for ex dma_alloc_coherent(), if the IOMMU is disabled or
> bypassed, that means that the kernel has to find such contiguous size in the
> physical address which can be impossible on devices with small memory as
> mobile devices. Similarly.

I see, that make sense

> It’s not just about completely external devices, it’s a defence in depth
> measure, where for example, network devices can be poked externally an
> there have cases in the past where exploits were found[1], so some vendors
> might have a policy to isolate such devices. Which I believe is a valid.

The performance cost of doing isolation like that with networking is
probably prohibitive with paravirt..

Jason