ARM CCA Device Assignment support

[RFC PATCH v1 00/38] ARM CCA Device Assignment support

Aneesh Kumar K.V (Arm) posted 38 patches 2 months, 1 week ago

Download series mbox

arch/arm64/include/asm/kvm_rme.h              |   3 +
arch/arm64/include/asm/mem_encrypt.h          |   6 +-
arch/arm64/include/asm/rhi.h                  |  39 +
arch/arm64/include/asm/rmi_cmds.h             | 173 ++++
arch/arm64/include/asm/rmi_smc.h              | 210 ++++-
arch/arm64/include/asm/rsi.h                  |   5 +-
arch/arm64/include/asm/rsi_cmds.h             | 129 +++
arch/arm64/include/asm/rsi_smc.h              |  60 ++
arch/arm64/kernel/Makefile                    |   2 +-
arch/arm64/kernel/rhi.c                       |  35 +
arch/arm64/kernel/rsi.c                       |  26 +-
arch/arm64/kvm/mmu.c                          |  45 +
arch/arm64/kvm/rme-exit.c                     |  87 ++
arch/arm64/kvm/rme.c                          | 208 ++++-
arch/arm64/mm/mem_encrypt.c                   |  10 +
crypto/asymmetric_keys/x509_cert_parser.c     |   9 +
crypto/asymmetric_keys/x509_loader.c          |  38 +-
crypto/asymmetric_keys/x509_parser.h          |  40 +-
drivers/iommu/iommufd/device.c                |  54 ++
drivers/iommu/iommufd/iommufd_private.h       |   7 +
drivers/iommu/iommufd/main.c                  |  13 +
drivers/iommu/iommufd/viommu.c                | 178 +++-
drivers/pci/tsm.c                             | 229 ++++-
drivers/vfio/pci/vfio_pci_core.c              |  20 +-
drivers/virt/coco/Kconfig                     |   5 +-
drivers/virt/coco/Makefile                    |   7 +-
drivers/virt/coco/arm-cca-guest/Kconfig       |  10 +-
drivers/virt/coco/arm-cca-guest/Makefile      |   3 +
.../{arm-cca-guest.c => arm-cca.c}            | 175 +++-
drivers/virt/coco/arm-cca-guest/rsi-da.c      | 576 ++++++++++++
drivers/virt/coco/arm-cca-guest/rsi-da.h      |  73 ++
drivers/virt/coco/arm-cca-host/Kconfig        |  17 +
drivers/virt/coco/arm-cca-host/Makefile       |   5 +
drivers/virt/coco/arm-cca-host/arm-cca.c      | 384 ++++++++
drivers/virt/coco/arm-cca-host/rmm-da.c       | 857 ++++++++++++++++++
drivers/virt/coco/arm-cca-host/rmm-da.h       | 108 +++
drivers/virt/coco/host/Kconfig                |   6 -
drivers/virt/coco/host/Makefile               |   6 -
drivers/virt/coco/{host => }/tsm-core.c       |  27 +
include/keys/asymmetric-type.h                |   2 +
include/keys/x509-parser.h                    |  55 ++
include/linux/device.h                        |   1 +
include/linux/iommufd.h                       |   4 +
include/linux/kvm_host.h                      |   1 +
include/linux/pci-tsm.h                       |  37 +-
include/linux/swiotlb.h                       |   4 +
include/linux/tsm.h                           |  29 +
include/uapi/linux/iommufd.h                  |  69 ++
48 files changed, 3887 insertions(+), 200 deletions(-)
create mode 100644 arch/arm64/include/asm/rhi.h
create mode 100644 arch/arm64/kernel/rhi.c
rename drivers/virt/coco/arm-cca-guest/{arm-cca-guest.c => arm-cca.c} (62%)
create mode 100644 drivers/virt/coco/arm-cca-guest/rsi-da.c
create mode 100644 drivers/virt/coco/arm-cca-guest/rsi-da.h
create mode 100644 drivers/virt/coco/arm-cca-host/Kconfig
create mode 100644 drivers/virt/coco/arm-cca-host/Makefile
create mode 100644 drivers/virt/coco/arm-cca-host/arm-cca.c
create mode 100644 drivers/virt/coco/arm-cca-host/rmm-da.c
create mode 100644 drivers/virt/coco/arm-cca-host/rmm-da.h
delete mode 100644 drivers/virt/coco/host/Kconfig
delete mode 100644 drivers/virt/coco/host/Makefile
rename drivers/virt/coco/{host => }/tsm-core.c (85%)
create mode 100644 include/keys/x509-parser.h

Expand all Fold all

[RFC PATCH v1 00/38] ARM CCA Device Assignment support

Posted by Aneesh Kumar K.V (Arm) 2 months, 1 week ago

This patch series implements support for Device Assignment in the ARM CCA
architecture. The code changes are based on Alp12 specification published here
[1].

The code builds on the TSM framework patches posted at [2]. We add extension to
that framework so that TSM is now used in both the host and the guest.

A DA workflow can be summarized as below:

Host:
step 1.
echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind
echo vfio-pci > /sys/bus/pci/devices/${DEVICE}/driver_override
echo ${DEVICE} > /sys/bus/pci/drivers_probe

step 2.
echo 1 > /sys/bus/pci/devices/$DEVICE/tsm/connect

Now in the guest we follow the below steps

step 1:
echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind

step 2: Move the device to TDISP LOCK state
echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/lock

step 3: Moves the device to TDISP RUN state
echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/accept

step 4: Load the driver again.
echo ${DEVICE} > /sys/bus/pci/drivers_probe

I'm currently working against TSM v3, as TSM v4 lacks the necessary
callbacks—bind, unbind, and guest_req—required for guest interactions.

The implementation also makes use of RHI interfaces that fall outside the
current RHI specification [5]. Once the spec is finalized, the code will be aligned
accordingly.

For now, I’ve retained validate_mmio and vdev_req exit handling within KVM. This
will transition to a guest_req-based mechanism once the specification is
updated.

At that point, all device assignment (DA)-specific VM exits will exit directly
to the VMM, and will use the guest_req ioctl to handle exit reasons. As part of
this change, the handlers realm_exit_vdev_req_handler,
realm_exit_vdev_comm_handler, and realm_exit_dev_mem_map_handler will be
removed.

Full patchset for the kernel and kvmtool can be found at [3] and [4]

[1] https://developer.arm.com/-/cdn-downloads/permalink/Architectures/Armv9/DEN0137_1.1-alp12.zip

[2] https://lore.kernel.org/all/20250516054732.2055093-1-dan.j.williams@intel.com

[3] https://git.gitlab.arm.com/linux-arm/linux-cca.git cca/tdisp-upstream-post-v1
[4] https://git.gitlab.arm.com/linux-arm/kvmtool-cca.git cca/tdisp-upstream-post-v1
[5] https://developer.arm.com/documentation/den0148/latest/


Aneesh Kumar K.V (Arm) (35):
  tsm: Add tsm_bind/unbind helpers
  tsm: Move tsm core outside the host directory
  tsm: Move dsm_dev from pci_tdi to pci_tsm
  tsm: Support DMA Allocation from private memory
  tsm: Don't overload connect
  iommufd: Add and option to request for bar mapping with
    IORESOURCE_EXCLUSIVE
  iommufd/viommu: Add support to associate viommu with kvm instance
  iommufd/tsm: Add tsm_op iommufd ioctls
  iommufd/vdevice: Add TSM Guest request uAPI
  iommufd/vdevice: Add TSM map ioctl
  KVM: arm64: CCA: register host tsm platform device
  coco: host: arm64: CCA host platform device driver
  coco: host: arm64: Create a PDEV with rmm
  coco: host: arm64: Device communication support
  coco: host: arm64: Stop and destroy the physical device
  coco: host: arm64: set_pubkey support
  coco: host: arm64: Add support for creating a virtual device
  coco: host: arm64: Add support for virtual device communication
  coco: host: arm64: Stop and destroy virtual device
  coco: guest: arm64: Update arm CCA guest driver
  arm64: CCA: Register guest tsm callback
  cca: guest: arm64: Realm device lock support
  KVM: arm64: Add exit handler related to device assignment
  coco: host: arm64: add RSI_RDEV_GET_INSTANCE_ID related exit handler
  coco: host: arm64: Add support for device communication exit handler
  coco: guest: arm64: Add support for collecting interface reports
  coco: host: arm64: Add support for realm host interface (RHI)
  coco: guest: arm64: Add support for fetching interface report and
    certificate chain from host
  coco: guest: arm64: Add support for guest initiated TDI bind/unbind
  KVM: arm64: CCA: handle dev mem map/unmap
  coco: guest: arm64: Validate mmio range found in the interface report
  coco: guest: arm64: Add Realm device start and stop support
  KVM: arm64: CCA: enable DA in realm create parameters
  coco: guest: arm64: Add support for fetching device measurements
  coco: guest: arm64: Add support for fetching device info

Lukas Wunner (3):
  X.509: Make certificate parser public
  X.509: Parse Subject Alternative Name in certificates
  X.509: Move certificate length retrieval into new helper

 arch/arm64/include/asm/kvm_rme.h              |   3 +
 arch/arm64/include/asm/mem_encrypt.h          |   6 +-
 arch/arm64/include/asm/rhi.h                  |  39 +
 arch/arm64/include/asm/rmi_cmds.h             | 173 ++++
 arch/arm64/include/asm/rmi_smc.h              | 210 ++++-
 arch/arm64/include/asm/rsi.h                  |   5 +-
 arch/arm64/include/asm/rsi_cmds.h             | 129 +++
 arch/arm64/include/asm/rsi_smc.h              |  60 ++
 arch/arm64/kernel/Makefile                    |   2 +-
 arch/arm64/kernel/rhi.c                       |  35 +
 arch/arm64/kernel/rsi.c                       |  26 +-
 arch/arm64/kvm/mmu.c                          |  45 +
 arch/arm64/kvm/rme-exit.c                     |  87 ++
 arch/arm64/kvm/rme.c                          | 208 ++++-
 arch/arm64/mm/mem_encrypt.c                   |  10 +
 crypto/asymmetric_keys/x509_cert_parser.c     |   9 +
 crypto/asymmetric_keys/x509_loader.c          |  38 +-
 crypto/asymmetric_keys/x509_parser.h          |  40 +-
 drivers/iommu/iommufd/device.c                |  54 ++
 drivers/iommu/iommufd/iommufd_private.h       |   7 +
 drivers/iommu/iommufd/main.c                  |  13 +
 drivers/iommu/iommufd/viommu.c                | 178 +++-
 drivers/pci/tsm.c                             | 229 ++++-
 drivers/vfio/pci/vfio_pci_core.c              |  20 +-
 drivers/virt/coco/Kconfig                     |   5 +-
 drivers/virt/coco/Makefile                    |   7 +-
 drivers/virt/coco/arm-cca-guest/Kconfig       |  10 +-
 drivers/virt/coco/arm-cca-guest/Makefile      |   3 +
 .../{arm-cca-guest.c => arm-cca.c}            | 175 +++-
 drivers/virt/coco/arm-cca-guest/rsi-da.c      | 576 ++++++++++++
 drivers/virt/coco/arm-cca-guest/rsi-da.h      |  73 ++
 drivers/virt/coco/arm-cca-host/Kconfig        |  17 +
 drivers/virt/coco/arm-cca-host/Makefile       |   5 +
 drivers/virt/coco/arm-cca-host/arm-cca.c      | 384 ++++++++
 drivers/virt/coco/arm-cca-host/rmm-da.c       | 857 ++++++++++++++++++
 drivers/virt/coco/arm-cca-host/rmm-da.h       | 108 +++
 drivers/virt/coco/host/Kconfig                |   6 -
 drivers/virt/coco/host/Makefile               |   6 -
 drivers/virt/coco/{host => }/tsm-core.c       |  27 +
 include/keys/asymmetric-type.h                |   2 +
 include/keys/x509-parser.h                    |  55 ++
 include/linux/device.h                        |   1 +
 include/linux/iommufd.h                       |   4 +
 include/linux/kvm_host.h                      |   1 +
 include/linux/pci-tsm.h                       |  37 +-
 include/linux/swiotlb.h                       |   4 +
 include/linux/tsm.h                           |  29 +
 include/uapi/linux/iommufd.h                  |  69 ++
 48 files changed, 3887 insertions(+), 200 deletions(-)
 create mode 100644 arch/arm64/include/asm/rhi.h
 create mode 100644 arch/arm64/kernel/rhi.c
 rename drivers/virt/coco/arm-cca-guest/{arm-cca-guest.c => arm-cca.c} (62%)
 create mode 100644 drivers/virt/coco/arm-cca-guest/rsi-da.c
 create mode 100644 drivers/virt/coco/arm-cca-guest/rsi-da.h
 create mode 100644 drivers/virt/coco/arm-cca-host/Kconfig
 create mode 100644 drivers/virt/coco/arm-cca-host/Makefile
 create mode 100644 drivers/virt/coco/arm-cca-host/arm-cca.c
 create mode 100644 drivers/virt/coco/arm-cca-host/rmm-da.c
 create mode 100644 drivers/virt/coco/arm-cca-host/rmm-da.h
 delete mode 100644 drivers/virt/coco/host/Kconfig
 delete mode 100644 drivers/virt/coco/host/Makefile
 rename drivers/virt/coco/{host => }/tsm-core.c (85%)
 create mode 100644 include/keys/x509-parser.h

-- 
2.43.0

Re: [RFC PATCH v1 00/38] ARM CCA Device Assignment support

Posted by Jason Gunthorpe 2 months, 1 week ago

On Mon, Jul 28, 2025 at 07:21:37PM +0530, Aneesh Kumar K.V (Arm) wrote:
> This patch series implements support for Device Assignment in the ARM CCA
> architecture. The code changes are based on Alp12 specification published here
> [1].

Robin and I were talking about CCA and DMA here:

https://lore.kernel.org/r/6c5fb9f0-c608-4e19-8c60-5d8cef3efbdf@arm.com

What do you think about pulling some of this out and trying to
independently push a series getting the DMA API layers ready for
device assignment?

I think there will be some discussion on these points, it would be
good to get started.

Jason

Re: [RFC PATCH v1 00/38] ARM CCA Device Assignment support

Posted by dan.j.williams@intel.com 2 months ago

Aneesh Kumar K.V (Arm) wrote:
> This patch series implements support for Device Assignment in the ARM CCA
> architecture. The code changes are based on Alp12 specification published here
> [1].
> 
> The code builds on the TSM framework patches posted at [2]. We add extension to
> that framework so that TSM is now used in both the host and the guest.
> 
> A DA workflow can be summarized as below:
> 
> Host:
> step 1.
> echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind
> echo vfio-pci > /sys/bus/pci/devices/${DEVICE}/driver_override
> echo ${DEVICE} > /sys/bus/pci/drivers_probe
> 
> step 2.
> echo 1 > /sys/bus/pci/devices/$DEVICE/tsm/connect

Just for my own understanding... presumably there is no ordering
constraint for ARM CCA between step1 and step2, right? I.e. The connect
state is independent of the bind state.

In the v4 PCI/TSM scheme the connect command is now:

echo $tsm_dev > /sys/bus/pci/devices/$DEVICE/tsm/connect

> Now in the guest we follow the below steps

I assume a signifcant amount of kvmtool magic happens here to get the
TDI into a "bind capable" state, can you share that command?

I had been assuming that everyone was prototyping with QEMU. Not a
problem per se, but the memory management for shared device assignment /
bounce buffering has had a quite of bit of work on the QEMU side, so
just curious about the difference in approach here. Like, does kvmtool
support operating the device in shared mode with bounce buffering and
page conversion (shared <=> private) support? In any event, happy to see
mutiple simultaneous consumers of this new kernel infrastructure.

> step 1:
> echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind
> 
> step 2: Move the device to TDISP LOCK state
> echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/lock

Ok, so my stance has recently picked up some nuance here. As Jason
mentions here:

http://lore.kernel.org/20250410235008.GC63245@ziepe.ca

"However it works, it should be done before the driver is probed and
remain stable for the duration of the driver attachment. From the
iommu side the correct iommu domain, on the correct IOMMU instance to
handle the expected traffic should be setup as the DMA API's iommu
domain."

I agree with that up until the point where the implication is userspace
control of the UNLOCKED->LOCKED transition. That transition requires
enabling bus-mastering (BME), configuring the device into an expected
state, and *then* locking the device. That means userspace is blindly
hoping that the device is in a state where it will remain quiet on the
bus between BME and LOCKED, and that the previous unbind left the device
in a state where it is prepared to be locked again.

The BME concern may be overblown given major PCI drivers blindly set BME
without validating the device is in a quiesced state, but the "device is
prepped for locking" problem seems harder.

2 potential ways to solve this, but open to other ideas:

- Userspace only picks the iommu domain context for the device not the
  lock state. Something like:

  private > /sys/bus/pci/devices/${DEVICE}/tsm/domain

  ...where the default is "shared" and from that point the device can
  not issue DMA until a driver attaches.  Driver controls
  UNLOCKED->LOCKED->RUN.

- Userspace is not involved in this transition and the dma mapping API
  is updated to allow a driver to switch the iommu domain at runtime,
  but only if the device has no outstanding mappings and the transition
  can only happen from ->probe() context. Driver controls joining
  secure-world-DMA and UNLOCKED->LOCKED->RUN.

Clearly the first option is less work in the kernel, but in both options
the driver is in control of when BME is set relative to being ready for
the LOCKED transition.

> step 3: Moves the device to TDISP RUN state
> echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/accept

This has the same concern from me about userspace being in control of
BME. It feels like a departure from typical expectations.  At least in
the case of a driver setting BME the driver's probe routine is going to
get the device in order shortly and otherwise have error handlers at the
ready to effect any needed recovery.

Userspace just leaves the device enabled indefinitely and hopes.

Now, the nice thing about the scheme as proposed in this set is that
userspace has all the time in the world between "lock" and "accept" to
talk to a verifier.

With the driver in control there would need to be something like a
usermodehelper to notify userspace that the device is in the locked
state and to go ahead and run the attestation while the driver waits*.

* or driver could decide to not wait, especially useful for debug and
  development

> step 4: Load the driver again.
> echo ${DEVICE} > /sys/bus/pci/drivers_probe

TIL drivers_probe

Maybe want to recommend:

echo ${DEVICE} > /sys/bus/pci/drivers/${DRIVER}/bind

...to users just in case there are multiple drivers loaded for the
device for the "shared" vs "private" case?

> I'm currently working against TSM v3, as TSM v4 lacks the necessary
> callbacks—bind, unbind, and guest_req—required for guest interactions.

For staging purposes I wanted to put the "connect" flow to bed before
moving on to the guest side.

> The implementation also makes use of RHI interfaces that fall outside the
> current RHI specification [5]. Once the spec is finalized, the code will be aligned
> accordingly.
> 
> For now, I’ve retained validate_mmio and vdev_req exit handling within KVM. This
> will transition to a guest_req-based mechanism once the specification is
> updated.
> 
> At that point, all device assignment (DA)-specific VM exits will exit directly
> to the VMM, and will use the guest_req ioctl to handle exit reasons. As part of
> this change, the handlers realm_exit_vdev_req_handler,
> realm_exit_vdev_comm_handler, and realm_exit_dev_mem_map_handler will be
> removed.
> 
> Full patchset for the kernel and kvmtool can be found at [3] and [4]
> 
> [1] https://developer.arm.com/-/cdn-downloads/permalink/Architectures/Armv9/DEN0137_1.1-alp12.zip
> 
> [2] https://lore.kernel.org/all/20250516054732.2055093-1-dan.j.williams@intel.com
> 
> [3] https://git.gitlab.arm.com/linux-arm/linux-cca.git cca/tdisp-upstream-post-v1
> [4] https://git.gitlab.arm.com/linux-arm/kvmtool-cca.git cca/tdisp-upstream-post-v1
> [5] https://developer.arm.com/documentation/den0148/latest/

Thanks for this and the help reviewing PCI/TSM so far! I want to get
this into tsm.git#staging so we can start to make hard claims ("look at
the shared tree!") of hardware vendor consensus.

Re: [RFC PATCH v1 00/38] ARM CCA Device Assignment support

Posted by Jason Gunthorpe 2 months ago

On Thu, Jul 31, 2025 at 07:07:17PM -0700, dan.j.williams@intel.com wrote:
> Aneesh Kumar K.V (Arm) wrote:
> > Host:
> > step 1.
> > echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind
> > echo vfio-pci > /sys/bus/pci/devices/${DEVICE}/driver_override
> > echo ${DEVICE} > /sys/bus/pci/drivers_probe
> > 
> > step 2.
> > echo 1 > /sys/bus/pci/devices/$DEVICE/tsm/connect
> 
> Just for my own understanding... presumably there is no ordering
> constraint for ARM CCA between step1 and step2, right? I.e. The connect
> state is independent of the bind state.
> 
> In the v4 PCI/TSM scheme the connect command is now:
> 
> echo $tsm_dev > /sys/bus/pci/devices/$DEVICE/tsm/connect

What does this do on the host? It seems to somehow prep it for VM
assignment? Seems pretty strange this is here in sysfs and not part of
creating the vPCI function in the VM through VFIO and iommufd?

Frankly, I'm nervous about making any uAPI whatsoever for the
hypervisor side at this point. I don't think we have enough of the
solution even in draft format. I'd really like your first merged TSM
series to only have uAPI for the guest side where things are hopefully
closer to complete..

> > step 1:
> > echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind
> > 
> > step 2: Move the device to TDISP LOCK state
> > echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/lock
> 
> Ok, so my stance has recently picked up some nuance here. As Jason
> mentions here:
> 
> http://lore.kernel.org/20250410235008.GC63245@ziepe.ca
> 
> "However it works, it should be done before the driver is probed and
> remain stable for the duration of the driver attachment. From the
> iommu side the correct iommu domain, on the correct IOMMU instance to
> handle the expected traffic should be setup as the DMA API's iommu
> domain."

I think it is not just the dma api, but also the MMIO registers may
move location (form shared to protected IPA space for
example). Meaning any attached driver is completely wrecked.

> I agree with that up until the point where the implication is userspace
> control of the UNLOCKED->LOCKED transition. That transition requires
> enabling bus-mastering (BME), 

Why? That's sad. BME should be controlled by the VM driver not the
TSM, and it should be set only when a VM driver is probed to the RUN
state device?

> and *then* locking the device. That means userspace is blindly
> hoping that the device is in a state where it will remain quiet on the
> bus between BME and LOCKED, and that the previous unbind left the device
> in a state where it is prepared to be locked again.

Yes, but we broadly assume this already in Linux. Drivers assume their
devices are quiet when they are bound the first time, we expect on
unbinding a driver quiets the device before removing.

So broadly I think you can assume that a device with no driver is
quiet regardless of BME.

> 2 potential ways to solve this, but open to other ideas:
> 
> - Userspace only picks the iommu domain context for the device not the
>   lock state. Something like:
> 
>   private > /sys/bus/pci/devices/${DEVICE}/tsm/domain
> 
>   ...where the default is "shared" and from that point the device can
>   not issue DMA until a driver attaches.  Driver controls
>   UNLOCKED->LOCKED->RUN.

What? Gross, no way can we let userspace control such intimate details
of the kernel. The kernel must auto set based on what T=x mode the
device driver binds into.

> - Userspace is not involved in this transition and the dma mapping API
>   is updated to allow a driver to switch the iommu domain at runtime,
>   but only if the device has no outstanding mappings and the transition
>   can only happen from ->probe() context. Driver controls joining
>   secure-world-DMA and UNLOCKED->LOCKED->RUN.

I don't see why it is so complicated. The driver is unbound before it
reaches T=1 so we expect the device to be quiet (bigger problems if
not).  When the PCI core reaches T=1 it tells the DMA API to
reconfigure things for the unbound struct device. Then we bind a
driver as normal.

Driver controls nothing. All existing T=0 drivers "just work" with no
source changes in T=1 mode. DMA API magically hides the bounce
buffering. Surely this should be the baseline target functionality
from a Linux perspective?

So we should not have "driver controls" statements at all. Userspace
prepares the PCI device, driver probes onto a T=1 environment and just
works.

> > step 3: Moves the device to TDISP RUN state
> > echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/accept
> 
> This has the same concern from me about userspace being in control of
> BME. It feels like a departure from typical expectations.  

It is, it is architecturally broken for BME to be controlled by the
TSM. BME is controlled by the guest OS driver only.

IMHO if this is a real worry (and I don't think it is) then the right
answer is for physical BME to be set on during locking, but VIRTUAL
BME is left off. Virtual BME is created by the hypervisor/tsm by
telling the IOMMU to block DMA.

The Guest OS should not participate in this broken design, the
hypervisor can set pBME automatically when the lock request comes in,
and the quality of vBME emulation is left up to the implementation,
but the implementation must provide at least a NOP vBME once locked.

> Now, the nice thing about the scheme as proposed in this set is that
> userspace has all the time in the world between "lock" and "accept" to
> talk to a verifier.

Seems right to me. There should be NO trusted kernel driver bound
until the verifier accepts the attestation. Anything else allows un
accepted devices to attack the kernel drivers. Few kernel drivers
today distrust their HW interfaces as hostile actors and security
defend against them. Therefore we should be very reluctant to bind
drivers to anything..

Arguably a CC secure kernel should have an allow list of audited
secure drivers that can autoprobe and all other drivers must be
approved by userspace in some way, either through T=1 and attestation
or some customer-aware risk assumption.

From that principal the kernel should NOT auto probe drivers to T=0
devices that can be made T=1. Userspace should handle attaching HW to
such devices, and userspace can sequence whatever is required,
including the attestation and verifying.

Otherwise, if you say, have a TDISP capable mlx5 device and boot up
the cVM in a comporomised host the host can probably completely hack
your cVM by exploiting the mlx5 drivers's total trust in the HW
interface while running in T=0 mode.

You must attest it and switch to T=1 before binding any driver if you
care about mitigating this risk.

> With the driver in control there would need to be something like a
> usermodehelper to notify userspace that the device is in the locked
> state and to go ahead and run the attestation while the driver waits*.

It doesn't make sense to require modification to all existing drivers
in Linux! The starting point must have the core code do this sequence
for every driver. Once that is working we can talk about if other
flows are needed.

> > step 4: Load the driver again.
> > echo ${DEVICE} > /sys/bus/pci/drivers_probe
> 
> TIL drivers_probe
> 
> Maybe want to recommend:
> 
> echo ${DEVICE} > /sys/bus/pci/drivers/${DRIVER}/bind
>
> ...to users just in case there are multiple drivers loaded for the
> device for the "shared" vs "private" case?

Generic userspace will have a hard time to know what the driver names
are..

The driver_probe option looks good to me as the default.

I'm not sure how generic code can handle "multiple drivers".. Most
devices will be able to work just fine with T=0 mode with bounce
buffers so we should generally not encourage people to make completely
different drivers for T=0/T=1 mode.

I think what is needed is some way for userspace to trigger the
"locking configuration" you mentioned, that may need a special driver,
but ONLY if the userspace is sequencing the device to T=1 mode. Not
sure how to make that generic, but I think so long as userspace is
explicitly controlling driver binding we can punt on that solution to
the userspace project :)

The real nastyness is RAS - what do you do when the device falls out
of RUN, the kernel driver should pretty much explode. But lots of
people would like the kernel driver to stay alive and somehow we FLR,
re-attest and "resume" the kernel driver without allowing any T=0
risks. For instance you can keep your netdev and just see a lot of
lost packets while the driver thrashes.

But I think we can start with the idea that such RAS failures have to
reload the driver too and work on improvements. Realistically few
drivers have the sort of RAS features to consume this anyhow and maybe
we introduce some "enhanced" driver mode to opt-into down the road.

Jason

Re: [RFC PATCH v1 00/38] ARM CCA Device Assignment support

Posted by dan.j.williams@intel.com 2 months ago

Jason Gunthorpe wrote:
> On Thu, Jul 31, 2025 at 07:07:17PM -0700, dan.j.williams@intel.com wrote:
> > Aneesh Kumar K.V (Arm) wrote:
> > > Host:
> > > step 1.
> > > echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind
> > > echo vfio-pci > /sys/bus/pci/devices/${DEVICE}/driver_override
> > > echo ${DEVICE} > /sys/bus/pci/drivers_probe
> > > 
> > > step 2.
> > > echo 1 > /sys/bus/pci/devices/$DEVICE/tsm/connect
> > 
> > Just for my own understanding... presumably there is no ordering
> > constraint for ARM CCA between step1 and step2, right? I.e. The connect
> > state is independent of the bind state.
> > 
> > In the v4 PCI/TSM scheme the connect command is now:
> > 
> > echo $tsm_dev > /sys/bus/pci/devices/$DEVICE/tsm/connect
> 
> What does this do on the host? It seems to somehow prep it for VM
> assignment? Seems pretty strange this is here in sysfs and not part of
> creating the vPCI function in the VM through VFIO and iommufd?

vPCI is out of the picture at this phase.

On the host this establishes an SPDM session and sets up link encryption
(IDE) with the physical device. Leave VMs out of the picture, this
capability in isolation is a useful property. It addresses the similar
threat model that Intel Total Memory Encryption (TME) or AMD Secure
Memory Encryption (SME) go after, i.e. interposer on a physical link
capturing data in flight. 

With that established then one can go futher to do the full TDISP dance.

> Frankly, I'm nervous about making any uAPI whatsoever for the
> hypervisor side at this point. I don't think we have enough of the
> solution even in draft format. I'd really like your first merged TSM
> series to only have uAPI for the guest side where things are hopefully
> closer to complete.

Aligned. I am not comfortable merging any of this until we have that end
to end reliably stable for a kernel cycle or 2. The proposal is soak all
the vendor solutions together in tsm.git#staging.

Now, if the guest side graduates out of that staging before the host
side, I am ok with that.

> > > step 1:
> > > echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind
> > > 
> > > step 2: Move the device to TDISP LOCK state
> > > echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/lock
> > 
> > Ok, so my stance has recently picked up some nuance here. As Jason
> > mentions here:
> > 
> > http://lore.kernel.org/20250410235008.GC63245@ziepe.ca
> > 
> > "However it works, it should be done before the driver is probed and
> > remain stable for the duration of the driver attachment. From the
> > iommu side the correct iommu domain, on the correct IOMMU instance to
> > handle the expected traffic should be setup as the DMA API's iommu
> > domain."
> 
> I think it is not just the dma api, but also the MMIO registers may
> move location (form shared to protected IPA space for
> example). Meaning any attached driver is completely wrecked.

True.

> > I agree with that up until the point where the implication is userspace
> > control of the UNLOCKED->LOCKED transition. That transition requires
> > enabling bus-mastering (BME), 
> 
> Why? That's sad. BME should be controlled by the VM driver not the
> TSM, and it should be set only when a VM driver is probed to the RUN
> state device?

To me it is an unfortunate PCI specification wrinkle that writing to the
command register drops the device from RUN to ERROR. So you can LOCK
without setting BME, but then no DMA.

> > and *then* locking the device. That means userspace is blindly
> > hoping that the device is in a state where it will remain quiet on the
> > bus between BME and LOCKED, and that the previous unbind left the device
> > in a state where it is prepared to be locked again.
> 
> Yes, but we broadly assume this already in Linux. Drivers assume their
> devices are quiet when they are bound the first time, we expect on
> unbinding a driver quiets the device before removing.
> 
> So broadly I think you can assume that a device with no driver is
> quiet regardless of BME.
> 
> > 2 potential ways to solve this, but open to other ideas:
> > 
> > - Userspace only picks the iommu domain context for the device not the
> >   lock state. Something like:
> > 
> >   private > /sys/bus/pci/devices/${DEVICE}/tsm/domain
> > 
> >   ...where the default is "shared" and from that point the device can
> >   not issue DMA until a driver attaches.  Driver controls
> >   UNLOCKED->LOCKED->RUN.
> 
> What? Gross, no way can we let userspace control such intimate details
> of the kernel. The kernel must auto set based on what T=x mode the
> device driver binds into.

Flummoxed. Any way this gets sliced, userspace is asking for "private
world attach" because it alone knows that this device is acceptable, and
devices need to arrive in "shared world attach" mode.

> > - Userspace is not involved in this transition and the dma mapping API
> >   is updated to allow a driver to switch the iommu domain at runtime,
> >   but only if the device has no outstanding mappings and the transition
> >   can only happen from ->probe() context. Driver controls joining
> >   secure-world-DMA and UNLOCKED->LOCKED->RUN.
> 
> I don't see why it is so complicated. The driver is unbound before it
> reaches T=1 so we expect the device to be quiet (bigger problems if
> not).  When the PCI core reaches T=1 it tells the DMA API to
> reconfigure things for the unbound struct device. Then we bind a
> driver as normal.
> 
> Driver controls nothing. All existing T=0 drivers "just work" with no
> source changes in T=1 mode. DMA API magically hides the bounce
> buffering. Surely this should be the baseline target functionality
> from a Linux perspective?

I started this project with "all existing T=0 drivers 'just work'" as a
goal and a virtue. I have been begrudgingly pulled away from it from the
slow drip of complexity it appears to push into the PCI core.

Now, I suspect the number of devices that are willing to spend gates and
firmware on TDISP capabilities in the near term is small. The "just
works" case is saved for either an L1 VMM to hide all this from an L2
guest, or a simplified TDISP specification that actually allows an OS
PCI core to handle these details in a standard way.

> So we should not have "driver controls" statements at all. Userspace
> prepares the PCI device, driver probes onto a T=1 environment and just
> works.

The concern is neither userspace nor the PCI core have everything it
needs to get the device to T=1. PCI core knows that the device is T=1
capable, but does not know how to preconfigure the device-specific lock
state, needs to wait for attestation. Userpace knows how to
attest/verify the device but really has no business running the device
outside of binding a driver, and can not rely on the PCI core to have
prepped the device's device-specific lock state.

Userspace might be able to bind a new driver that leaves the device in a
lockable state on unbind, but that is not "just works" that is,
"introduce a new concept of skinny TDISP setup drivers that leave
devices in LOCKED state on driver unbind, so that userspace can do the
work to verify the device and move it to RUN before loading the main
driver that expects the device arrives already running. Also, that main
driver needs to be careful not to trigger typically benign actions like
touch the command register to trip the device into ERROR state, or any
device-specific actions that trip ERROR state but would otherwise be
benign outside of TDISP."

If locking the device was just a toggle it would be possible. As far as
I can see it is a "prep+toggle" where "prep" needs a driver.

> > > step 3: Moves the device to TDISP RUN state
> > > echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/accept
> > 
> > This has the same concern from me about userspace being in control of
> > BME. It feels like a departure from typical expectations.  
> 
> It is, it is architecturally broken for BME to be controlled by the
> TSM. BME is controlled by the guest OS driver only.

Agree. That "accept" attribute does not belong with TSM. That is where
Aneesh has it in this RFC. "Accept" as an action is the combination of
device entered the LOCKED state in a configuration the verifier is
willing to accept and the mechanics of triggering the LOCKED->RUN
transition.

> IMHO if this is a real worry (and I don't think it is) then the right
> answer is for physical BME to be set on during locking, but VIRTUAL
> BME is left off. Virtual BME is created by the hypervisor/tsm by
> telling the IOMMU to block DMA.
> 
> The Guest OS should not participate in this broken design, the
> hypervisor can set pBME automatically when the lock request comes in,
> and the quality of vBME emulation is left up to the implementation,
> but the implementation must provide at least a NOP vBME once locked.

I can let go of the "BME without driver" worry, but that does nothing to
solve the "device specific configuration required before lock" problem.

> > Now, the nice thing about the scheme as proposed in this set is that
> > userspace has all the time in the world between "lock" and "accept" to
> > talk to a verifier.
> 
> Seems right to me. There should be NO trusted kernel driver bound
> until the verifier accepts the attestation. Anything else allows un
> accepted devices to attack the kernel drivers. Few kernel drivers
> today distrust their HW interfaces as hostile actors and security
> defend against them. Therefore we should be very reluctant to bind
> drivers to anything.
> 
> Arguably a CC secure kernel should have an allow list of audited
> secure drivers that can autoprobe and all other drivers must be
> approved by userspace in some way, either through T=1 and attestation
> or some customer-aware risk assumption.

Yes, today, where nothing is T=1 capable for an L1 guest*, the onus is
100% on the distribution, not the kernel. I.e. trim kernel config and
set modprobe policy to prevent unwanted drivers.

* For L2 there are proposals like this, where if you already trust your
  paravisor also pre-trust all the devices it tells you to trust.
[1]: http://lore.kernel.org/20250714221545.5615-1-romank@linux.microsoft.com

> From that principal the kernel should NOT auto probe drivers to T=0
> devices that can be made T=1. Userspace should handle attaching HW to
> such devices, and userspace can sequence whatever is required,
> including the attestation and verifying.

Agree, for PCI it would be simple to set a no-auto-probe policy for T=1
capable devices.

> Otherwise, if you say, have a TDISP capable mlx5 device and boot up
> the cVM in a comporomised host the host can probably completely hack
> your cVM by exploiting the mlx5 drivers's total trust in the HW
> interface while running in T=0 mode.
> 
> You must attest it and switch to T=1 before binding any driver if you
> care about mitigating this risk.

Yes, userspace must have a chance to say "no" before a driver attempts
to launch DMA to private memory after secrets have been deployed to the
TVM.

> > With the driver in control there would need to be something like a
> > usermodehelper to notify userspace that the device is in the locked
> > state and to go ahead and run the attestation while the driver waits*.
> 
> It doesn't make sense to require modification to all existing drivers
> in Linux!

I do not want to burden the PCI core with TDISP compatibility hacks and
workarounds if it turns out only a small handful of devices ever deploy
a first generation TDISP Device Security Manager (DSM). L1 aiding L2, or
TDISP simplicity improvements to allow the PCI core to handle this in a
non-broken way, are what I expect if secure device assignment takes off.

> The starting point must have the core code do this sequence
> for every driver. Once that is working we can talk about if other
> flows are needed.

Do you agree that "device-specific-prep+lock" is the problem to solve?

> > > step 4: Load the driver again.
> > > echo ${DEVICE} > /sys/bus/pci/drivers_probe
> > 
> > TIL drivers_probe
> > 
> > Maybe want to recommend:
> > 
> > echo ${DEVICE} > /sys/bus/pci/drivers/${DRIVER}/bind
> >
> > ...to users just in case there are multiple drivers loaded for the
> > device for the "shared" vs "private" case?
> 
> Generic userspace will have a hard time to know what the driver names
> are..
> 
> The driver_probe option looks good to me as the default.
> 
> I'm not sure how generic code can handle "multiple drivers".. Most
> devices will be able to work just fine with T=0 mode with bounce
> buffers so we should generally not encourage people to make completely
> different drivers for T=0/T=1 mode.
> 
> I think what is needed is some way for userspace to trigger the
> "locking configuration" you mentioned, that may need a special driver,
> but ONLY if the userspace is sequencing the device to T=1 mode. Not
> sure how to make that generic, but I think so long as userspace is
> explicitly controlling driver binding we can punt on that solution to
> the userspace project :)
> 
> The real nastyness is RAS - what do you do when the device falls out
> of RUN, the kernel driver should pretty much explode. But lots of
> people would like the kernel driver to stay alive and somehow we FLR,
> re-attest and "resume" the kernel driver without allowing any T=0
> risks. For instance you can keep your netdev and just see a lot of
> lost packets while the driver thrashes.

Ideally the RUN->ERROR->UNLOCKED->LOCKED->RUN recovery can fit into the
existing 'struct pci_error_handlers' regime in some farther out future.

It was a "fun" discovery to see that virtual AER injection does not
exist in QEMU (at least last time I checked) and assigned devices that
throw physical AER events just kill the VM.

> But I think we can start with the idea that such RAS failures have to
> reload the driver too and work on improvements. Realistically few
> drivers have the sort of RAS features to consume this anyhow and maybe
> we introduce some "enhanced" driver mode to opt-into down the road.

Hmm, having trouble not reading that back supporting my argument above:

Realistically few devices support TDISP lets require enhanced drivers to
opt-into TDISP for the time being.

Re: [RFC PATCH v1 00/38] ARM CCA Device Assignment support

Posted by Jason Gunthorpe 2 months ago

On Fri, Aug 01, 2025 at 02:19:54PM -0700, dan.j.williams@intel.com wrote:

> On the host this establishes an SPDM session and sets up link encryption
> (IDE) with the physical device. Leave VMs out of the picture, this
> capability in isolation is a useful property. It addresses the similar
> threat model that Intel Total Memory Encryption (TME) or AMD Secure
> Memory Encryption (SME) go after, i.e. interposer on a physical link
> capturing data in flight. 

Okay, maybe connect is not an intuitive name for opening IDE
sessions..

> I started this project with "all existing T=0 drivers 'just work'" as a
> goal and a virtue. I have been begrudgingly pulled away from it from the
> slow drip of complexity it appears to push into the PCI core.

Do you have some examples? I don't really see what complexity there is
if the solution it simply not auto bind any drivers to TDISP capable
devices and userspace is responsible to manually bind a driver once it
has reached T=1.

This seems like the minimum possible simplicitly for the kernel as
simply everything is managed by userspace, and there is really no
special kernel behavior beyond switching the DMA API of an unbound
driver on the T=0/1 change.

> The concern is neither userspace nor the PCI core have everything it
> needs to get the device to T=1. 

Disagree, I think userspace can have everything. It may need some
per-device userspace support in difficult cases, but userspace can
deal with it..

> PCI core knows that the device is T=1 capable, but does not know how
> to preconfigure the device-specific lock state,

Userspace can do this. Can we define exactly what is needed to do this
"pre-configure the device specific lock state"? At the very worst, for
the most poorly designed device, userspace would have to bind a T=0
driver and then unbind it.

Again, I am trying to make something simple for the kernel that gets
us to a working solution before we jump ahead to far more complex in
the kernel models, like aware drivers that can toggle themselves
between T=0/1.

> Userspace might be able to bind a new driver that leaves the device in a
> lockable state on unbind, but that is not "just works" that is,

I wouldn't have the kernel leave the device in the locked state. That
should always be userspace. The special driver may do whatever special
setup is needed, then unbind and leave a normal unlocked device
"prepped" for userspace locking without doing a FLR or
something. Realistically I expect this to be a very rare requirement,
I think this coming up just reflects the HW immaturity of some early
TDISP devices.

Sensible mature devices should have no need of a pre-locking step. I
think we should design toward that goal as the stable future and only
try to enable a hacky work around for the problematic early devices. I
certainly am not keen on seeing significant permanent kernel
complexity to support this device design defect.

> driver that expects the device arrives already running. Also, that main
> driver needs to be careful not to trigger typically benign actions like
> touch the command register to trip the device into ERROR state, or any
> device-specific actions that trip ERROR state but would otherwise be
> benign outside of TDISP."

As I said below, I disagree with this. You can't touch the *physical*
command register but the cVM can certainly touch the *virtualized*
command register. It up to the VMM To ensure this doesn't cause the
device to fall out of RUN as part of virtualization.

I'd also say that the VMM should be responsible to set pBME=1 even if
vBME=0? Shouldn't it? That simplifies even more things for the guest.

> > From that principal the kernel should NOT auto probe drivers to T=0
> > devices that can be made T=1. Userspace should handle attaching HW to
> > such devices, and userspace can sequence whatever is required,
> > including the attestation and verifying.
> 
> Agree, for PCI it would be simple to set a no-auto-probe policy for T=1
> capable devices.

So then it is just a question of what does a userspace component need
to do.

> I do not want to burden the PCI core with TDISP compatibility hacks and
> workarounds if it turns out only a small handful of devices ever deploy
> a first generation TDISP Device Security Manager (DSM). L1 aiding L2, or
> TDISP simplicity improvements to allow the PCI core to handle this in a
> non-broken way, are what I expect if secure device assignment takes off.

Same feeling about pre-configuration :)

> > The starting point must have the core code do this sequence
> > for every driver. Once that is working we can talk about if other
> > flows are needed.
> 
> Do you agree that "device-specific-prep+lock" is the problem to solve?

Not "the" problem, but an design issue we need to accommodate but not
endorse.

> > But I think we can start with the idea that such RAS failures have to
> > reload the driver too and work on improvements. Realistically few
> > drivers have the sort of RAS features to consume this anyhow and maybe
> > we introduce some "enhanced" driver mode to opt-into down the road.
> 
> Hmm, having trouble not reading that back supporting my argument above:
> 
> Realistically few devices support TDISP lets require enhanced drivers to
> opt-into TDISP for the time being.

I would be comfortable if hitless RAS recovery for TDISP devices
requires some kernel opt-in. But also I'm not sure how this should
work from a security perspective. Should userspace also have to
re-attest before allowing back to RUN? Clearly this is complicated.

Also, I would be comfortable to support this only for devices that do
not require pre-configuration.

Jason

Re: [RFC PATCH v1 00/38] ARM CCA Device Assignment support

Posted by dan.j.williams@intel.com 2 months ago

Jason Gunthorpe wrote:
> On Fri, Aug 01, 2025 at 02:19:54PM -0700, dan.j.williams@intel.com wrote:
> 
> > On the host this establishes an SPDM session and sets up link encryption
> > (IDE) with the physical device. Leave VMs out of the picture, this
> > capability in isolation is a useful property. It addresses the similar
> > threat model that Intel Total Memory Encryption (TME) or AMD Secure
> > Memory Encryption (SME) go after, i.e. interposer on a physical link
> > capturing data in flight. 
> 
> Okay, maybe connect is not an intuitive name for opening IDE
> sessions..

Part of the rationale for a generic name is the TSM is free to assert
that the link is secure without IDE. Think integrated devices where
there is no expectation the link can be observed.

The host and guest side TSM operations are split into link/transport
security and device/state security (private MMIO/DMA) concerns
respectively. So maybe "secure_link" would be a better name for this
host-side-only operation.

> > I started this project with "all existing T=0 drivers 'just work'" as a
> > goal and a virtue. I have been begrudgingly pulled away from it from the
> > slow drip of complexity it appears to push into the PCI core.
> 
> Do you have some examples? I don't really see what complexity there is
> if the solution it simply not auto bind any drivers to TDISP capable
> devices and userspace is responsible to manually bind a driver once it
> has reached T=1.

The example I have front of mind (confirmed by 2 vendors) is deferring
the loading of guest-side device/state security capable firmware to the
guest driver when the full device is assigned. In that scenario default
device power-on firmware is capable of link/transport security, enough
to get the device assigned. Guest needs to get the device/state security
firmware loaded before TDISP state transitions are possible.

I do think RAS recovery needs it too, but like you say below that should
come with conditions.

> This seems like the minimum possible simplicitly for the kernel as
> simply everything is managed by userspace, and there is really no
> special kernel behavior beyond switching the DMA API of an unbound
> driver on the T=0/1 change.
> 
> > The concern is neither userspace nor the PCI core have everything it
> > needs to get the device to T=1. 
> 
> Disagree, I think userspace can have everything. It may need some
> per-device userspace support in difficult cases, but userspace can
> deal with it..

I do think userspace can / must deal with it. Let me come back with
actual patches and a sample test case. I see a potential path to support
the above "prep" scenario without the mess of TDISP setup drivers, or
the ugly complexity of driver toggles or a usermodehelper.

> > PCI core knows that the device is T=1 capable, but does not know how
> > to preconfigure the device-specific lock state,
> 
> Userspace can do this. Can we define exactly what is needed to do this
> "pre-configure the device specific lock state"? At the very worst, for
> the most poorly designed device, userspace would have to bind a T=0
> driver and then unbind it.
> 
> Again, I am trying to make something simple for the kernel that gets
> us to a working solution before we jump ahead to far more complex in
> the kernel models, like aware drivers that can toggle themselves
> between T=0/1.

Agree. When I talked about wishing for the simple TDISP case that is
userspace can always "just lock" and "driver bind" without needing to
worry about "prep", i.e any "prep" is always implied by "lock". That
should be the baseline.

> > Userspace might be able to bind a new driver that leaves the device in a
> > lockable state on unbind, but that is not "just works" that is,
> 
> I wouldn't have the kernel leave the device in the locked state. That
> should always be userspace. The special driver may do whatever special
> setup is needed, then unbind and leave a normal unlocked device
> "prepped" for userspace locking without doing a FLR or
> something. Realistically I expect this to be a very rare requirement,
> I think this coming up just reflects the HW immaturity of some early
> TDISP devices.
> 
> Sensible mature devices should have no need of a pre-locking step. I
> think we should design toward that goal as the stable future and only
> try to enable a hacky work around for the problematic early devices. I
> certainly am not keen on seeing significant permanent kernel
> complexity to support this device design defect.

Yeah, that is the nightmare I had last night. I completed the thought
exercise about driver toggle and said, "whoops, nope, Jason is right, we
can't design for that without leaving a permanent mess to cleanup".
The end goal needs to look like straight line typical driver probe path
for TDISP capable devices.

> > driver that expects the device arrives already running. Also, that main
> > driver needs to be careful not to trigger typically benign actions like
> > touch the command register to trip the device into ERROR state, or any
> > device-specific actions that trip ERROR state but would otherwise be
> > benign outside of TDISP."
> 
> As I said below, I disagree with this. You can't touch the *physical*
> command register but the cVM can certainly touch the *virtualized*
> command register. It up to the VMM To ensure this doesn't cause the
> device to fall out of RUN as part of virtualization.
> 
> I'd also say that the VMM should be responsible to set pBME=1 even if
> vBME=0? Shouldn't it? That simplifies even more things for the guest.

True. Although, now I am going back on my PCI core burden concern to
wonder if *it* should handle a vBME on behalf of the driver if only
because it may want to force the device out of the RUN state on driver
unbind to meet typical pci_disable_device() expectations.

Alexey had this, I thought it was burdensome, now coming around.

> > > From that principal the kernel should NOT auto probe drivers to T=0
> > > devices that can be made T=1. Userspace should handle attaching HW to
> > > such devices, and userspace can sequence whatever is required,
> > > including the attestation and verifying.
> > 
> > Agree, for PCI it would be simple to set a no-auto-probe policy for T=1
> > capable devices.
> 
> So then it is just a question of what does a userspace component need
> to do.
> 
> > I do not want to burden the PCI core with TDISP compatibility hacks and
> > workarounds if it turns out only a small handful of devices ever deploy
> > a first generation TDISP Device Security Manager (DSM). L1 aiding L2, or
> > TDISP simplicity improvements to allow the PCI core to handle this in a
> > non-broken way, are what I expect if secure device assignment takes off.
> 
> Same feeling about pre-configuration :)
> 
> > > The starting point must have the core code do this sequence
> > > for every driver. Once that is working we can talk about if other
> > > flows are needed.
> > 
> > Do you agree that "device-specific-prep+lock" is the problem to solve?
> 
> Not "the" problem, but an design issue we need to accommodate but not
> endorse.

I hear you, let me walk back from the cliff with patches.

> 
> > > But I think we can start with the idea that such RAS failures have to
> > > reload the driver too and work on improvements. Realistically few
> > > drivers have the sort of RAS features to consume this anyhow and maybe
> > > we introduce some "enhanced" driver mode to opt-into down the road.
> > 
> > Hmm, having trouble not reading that back supporting my argument above:
> > 
> > Realistically few devices support TDISP lets require enhanced drivers to
> > opt-into TDISP for the time being.
> 
> I would be comfortable if hitless RAS recovery for TDISP devices
> requires some kernel opt-in. But also I'm not sure how this should
> work from a security perspective. Should userspace also have to
> re-attest before allowing back to RUN? Clearly this is complicated.
> 
> Also, I would be comfortable to support this only for devices that do
> not require pre-configuration.

That seems reasonable. You want hitless RAS? Give us hitless init.

Re: [RFC PATCH v1 00/38] ARM CCA Device Assignment support

Posted by Jason Gunthorpe 2 months ago

On Sat, Aug 02, 2025 at 04:50:50PM -0700, dan.j.williams@intel.com wrote:
> > Do you have some examples? I don't really see what complexity there is
> > if the solution it simply not auto bind any drivers to TDISP capable
> > devices and userspace is responsible to manually bind a driver once it
> > has reached T=1.
> 
> The example I have front of mind (confirmed by 2 vendors) is deferring
> the loading of guest-side device/state security capable firmware to the
> guest driver when the full device is assigned. In that scenario default
> device power-on firmware is capable of link/transport security, enough
> to get the device assigned. Guest needs to get the device/state security
> firmware loaded before TDISP state transitions are possible.

Yeah, those are the only cases I know of too, and IMHO, they are just
early devices. Clearly the clean answer is to put enough boot FW on
the device's flash to get to T=1 mode, then have the trusted OS driver
load the operating firmware from the trusted OS filesystem though the
trusted bootloader T=1 device.

You effectively attest the bootloader, and then if you trust the
bootloader you know that when the device gets to T=1 it can be trusted
to properly run the FW the trusted driver provides.

Think about this more broadly, does the prep FW load idea make sense
for something like SRIOV? No, it really doesn't. The hypervisor loaded
FW that is running the PF should definately be strong enough to get to
T=1 on the VM/VF side as well.

The non-SRIOV cases are quite often whole machine assignment
scenarios. But I'm sensing alot of that space is moving toward bare
metal machines instead of VMs.

I wonder if you can use all the CC machinery to attest and secure a
bare metal host?

> I do think RAS recovery needs it too, but like you say below that should
> come with conditions.

Especially RAS becomes simple because it basically follows the normal
flows that existed prior to TDISP, with the exception of needing some
attestation step.

I don't know alot about CC attestation, but maybe we can have
userspace provide the kernel with the accepted measurement and then
for RAS the kernel can FLR, remeasure and if the measurement is
exactly the same go back into T=1 automatically as part of the PCI
core FLR logic.

> I do think userspace can / must deal with it. Let me come back with
> actual patches and a sample test case. I see a potential path to support
> the above "prep" scenario without the mess of TDISP setup drivers, or
> the ugly complexity of driver toggles or a usermodehelper.

I don't see how, something nasty has to be done in the kernel to allow
an attached driver to switch between T=1 and T=0 "views" of the device
and lockstep those changes with userspace. This is not so simple and
it really basically exactly the same as driver binding.

I don't think we should be afraid of T=0 prep drivers in these early
days.

Something more complex could come later if it is really warranted and
people really insist on continuing this unclean device design
strategy.

> Yeah, that is the nightmare I had last night. I completed the thought
> exercise about driver toggle and said, "whoops, nope, Jason is right, we
> can't design for that without leaving a permanent mess to cleanup".
> The end goal needs to look like straight line typical driver probe path
> for TDISP capable devices.

Yeah, maybe it is worthwhile to someday try to figure out an
alternative - keep in mind that critically this requires someone to
also come with an intree driver that will use all these new APIs and
capabilities!!!

So lets get walking first and then someone can come with some
proposal, complete with a driver implementing it, and it can be
judged. This project is already so big, and I'm pretty sure if you
start to also need entirely new operating modes for drivers the basics
will just get bogged down in that discussion, and very likely killed
anyhow due to a lack of user.

Even if we decide that is prefered it is better to separate it and
discuss it after the basics are merged. At least where I sit getting
basic guest support is a big priority so I strongly want to strip it
down to minimal as possible to make consistent progress steps.

> True. Although, now I am going back on my PCI core burden concern to
> wonder if *it* should handle a vBME on behalf of the driver if only
> because it may want to force the device out of the RUN state on driver
> unbind to meet typical pci_disable_device() expectations.

Hiding some vBME in the PCI core might make sense if we can't get the
VMM owners to agree to do it on the hypervisor side. It works better
on the VMM side because there is always an IOMMU and the VMM can
emulate BME by blocking DMA with the IOMMU.

But I would not allow/expect kernel device drivers to have anything to
do with the TDISP states. Getting into RUN is fully sequenced by
userspace, getting out of run should also be sequenced only by
userspace.

Removing a driver does not change the trust state of the PCI device,
so it shouldn't drop out of RUN. If userspace wishes to FLR the device
after userspace asked to unbind it can, there are already sysfs
controls for this IIRC.

Basically, all this says that Linux drivers that want to be used with
T=1 should be well behaved, fully quite all their DMA on remove, and
have no *functional* need for BME to do anyhting. We pretty much
already expect this of drivers today, so I don't see an issue with
strongly requiring it for T=1.

Keep in mind the flip side, almost no drivers are structured properly
to forcibly quiet any DMA before pci_enable_device(). Some HW, like
mlx5, can't do this at all without either using DMA to send a reset
command or through FLR.

> > I would be comfortable if hitless RAS recovery for TDISP devices
> > requires some kernel opt-in. But also I'm not sure how this should
> > work from a security perspective. Should userspace also have to
> > re-attest before allowing back to RUN? Clearly this is complicated.
> > 
> > Also, I would be comfortable to support this only for devices that do
> > not require pre-configuration.
> 
> That seems reasonable. You want hitless RAS? Give us hitless init.

Yeah.. Realistically there are few drivers that can even do this
today, mlx5 for example has such code (and it is hard!).

There is alot of investment required in the driver's core subsystem to
make this work. netdev and RDMA can support a 'rebirth' sort of flow
where the driver can disconnect the SW APIs, FLR the device, then
reconnect in some way. However, for example, I recently had a
discussion with DRM guys about RAS and they are not even doing the
basic locking/etc to be able to do this. :\

Jason

Re: [RFC PATCH v1 00/38] ARM CCA Device Assignment support

Posted by Aneesh Kumar K.V 2 months ago

<dan.j.williams@intel.com> writes:

> Jason Gunthorpe wrote:
>> On Thu, Jul 31, 2025 at 07:07:17PM -0700, dan.j.williams@intel.com wrote:
>> > Aneesh Kumar K.V (Arm) wrote:
>> > > Host:
>> > > step 1.
>> > > echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind
>> > > echo vfio-pci > /sys/bus/pci/devices/${DEVICE}/driver_override
>> > > echo ${DEVICE} > /sys/bus/pci/drivers_probe
>> > > 
>> > > step 2.
>> > > echo 1 > /sys/bus/pci/devices/$DEVICE/tsm/connect
>> > 
>> > Just for my own understanding... presumably there is no ordering
>> > constraint for ARM CCA between step1 and step2, right? I.e. The connect
>> > state is independent of the bind state.
>> > 
>> > In the v4 PCI/TSM scheme the connect command is now:
>> > 
>> > echo $tsm_dev > /sys/bus/pci/devices/$DEVICE/tsm/connect
>> 
>> What does this do on the host? It seems to somehow prep it for VM
>> assignment? Seems pretty strange this is here in sysfs and not part of
>> creating the vPCI function in the VM through VFIO and iommufd?
>
> vPCI is out of the picture at this phase.
>
> On the host this establishes an SPDM session and sets up link encryption
> (IDE) with the physical device. Leave VMs out of the picture, this
> capability in isolation is a useful property. It addresses the similar
> threat model that Intel Total Memory Encryption (TME) or AMD Secure
> Memory Encryption (SME) go after, i.e. interposer on a physical link
> capturing data in flight. 
>
> With that established then one can go futher to do the full TDISP dance.
>
>> Frankly, I'm nervous about making any uAPI whatsoever for the
>> hypervisor side at this point. I don't think we have enough of the
>> solution even in draft format. I'd really like your first merged TSM
>> series to only have uAPI for the guest side where things are hopefully
>> closer to complete.
>
> Aligned. I am not comfortable merging any of this until we have that end
> to end reliably stable for a kernel cycle or 2. The proposal is soak all
> the vendor solutions together in tsm.git#staging.
>
> Now, if the guest side graduates out of that staging before the host
> side, I am ok with that.
>
>> > > step 1:
>> > > echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind
>> > > 
>> > > step 2: Move the device to TDISP LOCK state
>> > > echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/lock
>> > 
>> > Ok, so my stance has recently picked up some nuance here. As Jason
>> > mentions here:
>> > 
>> > http://lore.kernel.org/20250410235008.GC63245@ziepe.ca
>> > 
>> > "However it works, it should be done before the driver is probed and
>> > remain stable for the duration of the driver attachment. From the
>> > iommu side the correct iommu domain, on the correct IOMMU instance to
>> > handle the expected traffic should be setup as the DMA API's iommu
>> > domain."
>> 
>> I think it is not just the dma api, but also the MMIO registers may
>> move location (form shared to protected IPA space for
>> example). Meaning any attached driver is completely wrecked.
>
> True.
>
>> > I agree with that up until the point where the implication is userspace
>> > control of the UNLOCKED->LOCKED transition. That transition requires
>> > enabling bus-mastering (BME), 
>> 
>> Why? That's sad. BME should be controlled by the VM driver not the
>> TSM, and it should be set only when a VM driver is probed to the RUN
>> state device?
>
> To me it is an unfortunate PCI specification wrinkle that writing to the
> command register drops the device from RUN to ERROR. So you can LOCK
> without setting BME, but then no DMA.
>

This is only w.r.t clearing BME isn't ?

According to section 11.2.6 DSM Tracking and Handling of Locked TDI Configurations

Clearing any of the following bits causes the TDI hosted
by the Function to transition to ERROR:

• Memory Space Enable
• Bus Master Enable


Which implies the flow described in the cover-letter where driver enable the BME works?
However clearing BME may be problematic? I did have a FIXME!!/comment in [1]

vfio_pci_core_close_device():

#if 0
	/*
	 * destroy vdevice which involves tsm unbind before we disable pci disable
	 * A MSE/BME clear will transition the device to error state.
	 */
	if (core_vdev->iommufd_device)
		iommufd_device_tombstone_vdevice(core_vdev->iommufd_device);
#endif

	vfio_pci_core_disable(vdev);


Currently, we destroy (TSM unbind) the vdevice after calling
vfio_pci_core_disable(), which means BME is cleared before unbinding,
and the TDI transitions to the ERROR state.

[1] https://lore.kernel.org/all/20250728135216.48084-9-aneesh.kumar@kernel.org/

-aneesh

Re: [RFC PATCH v1 00/38] ARM CCA Device Assignment support

Posted by Jason Gunthorpe 2 months ago

On Tue, Aug 05, 2025 at 10:37:01AM +0530, Aneesh Kumar K.V wrote:
> > To me it is an unfortunate PCI specification wrinkle that writing to the
> > command register drops the device from RUN to ERROR. So you can LOCK
> > without setting BME, but then no DMA.
> 
> This is only w.r.t clearing BME isn't ?
>
> According to section 11.2.6 DSM Tracking and Handling of Locked TDI Configurations
> 
> Clearing any of the following bits causes the TDI hosted
> by the Function to transition to ERROR:
> 
> • Memory Space Enable
> • Bus Master Enable

Oh that's nice, yeah!

> Which implies the flow described in the cover-letter where driver enable the BME works?
> However clearing BME may be problematic? I did have a FIXME!!/comment in [1]
> 
> vfio_pci_core_close_device():
> 
> #if 0
> 	/*
> 	 * destroy vdevice which involves tsm unbind before we disable pci disable
> 	 * A MSE/BME clear will transition the device to error state.
> 	 */
> 	if (core_vdev->iommufd_device)
> 		iommufd_device_tombstone_vdevice(core_vdev->iommufd_device);
> #endif
> 
> 	vfio_pci_core_disable(vdev);

Here is where I feel the VMM should be trapping this and NOPing it, or
failing that the guest PCI Core should NOP it.

With the ideal version being the TSM and VMM would be able to block
the iommu as a functional stand in for BME.

> Currently, we destroy (TSM unbind) the vdevice after calling
> vfio_pci_core_disable(), which means BME is cleared before unbinding,
> and the TDI transitions to the ERROR state.

I don't think this ordering is deliberate, we can destroy the vdevice
much earlier??

Jason

Re: [RFC PATCH v1 00/38] ARM CCA Device Assignment support

Posted by dan.j.williams@intel.com 2 months ago

Jason Gunthorpe wrote:
> On Tue, Aug 05, 2025 at 10:37:01AM +0530, Aneesh Kumar K.V wrote:
> > > To me it is an unfortunate PCI specification wrinkle that writing to the
> > > command register drops the device from RUN to ERROR. So you can LOCK
> > > without setting BME, but then no DMA.
> > 
> > This is only w.r.t clearing BME isn't ?
> >
> > According to section 11.2.6 DSM Tracking and Handling of Locked TDI Configurations
> > 
> > Clearing any of the following bits causes the TDI hosted
> > by the Function to transition to ERROR:
> > 
> > • Memory Space Enable
> > • Bus Master Enable
> 
> Oh that's nice, yeah!

That is useful, but an unmodified PCI driver is going to make separate
calls to pci_set_master() and pci_enable_device() so it should still be
the case that those need to be trapped out of the concern that
writing back zero for a read-modify-write also trips the error state on
some device that fails the Robustness Principle.

I guess we could wait to solve that problem until the encountering the
first device that trips ERROR when writing zero to an already zeroed
bit.

> > Which implies the flow described in the cover-letter where driver enable the BME works?
> > However clearing BME may be problematic? I did have a FIXME!!/comment in [1]
> > 
> > vfio_pci_core_close_device():
> > 
> > #if 0
> > 	/*
> > 	 * destroy vdevice which involves tsm unbind before we disable pci disable
> > 	 * A MSE/BME clear will transition the device to error state.
> > 	 */
> > 	if (core_vdev->iommufd_device)
> > 		iommufd_device_tombstone_vdevice(core_vdev->iommufd_device);
> > #endif
> > 
> > 	vfio_pci_core_disable(vdev);
> 
> Here is where I feel the VMM should be trapping this and NOPing it, or
> failing that the guest PCI Core should NOP it.

At this point (vfio shutdown path) the VMM is committed stopping guest
operations with the device. So ok not to not NOP in this specific path,
right?

> With the ideal version being the TSM and VMM would be able to block
> the iommu as a functional stand in for BME.

The TSM block for BME is the LOCKED or ERROR state. That would be in
conflict with the proposal that the device stays in the RUN state on
guest driver unbind.

I feel like either the device stays in RUN state and BME leaks, or the
device is returned to LOCKED on driver unbind. Otherwise a functional
stand-in for BME that also keeps the device in RUN state feels like a
TSM feature request for a "RUN but BLOCKED" state.

Re: [RFC PATCH v1 00/38] ARM CCA Device Assignment support

Posted by Jason Gunthorpe 2 months ago

On Tue, Aug 05, 2025 at 11:27:36AM -0700, dan.j.williams@intel.com wrote:
> > > Clearing any of the following bits causes the TDI hosted
> > > by the Function to transition to ERROR:
> > > 
> > > • Memory Space Enable
> > > • Bus Master Enable
> > 
> > Oh that's nice, yeah!
> 
> That is useful, but an unmodified PCI driver is going to make separate
> calls to pci_set_master() and pci_enable_device() so it should still be
> the case that those need to be trapped out of the concern that
> writing back zero for a read-modify-write also trips the error state on
> some device that fails the Robustness Principle.

I hope we don't RMW BME and MSE in some weird way like that :(

> > Here is where I feel the VMM should be trapping this and NOPing it, or
> > failing that the guest PCI Core should NOP it.
> 
> At this point (vfio shutdown path) the VMM is committed stopping guest
> operations with the device. So ok not to not NOP in this specific path,
> right?

What I said in my other mail was the the T=1 state should have nothing
to do with driver binding. So unbinding vfio should leave the device
in the RUN state just fine.

> > With the ideal version being the TSM and VMM would be able to block
> > the iommu as a functional stand in for BME.
> 
> The TSM block for BME is the LOCKED or ERROR state. That would be in
> conflict with the proposal that the device stays in the RUN state on
> guest driver unbind.

This is a different thing. Leaving RUN says the OS (especially
userspace) does not trust the device.

Disabling DMA, on explict trusted request from the cVM, is entirely
fine to do inside the T=1 state. PCI made it so the only way to do
this is with the IOMMU, oh well, so be it.

> I feel like either the device stays in RUN state and BME leaks, or the
> device is returned to LOCKED on driver unbind. 

Stay in RUN is my vote. I can't really defend the other choice from a
linux driver model perspective.

> Otherwise a functional stand-in for BME that also keeps the device
> in RUN state feels like a TSM feature request for a "RUN but
> BLOCKED" state.

Yes, and probably not necessary, more of a defence against bugs in
depth kind of request. For Linux we would like it if the device can be
in RUN and have DMA blocked off during all times when no driver is
attached.

Jason

Re: [RFC PATCH v1 00/38] ARM CCA Device Assignment support

Posted by dan.j.williams@intel.com 2 months ago

Jason Gunthorpe wrote:
> On Tue, Aug 05, 2025 at 11:27:36AM -0700, dan.j.williams@intel.com wrote:
> > > > Clearing any of the following bits causes the TDI hosted
> > > > by the Function to transition to ERROR:
> > > > 
> > > > • Memory Space Enable
> > > > • Bus Master Enable
> > > 
> > > Oh that's nice, yeah!
> > 
> > That is useful, but an unmodified PCI driver is going to make separate
> > calls to pci_set_master() and pci_enable_device() so it should still be
> > the case that those need to be trapped out of the concern that
> > writing back zero for a read-modify-write also trips the error state on
> > some device that fails the Robustness Principle.
> 
> I hope we don't RMW BME and MSE in some weird way like that :(

Yeah, I would like to say, "device, you get to keep the pieces if you
transition to ERROR state on re-writing on already zeroed-bit."

> > > Here is where I feel the VMM should be trapping this and NOPing it, or
> > > failing that the guest PCI Core should NOP it.
> > 
> > At this point (vfio shutdown path) the VMM is committed stopping guest
> > operations with the device. So ok not to not NOP in this specific path,
> > right?
> 
> What I said in my other mail was the the T=1 state should have nothing
> to do with driver binding.

Guest driver unbind, agree.

> So unbinding vfio should leave the device in the RUN state just fine.

Perhaps my vfio inexperience is showing, but at the point where the VMM
is unbinding vfio it is committed to destroying the guest's assigned
device context, no? So should that not be the point where continuing to
maintain the RUN state ends?

> > > With the ideal version being the TSM and VMM would be able to block
> > > the iommu as a functional stand in for BME.
> > 
> > The TSM block for BME is the LOCKED or ERROR state. That would be in
> > conflict with the proposal that the device stays in the RUN state on
> > guest driver unbind.
> 
> This is a different thing. Leaving RUN says the OS (especially
> userspace) does not trust the device.
> 
> Disabling DMA, on explict trusted request from the cVM, is entirely
> fine to do inside the T=1 state. PCI made it so the only way to do
> this is with the IOMMU, oh well, so be it.
> 
> > I feel like either the device stays in RUN state and BME leaks, or the
> > device is returned to LOCKED on driver unbind. 
> 
> Stay in RUN is my vote. I can't really defend the other choice from a
> linux driver model perspective.
> 
> > Otherwise a functional stand-in for BME that also keeps the device
> > in RUN state feels like a TSM feature request for a "RUN but
> > BLOCKED" state.
> 
> Yes, and probably not necessary, more of a defence against bugs in
> depth kind of request. For Linux we would like it if the device can be
> in RUN and have DMA blocked off during all times when no driver is
> attached.

Ok, defense in depth, but in the meantime rely on unbound driver == DMA
unmapped and device should be quiescent. Combine that with the fact that
userspace PCI drivers should be disabled in cVMs should mean that guest
can expect that an unbound TDI in the RUN state will remain quiet.

Re: [RFC PATCH v1 00/38] ARM CCA Device Assignment support

Posted by Jason Gunthorpe 2 months ago

On Tue, Aug 05, 2025 at 12:06:11PM -0700, dan.j.williams@intel.com wrote:

> > So unbinding vfio should leave the device in the RUN state just fine.
> 
> Perhaps my vfio inexperience is showing, but at the point where the VMM
> is unbinding vfio it is committed to destroying the guest's assigned
> device context, no? So should that not be the point where continuing to
> maintain the RUN state ends?

Oh, sorry it gets so confusing..

VFIO *in the guest* should behave as above, like any other driver
unbind leaves it in RUN.

VFIO *in the host* should leave the RUN state at the soonest of:

 - cVM's KVM is destroyed
 - iommufd vdevice is destroyed
 - vfio device is closed

And maybe more cases I didn't think of.. BME should happen strictly
after all of the above and should not be the trigger that drops it out
of RUN.

> > Yes, and probably not necessary, more of a defence against bugs in
> > depth kind of request. For Linux we would like it if the device can be
> > in RUN and have DMA blocked off during all times when no driver is
> > attached.
> 
> Ok, defense in depth, but in the meantime rely on unbound driver == DMA
> unmapped and device should be quiescent. Combine that with the fact that
> userspace PCI drivers should be disabled in cVMs should mean that guest
> can expect that an unbound TDI in the RUN state will remain quiet.

"userspace PCI drivers" is VFIO in the guest which means you get
FLRs to fence the DMA.

If we end up where I suggested earlier for RAS that a FLR can check
the attestation and if exactly matching reaccept it automatically then
it would maintain the 'once accepted we stay in T=1 RUN state' idea.

Jason

Re: [RFC PATCH v1 00/38] ARM CCA Device Assignment support

Posted by Aneesh Kumar K.V 2 months ago

<dan.j.williams@intel.com> writes:

> Aneesh Kumar K.V (Arm) wrote:
>> This patch series implements support for Device Assignment in the ARM CCA
>> architecture. The code changes are based on Alp12 specification published here
>> [1].
>> 
>> The code builds on the TSM framework patches posted at [2]. We add extension to
>> that framework so that TSM is now used in both the host and the guest.
>> 
>> A DA workflow can be summarized as below:
>> 
>> Host:
>> step 1.
>> echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind
>> echo vfio-pci > /sys/bus/pci/devices/${DEVICE}/driver_override
>> echo ${DEVICE} > /sys/bus/pci/drivers_probe
>> 
>> step 2.
>> echo 1 > /sys/bus/pci/devices/$DEVICE/tsm/connect
>
> Just for my own understanding... presumably there is no ordering
> constraint for ARM CCA between step1 and step2, right? I.e. The connect
> state is independent of the bind state.
>
> In the v4 PCI/TSM scheme the connect command is now:
>
> echo $tsm_dev > /sys/bus/pci/devices/$DEVICE/tsm/connect
>
>> Now in the guest we follow the below steps
>
> I assume a signifcant amount of kvmtool magic happens here to get the
> TDI into a "bind capable" state, can you share that command?
>

lkvm run --realm -c 2 -m 256 -k /kselftest/Image  -p  "$KERNEL_PARAMS" -d ./rootfs-guest.ext2 --iommufd-vdevice --vfio-pci $DEVICE1 --vfio-pci $DEVICE2

> I had been assuming that everyone was prototyping with QEMU. Not a
> problem per se, but the memory management for shared device assignment /
> bounce buffering has had a quite of bit of work on the QEMU side, so
> just curious about the difference in approach here. Like, does kvmtool
> support operating the device in shared mode with bounce buffering and
> page conversion (shared <=> private) support? In any event, happy to see
> mutiple simultaneous consumers of this new kernel infrastructure.
>

-aneesh